China Telecom Cloud Computing Research Institute has made a significant breakthrough in intelligent fault monitoring. The research paper titled “Nip it in the Bud: Unsupervised KPI Incipient Fault Detection via Dynamic Latent Feature Ensembling”, authored by Yanwen Wang, Wenda Tang, and Jie Wu, has been accepted by the 44th IEEE International Symposium on Reliable Distributed Systems (SRDS 2025), one of the premier international conferences in the field of distributed system reliability. The study addresses the core challenge of incipient fault detection in cloud environments and proposes an innovative solution for improving system reliability in operations and maintenance.
As cloud computing and distributed systems grow increasingly complex, real-time monitoring of key performance indicators (KPIs) and early detection of system performance degradation have become critical to ensuring service continuity and user experience. However, challenges such as noise interference, the high-dimensional dependencies in multivariate time series (MTS), and limited labeled data often hinder existing methods from identifying potential faults before visible anomalies occur. To tackle these issues, the research introduces HEIMDALLR, a novel unsupervised detection framework designed to capture incipient and subtle anomaly signals hidden in KPI patterns. HEIMDALLR centers around dynamic latent space modeling tailored to KPI behavior and integrates an anomaly attribution mechanism to extract and decompose latent causal patterns. Compared to traditional techniques, HEIMDALLR offers notable improvements in accuracy, false alarm suppression, and computational efficiency, making it particularly suitable for real-time deployment in large-scale cloud environments.
Figure 1: Overall architecture of the HEIMDALLR framework.
The IEEE International Symposium on Reliable Distributed Systems (SRDS) is one of the most prestigious and long-standing conferences in the field of distributed system reliability, with a history spanning 43 editions. SRDS covers a broad range of topics including trust and privacy in distributed systems, fault tolerance and self-healing technologies, real-time and resilient computing, as well as the design and evaluation of dependable systems. SRDS 2025 will be held in Porto, Portugal, from September 29 to October 2.
As a key driver of technological innovation within China Telecom, the China Telecom Cloud Computing Research Institute continues to advance the development of the “Intelligent Ubiquitous Cloud” technology ecosystem and deepen research in cutting-edge domains. This latest breakthrough in fault detection not only enhances the core capabilities of intelligent cloud-network monitoring, but also strengthens the institute’s expertise in unsupervised intelligent diagnostics and reliable distributed systems. The publication of this work will play a critical role in supporting the intelligent evolution of large-scale cloud system operations with high reliability and low latency, laying a solid technical foundation for next-generation intelligent infrastructure.
China Telecom Cloud Computing Research Institute has made a significant breakthrough in intelligent fault monitoring. The research paper titled “Nip it in the Bud: Unsupervised KPI Incipient Fault Detection via Dynamic Latent Feature Ensembling”, authored by Yanwen Wang, Wenda Tang, and Jie Wu, has been accepted by the 44th IEEE International Symposium on Reliable Distributed Systems (SRDS 2025), one of the premier international conferences in the field of distributed system reliability. The study addresses the core challenge of incipient fault detection in cloud environments and proposes an innovative solution for improving system reliability in operations and maintenance.
As cloud computing and distributed systems grow increasingly complex, real-time monitoring of key performance indicators (KPIs) and early detection of system performance degradation have become critical to ensuring service continuity and user experience. However, challenges such as noise interference, the high-dimensional dependencies in multivariate time series (MTS), and limited labeled data often hinder existing methods from identifying potential faults before visible anomalies occur. To tackle these issues, the research introduces HEIMDALLR, a novel unsupervised detection framework designed to capture incipient and subtle anomaly signals hidden in KPI patterns. HEIMDALLR centers around dynamic latent space modeling tailored to KPI behavior and integrates an anomaly attribution mechanism to extract and decompose latent causal patterns. Compared to traditional techniques, HEIMDALLR offers notable improvements in accuracy, false alarm suppression, and computational efficiency, making it particularly suitable for real-time deployment in large-scale cloud environments.
Figure 1: Overall architecture of the HEIMDALLR framework.
The IEEE International Symposium on Reliable Distributed Systems (SRDS) is one of the most prestigious and long-standing conferences in the field of distributed system reliability, with a history spanning 43 editions. SRDS covers a broad range of topics including trust and privacy in distributed systems, fault tolerance and self-healing technologies, real-time and resilient computing, as well as the design and evaluation of dependable systems. SRDS 2025 will be held in Porto, Portugal, from September 29 to October 2.
As a key driver of technological innovation within China Telecom, the China Telecom Cloud Computing Research Institute continues to advance the development of the “Intelligent Ubiquitous Cloud” technology ecosystem and deepen research in cutting-edge domains. This latest breakthrough in fault detection not only enhances the core capabilities of intelligent cloud-network monitoring, but also strengthens the institute’s expertise in unsupervised intelligent diagnostics and reliable distributed systems. The publication of this work will play a critical role in supporting the intelligent evolution of large-scale cloud system operations with high reliability and low latency, laying a solid technical foundation for next-generation intelligent infrastructure.