With the rapid growth of cloud computing and the creation of large-scale systems such as IoT environments, the failure of machines/devices and, by extension, the systems that rely on them is a major risk to their performance, usability, and the security systems that support them. The need to predict such anomalies in combination with the creation of fault-tolerant systems to manage them is a key factor for the development of safer and more stable systems.
1. Introduction
Critical infrastructure systems, such as water supply, power supply, transportation, telecommunications, etc., play a significant role in the sustainable development of modern societies. Modern infrastructure systems are highly interconnected and consist of geographically extensive networks. Continuous communication and data exchange between these systems leads to interdependencies that are essential for their proper functioning and the functioning of the overall system they belong to. Due to the large-scale networking of infrastructure systems, there can be economic, social, health, and environmental problems in case of their failure. The failure of these systems can arise from extreme natural phenomena (hurricanes, floods) or technological disasters and cyber-attacks. As a result, systems of this type must be regularly monitored, upgraded, and maintained
[1].
Ensuring the healthy and continuous operation of systems such as aircraft engines, cars, computer servers, and even satellites, is an imperative need, given their contribution to critical services, beyond urban infrastructure. The accurate prediction of their malfunctions and, by extension, their operational interruptions, can contribute to improvements in the design of proactive fault-tolerant systems, as well as significant cost reduction through prompt fault reporting. Previous research has used various techniques to create predictions in scenarios such as the autoregressive model
[2], principal component analysis
[3], and opposite degree algorithm
[4].
Furthermore, a significant amount of research has been carried out in the anomaly detection field, which has a closely correlated relation to machine failure. The research involves a variety of proposed models mostly making use of different machine learning techniques such as ANN
[5][6], RF and SVM
[7], and convolutional neural networks (CNNs) and long short-term memory (LSTM)
[8].
To accurately predict machine failure and prevent costly downtime, it is essential to utilize a reliable and flexible approach that can account for a wide range of factors influencing machine degradation. While traditional statistical models have been widely used in failure prediction
[9], they often rely on strict assumptions about machine degradation patterns that may not accurately reflect the real-world complexity of the problem. In contrast, survival analysis has emerged as a promising approach, offering several advantages that can help improve the accuracy and efficiency of predictions. By incorporating time-to-failure information, handling right-censored data, accounting for covariate effects, and providing flexibility in application, survival analysis represents a superior alternative to traditional models in the context of machine failure prediction.
While some studies have utilized survival analysis to predict failure, few have examined the effectiveness of different feature selection/analysis methods in conjunction with this technique. Most of the existing research focuses primarily on different machine learning models, such as LSTM networks
[10][11], moving away from the survival analysis approach. While machine learning models are undoubtedly useful in predicting machine failure, survival analysis may be more appropriate when the goal is to predict failure times and identify key factors in failure prediction. In this proposed model, a machine learning survival analysis technique is used, along with feature analysis and selection methods. The machine learning model used is random survival forest (RSF), which is well-suited to time-to-event data, such as machine failure times, and has several advantages over other machine learning methods in the context of survival analysis. Examples include the ability to handle time-dependent covariates, non-linear relationships between covariates and survival, and interactions between covariates
[12].
Unlike survival analysis models, standard machine learning algorithms are not equipped to handle the right-censored data that are prevalent in machine failure datasets. RSF is a great way to incorporate machine learning and simultaneously overcome this problem, due to its ability to handle high-dimensional, complex data
[13]. RSF can handle both continuous and categorical predictors, as well as complex interactions, nonlinear relationships, and time-varying effects. This makes it particularly useful in survival analysis settings where there may be many potential predictors that interact in complex ways
[14]. One of the key benefits of using RSF is its ability to handle missing data, which is common in many real-world datasets
[13][14][15]. RSF uses a tree-based approach to impute missing values by splitting the data at each node based on the available data, and then using the available data to make a prediction for the missing value
[13]. This imputation process is repeated multiple times, resulting in a distribution of imputed datasets that can be used to estimate uncertainty. Overall, the flexibility and versatility of RSF make it a powerful tool for survival analysis, particularly in situations where there are many potential predictors and complex interactions among variables
[13].
2. Machine Failure Prediction Using Survival Analysis
Despite the abundance of statistical methods that can be used, survival analysis is conceptually largely aligned with the study of predicting the failure of a machine
[16].
Kaplan–Meier curves and the COX regression model have been employed in similar research to determine the relationship between the survival time of a subject and one or more prognostic variables. In
[17], the SMED (single-minute exchange of die) philosophy and survival analysis are used to reduce transition times. In this research, the COX model is also used to identify the significance of causes of time loss. The proposed methodology predicts activity times, considering only the characteristics that were identified as significant toward transition times, which is defined as a limitation by the authors.
Similarly, the model proposed in
[18] follows a gradual Bayesian approach to model failure using the tree-like accident theory and the Bayesian survival analysis model to predict the probability of survival for welded pipes. Using Bayesian, Kaplan–Meier, and Weibull curves, the authors construct staged Bayesian distribution, which is then used to make predictions about the time-to-failure of the pipes. Weibull distribution is also used in
[19] to predict the life of battery cells, combined with the exponential, log-normal, and log-logistic distributions to create an accelerated failure time (AFT) parametric survival model. In this research, the authors concluded that low values of prediction error could be achieved by only using a small number of variables on the proposed model. The authors assert that this discovery holds significant value in their research, as their model yielded a total decrease of 40% in the root mean square error (RMSE). A limitation of this study is that the authors relied on only two datasets to support their findings, without exploring the use of other datasets.
In
[10], an LSTM approach is presented for the remaining life prediction of machines. The proposed approach leverages the advantages of LSTMs in capturing temporal dependencies in sensor data while also effectively handling missing data. The authors conducted experiments on a real-world dataset of a milling machine and evaluated the performance of their novel approach in comparison to various baseline methods. The results indicate that the LSTM-based approach surpasses the other methods in accurately predicting the machine’s health status and effectively capturing its dynamic behavior. Out of the variety of models tested, the bidirectional-LSTM model had a total RMSE value of 15.42 cycles, outperforming all other models, such as deep convolutional neural network (DCNN) (18.44 cycles); support vector regressor (SVR) (20.96 cycles); multilayer perceptron (MLP) (20.84 cycles); bidirectional recurrent neural network (BD-RNN) (20.04 cycles); and a classic LSTM (18.07 cycles). Similarly, the authors of
[11] proposed a semi-supervised deep architecture for predicting the remaining useful life (RUL) of turbofan engines. The model uses both labeled and unlabeled data to enhance its performance and reduce the need for extensive labeled data. It involves a combination of a convolutional neural network (CNN) and a LSTM network that work in tandem to extract features and capture the temporal dependencies of the input data. The model was evaluated on the C-MAPSS dataset and compared against several state-of-the-art methods, achieving superior performance in terms of both RUL prediction accuracy and mean absolute error. Specifically, the model proposed by the authors yielded superior RMSE results for most of the subsets that it tested, with the value of 12.10 on the FD003 subset being the lowest, while also providing the best prediction result on all of them (FD001: 231, FD002: 3366, FD003: 251, FD004: 2840). A limitation of the study, as stated by the authors, is the use of a piece-wise linear degradation model, which does not account for the individual degradation patterns of each engine in each subset. The authors plan to address this limitation in future work by exploring the use of an unsupervised fault detector based on a variational autoencoder to optimize performance.
Introduced in
[20] is a method based on DCNNs to diagnose faults in induction motors using multiple signals. The proposed method leverages the advantages of DCNNs in automatic feature extraction and achieves improved diagnostic performance by combining information from multiple sensor signals. The authors conducted experiments on a dataset containing multiple types of faults in induction motors and evaluated the performance between two different architectures of their proposed method. The first architecture utilized a multichannel model that merged two separate time–frequency images from vibration signals and current signals, forming a two-channel image. This image was then fed into a deep model that consisted of three 2D convolutional layers and a fully connected layer with ReLU activation functions. The output layer had six units that correlated with six distinct labels. The second architecture used two convolutional networks were utilized to analyze different sensor signals separately, and then merged in fully connected layers to contribute to the output of label prediction. One network was trained on vibration signals, while the other was trained on current signals. The learned fault signatures from each network were combined by flattening them into a fully connected layer with 1024 ReLUs. The output layer used for predicting the state label was the same as the one used in architecture 1. The confidence interval analysis showed that the proposed multi-signal DCNN model had stable performance and the merged model outperformed the multi-channel model, with a 95% likelihood of covering fault classification skill between 99.89% and 99.93%. To address the issue of limited training data for deep architectures, the authors suggest the use of data augmentation techniques to expand the dataset and exploring pre-existing models for fault diagnosis as fields of improvement for their future work.
The paper
[21] proposes a method for equipment failure diagnosis that addresses the challenge of limited data and imbalanced data distribution. Specifically, the proposed method combines the synthetic minority oversampling technique (SMOTE) with a conditional tabular generative adversarial network (CTGAN) to predict equipment failures with a mixture of numerical and categorical data. The experimental results show that the proposed method outperforms other similar methods in five-category failure classification, even when failure data account for less than 1% of the total data. The proposed model showed a high recall rate of 0.9068, an accuracy of 0.8712, and a balanced accuracy of 0.8883. The recall rate and balanced accuracy were the highest across all methods tested by the authors which, apart from the crated model, were a CatBoost (non-oversampling) model, a combination of SmoteNC and CatBoost, and finally, another combination of the ctGAN and CatBoost models. It is noteworthy that the highest accuracy was obtained using the CatBoost algorithm without oversampling. Moreover, the paper highlights the importance of false positives in equipment failure prediction, as the cost of sudden machine downtime far exceeds that of system misdiagnosis. Therefore, the proposed method aims to increase the possibility of false positives to reduce the possibility of false negatives. The authors also note that the interpretability of the equipment failure prediction results is crucial, and they incorporated a tree-based model for failure prediction to analyze the causes of failures and implement preventive measures accordingly.
In
[22], the authors make a comparative study to evaluate a plethora of machine learning techniques for the task of fault detection and classification. The models used in this study are SVM classifier, KNN classifier, random forest, logistic regression, and decision tree. All models were tested on five datasets, and their accuracy and AUC-ROC scores were measured. The authors concluded that the best performing machine learning method was random forest, with an average accuracy of 0.964 and an average AUC_ROC score of 0.948 across all datasets. The other notable methods were the decision tree model, with an average accuracy of 0.959 and AUC_ROC score of 0.944, and KNC, with an average accuracy and AUC_ROC score of 0.942 and 0.930, respectively.
In the proposal of
[12], the authors suggest a new approach for predicting the remaining service life of water mains by combining machine learning and survival statistics. The authors developed a machine learning algorithm that uses a combination of historical failure data and pipe-specific characteristics to predict the probability of failure at any given time. They then applied survival statistics to estimate the remaining service life of the water main based on the predicted failure probability. The study utilized two distinct machine learning models—specifically, a random forest model and a random survival forest model—and additionally incorporated the Weibull proportional hazard survival model to assess and compare their respective abilities, in order to accurately predict the remaining useful life of water mains. The results showed that the RSF model achieved superior performance (C-index = 0.880) compared to the Weibull proportional hazard survival model (C-index = 0.734) and the random forest machine learning model (C-index = 0.807), indicating the potential of machine learning in predicting the remaining service life of water mains.
The literature reviewed herein indicates that many of the methods for predicting remaining useful life either solely employ survival analysis
[16][18][23] or only use machine learning techniques
[10][11][20]. However, combining both approaches can be beneficial, as demonstrated by the papers
[12][19], which use a combination of survival analysis and machine learning to make their predictions. Furthermore, most of the papers using survival analysis rely on the use of the COX model for their feature evaluation
[18][19][23][24]. By relying solely on the COX model for feature selection, important non-linear or time-dependent relationships between predictor variables and survival time may be missed or obscured, leading to a potentially incomplete or inaccurate understanding of the underlying data. Similarly, in the case of
[12], only using feature ranking from a RSF model may not provide information about the direction or magnitude of the relationship between predictor variables and survival time. By not only using the standard COX and RSF feature ranking/selection methods, we can achieve a more comprehensive and accurate understanding of the data, as well as increased confidence and validation of the results, while mitigating some of the limitations and biases of each individual method.
This entry is adapted from the peer-reviewed paper 10.3390/fi15050153