To date, despite the significant improvement in the performance of the hardware elements of the cloud infrastructure, the failure rate remains substantial. Moreover, the cloud is not as reliable as the cloud service providers, such as Amazon AWS and Ali Cloud, claimed, which is more than 99.9%. For example, multiple instances of failure have been reported, such as the failure of Amazon’s cloud data servers in early October 2012, which resulted in the collapse of Reddit, Airbnb, and Flipboard, the loss of Amazon AWS S3 on 28 February 2017, and the crash of Microsoft cloud services on 22 March 2017. Such failures show that cloud service providers are not as reliable as they claim.
1. Introduction
Cloud computing has emerged as the fifth utility over the last decade, and is a backbone to the modern economy
[1]. It is a model of computing that allows flexible use of virtual servers, massive scalability, and management services for the delivery of information services. With the low-cost pay-per-use model of on-demand computing
[2], the cloud has grown massively over the years, both in terms of size and complexity.
Today, almost everyone is connected to the cloud in one way or another. This is because of cost effectiveness with a pay-as-you-go or subscription-based service model for on-demand access to IT resources
[1][2]. Industries rely on the cloud for their operations, academicians to accelerate and conduct scientific experiments, and ordinary end-users by using cloud-based services knowingly or unknowingly, such as Google Drive, Gmail, Outlook, and so on. Furthermore, the cloud today is more important than yesterday, as it supports smart city construction
[3], enterprise business
[4], scalable data analysis
[5][6], healthcare
[7][8] and also new evolving computing paradigms, such as fog and edge computing
[9].
To date, despite the significant improvement in the performance of the hardware elements of the cloud infrastructure, the failure rate remains substantial. Moreover, the cloud is not as reliable as the cloud service providers, such as Amazon AWS and Ali Cloud, claimed, which is more than 99.9%
[10]. For example, multiple instances of failure have been reported, such as the failure of Amazon’s cloud data servers in early October 2012, which resulted in the collapse of Reddit, Airbnb, and Flipboard, the loss of Amazon AWS S3 on 28 February 2017, and the crash of Microsoft cloud services on 22 March 2017
[10]. Such failures show that cloud service providers are not as reliable as they claim
[10][11].
The public cloud vendor revenue is forecast to be around 500 billion by 2026
[12]. The majority of this revenue goes to platform-as-a-service (PaaS) and infrastructure-as-a-service (IaaS), 298.4 and 126 billion, respectively. Any occurrence of the cloud’s failure, therefore, impacts the cloud-based environment and services it supports, its users, and the economy. As a result, maintaining reliability is essential, and failure prediction is one of the mechanisms to obtain it.
AI has the ability to learn patterns and make future predictions accordingly. AI can be manifested as a machine exhibiting human intelligence
[13] and is utilised in diverse domains, such as healthcare, autonomous systems, monitoring applications, and predictive maintenance, because it allows solving problems that, before, seemed to be unsolvable by computational processes alone
[14]. The tremendous advancement in AI today has resulted in state-of-the-art performance for many practical problems, especially in areas involving high-dimensional unstructured data, such as computer vision, speech, and natural language processing
[15].
2. Server-Level Failure Prediction
Mohammed et al.
[16], Xu et al.
[17], Lai et al.
[18], Das et al.
[19], Chigurupati et al.
[20], Tehrani et al.
[21], and Adamu et al.
[22] carried out a study on server (or server-level) failure prediction. The research by Mohammed et al.
[16] focused on the prediction of containerised high-performance computing (HPC) system failures using failure information, such as hardware, software, network, undetermined, and human error. Furthermore, support vector machine (SVM), RF, k-nearest neighbours (KNN), classification and regression trees (CART), and linear discriminant analysis (LDA) were used in the study. However, it could not tell if the system failed or if there was human intervention based on information such as human errors. Furthermore, the scope of the unidentified error source is unclear. Unlike Mohammed et al.
[16], Xu et al.
[17] used a ranking based machine learning approach and SMART hard drive information for failure prediction in cloud systems to improve the service availability of Microsoft Azure by migrating VMs from failing to healthy nodes.
Similar to Xu et al.
[17], Das et al.
[19] also focused on migrating computation from a failing node to a healthy node. However, Das et al.
[19] focused on using a deep learning (i.e., LSTM) approach, compared to Xu et al.
[17], who used a ranking-based approach. On the other hand, Lai et al.
[18] used techniques such as KNN and hard drive data from the SLAC Accelerator Laboratory
[23] to predict server failure within 60 days and introduced a derived metric
time_since_prev_failure for server failure prediction. Furthermore, the study by Lai et al.
[18] made use of failure logs that were kept for a period of 10 years. Based on their experience, Lai et al.
[18] also recommended using an RNN-based technique, such as LSTM.
Similarly, Chigurupati et al.
[20], Tehrani et al.
[21], and Adamu et al.
[22] used techniques such as SVM for failure prediction. While the study by Chigurupati et al.
[20] focused on predicting communication hardware failure 5 min ahead, the study by Tehrani et al.
[21] focused on failure prediction in cloud systems in a simulated environment, using system metrics such as temperature, CPU, RAM, and bandwidth utilisation. Adamu et al.
[22], like other previous studies, focused on failure prediction in a cloud environment using data from the National Energy Research Scientific Computing Center’s
[24] Computer Failure Data Repository. The author separated the failures of a disc, a dual in-line memory module (DIMM), the CPU, and other components. However, the scope of the failure, such as other failures in the study, is unclear, and network information was not used, which is another reason for the failure.
3. VM-Level Failure Prediction
A study on VM failure prediction was carried out by Meenakumari et al.
[25], Alkasem et al.
[26], Qasem et al.
[27], Liu et al.
[28] and Rawat et al.
[29]. The study by Meenakumari et al.
[25] employed a dynamic thresholding approach to predict failure based on system metrics such as CPU utilisation, CPU usage, bandwidth, temperature, and memory. Similar to Meenakumari et al.
[25], Alkasem et al.
[26] also focused on VM failure prediction. The study by Alkasem et al.
[26] focused on the VM startup failure problem by using system metrics such as CPU utilisation, memory usage, network overhead, and IO (input/output) storage usage. Alkasem et al.
[26] used Apache Spark
[30] streaming together with Naïve Bayes (NB). Both Qasem et al.
[27] and Liu et al.
[28] investigated VM failure using recurrent neural networks (RNN). However, Qasem et al.
[27] used simulated data from Cloudsim
[31], whereas Liu et al. used SMART hard drive system metrics. Similar to Qasem et al.
[27], Rawat et al. conducted a VM failure prediction study using simulated data. However, unlike Qasem et al.
[27], Rawat et al.
[29] focused on using an autoregressive integrated moving average and the Box–Jenkin method. Saxena et al.
[11] proposed an online model for VM failure prediction and tolerance. The study focused on resource capacity utilisation-based failure prediction and classified virtual machines into failure-prone and normal virtual machines based on their failure tolerance units. Following the classification, the failure-prone VM was replicated into a new VM instance to be hosted on other physical machines.
4. Task-Level Failure Prediction
Shetty et al.
[32], Jassas et al.
[33], Bala et al.
[34], Rosa et al.
[35], Gao et al.
[36], and Marahatta et al.
[37] conducted a study on task failure (or job) prediction. The majority of these studies, such as Refs.
[32][33][35][36], made use of the Google cluster trace dataset for their research, while the other studies, such as Refs.
[34], used the simulated data from simulators such as WorkflowSim
[38]. Shetty et al.
[32] focused on statistical resource usage analysis as well as failure prediction using XGboost, whereas Jassas et al.
[33] focused on failure analysis to identify a correlation between the failure and the requested resource. Bala et al.
[34] focused on task failure prediction for scientific workflow applications, employing techniques such as NB, random forest, logistic regression (LR), and artificial neural networks (ANN).
Similar to the study of Shetty et al.
[32] and Jassas et al.
[33], the studies of Rosa et al.
[35] and Gao et al.
[36] also used the Google cluster trace dataset for their study. The study of Rosa et al.
[35] also focused on job failure prediction, similar to other studies. However, unlike other studies, Rosa et al.
[35] characterised failure to identify key features contributing to failure and employs techniques, such as LDA, quadratic discriminant analysis (QDA), and LR. In order to improve task failure prediction further, Gao et al.
[36] proposed a multi-layer bidirectional long short-term memory (Bi-LSTM) and conducted a study, achieving an accuracy of up to 93%. Marahatta et al.
[37], on the other hand, focused on energy consumption in addition to task failure prediction (i.e., energy-aware task failure prediction). Marahatta et al.
[37] used deep neural networks to classify tasks (i.e., whether they are prone to failure or not) in the first stage and then scheduled them in the second stage.
This entry is adapted from the peer-reviewed paper 10.3390/bdcc6010026