Cloud Computing Failure Prediction

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Chinmaya Kumar Dehury	--	1490	2022-04-01 11:59:00	\|
2	format correct	Conner Chen	-54 word(s)	1436	2022-04-02 05:18:40	\| \|
3	format correct	Conner Chen	+ 1390 word(s)	1436	2022-04-02 05:19:06	\|

This entry is adapted from the peer-reviewed paper 10.3390/bdcc6010026

To date, despite the significant improvement in the performance of the hardware elements of the cloud infrastructure, the failure rate remains substantial. Moreover, the cloud is not as reliable as the cloud service providers, such as Amazon AWS and Ali Cloud, claimed, which is more than 99.9%. For example, multiple instances of failure have been reported, such as the failure of Amazon’s cloud data servers in early October 2012, which resulted in the collapse of Reddit, Airbnb, and Flipboard, the loss of Amazon AWS S3 on 28 February 2017, and the crash of Microsoft cloud services on 22 March 2017. Such failures show that cloud service providers are not as reliable as they claim.

failure prediction cloud computing fault tolerance artificial intelligence reliability

1. Introduction

Cloud computing has emerged as the fifth utility over the last decade, and is a backbone to the modern economy ^[1]. It is a model of computing that allows flexible use of virtual servers, massive scalability, and management services for the delivery of information services. With the low-cost pay-per-use model of on-demand computing ^[2], the cloud has grown massively over the years, both in terms of size and complexity.

Today, almost everyone is connected to the cloud in one way or another. This is because of cost effectiveness with a pay-as-you-go or subscription-based service model for on-demand access to IT resources ^[1]^[2]. Industries rely on the cloud for their operations, academicians to accelerate and conduct scientific experiments, and ordinary end-users by using cloud-based services knowingly or unknowingly, such as Google Drive, Gmail, Outlook, and so on. Furthermore, the cloud today is more important than yesterday, as it supports smart city construction ^[3], enterprise business ^[4], scalable data analysis ^[5]^[6], healthcare ^[7]^[8] and also new evolving computing paradigms, such as fog and edge computing ^[9].

To date, despite the significant improvement in the performance of the hardware elements of the cloud infrastructure, the failure rate remains substantial. Moreover, the cloud is not as reliable as the cloud service providers, such as Amazon AWS and Ali Cloud, claimed, which is more than 99.9% ^[10]. For example, multiple instances of failure have been reported, such as the failure of Amazon’s cloud data servers in early October 2012, which resulted in the collapse of Reddit, Airbnb, and Flipboard, the loss of Amazon AWS S3 on 28 February 2017, and the crash of Microsoft cloud services on 22 March 2017 ^[10]. Such failures show that cloud service providers are not as reliable as they claim ^[10]^[11].

The public cloud vendor revenue is forecast to be around 500 billion by 2026 ^[12]. The majority of this revenue goes to platform-as-a-service (PaaS) and infrastructure-as-a-service (IaaS), 298.4 and 126 billion, respectively. Any occurrence of the cloud’s failure, therefore, impacts the cloud-based environment and services it supports, its users, and the economy. As a result, maintaining reliability is essential, and failure prediction is one of the mechanisms to obtain it.

AI has the ability to learn patterns and make future predictions accordingly. AI can be manifested as a machine exhibiting human intelligence ^[13] and is utilised in diverse domains, such as healthcare, autonomous systems, monitoring applications, and predictive maintenance, because it allows solving problems that, before, seemed to be unsolvable by computational processes alone ^[14]. The tremendous advancement in AI today has resulted in state-of-the-art performance for many practical problems, especially in areas involving high-dimensional unstructured data, such as computer vision, speech, and natural language processing ^[15].

2. Server-Level Failure Prediction

Mohammed et al. ^[16], Xu et al. ^[17], Lai et al. ^[18], Das et al. ^[19], Chigurupati et al. ^[20], Tehrani et al. ^[21], and Adamu et al. ^[22] carried out a study on server (or server-level) failure prediction. The research by Mohammed et al. ^[16] focused on the prediction of containerised high-performance computing (HPC) system failures using failure information, such as hardware, software, network, undetermined, and human error. Furthermore, support vector machine (SVM), RF, k-nearest neighbours (KNN), classification and regression trees (CART), and linear discriminant analysis (LDA) were used in the study. However, it could not tell if the system failed or if there was human intervention based on information such as human errors. Furthermore, the scope of the unidentified error source is unclear. Unlike Mohammed et al. ^[16], Xu et al. ^[17] used a ranking based machine learning approach and SMART hard drive information for failure prediction in cloud systems to improve the service availability of Microsoft Azure by migrating VMs from failing to healthy nodes.

Similar to Xu et al. ^[17], Das et al. ^[19] also focused on migrating computation from a failing node to a healthy node. However, Das et al. ^[19] focused on using a deep learning (i.e., LSTM) approach, compared to Xu et al. ^[17], who used a ranking-based approach. On the other hand, Lai et al. ^[18] used techniques such as KNN and hard drive data from the SLAC Accelerator Laboratory ^[23] to predict server failure within 60 days and introduced a derived metric time_since_prev_failure for server failure prediction. Furthermore, the study by Lai et al. ^[18] made use of failure logs that were kept for a period of 10 years. Based on their experience, Lai et al. ^[18] also recommended using an RNN-based technique, such as LSTM.

Similarly, Chigurupati et al. ^[20], Tehrani et al. ^[21], and Adamu et al. ^[22] used techniques such as SVM for failure prediction. While the study by Chigurupati et al. ^[20] focused on predicting communication hardware failure 5 min ahead, the study by Tehrani et al. ^[21] focused on failure prediction in cloud systems in a simulated environment, using system metrics such as temperature, CPU, RAM, and bandwidth utilisation. Adamu et al. ^[22], like other previous studies, focused on failure prediction in a cloud environment using data from the National Energy Research Scientific Computing Center’s ^[24] Computer Failure Data Repository. The author separated the failures of a disc, a dual in-line memory module (DIMM), the CPU, and other components. However, the scope of the failure, such as other failures in the study, is unclear, and network information was not used, which is another reason for the failure.

3. VM-Level Failure Prediction

A study on VM failure prediction was carried out by Meenakumari et al. ^[25], Alkasem et al. ^[26], Qasem et al. ^[27], Liu et al. ^[28] and Rawat et al. ^[29]. The study by Meenakumari et al. ^[25] employed a dynamic thresholding approach to predict failure based on system metrics such as CPU utilisation, CPU usage, bandwidth, temperature, and memory. Similar to Meenakumari et al. ^[25], Alkasem et al. ^[26] also focused on VM failure prediction. The study by Alkasem et al. ^[26] focused on the VM startup failure problem by using system metrics such as CPU utilisation, memory usage, network overhead, and IO (input/output) storage usage. Alkasem et al. ^[26] used Apache Spark ^[30] streaming together with Naïve Bayes (NB). Both Qasem et al. ^[27] and Liu et al. ^[28] investigated VM failure using recurrent neural networks (RNN). However, Qasem et al. ^[27] used simulated data from Cloudsim ^[31], whereas Liu et al. used SMART hard drive system metrics. Similar to Qasem et al. ^[27], Rawat et al. conducted a VM failure prediction study using simulated data. However, unlike Qasem et al. ^[27], Rawat et al. ^[29] focused on using an autoregressive integrated moving average and the Box–Jenkin method. Saxena et al. ^[11] proposed an online model for VM failure prediction and tolerance. The study focused on resource capacity utilisation-based failure prediction and classified virtual machines into failure-prone and normal virtual machines based on their failure tolerance units. Following the classification, the failure-prone VM was replicated into a new VM instance to be hosted on other physical machines.

4. Task-Level Failure Prediction

Shetty et al. ^[32], Jassas et al. ^[33], Bala et al. ^[34], Rosa et al. ^[35], Gao et al. ^[36], and Marahatta et al. ^[37] conducted a study on task failure (or job) prediction. The majority of these studies, such as Refs. ^[32]^[33]^[35]^[36], made use of the Google cluster trace dataset for their research, while the other studies, such as Refs. ^[34], used the simulated data from simulators such as WorkflowSim ^[38]. Shetty et al. ^[32] focused on statistical resource usage analysis as well as failure prediction using XGboost, whereas Jassas et al. ^[33] focused on failure analysis to identify a correlation between the failure and the requested resource. Bala et al. ^[34] focused on task failure prediction for scientific workflow applications, employing techniques such as NB, random forest, logistic regression (LR), and artificial neural networks (ANN).

Similar to the study of Shetty et al. ^[32] and Jassas et al. ^[33], the studies of Rosa et al. ^[35] and Gao et al. ^[36] also used the Google cluster trace dataset for their study. The study of Rosa et al. ^[35] also focused on job failure prediction, similar to other studies. However, unlike other studies, Rosa et al. ^[35] characterised failure to identify key features contributing to failure and employs techniques, such as LDA, quadratic discriminant analysis (QDA), and LR. In order to improve task failure prediction further, Gao et al. ^[36] proposed a multi-layer bidirectional long short-term memory (Bi-LSTM) and conducted a study, achieving an accuracy of up to 93%. Marahatta et al. ^[37], on the other hand, focused on energy consumption in addition to task failure prediction (i.e., energy-aware task failure prediction). Marahatta et al. ^[37] used deep neural networks to classify tasks (i.e., whether they are prone to failure or not) in the first stage and then scheduled them in the second stage.

References

Buyya, R.; Srirama, S.N.; Casale, G.; Calheiros, R.; Simmhan, Y.; Varghese, B.; Gelenbe, E.; Javadi, B.; Vaquero, L.M.; Netto, M.A.; et al. A manifesto for future generation cloud computing: Research directions for the next decade. ACM Comput. Surv. (CSUR) 2018, 51, 1–38.
Sahoo, P.K.; Dehury, C.K.; Veeravalli, B. LVRM: On the Design of Efficient Link Based Virtual Resource Management Algorithm for Cloud Platforms. IEEE Trans. Parallel Distrib. Syst. 2018, 29, 887–900.
Jiang, D. The construction of smart city information system based on the Internet of Things and cloud computing. Comput. Commun. 2020, 150, 158–166.
Saini, H.; Upadhyaya, A.; Khandelwal, M.K. Benefits of Cloud Computing for Business Enterprises: A Review. In Proceedings of the International Conference on Advancements in Computing & Management (ICACM), Jaipur, India, 13–14 April 2019.
Varadarajan, V.; Neelanarayanan, V.; Doyle, R.; Al-Shaikhli, I.F.; Groppe, S. Emerging Solutions in Big Data and Cloud Technologies for Mobile Networks. Mob. Netw. Appl. 2019, 24, 1015–1017.
Langmead, B.; Nellore, A. Cloud computing for genomic data analysis and collaboration. Nat. Rev. Genet. 2018, 19, 208.
Sahoo, P.K.; Dehury, C.K. Efficient data and CPU-intensive job scheduling algorithms for healthcare cloud. Comput. Electr. Eng. 2018, 68, 119–139.
Liu, Y.; Zhang, L.; Yang, Y.; Zhou, L.; Ren, L.; Wang, F.; Liu, R.; Pang, Z.; Deen, M.J. A novel cloud-based framework for the elderly healthcare services using digital twin. IEEE Access 2019, 7, 49088–49101.
Byers, C.; Zahavi, R.; Zao, J.K. The Edge Computing Advantage. 2019. Available online: https://www.iiconsortium.org/pdf/IIC_Edge_Computing_Advantages_White_Paper_2019-10-24.pdf (accessed on 25 December 2020).
Luo, L.; Meng, S.; Qiu, X.; Dai, Y. Improving failure tolerance in large-scale cloud computing systems. IEEE Trans. Reliab. 2019, 68, 620–632.
Saxena, D.; Singh, A.K. OFP-TM: An online VM failure prediction and tolerance model towards high availability of cloud computing environments. J. Supercomput. 2022, 1–22.
Gracely, B. Wikibon Research Cloud Computing (2015-2025). Available online: https://wikibon.com/wp-content/uploads/Wikibon-BGracely-Cloud-Computing-Nov-20152.pdf (accessed on 11 October 2021).
Huang, M.H.; Rust, R.T. Artificial intelligence in service. J. Serv. Res. 2018, 21, 155–172.
Ropinski, T.; Archambault, D.; Chen, M.; Maciejewski, R.; Mueller, K.; Telea, A.; Wattenberg, M. How do Recent Machine Learning Advances Impact the Data Visualization Research Agenda? IEEE Vis Panel. Phoenix 2017. Available online: https://lahmesding.informatik.uni-ulm.de/api/uploads/25/vis17mlpanel.pdf (accessed on 2 January 2022).
Ramachandram, D.; Taylor, G.W. Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Process. Mag. 2017, 34, 96–108.
Mohammed, B.; Awan, I.; Ugail, H.; Younas, M. Failure prediction using machine learning in a virtualised HPC system and application. Clust. Comput. 2019, 22, 471–485.
Xu, Y.; Sui, K.; Yao, R.; Zhang, H.; Lin, Q.; Dang, Y.; Li, P.; Jiang, K.; Zhang, W.; Lou, J.G.; et al. Improving service availability of cloud systems by predicting disk error. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, MA, USA, 11–13 July 2018; pp. 481–494.
Lai, B. Predicting Server Failures with Machine Learning; Technical Report; SLAC National Accelerator Lab.: Menlo Park, CA, USA, 2018.
Das, A.; Mueller, F.; Siegel, C.; Vishnu, A. Desh: Deep learning for system health prediction of lead times to failure in hpc. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, Tempe, AZ, USA, 11–15 June 2018; pp. 40–51.
Chigurupati, A.; Thibaux, R.; Lassar, N. Predicting hardware failure using machine learning. In Proceedings of the 2016 Annual Reliability and Maintainability Symposium (RAMS), Tucson, AZ, USA, 25–28 January 2016; pp. 1–6.
Fadaei Tehrani, A.; Safi-Esfahani, F. A threshold sensitive failure prediction method using support vector machine. Multiagent Grid Syst. 2017, 13, 97–111.
Adamu, H.; Mohammed, B.; Maina, A.B.; Cullen, A.; Ugail, H.; Awan, I. An Approach to Failure Prediction in a Cloud Based Environment. In Proceedings of the 2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud), Prague, Czech Republic, 21–23 August 2017; pp. 191–197.
SLAC Accelerator Laboratory. Available online: https://www6.slac.stanford.edu (accessed on 15 October 2021).
National Energy Research Scientific Computing Center (NERSC). Available online: https://www.nersc.gov (accessed on 15 October 2021).
Meenakumari, J. Virtual Machine (VM) Earlier Failure Prediction Algorithm. Int. J. Appl. Eng. Res. 2017, 12, 9285–9289.
Alkasem, A.; Liu, H.; Zuo, D.; Algarash, B. Cloud computing: A model construct of real-time monitoring for big dataset analytics using apache spark. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2018; Volume 933, p. 012018.
Qasem, G.M.; Madhu, B. Proactive fault tolerance in cloud data centers for performance efficiency. Int. J. Pure Appl. Math. 2017, 117, 325–329.
Liu, D.; Wang, B.; Li, P.; Stones, R.J.; Marbach, T.G.; Wang, G.; Liu, X.; Li, Z. Predicting Hard Drive Failures for Cloud Storage Systems. In Algorithms and Architectures for Parallel Processing; Wen, S., Zomaya, A., Yang, L.T., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 373–388.
Rawat, A.; Sushil, R.; Agarwal, A.; Sikander, A. A New Approach for VM Failure Prediction using Stochastic Model in Cloud. IETE J. Res. 2021, 67, 165–172.
Apache Spark. Available online: https://spark.apache.org (accessed on 15 October 2021).
Cloudsim. Available online: http://www.cloudbus.org/cloudsim/ (accessed on 15 October 2021).
Shetty, J.; Sajjan, R.; Shobha, G. Task Resource Usage Analysis and Failure Prediction in Cloud. In Proceedings of the 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 10–11 January 2019; pp. 342–348.
Jassas, M.; Mahmoud, Q.H. Failure analysis and characterization of scheduling jobs in google cluster trace. In Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018; pp. 3102–3107.
Bala, A.; Chana, I. Intelligent failure prediction models for scientific workflows. Expert Syst. Appl. 2015, 42, 980–989.
Rosa, A.; Chen, L.Y.; Binder, W. Predicting and mitigating jobs failures in big data clusters. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Shenzhen, China, 4–7 May 2015; pp. 221–230.
Gao, J.; Wang, H.; Shen, H. Task Failure Prediction in Cloud Data Centers Using Deep Learning. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019.
Marahatta, A.; Xin, Q.; Chi, C.; Zhang, F.; Liu, Z. PEFS: AI-driven prediction based energy-aware fault-tolerant scheduling scheme for cloud data center. IEEE Trans. Sustain. Comput. 2020, 6, 655–666.
WorkflowSim. Available online: https://github.com/WorkflowSim/WorkflowSim-1.0 (accessed on 16 October 2021).

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Chinmaya Kumar Dehury

Tek Raj Chhetri

Artjom Lind

Anna Fensel

View Times: 741

Update Date: 06 Apr 2022

Table of Contents

Video Upload Options

Confirm