Data Analysis Using Machine Learning for Cybersecurity

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Allam Jaya Prakash	--	2614	2023-12-08 07:10:45	\|
2	layout & references	Sirius Huang	Meta information modification	2614	2023-12-15 02:16:35	\|

This entry is adapted from the peer-reviewed paper 10.3390/bdcc7040176

The internet has become an indispensable tool for organizations, permeating every facet of their operations. Virtually all companies leverage Internet services for diverse purposes, including the digital storage of data in databases and cloud platforms. Furthermore, the rising demand for software and applications has led to a widespread shift toward computer-based activities within the corporate landscape. However, this digital transformation has exposed the information technology (IT) infrastructures of these organizations to a heightened risk of cyber-attacks, endangering sensitive data. Consequently, organizations must identify and address vulnerabilities within their systems, with a primary focus on scrutinizing customer-facing websites and applications.

analytics cybersecurity machine learning Power BI website

1. Introduction

Due to increased demand and increased competitiveness, the majority of operations conducted by corporate companies have become digital. In contrast to the conventional methods of corporate practices, the internet has become the sole tool for carrying out activities due to its reach and ability to attend to a larger clientele. Moreover, with the advent of new technologies like cloud computing and SQL, a large quantity of sensitive data can be stored in repositories online or in the cloud. However, the universality and versatility of the internet can prove to be fatal and a loophole for dangerous cyber-attacks, which might not only affect the sensitive data of the corporation but also affect the entire information technology structure of the organization by crippling it, halting operations, and, in turn, decelerating profits exponentially. Therefore, in order to combat this, it is essential to develop a foolproof, strong cybersecurity system to combat these loopholes and develop a remedial solution. Cyber security is one of the practices of developing machinery to combat cyber assaults on software and computerized systems. It has an application spread over virtually every sector due to its digital dependence on operations. Moreover, in the political, military, and healthcare sectors, it is all the more important due to the risks involved in the sensitive data being leaked.

Cyber-attacks come in various forms such as SQL injections ^[1], malware, and cross-site scripting (XSS) ^[2]. These attacks occur in various parts of the machinery or structure, like the database or the HTML code of the page, and manipulate the values to obtain unauthorized access to the data of clients and extort these data for monetary incentives and malicious gains. Companies lose a massive amount of data and money due to falling victim to these attacks and it is hence mandatory to invest in cybersecurity. In order to understand and develop a cyber-security strategy, it is essential to obtain and aggregate data from sources like passive scans and reports of the website’s cyber health. After receiving the data, the next crucial step would be to analyze the data to obtain insights that can be provided to the clients and other departments so that they can act on the insights and formulate a new strategic plan. This is where data analytics comes into the picture. Data analytics ^[3] is the practice of collecting, managing, analyzing, visualizing, and presenting data. Over the years, data analytics has garnered significant popularity due to the influx of large amounts of data and due to it enabling data-driven decisions with higher chances of success. The ideal strategy of action can be planned by using the practices mentioned for cybersecurity data such as antivirus scan results, features of URLs, and the presence or absence of security features like security headers, and SSL certificates. This is why data analytics is important in the cybersecurity domain. An upgrade to data analytics and analysis is data science, which utilizes machine learning algorithms and deep learning to predict values for analysis to present predicted values to prospective clients so that clients can invest in the product. Machine learning has extensive uses in predictive analysis. It can be used to predict continuous values, like sales or values that do not have a 0 or 1 output. For continuous outputs, linear and polynomial regression models are used to predict the results. To predict categorical values of 0 or 1, classifier algorithms such as logistic regression, the KNN algorithm, the Naïve Bayes algorithm, the decision tree algorithm, and support vector machines are used. There are deep learning models on Keras, TensorFlow, PyTorch, etc., which are mainly used for emulating and programming neural networks to perform functions similar to a human brain. This would involve text detection, image classifiers, etc. However, these are only a few examples and there is not an exhaustive list.

In order to carry out the implementation of the cyber security system for protecting data, the tools most extensively used are Python 3.12 and Microsoft Power BI-V.2.119. 323.0. Visualizations and graphical representations of data are carried out by Microsoft Power BI, whereas the machine learning part is developed on the Python platform. Microsoft Power BI is a data visualization tool commonly used by analysts to make graphical depictions of the data via reports and dashboards which can be used to convey any insights to prospective clients. It has a free desktop version that allows users to make visualizations offline with no attached cost whatsoever. It also has a pro and premium version, which provides additional features like Azure Machine Learning, etc. Microsoft Power BI is the preferred tool mainly due to its ease of use, serving as a starting point for the majority of analysts. It also provides streamlined distribution and ease of working with real-time data. For the work assigned, a data set of the various cybersecurity subdomains is provided which contains all the cybersecurity characteristics of said subdomains. Power BI is used to create a report that will depict all the parameters with ease, make visualizations of these parameters, and come up with some basic conclusions and insights. Python is a dynamic universal programming tool that has applications in almost every sector of computer science and recently has become massively popular in the data science sector. It is an open-source language; hence, it does not require any form of payment to use, and it can be installed easily. In the data science domain, Python is mainly used for both data analysis and machine learning. From Python, some of the tasks can be identified as obtaining data sets from data engineers or via web scraping; removing and cleaning redundant values and modifying faulty values; conducting analysis; obtaining desired insights; and using machine learning algorithms to calculate the most accurate algorithm for predicting the likeliness of a cyber-attack.

Table 1 serves as a valuable resource for researchers and practitioners in the field of intrusion detection, offering a quick reference for the key attributes and findings of these influential studies. These studies collectively contribute to the ongoing effort to enhance the security of computer systems and networks. Researchers have explored a range of techniques, from traditional approaches like K-nearest neighbors (K-NN) and support vector machines (SVM) to more modern methods like random forest (RF) and Genetic Algorithms. The choice of method often reflects a trade-off between factors such as accuracy, training time, and false alarm rates.

Table 1. Literature survey data analysis using machine learning for cybersecurity.

Literature	Year	Method	Database	Number of Classes	Remarks
Swathi et al. ^[4]	2017	K-NN	NSL-KDD	3	Author not reported Precision and Recall
Verma et al. ^[5]	2018	K-NN and K-means	CIDDS-001	2	Method has low false alarm rate
Belouch et al. ^[6]	2018	SVM RF DT NB	UNSW-NB15	2	More training time required
Krichen et al. ^[7]	2017	Logistic Regression with Genetic Algorithm	UNSW-NB15	3	Less accuracy
Jabbar et al. ^[8]	2016	RF	NSL-KDD	3	More time for prediction

2. Cybersecurity

Cybercrime is an offense that uses a computer or a device as a vector to attack another system to either sabotage the device or gain unauthorized access to data. It can also involve other activities, such as fraud, identity theft, and suspension of the system. Attackers then extort the victim for monetary gain, which can result in severe losses for companies. It is crucial to implement cybersecurity ^[9] strategies as most data are now cyber-oriented and stored on the internet, making them the most vulnerable to cyber crimes. In the context of the task assigned, it is crucial to analyze the trends in cybersecurity. Web servers have proven to be a very susceptible platform to cyber-attackers. Attackers deploy their hazardous code and techniques on affected servers and hence they must be diagnosed first.

Cloud computing ^[10] is becoming a growing norm for the majority of companies. Via the cloud, companies can create and deploy apps, manage data, store terabytes of data in their storage facilities, and perform artificial intelligence and machine learning functions. However, despite the increase in the number of features offered by the cloud, concerns are raised regarding the security features. Therefore, cloud providers must think about developing a secure system to protect the data. Today, via mobile networks, earlier issues of accessibility have been bridged. Data can be stored on compact cellular devices, making them the most popular device amongst the masses recently. However, due to this popularity, they become more susceptible to cyber-attacks. Therefore, various strategies like firewalls and other protective practices must be implemented to prevent any data breaches. In order to develop safe strategies for these devices, some cyber security techniques are used by the cyber security team of a company. Malware detection programs ^[11] that scan the system files to detect any flaws and viruses are used by cybersecurity engineers. Malware is a general term for a variety of attack types, some of which are viruses, worms, ransomware, and trojans, but this is not an exhaustive list ^[12]^[13]. Firewalls can be considered as a screening mechanism that protects the system from any form of hacking bugs or viruses. They screen all forms of content entering the system from the internet and filter out all content messages and commands that cannot meet a certain criterion. Furthermore, they can perform other functions, such as stateful inspection and application-layer filtering. Antivirus systems are computer software programs for diagnosing any form of malicious content. These scans can be updated and progressively implemented to obtain an output that shows the loopholes present and also to discover any new viruses which were introduced later so that is does not become redundant.

3. Big Data Analytics

Big data analytics is the terminology coined for the analysis of data that has a processing power in a range that surpasses conventional databases and is restricted depending on the application. In this case, the amount of data is too large and is produced at a very high speed, making the data impossible to handle. Big data has the characteristics of the four main Vs: volume, velocity, veracity, and value. In addition to these, variety, variability, validity, visualization, and vulnerability are also used in big data analytics ^[14]^[15]. To be able to store these data, different types of databases can be used. NoSQL is the abbreviated form of ‘not only SQL’, and NoSQL databases differ from other database tables as they do not store data in a tabular format as in relational databases. The major advantage of these databases is their ability to store large quantities of data at high speed. Moreover, they can accommodate various data types, including unstructured, semi-structured, and structured data. These data types are crucial for modern data management systems. Unstructured data, such as text and multimedia content, lack a predetermined data model. Semi-structured data, such as XML or JSON, have some organizational qualities but do not adhere to a rigorous format. Structured data, on the other hand, adhere to a predefined standard and can easily be grouped into rows and columns. By efficiently managing these many data kinds, firms can extract important insights, improve decision-making processes, and find hidden patterns or trends that might otherwise go unnoticed. Cloud storage systems have been used in recent times to store a lot of data. One of the main benefits of cloud storage is its increased scalability. In addition to this, it is very easy to retrieve data at zero startup cost. Examples of cloud providers’ storage services ^[16] are Amazon, S3, and Azure Data Lakes. Using these tools, volumes of client data are stored and utilized by data analysts whenever necessary for analysis. After retrieving data from databases, they are then imported into a visualization tool, like Power BI or Tableau, in JSON or CSV format and then the data are analyzed further. In addition, it is noteworthy that other formats, such as XML, are also commonly used or direct database connections are created in data analysis and visualization. These data formats and communication mechanisms are supported by a wide range of visualization tools.

4. Machine Learning in Cybersecurity

Big data implies the storage of large volumes of data at high speeds. This provides a massive quantity of data that can be broken down into simpler insights with the help of analytical tools. Further, the quality of these insights also depends on the quality of the data and the analytical methods. Small sets of obtained data are taken for analysis using machine learning algorithms. Various outputs can be obtained which can help analysts detect potential issues. Through machine learning, a vivid look at the severity and type of attack can be obtained and hence one is able to statistically decipher the loopholes present and present the findings to prospective clients. Machine learning also helps in developing predictive models that can be used to detect the entry point of the cyber assault and then take the correct analytical approach depending on the scenario. Furthermore, machine learning approaches in cybersecurity are becoming increasingly important for improving threat detection, improving incident responses, and protecting digital systems from a variety of cyber-attacks. Incorporating big data analytics into the cybersecurity domain is a step towards positive development as it provides an accurate analysis of the predicament and also a calculated idea of various hypothetical scenarios. It is hence essential for all companies and firms to employ machine learning models in their cybersecurity strategies. Depending on the desired outcome, different kinds of algorithms can be used. (a) Regression models ^[17]^[18]^[19] are mainly used for continuous variables. To execute a regression model, existing data are taken and split into testing and training data sets. The training data are then trained with a regression algorithm and predictions are made. In the domain of cyber-security, regression algorithms are used for detecting variables like the number of fraudulent transactions and the possible location of a cyber–attack. Different types of regression models can be used depending on the arrangement of data points, like linear, polynomial, ridge, and lasso regression. (b) Classification algorithms are mainly used to determine a binary or a ‘0 or 1’ output. In the cyber-security domain, these algorithms can be used to detect the status of the cyber health of the URL. While executing an algorithm, they can expose the data to a variety of classifier algorithms, calculate the accuracy score of these algorithms, and decide the best algorithm which must be deployed. The algorithms used here are mainly logistic regression, decision tree ^[20]^[21]^[22], the k-nearest neighbors algorithm ^[23], and support vector machine ^[24]^[25]. Classification is also used to segregate spam mail. For greater accuracy, increasingly larger data sets are employed in deep learning ^[26]^[27]^[28]. They can be used for various functions like text detection, image classification, or even for classification. For regression, artificial and recurrent neural networks ^[29]^[30]^[31] are used. Thus, it is ideal to incorporate big data analytics into the cybersecurity domain. With the increase in the number and sophistication of cyber-attacks, it is essential to have a data-driven strategy to make calculated decisions. Therefore, data analysis tools can be made with specific visualizations and presented to prospective clients. Moreover, with cyber-attacks constantly developing, it is essential to understand their nature and an ideal strategy must be formulated. Moreover, machine learning predictive analysis can be used to determine the location and probability of a cyber-attack. Finally, a comprehensive analysis will be carried out to provide a detailed analysis and provide insights for the authorities.

References

Shar, L.K.; Tan, H.B.K. Defeating SQL injection. Computer 2012, 46, 69–77.
Fang, Y.; Li, Y.; Liu, L.; Huang, C. DeepXSS: Cross site scripting detection based on deep learning. In Proceedings of the International Conference on Computing and Artificial Intelligence, Sanya, China, 21–23 December 2018.
Tsai, C.-W.; Lai, C.-F.; Chao, H.-C.; Vasilakos, A.V. Big data analytics: A survey. J. Big Data 2015, 2, 21.
Rao, B.B.; Krishna, K.V.; Swathi, K. A Fast KNN Based Intrusion Detection System For Cloud Environment. J. Adv. Res. Dyn. Control. Syst. 2018, 10, 1509–1515.
Verma, A.; Ranga, V. Statistical analysis of CIDDS-001 dataset for network intrusion detection systems using distance-based machine learning. Procedia Comput. Sci. 2018, 125, 709–716.
Belouch, M.; El Hadaj, S.; Idhammad, M. Performance evaluation of intrusion detection based on machine learning using Apache Spark. Procedia Comput. Sci. 2018, 127, 1–6.
Khammassi, C.; Krichen, S. A GA-LR wrapper approach for feature selection in network intrusion detection. Comput. Secur. 2017, 70, 255–277.
Farnaaz, N.; Jabbar, M. Random forest modeling for network intrusion detection system. Procedia Comput. Sci. 2016, 89, 213–217.
Bhardwaj, M.; Alshehri, K.; Kaushik, K.; Alyamani, H.; Kumar, M. Secure framework against cyber-attacks on cyber-physical robotic systems. J. Electron. Imaging 2022, 31, 061802.
Armbrust, M.; Fox, A.; Griffith, R.; Joseph, D.; Katz, R. Above the Clouds: A Berkeley View of Cloud Computing; Technical Report EECS-2009-28; University of California: Berkeley, CA, USA, 2009.
AlOmari, H.; Yaseen, Q.M.; Al-Betar, M.A. A Comparative Analysis of Machine Learning Algorithms for Android Malware Detection. Procedia Comput. Sci. 2023, 220, 763–768.
Karajeh, H.; Maqableh, M.; Masa’deh, R. Privacy and security issues of cloud computing environment. In Proceedings of the 23rd IBIMA Conference, Valencia, Spain, 13–14 May 2020; pp. 1–15.
Jouini, M.; Rabai, L. A security framework for secure cloud computing environments. In Cloud Security: Concepts, Methodologies, Tools, and Applications; IGI Global: Hershey, PA, USA, 2019; pp. 249–263.
Mathrani, S.; Lai, X. Big data analytic framework for organizational leverage. Appl. Sci. 2021, 11, 2340.
Joshi, N.; Kadhiwala, B. Big data security and privacy issues—A survey. In Proceedings of the 2017 Innovations in Power and Advanced Computing Technologies (i-PACT), Vellore, India, 21–22 April 2017; pp. 1–5.
Pedchenko, Y.; Ivanchenko, Y.; Ivanchenko, I.; Lozova, I.; Jancarczyk, D.; Sawicki, P. Analysis of modern cloud services to ensure cybersecurity. Procedia Comput. Sci. 2022, 207, 110–117.
Ma, J.; Saul, L.K.; Savage, S.; Voelker, G.M. Beyond blacklists: Learning to detect malicious websites from suspicious URLs. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 28 June–1 July 2009; pp. 1245–1254.
Xu, L.; Zhan, Z.; Xu, S.; Ye, K. Cross-layer detection of malicious websites. In Proceedings of the ACM Conference on Data and Application Security and Privacy, San Antonio, TX, USA, 18–23 February 2013; pp. 141–152.
Wang, D.; Navathe, S.B.; Liu, L.; Irani, D.; Tamersoy, A.; Pu, C. Click traffic analysis of short URL spam on Twitter. In Proceedings of the IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing, Austin, TX, USA, 20–23 October 2013; pp. 250–259.
Chiba, D.; Tobe, K.; Mori, T.; Goto, S. Detecting malicious websites by learning IP address features. In Proceedings of the IEEE/IPSJ International Symposium on Applications and the Internet, Izmir Turkey, 16–20 July 2012; pp. 29–39.
Cao, J.; Li, Q.; Ji, Y.; He, Y.; Guo, D. Detection of forwarding-based malicious URLs in online social networks. Int. J. Parallel Program. 2016, 44, 163–180.
Marchal, S.; François, J.; State, R.; Engel, T. PhishStorm: Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 2014, 11, 458–471.
Choi, H.; Zhu, B.B.; Lee, H. Detecting malicious web links and identifying their attack types. In Proceedings of the 2nd USENIX Conference on Web Application Development (WebApps 11), Portland, OR, USA, 15–16 June 2011.
Huang, H.; Qian, L.; Wang, Y. A SVM-based technique to detect phishing URLs. Inf. Technol. J. 2012, 11, 921–925.
Nepali, R.; Wang, Y.; Alshboul, Y. Detecting Malicious Short URLs on Twitter. In Proceedings of the 21st Americas Conference on Information Systems, Fajardo, Puerto Rico, 13–15 August 2015; pp. 1–6.
Canali, D.; Cova, M.; Vigna, G.; Kruegel, C. Prophiler: A fast filter for the large-scale detection of malicious web pages. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 197–206.
Hemalatha, J.; Roseline, S.A.; Geetha, S.; Kadry, S.; Damaševičius, R. An efficient densenet-based deep learning model for malware detection. Entropy 2021, 23, 344.
Ahsan, M.; Gomes, R.; Chowdhury, M.M.; Nygard, K.E. Enhancing machine learning prediction in cybersecurity using dynamic feature selector. J. Cybersecur. Priv. 2021, 1, 199–218.
Saxe, J.; Berlin, K. eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys. arXiv 2017, arXiv:1702.08568.
Wang, H.H.; Yu, L.; Tian, S.W.; Peng, Y.F.; Pei, X.J. Bidirectional LSTM Malicious webpages detection algorithm based on convolutional neural network and independent recurrent neural network. Appl. Intell. 2019, 49, 3016–3026.
Yang, W.; Zuo, W.; Cui, B. Detecting malicious URLs via a keyword-based convolutional gated-recurrent-unit neural network. IEEE Access 2019, 7, 29891–29900.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Cybernetics

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Shivashankar Hiremath

Eeshan Shetty

Allam Jaya Prakash

Suraj Prakash Sahoo

Kiran Kumar Patro

Kandala N. V. P. S. Rajesh

Paweł Pławiak

View Times: 243

Update Date: 15 Dec 2023

Table of Contents

Video Upload Options

Confirm