Machine-Learning Forensics | Encyclopedia MDPI

Machine-Learning Forensics: Comparison

Please note this is a comparison between Version 1 by Laila Mohammed Tajeldin and Version 2 by Rita Xu.

A world-wide trend has been observed that there is widespread adoption across all fields to embrace smart environments and automation. Smart environments include a wide variety of Internet-of-Things (IoT) devices, so many challenges face conventional digital forensic investigation (DFI) in such environments. These challenges include data heterogeneity, data distribution, and massive amounts of data, which exceed digital forensic (DF) investigators’ human capabilities to deal with all of these challenges within a short period of time.

smart environments
digital forensics
machine-learning techniques

1. Introduction

Currently, smart environments offer various technologies and services, such as smart transport systems, smart vehicles, smart homes, smart urban lighting, integrated travel ticketing, smart energy grids, and smart sensors [1]. These technologies strongly depend on the use of small electronic chips and electromechanical devices (i.e., IoT devices), such as sensors, wireless technologies, radio-frequency identification (RFID) devices, localisation technologies, and near-field communication devices [1].

The wide variety of IoT devices used within smart environments makes it very difficult to perform digital forensics (DF) in this environment. The challenge for DF professionals and practitioners is that standard industrial DF equipment and its capabilities concerning conventional computing operating systems are not coping with the smart environment due to its complex, heterogeneous, and distributed nature [2].

The problem raised in this rpapesearchr is that little to no reliable DF applications or DF directives currently exist to retrieve data from Internet-of-Things (IoT) devices in the event of a digital attack, an active investigation, or a litigation request within a smart environment [3]. Thus, researchers and practitioners in the DF field are working hard to define new techniques and tools to improve DF capabilities for coping with this problem. For example, it is currently possible to gather evidential data from a computer hard drive or even a mobile phone. However, when it comes to smart devices like smart watches or smart switches, there is no standard interface to connect to in order to reach their storage components. In yet another example, many such devices do not host large amounts of storage space, but rather communicate their data to other devices. On the other hand, some of these devices generate such vast amounts of data that, should an investigator not act fast enough, evidential data might be lost forever. The vast volume of data as well as the short-lived data created by these smart devices become humanly impossible to sift through. ML techniques may potentially be employed to assist with this dilemma in order to find evidence much more effectively in a much shorter time span.

The numerous challenges that face traditional digital forensic investigation (DFI) in smart environments result from the heterogeneity of, distribution of, and huge amounts of data involved. This exceeds the capabilities of human DF investigators to cope with all these challenges in a short time. It severely slows down or even incapacitates the conventional DFI process. Due to the rapid pace at which digital crimes are committed, better and more intelligent DFI techniques are sorely needed, especially in smart environments. Machine-learning (ML) techniques might offer a solution to these challenges [4].

ML has recently been applied in DFI and is still evolving; for example, Ref. [5] designed a new framework known as IoTDots to help protect the data collected by various smart devices and applications. This features two main components: the IoTDots analyser and the IoTDots modifier. The former scans the source code of the applications and detects forensic information. The latter automatically inserts tracking logs and reports the results.

In an IoT system, particularly in the case of emergent configurations, data might also be dynamic, making it difficult to classify information during live forensics. In this sense, live forensics refers to a forensic investigation that is done in near-real time. Hence, ref. [6] proposed a conceptual framework based on supervised machine-learning techniques. One of the advantages of using supervised ML techniques in live forensics is the ability of such techniques to predict possible events based on past occurrences. In addition, automated feature identification was used to prevent redundancy throughout feature selection and elimination.

The importance of ML in DFIs should not be underestimated, since such intelligent technologies have the potential to support and significantly enhance the conventional DFI process. ML technologies can potentially assist in the automation of manual DFI processes when significant volumes and a large variety of data must be analysed. Using more intelligent techniques will increase the chances of identifying and successfully investigating cybercrimes in modern smart environments. This will help DF specialists get to the root cause much faster and more efficiently [6].

For all the reasons mentioned above, ML holds great potential for DFIs. However, it is a foreign field to most DF investigators, and the scope for new research is vast. That being said, there exists a small corpus of research where ML technology was used to investigate digital crimes [4].

ML techniques, which are often used to predict behaviour, make use of pattern recognition software for investigators to analyse huge amounts of data. ML techniques seek to learn from historical perspectives so as to predict future behaviour. Therefore, by using ML techniques, investigators may gain the capability to recognise patterns of criminal activity and learn from the historical data when, where, and how the cybercrime probably took place.

2. State-of-the-Art Use of Machine-Learning Techniques in Digital Forensics

Due to the challenges that traditional DFIs face in smart environments (i.e., the heterogeneity, distribution, and huge amount of data, managing which in a short time exceeds human capabilities), ML seems to be the best solution for these environments [4]. These technologies can automate the laborious DFI operations of analysing huge amounts and wide ranges of data to increase the likelihood of successfully detecting and investigating cybercrime. This would greatly aid DF professionals in rapidly and effectively determining the fundamental causes of incidents [6]. As mentioned before, the amount of data collected by IoT devices and sensors is immense and contains valuable forensic evidence. This data can help identify and prevent unauthorised access within smart environments. The authors of [5] designed a new framework known as IoTDots to help protect the data collected by various smart devices and applications. This features two main components: the IoTDots analyser and the IoTDots modifier. The former scans the source code of the applications and detects forensic information. The latter automatically inserts tracking logs and reports the results. However, to reduce the amount of manual analysis required in DFI, ref. ^[7][16] proposed a methodology for the automatic prioritising of suspicious file artefacts. Rather than providing the final analysis results, this methodology aims to predict and recommend the artefacts that are likely to be suspicious. A supervised machine-learning approach is used, which makes use of previously processed case results. One of the most discussed challenges in DFI is the growing volume of data. Since the majority of file artefacts on seized devices are usually irrelevant to the investigation, manually retrieving suspicious files relevant to the investigation is very difficult. In support of DF, “intelligent methods” are proposed, which include the ability of computers to learn a specific task from data, data mining, machine learning, soft computing, and traditional artificial intelligence. This term is commonly used to express ways to automate problem solving in DF, and two main intelligent approaches are utilised, namely rule-based and anomaly-based ^[8][17]. The authors of ^[9][18] introduced a novel and practical DF capability for smart environments, since current smart platforms lack any digital forensic capability for identifying, tracing, storing, or analysing data generated in these environments. The collector and the analyser are the two main components of VERITAS. The collector employs mechanisms to automatically collect forensically relevant data from the smart environment. The analyser then uses a first-order Markov chain model to extract valuable and usable forensic evidence from the collected data for the purposes of a forensic investigation. Therefore, to discover and declare the presence of adversaries, DF necessitates intensive data analysis, such as retrieving and confirming system logs, blockchain information evaluation, and so on. Hence, ref. ^[10][19] proposed a blockchain-assisted shared audit framework to analyse DF data in an IoT environment. This was created to identify the sources and causes of data scavenging attacks in virtualised resources. It uses blockchain technology to manage access logs and controls. Using logistic regression ML and cross-validation, access-log data is examined for the consistency of adversary event detection. The number of cases needing DF competence and the volume of data to be processed have overburdened digital forensic investigators. Automated evidence processing based on artificial intelligence techniques holds considerable potential for speeding up the digital forensic analysis process while improving case-processing capacity [4]. In DFI, automation uses ML techniques for classification. ML techniques can obtain important information for investigations more efficiently by exploiting existing digital evidence-processing knowledge. Additionally, digital-evidence triage was developed for the prompt detection, processing, and interpretation of digital evidence. Currently, with AI techniques, the investigator determines the priority of device gathering and processing at a crime scene [4]. Furthermore, ref. ^[11][20] proposed an intelligent framework based on clustering and classification. The model learns from past crimes, and, when a new crime is registered, some of the crime information needs to be inserted by the investigator, such as the crime type, location, and time. The clustering process then automatically groups the new crime with previous similar crimes in the system using the k-nearest neighbour and crime-matching classification algorithms. In this way, the investigator can gain insights into the pre-investigation process by exploring the new crime, which is then clustered with previous similar crimes. Moreover, with the growth of cybercrime that targets minors, chat logs can be examined to detect and report harmful behaviour to law authorities. This can make a significant difference in protecting youngsters on social media platforms from being abused by cyber predators. Since DFI is done primarily by hand, the enormous volume and variety of data cause DF investigators to have a tough assignment; Ref. ^[12][21] suggested an approach using a DF process model backed by ML methodologies, to enable the automatic finding of hazardous talks in chat logs. One of the most fundamental characteristics of any smart device in an IoT network is its ability to acquire a bigger set of data than has been produced and then send the obtained data to the destination/receiver server through the internet. Thus, IoT-based networks are particularly vulnerable to simple or sophisticated assaults, which must be discovered early in the data transmission process in order to protect the network against these hostile attacks. The authors of ^[13][22] developed and built an intelligent intrusion detection system utilising machine-learning models so that assaults in the IoT network may be discovered. The adaptability of IoT devices raises the probability of continual attacks on them. Due to the low processing power and memory of IoT devices, security researchers have found it challenging to preserve records of diverse attacks performed on these devices during a DFI. The authors of ^[14][23] proposed an intelligent forensic analysis mechanism, to automate the detection of attacks on IoT devices based on the machine-to-machine framework. However, the proposed mechanism combines several ML techniques and different forensic analysis tools to detect different types of attacks. Furthermore, by providing a third-party logging server, the problem of evidence gathering has been overcome. To assess the effects and types of attacks and violations, forensic analysis is done on logs utilising a forensic server. In addition, ref. ^[15][24] indicated that the use of ML and deep-learning algorithms is effective for cyber-attack discovery, identification, and tracing by proposing a framework of cyber-attacks against smart satellite networks. In addition, IoT forensics and smart environments, with their recognised challenges, provide a great opportunity to develop new forensic tools to make the task of forensic investigators easier, which can be used for acquiring, preserving, and also analysing such forensic data. The authors of ^[16][25] proposed a user-friendly tool for smart devices that support WiFi and used smart-environment scenarios to allow forensic investigators, network administrators, and data scientists access to various features of network traffic with simple steps. The proposed tool allows network traffic features to be computed in real time on any WiFi access point running the OpenWrt firmware, avoiding the time-consuming tasks of dumping network traffic and implementing the procedures needed to analyse the captured traffic. On the other hand, due to the lack of examination and available data, ref. ^[17][26] selected a smart fridge as an IoT device to be examined and investigated. The dataset was examined using two ML algorithms, Bayes net and decision stump. Each algorithm represents a distinct idea. A stump tree is a simple version of the decision-tree ML technique. The Bayes net is useful for estimating the likelihood of numerous recognised causes, one of which is the occurrence of an event. The validation results indicate that the Bayes net algorithm is more accurate than the decision stump tree. Research shows that the main issues that face DF investigators in the smart environment are the large volume of data and attack and violation detection. The proposed solutions are summarised in Figure 1 and Figure 2. The authors decided to split the summary into two separate figures, since there were two main themes detected in all existing solutions: the first theme involved MLF solutions for large amounts of data, while the second theme involved MLF solutions for attack and violation detection.

Figure 1. MLF solutions for large amounts of data in smart environments.

Figure 2. MLF solutions for attack and violation detection in smart environments.

Figure 1 summarises the applications of MLF that were reported in research papers from 2018 to 2023 to serve as proposed solutions for dealing with the large amounts of data generated in smart environments. The following list explains the elements of Figure 1 in more detail:

The IoTDots framework was proposed as a solution to deal with the large amounts of data collected by IoT devices and sensors.
Automatic prioritisation of suspicious file artefacts was proposed as a solution to deal with the growing volume of data and manual retrieval of suspicious files.
A blockchain-assisted shared audit framework for identifying data-scavenging attacks in virtualised resources was proposed as a solution to deal with attack and violation detection in smart environments.

Intelligent methods to automate problem-solving were proposed as a solution to deal with the massive amounts of data that must be analysed for digital evidence.
Automation using ML techniques for classification and AI techniques for prioritising suspicious devices was proposed as a solution to deal with the growing number of cases needing DF competence and the large volumes of data to be processed.
Automatic text analysis to detect online sexual predatory talks was proposed as a solution to deal with the growth of cybercrime targeting minors, the large volume of data, and the DFI process, which is done primarily by hand.
The “VERITAS” mechanism to automatically collect and extract forensic evidence from smart environments was proposed as a solution to deal with the large amounts of data that is generated in smart environments.

Figure 2 summarises the applications of ML in DF as proposed in research published between 2018 and 2023 for detecting data attacks and violations in smart environments. The following list explains Figure 2 in more detail:

An intelligent intrusion detection system to detect regular and malicious attacks on data created in smart environments was proposed as a solution to deal with the simple and complex attacks that face IoT networks in particular.
An intelligent forensic analysis mechanism was proposed as a solution to deal with the probability of continual attacks on IoT devices and the low processing power and memory of these devices.