Attack Investigation | Encyclopedia MDPI

Attack Investigation: Comparison

Please note this is a comparison between Version 1 by Jiawei Li and Version 2 by Wendy Huang.

Attack investigation is an important research field in forensics analysis. Many existing supervised attack investigation methods rely on well-labeled data for effective training. While the unsupervised approach based on BERT can mitigate the issues, the high degree of similarity between certain real-world attacks and normal behaviors makes it challenging to accurately identify disguised attacks.

attack investigation
contrastive learning
audit logs
APT
deep learning (DL)

1. Introduction

Enterprises face threats from covert and persistent multi-step attacks [1], such as Advanced Persistent Threats (APT). To counter such attacks, attack investigation approaches have been extensively researched in order to identify and trace attack behaviors within information systems, which is an important research field of forensic analysis ^[2][3][4][5][2,3,4,5]. These methods conduct the comprehensive causality analysis of a large volume of audit logs collected from ubiquitous system monitoring to identify attack patterns that imply the tactics and objectives of attackers ^[6][7][8][9][6,7,8,9]. However, traditional methods rely heavily on feature engineering and require extensive manual work ^{[10][11][12][13]}[10,11,12,13]. In contrast, deep learning (DL) techniques have the capacity to learn irregular patterns from massive amounts of data that may elude human observation, thereby facilitating the automation of data analysis processes.

Previous research has introduced DL-based methods to advance attack investigation ^[6][14][15][6,14,15], yielding remarkable results. ATLAS [6] and AIRTAG [15] are state-of-the-art DL-based attack investigation approaches. However, these efforts still suffer from the following limitations.

Limitation I: lack of high-quality labeled data. ATLAS is a supervised learning method that requires labeled data for training. Unlike general domain DL tasks with publicly available datasets, the research area of attack investigation lacks well-labeled datasets. This is because the audit logs contain detailed confidential information from within enterprises, and making these data public would lead to privacy and security issues. In addition, precisely labeling audit logs necessitates expertise in both log and network security [16], and labeling extensive log data is labor-intensive and error-prone.

Limitation II: Difficulty in identifying disguised attacks. APT attacks typically disguise their behavior to evade security protection systems. These disguised attacks share processes with normal behaviors or leverage the process hollowing technique to inject malicious code into common processes. Moreover, their execution flow resembles normal behaviors, necessitating the correlation of contexts to identify the disguised attacks. However, it is challenging for current attack investigation techniques to effectively detect disguised attacks, especially for methods that depend on similarity to distinguish between regular and attack behaviors. AIRTAG leverages unlabeled log text data to pre-train the BERT [17] model and employs a one-class support vector machine (OC-SVM) as a downstream classifier for unsupervised attack investigation. The essence of this unsupervised downstream task is to discover attack behaviors through similarity. However, the data representations learned by the BERT model are to some extent collapsing [18], meaning that almost all log text data are mapped to a small space and therefore produce high similarity. This problem causes the already similar normal behaviors and disguised attacks to be closer together in the mapping space after representation learning by the DL model, thus hindering the identification of disguised attacks in the downstream attack investigation task.

2. Attack Investigation

Audit logs are collected by system monitoring tools from different operating systems. An audit log encapsulates a specific system event or system call that includes system entities, relationships, timestamps, and other essential system-related information. The concept of constructing provenance graphs from OS-level audit logs was proposed by King et al. ^[19][21]. Some investigations in the area of attack analysis utilize rule-based or Indicator of Compromises (IOCs) matching methods to identify possible threat behaviors. Nevertheless, the precision and comprehensiveness of the rule database and IOCs are crucial factors that impact the effectiveness of these techniques ^[3][11][3,11]. Holmes [3] maps low-level audit logs to tactics, techniques, and procedures (TTPs) and advanced persistent threat (APT) stages through rule-based matching within the knowledge base. Other techniques propose investigation strategies based on statistical analysis, leveraging the comparatively lower frequency of threat events in contrast to normal events to determine the authenticity of the alerts ^[20][22]. However, such methods may mistakenly categorize low-frequency normal events as high-threat occurrences. OmegaLog [7] combines application event logs and system logs to create a Universal Provenance Graph (UPG) that portrays multi-layer semantic data. In contrast, WATSON [4] infers log semantics from contextual indications and consolidates event semantics to depict behaviors. This technique greatly decreases the effort required for investigating attacks. However, the aforementioned traditional methods rely heavily on feature engineering and require extensive manual work. Deep learning-based approaches enable the creation of attack investigation models by identifying the unique features of normal or malicious behaviors ^[6][14][15][6,14,15]. ATLAS [6] applies Long Short-Term Memory (LSTM) networks for supervised sequence learning. AIRTAG [15] parses log files, utilizing BERT to train a pre-trained model, and subsequently train a downstream classifier. However, these methods are constrained by the availability of high-quality labeled data and model performance, making them less effective in addressing certain specific scenarios in real-world environments. These scenarios may include situations where the number of attack behaviors is significantly lower than that of normal behaviors, leading to sample imbalance, or cases in which the attackers’ disguises result in high similarity between attack sequences and normal sequences.

3. Contrastive Learning Framework

Recently, contrastive learning has become a very popular technique in unsupervised representation learning. A typical contrastive learning framework called SimCLR is widely used in different tasks. The SimCLR architecture consists of four components: (1) data augmentation strategies (t ~ T) are used to independently generate different input samples; (2) a base encoder network

f (\cdot)

; (3) a projection head

g (\cdot)

; and (4) a contrastive loss function that maximizes the agreement. Depending on the data characteristics, data augmentation strategies can be explored to enhance downstream tasks. An appropriate encoding network, such as GNN or BERT, can be chosen for

f (\cdot)

, based on the specific task requirements. With the development of language pre-trained models, the use of contrastive learning in natural language processing (NLP) tasks has increased significantly ^{[21][22][23][24][25]}[23,24,25,26,27]. For instance, IS-BERT ^[21][23] introduces a unique method by integrating 1-D convolutional neural network (CNN) layers over BERT. In this configuration, CNNs are trained to optimize the mutual information (MI) between the overall sentence embedding and its corresponding localized context embeddings. Similarly, CERT ^[22][24] utilizes a structure similar to MoCo ^[23][25] and employs back-translation to improve data augmentation. However, it should be noted that the inclusion of a momentum encoder in CERT requires additional memory, and back-translation may inadvertently introduce false positives. BERT-CT ^[24][26] employs two distinct encoders for contrastive learning, albeit at the expense of increased memory usage. It is pertinent to mention that their approach involves a limited sampling of seven negative instances, which can impact the training efficiency. Some of these methods draw inspiration from the SimCLR architecture, such as DeCLUTR ^[25][27] and CLEAR ^[26][19]. DeCLUTR takes a holistic training approach by amalgamating both contrastive and masked language model objectives. However, their primary focus lies in utilizing spans for contrastive learning, which may potentially result in fragmented semantic comprehension. CLEAR closely aligns with DeCLUTR in terms of architecture and objectives. Both approaches place a central emphasis on pre-training language models, albeit requiring substantial corpora and resource investments. The contrastive learning framework is a good solution to the problem of the data representations learned by BERT collapsing to some extent. The introduction of a contrastive learning framework in the field of attack investigation can make the distance between disguised attacks farther away from normal behaviors in the mapping space, thus facilitating the more accurate identification of disguised attacks in downstream attack investigation tasks.