Attack Investigation

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Jiawei Li	--	1213	2024-01-08 08:17:05	\|
2	Format correct	Wendy Huang	Meta information modification	1213	2024-01-08 08:24:36	\|

This entry is adapted from the peer-reviewed paper 10.3390/s23249881

Attack investigation is an important research field in forensics analysis. Many existing supervised attack investigation methods rely on well-labeled data for effective training. While the unsupervised approach based on BERT can mitigate the issues, the high degree of similarity between certain real-world attacks and normal behaviors makes it challenging to accurately identify disguised attacks.

attack investigation contrastive learning audit logs APT deep learning (DL)

1. Introduction

Enterprises face threats from covert and persistent multi-step attacks ^[1], such as Advanced Persistent Threats (APT). To counter such attacks, attack investigation approaches have been extensively researched in order to identify and trace attack behaviors within information systems, which is an important research field of forensic analysis ^[2]^[3]^[4]^[5]. These methods conduct the comprehensive causality analysis of a large volume of audit logs collected from ubiquitous system monitoring to identify attack patterns that imply the tactics and objectives of attackers ^[6]^[7]^[8]^[9]. However, traditional methods rely heavily on feature engineering and require extensive manual work ^[10]^[11]^[12]^[13]. In contrast, deep learning (DL) techniques have the capacity to learn irregular patterns from massive amounts of data that may elude human observation, thereby facilitating the automation of data analysis processes.

Previous research has introduced DL-based methods to advance attack investigation ^[6]^[14]^[15], yielding remarkable results. ATLAS ^[6] and AIRTAG ^[15] are state-of-the-art DL-based attack investigation approaches. However, these efforts still suffer from the following limitations.

Limitation I: lack of high-quality labeled data. ATLAS is a supervised learning method that requires labeled data for training. Unlike general domain DL tasks with publicly available datasets, the research area of attack investigation lacks well-labeled datasets. This is because the audit logs contain detailed confidential information from within enterprises, and making these data public would lead to privacy and security issues. In addition, precisely labeling audit logs necessitates expertise in both log and network security ^[16], and labeling extensive log data is labor-intensive and error-prone.

Limitation II: Difficulty in identifying disguised attacks. APT attacks typically disguise their behavior to evade security protection systems. These disguised attacks share processes with normal behaviors or leverage the process hollowing technique to inject malicious code into common processes. Moreover, their execution flow resembles normal behaviors, necessitating the correlation of contexts to identify the disguised attacks. However, it is challenging for current attack investigation techniques to effectively detect disguised attacks, especially for methods that depend on similarity to distinguish between regular and attack behaviors. AIRTAG leverages unlabeled log text data to pre-train the BERT ^[17] model and employs a one-class support vector machine (OC-SVM) as a downstream classifier for unsupervised attack investigation. The essence of this unsupervised downstream task is to discover attack behaviors through similarity. However, the data representations learned by the BERT model are to some extent collapsing ^[18], meaning that almost all log text data are mapped to a small space and therefore produce high similarity. This problem causes the already similar normal behaviors and disguised attacks to be closer together in the mapping space after representation learning by the DL model, thus hindering the identification of disguised attacks in the downstream attack investigation task.

2. Attack Investigation

Audit logs are collected by system monitoring tools from different operating systems. An audit log encapsulates a specific system event or system call that includes system entities, relationships, timestamps, and other essential system-related information. The concept of constructing provenance graphs from OS-level audit logs was proposed by King et al. ^[19]. Some investigations in the area of attack analysis utilize rule-based or Indicator of Compromises (IOCs) matching methods to identify possible threat behaviors. Nevertheless, the precision and comprehensiveness of the rule database and IOCs are crucial factors that impact the effectiveness of these techniques ^[3]^[11]. Holmes ^[3] maps low-level audit logs to tactics, techniques, and procedures (TTPs) and advanced persistent threat (APT) stages through rule-based matching within the knowledge base. Other techniques propose investigation strategies based on statistical analysis, leveraging the comparatively lower frequency of threat events in contrast to normal events to determine the authenticity of the alerts ^[20]. However, such methods may mistakenly categorize low-frequency normal events as high-threat occurrences. OmegaLog ^[7] combines application event logs and system logs to create a Universal Provenance Graph (UPG) that portrays multi-layer semantic data. In contrast, WATSON ^[4] infers log semantics from contextual indications and consolidates event semantics to depict behaviors. This technique greatly decreases the effort required for investigating attacks. However, the aforementioned traditional methods rely heavily on feature engineering and require extensive manual work.

Deep learning-based approaches enable the creation of attack investigation models by identifying the unique features of normal or malicious behaviors ^[6]^[14]^[15]. ATLAS ^[6] applies Long Short-Term Memory (LSTM) networks for supervised sequence learning. AIRTAG ^[15] parses log files, utilizing BERT to train a pre-trained model, and subsequently train a downstream classifier. However, these methods are constrained by the availability of high-quality labeled data and model performance, making them less effective in addressing certain specific scenarios in real-world environments. These scenarios may include situations where the number of attack behaviors is significantly lower than that of normal behaviors, leading to sample imbalance, or cases in which the attackers’ disguises result in high similarity between attack sequences and normal sequences.

3. Contrastive Learning Framework

Recently, contrastive learning has become a very popular technique in unsupervised representation learning. A typical contrastive learning framework called SimCLR is widely used in different tasks. The SimCLR architecture consists of four components: (1) data augmentation strategies (t ~ T) are used to independently generate different input samples; (2) a base encoder network

f (\cdot)

; (3) a projection head

g (\cdot)

; and (4) a contrastive loss function that maximizes the agreement. Depending on the data characteristics, data augmentation strategies can be explored to enhance downstream tasks. An appropriate encoding network, such as GNN or BERT, can be chosen for

f (\cdot)

, based on the specific task requirements.

With the development of language pre-trained models, the use of contrastive learning in natural language processing (NLP) tasks has increased significantly ^[21]^[22]^[23]^[24]^[25]. For instance, IS-BERT ^[21] introduces a unique method by integrating 1-D convolutional neural network (CNN) layers over BERT. In this configuration, CNNs are trained to optimize the mutual information (MI) between the overall sentence embedding and its corresponding localized context embeddings. Similarly, CERT ^[22] utilizes a structure similar to MoCo ^[23] and employs back-translation to improve data augmentation. However, it should be noted that the inclusion of a momentum encoder in CERT requires additional memory, and back-translation may inadvertently introduce false positives. BERT-CT ^[24] employs two distinct encoders for contrastive learning, albeit at the expense of increased memory usage. It is pertinent to mention that their approach involves a limited sampling of seven negative instances, which can impact the training efficiency. Some of these methods draw inspiration from the SimCLR architecture, such as DeCLUTR ^[25] and CLEAR ^[26]. DeCLUTR takes a holistic training approach by amalgamating both contrastive and masked language model objectives. However, their primary focus lies in utilizing spans for contrastive learning, which may potentially result in fragmented semantic comprehension. CLEAR closely aligns with DeCLUTR in terms of architecture and objectives. Both approaches place a central emphasis on pre-training language models, albeit requiring substantial corpora and resource investments.

The contrastive learning framework is a good solution to the problem of the data representations learned by BERT collapsing to some extent. The introduction of a contrastive learning framework in the field of attack investigation can make the distance between disguised attacks farther away from normal behaviors in the mapping space, thus facilitating the more accurate identification of disguised attacks in downstream attack investigation tasks.

References

Mirsaraei, A.G.; Barati, A.; Barati, H. A secure three-factor authentication scheme for IoT environments. J. Parallel Distrib. Comput. 2022, 169, 87–105.
Milajerdi, S.M.; Eshete, B.; Gjomemo, R.; Venkatakrishnan, V.N. Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 1795–1812.
Milajerdi, S.M.; Gjomemo, R.; Eshete, B.; Sekar, R.; Venkatakrishnan, V.N. Holmes: Real-time apt detection through correlation of suspicious information flows. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 1137–1152.
Zeng, J.; Chua, Z.L.; Chen, Y.; Ji, K.; Liang, Z.; Mao, J. Watson: Abstracting behaviors from audit logs via aggregation of contextual semantics. In Proceedings of the 28th Annual Network and Distributed System Security Symposium, NDSS, Online, 21–25 February 2021.
Gao, P.; Shao, F.; Liu, X.; Xiao, X.; Qin, Z.; Xu, F.; Mittal, P.; Kulkarni, S.R.; Song, D. Enabling efficient cyber threat hunting with cyber threat intelligence. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 193–204.
Alsaheel, A.; Nan, Y.; Ma, S.; Yu, L.; Walkup, G.; Celik, Z.B.; Zhang, X.; Xu, D. ATLAS: A Sequence-based Learning Approach for Attack Investigation. In Proceedings of the 30th USENIX Security Symposium, Online, 11–13 August 2021; pp. 3005–3022.
Hassan, W.U.; Noureddine, M.A.; Datta, P.; Bates, A. OmegaLog: High-Fidelity Attack Investigation via Transparent Multi-layer Log Analysis. In Proceedings of the Network and Distributed System Security Symposium 2020, Online, 23–26 February 2020.
Gao, P.; Xiao, X.; Li, Z.; Xu, F.; Kulkarni, S.R.; Mittal, P. AIQL: Enabling Efficient Attack Investigation from System Monitoring Data. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18), Boston, MA, USA, 11–13 July 2018; pp. 113–126.
Yonghwi, K.; Wang, F.; Wang, W.; Lee, K.H. MCI: Modeling-based Causality Inference in Audit Logging for Attack Investigation. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 18–21 February 2018; Volume 2, p. 4.
Zhao, J.; Yan, Q.; Liu, X.; Li, B.; Zuo, G. Cyber Threat Intelligence Modeling Based on Heterogeneous Graph Convolutional Network. In Proceedings of the 23rd International Symposium on Research in Attacks, Intrusions and Defenses ( 2020), San Sebastian, Spain, 14–16 October 2020; pp. 241–256.
Hossain, M.N.; Sheikhi, S.; Sekar, R. Combating dependence explosion in forensic analysis using alternative tag propagation semantics. In Proceedings of the 2020 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 18–21 May 2020; pp. 1139–1155.
Zhu, T.; Wang, J.; Ruan, L.; Xiong, C.; Yu, J.; Li, Y.; Chen, Y.; Chen, T. General, Efficient, and Real-time Data Compaction Strategy for APT Forensic Analysis. IEEE Trans. Inf. Forensics Secur. 2021, 16, 3312–3325.
Yang, R. RATScope: Recording and Reconstructing Missing RAT Semantic Behaviors for Forensic Analysis on Windows. IEEE Trans. Dependable Secur. Comput. 2020, 19, 1621–1638.
Du, M.; Li, F.; Zheng, G.; Srikumar, V. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017.
Ding, H.; Zhai, J.; Nan, Y. AIRTAG: Towards Automated Attack Investigation by Unsupervised Learning with Log Texts. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 23), Anaheim, CA, USA, 9–11 August 2023; pp. 373–390.
Liu, F.; Wen, Y.; Zhang, D.; Jiang, X.; Xing, X.; Meng, D. Log2vec: A heterogeneous graph embedding based approach for detecting cyber threats within enterprise. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019.
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 3–5 June 2019; pp. 4171–4186.
Yan, Y.; Li, R.; Wang, S.; Zhang, F.; Wu, W.; Xu, W. Consert: A contrastive framework for self-supervised sentence representation transfer. arXiv 2021, arXiv:2105.11741.
King, S.T.; Chen, P.M. Backtracking intrusions. ACM SIGOPS Oper. Syst. Rev. 2003, 37, 223–236.
Hassan, W.U.; Guo, S.; Li, D.; Chen, Z.; Jee, K.; Li, Z.; Bates, A. Nodoze: Combatting threat alert fatigue with automated provenance triage. In Proceedings of the Network and Distributed System Security Symposium 2019, San Diego, CA, USA, 24 February 2019.
Zhang, Y.; He, R.; Liu, Z.; Lim, K.H.; Bing, L. An unsupervised sentence embedding method by mutual information maximization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bristol, UK, 6–9 September 2022; pp. 1601–1610.
Fang, H.; Xie, P. Cert: Contrastive self-supervised learning for language understanding. arXiv 2020, arXiv:2005.12766.
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020.
Carlsson, F.; Sahlgren, M.; Gogoulou, E.; Gyllensten, A.C.; Ylipa, E. Semantic re-tuning with contrastive tension. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021.
Giorgi, J.M.; Nitski, O.; Bader, G.D.; Wang, B. Declutr: Deep contrastive learning for unsupervised textual representations. arXiv 2020, arXiv:2006.03659.
Wu, Z.; Wang, S.; Gu, J.; Khabsa, M.; Sun, F.; Ma, H. Clear: Contrastive learning for sentence representation. arXiv 2020, arXiv:2012.15466.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Others

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Jiawei Li

Ru Zhang

Jianyi Liu

View Times: 96

Update Date: 08 Jan 2024

Table of Contents

Video Upload Options

Confirm

1. Introduction

2. Attack Investigation

3. Contrastive Learning Framework

References