NLP- and API-Sequence-Based Malware Detection and Classification Methodologies

NLP- and API-Sequence-Based Malware Detection and Classification Methodologies: Comparison

Please note this is a comparison between Version 1 by Peng Wang and Version 2 by Peter Tang.

The surge in malware threats propelled by the rapid evolution of the internet and smart device technology necessitates effective automatic malware classification for robust system security.

Transformer
malware classification
API call sequences

1. Introduction

Malware, or malicious software, is crafted to infiltrate computers and mobile devices, aiming to manipulate authoritative systems, gather sensitive information, display unwanted ads, or extort users ^[1][2][1,2]. The surge in smart devices like laptops and phones has greatly expanded the threat landscape, jeopardizing user security and system integrity ^[3][4][3,4]. Malware classification assigns specific labels to identify its family, which is a crucial step in addressing security challenges [5].

Malware classification can be divided into signature-based, machine learning-based, and deep learning-based methods in the method view or static analysis and dynamic analysis in the feature view. Signature-based approaches may encounter challenges when dealing with the rapid evolution of malware [6]. In response, traditional machine learning methods, including Support Vector Machines (SVM), Random Forests (RF), and Naïve Bayes (NB), have been utilized for malware detection and classification ^[7][8][7,8]. However, these approaches necessitate the manual extraction of features, relying on expert knowledge, which can introduce complexity to the process.

Contemporary malware classification methods effectively leverage malware features, encompassing both static and dynamic attributes, to build machine learning or deep learning models. Static analysis involves the extraction of features as hex values and opcodes [9] from malware binary executable files through reverse engineering and examination of the original binary code. While static analysis is efficient, it is susceptible to evasion and obfuscation techniques. In contrast, dynamic analysis techniques capture malware behaviors, including file access, API (Application Programming Interface) calls, data flow, and other behavior traces, by executing and monitoring malware within a virtual sandbox. Dynamic analysis offers a more accurate representation of malware’s actual objectives and actions, resulting in lower false-positive rates and higher accuracy ^[10][11][10,11]. Combined with deep learning’s image representation, many research works would treat the malware as an image by converting the feature from static and dynamic analysis into a matrix ^[12][13][12,13].

Despite the success of feature analysis and deep learning, especially in image representation, thwe researchers posit that API call sequences can be regarded as a form of language through which programs establish communication with operating systems, analogous to how individuals employ languages for interpersonal interaction, which can better reflect the nature of the malware. Tran et al. point out that every type of malware has its own specific API call patterns or unique order of API calls [14]. In contrast to dynamic instruction features, the extraction of API call features necessitates only a coarse-grained dynamic analysis. Consequently, this approach incurs a relatively modest computational cost, rendering it highly effective for a broad spectrum of software codes.

2. Deep Learning-Based or API-Call-Related Malware Classification

There is a line of work focused on building malware classification systems based on extracted features. Nagano et al. [15] have proposed an innovative static analysis approach, integrating Natural Language Processing (NLP) with machine learning classifiers to discriminate between malicious and benign software. Their methodology entails the utilization of a PV-DBOW model for the extraction of features from diverse sources, including DLL imports, assembly code, and hex dumps, all derived from static analysis. Subsequently, these extracted features, or vectors, are input into Support Vector Machines (SVM) and k-nearest neighbor (KNN) classifiers for predictive inference. Another study proposed by Tran et al. [14] used NLP techniques such as N-gram, Doc2Vec (or paragraph vectors), and TF-IDF to convert API call sequences to numeric vectors before feeding them to the classifiers, including SVM, KNN, MLP, and RF. Schofield [16] also uses N-gram and TF-IDF to encode the API call sequences and employs a CNN to classify, which utilizes the ability of image representation. Chandrasekar Ravi et al. [17] employ a third-order Markov chain to model the Windows API call sequences. Nakazato J et al. [18] classify malware into some clusters using characteristics of the behavior, which are derived from Windows API calls in parallel threads with N-gram and TF-IDF. Deep learning-based methodologies have exhibited remarkable potential for delivering more efficacious and adaptable features, yielding superior outcomes in malware classification. Kolosnjaji et al. [19] pioneered the application of convolutional and recurrent network layers for the extraction of features from comprehensive API sequences. Their pioneering work underscores the substantial accomplishments attained through the integration of deep learning techniques within API-sequence-based malware classification. In the same way, C Li’s work [20] also demonstrates the RNN’s ability to classify the API call sequences alone. In a subsequent development, Li et al. [21] have further refined the network architecture, introducing the extraction of inherent features from API sequences. Especially, their approach incorporates embedding layers to represent API phrases and semantic chains, along with the utilization of Bidirectional Long Short-Term Memory (Bi-LSTM) units to capture interrelationships among APIs. The results of their endeavors demonstrate significant performance enhancements when compared to baseline methodologies, highlighting the efficacy of introducing additional intrinsic features associated with APIs. Some works consider the similarity among the features, especially API call sequences, and employ similarity to do the encoder, followed by some advanced models such as GNN [22], Random Forest, LSTM [23], and F-RCNN [24].

3. Transformer Models and Local Attention

Transformer is the first sequence transduction model that relies entirely on the attention mechanism. Unlike RNN [25] and LSTM [26], Transformer [27] uses multi-headed self-attention instead of recurrent layers in encoder-decoder architecture. Thanks to the absence of recurrent layers, the Transformer does not need to face the risk of gradient disappearance and gradient explosion, and it can process the entire sequence and learn the relationship between API calls. Using the Transformer Encoder–Decoder model takes less time to train than the LSTM model, and it is more stable [28]. MalBERT [29] first utilizes the pre-trained Transformer to process and detect malware, and experiments demonstrate that the Bert-based model can achieve high accuracy for malware classification. Transformer architecture delivers a good design of attention mechanisms; some work employs another attention module to capture the information. Yang [30] proposes to capture features from binary files using stacked CNNs and assembly files via triangular attention and then fuse all features via cross-attention. Their experimental results show that the method can extract both global and local features to improve the detection of malware variants effectively. Moreover, the local attention mechanism is very popular and effective in processing local features. Ma [31] points out that the mutual result of both global and local attention is useful to capture semantics and generate the most informative and discriminative features for text classification.

4. Training Strategies

Generally, benefiting from sufficient data, convolutional networks are always trained offline. Thus, researchers favor taking advantage of and developing better training methods that can not only promote the performance of the model but also have no inference cost increase. Inspired by [32], thwe researchers ccall this kind of method a “bag of freebies”. Strategies like data augmentation [33], hard negative example mining [34], online hard example mining [35], two-stage object detectors, and objective function designing [36], to name a few, are commonly used in computer vision and natural language processing (NLP). In malware classification, Hwang [37] designs a two-stage detection method to protect the victims by employing random forest to control false negative error rates in the second stage under low false positive rates delivered by the first stage using the Markov chain model. Baek [38] employs static analysis and dynamic analysis in different stages; static analysis in the first stage is used to classify malware and benign files. After that, they further employ dynamic analysis in the second stage to classify malware from the benign files in stage one to lower the false detection rate and reduce the malware misclassification in stage one. The results show that a two-stage scheme can perform better than a single static analysis or dynamic analysis. Although these strategies can better improve the detection rate, current research lacks consideration of the representation of malware and detection speed performance.