Malware Detection Method for with ViT Attention Mechanism: Comparison
Please note this is a comparison between Version 1 by Jeonggeun Jo and Version 2 by Jason Zhu.

Artificial intelligence (AI) is increasingly being utilized in cybersecurity, particularly for detecting malicious applications. However, the black-box nature of AI models presents a significant challenge. This lack of transparency makes it difficult to understand and trust the results. In order to address this, it is necessary to incorporate explainability into the detection model. There is insufficient research to provide reasons why applications are detected as malicious or explain their behavior.

  • explainable artificial intelligence (XAI)
  • deep learning
  • cybersecurity
  • mobile malware

1. Introduction

Mobile security threats that target Android devices are constantly evolving and becoming more sophisticated. Using Android malware, cybercriminals can steal sensitive information, disrupt device use, and compromise user privacy [1].
Among the many efforts to detect malicious applications (app), many studies have demonstrated the effectiveness of deep learning methods [2]. Recently, studies using image-based malware detection models have been increasing [3]. This method of detecting malicious applications by expressing binary as an image enables more accurate detection by applying advanced technology to image processing [4]. Additionally, this method can quickly generate training data because it processes the data in a way that does not require feature engineering. With these advantages, a more accurate and efficient malicious application detection method can be built.
However, a deep learning-based malware detection model does not explain the reason for detecting an application as malicious. This poses severe problems when integrating artificial intelligence (AI) into cybersecurity [5], reducing human users’ trust in the model and making it difficult for users to understand the process behind the results [6]. To address these issues, some studies interpret the model’s decision basis as images or strings [7][8][9][7,8,9]. Unfortunately, in many cases, a model is designed for interpretability rather than detection accuracy, and interpretation methods are usually complex. As malware evolves rapidly, a methodology that provides both accuracy and interpretability while requiring minimal updates or modifications to models or input data is needed to continuously respond to malware [10]. A method with a simple structure is needed for this purpose. In addition, a means of explaining the interpretation of these detection models should be designed with the needs and preferences of users in mind. That is, it should be provided in a form that is easy for users to understand, such as in text format [11].

2. Malware Detection

Android malware detection has been extensively studied and broadly divided into signature-based and artificial intelligence-based methods [12]. Zhang et al. [13] obtained features through a static analysis of the AndroidManifest.xml and Android Dalvik executable (DEX) file. They generated four different feature sets: permission, intent filter, API call, and string, and proposed a convolutional neural network (CNN)-based model for malicious app detection by creating a vector of features through a feature embedding model. Wang et al. [14] created a hybrid model using a deep autoencoder and convolutional neural network to detect malicious applications. They used seven categories of static features: requested permissions, intent, restricted API calls, hardware functions, code-related patterns, and suspicious API calls. The total number of extracted all individual features was 34,570. Among them, 413 features were used after filtering. Two variant CNN-based models, CNN-S and CNN-P, were used to detect malicious apps. Ren et al. [15] presented two methods for processing classes.dex files into fixed-size sequences and using them as input to a deep learning model. This method does not limit the input file size, does not require feature engineering, and consumes few resources. Hsien-De Huang and Kao [16] mapped the bytecode of classes.dex to RGB color to create fixed-size color images that revealed visual patterns in malware from the same family. The inception-V3 model detected malware with high accuracy, and the grayscale image was as effective as the color. Daoudi et al. [17] used grayscale images from DEX file bytecodes to detect malware with a CNN model, achieving high accuracy. Image size did not significantly impact performance, and obfuscated apps were also effectively detected. Freitas et al. [4] constructed MALNET-IMAGE, a dataset consisting of over one million malicious application images, providing a valuable resource for research into malicious apps. Using this MalNet dataset, detection performance was evaluated using CNN-based models such as ResNet, DenseNet, and MobileNet. Yadav et al. [18] presented an EfficientNet-B4 CNN-based method for Android malicious app detection, wherein the DEX file was transformed into an image and used as model input. This approach demonstrated superior malicious app detection performance compared to pre-trained models such as ResNet, InceptionV3, and DenseNet. These influential studies in the field of Android malicious app detection each employ unique approaches, ranging from static analysis and feature extraction to complex deep learning models. Studies focusing on image-based malware detection have demonstrated impressive performance, leveraging the latest CNN-based models.

3. Malware Detection Interpretation

XMal, proposed by Wu et al. [9], is a method for detecting malicious applications and generating descriptions of malicious behavior using an attention mechanism based on a multi-layer perceptron (MLP). Their model generated human-readable descriptions of malware behavior using API calls and permissions as features. It included an attention layer and MLP and used a pre-built semantic database of highly impactful features for detection. However, XMal prioritizes highly weighted features, but may not cover all malicious behavior, while its focus on interpretability may compromise detection accuracy. Deep learning techniques can visualize important image features, making them helpful in interpreting the results of image-based malware detection models. Iadarola et al. [7] used images to identify common patterns among malware of the same type. They used gradient-weighted class activation mapping (Grad-CAM) to visually present the model’s results to security professionals. They used average Euclidean distance to compare heatmap images of similar malware types, finding similar shapes and enabling security experts to identify patterns in these types without prior knowledge of the samples. One area of improvement is that the interpretation provided to security experts is a heatmap of a binary image; thus, it is not an image that humans can easily understand. Yakura et al. [19] proposed a method of extracting essential byte sequences from malware to make manual analysis more efficient. Based on attention mechanisms and CNNs, they showed that by applying attention maps to binary data, and thus it was possible to identify features or locations of these data that characterize the type of malware.

4. Vision Transformer

A ViT is a Transformer Encoder-based model for image classification that is highly scalable and performs well on large datasets with fewer training resources than CNN-based models [20]. Self-attention is an essential mechanism for ViT, which enables ViT to learn large image datasets very accurately and effectively [21]. One of the most valuable things about ViT is that self-attention makes it easy to recognize where the model is focusing on in the input data. This interpretability is a crucial advantage of ViT compared to other deep learning models and is particularly useful in applications where transparency and explainability are essential [22]. Overall, self-attention is a suitable method for the ViT to achieve high accuracy in various computer vision tasks and provide a transparent and intuitive way to interpret its inner workings [23]. One of the methods that can be used to compute the attention map is Attention Rollout [24]. The Attention Rollout method can be applied to a ViT to generate a heatmap showing the areas identified as critical in the ViT model. In CNN-based models, Grad-CAM is often used to generate heatmaps. Grad-CAM improves on the traditional class activation map (CAM) method and has the advantage that it can be applied to visualizations without modifying the model [25]. The CAM method relies on the last convolutional layer, the Global Average Pooling layer, and gradient values to produce a heatmap highlighting critical regions [26]. Some research points out that Attention Rollout may be more efficient in explaining ViT decisions than previous XAI techniques, such as CNN’s Grad-CAM. Both Attention Rollout and Grad-CAM aim to provide insight into the decision-making process of a deep neural network. Attention Rollout provides a more accurate and detailed visual description of ViT’s predictions [27]. However, it should be noted that the effectiveness of these visualization methods depends on a number of factors, such as the complexity of a given dataset and the specific task.

5. Android DEX File

The Dalvik Virtual Machine (DVM) runs code that has been converted to the DEX file format. DEX files contain data about the source code of the application. The DEX file contains crucial information to run Android apps but is not human-readable. It has sections such as header, string_ids, and type_ids. The data section contains bytecode and string data stored in a format specific to each element. DEX decompiler tools allow for the Java source code to be obtained by reorganizing the data in a DEX file into Java code format. The tools used to decompile DEX include jadx [28], dex2jar [29], and apktool [30]. Even without a decompiler tool, the source code and related information can be obtained by parsing the DEX file following Google’s dex format documentation.
Video Production Service