Trends in Malware Detection

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Umm-e-Hani Tayyab	--	5262	2022-10-19 10:13:30	\|
2	Format correction	Sirius Huang	-2 word(s)	5260	2022-10-20 04:20:34	\|

This entry is adapted from the peer-reviewed paper 10.3390/jcp2040041

Monitoring Indicators of Compromise (IOC) leads to malware detection for identifying malicious activity. Malicious activities potentially lead to a system breach or data compromise. Various tools and anti-malware products exist for the detection of malware and cyberattacks utilizing IOCs, but all have several shortcomings. In the quest to fight zero-day attacks, the research paradigm shifted from primitive methods to classical machine learning-based methods, then to deep learning-based methods.

malware machine learning deep learning few shot learning cyber attacks

1. Sophisticated Malware

Information, in today’s era, is one of the most valued but vulnerable assets. There is a constant threat of serious damage to infrastructure caused by evolving sophisticated malware. Various techniques, trends, and strategies are proposed to alleviate the threats triggered by malicious codes. These methods may range from the primitive type of malware detection based on statistical analysis to machine learning-based methodologies and specifically deep neural networks. In the following sections, a hierarchy is built to represent this development of malware detection according to the methodology used.

2. Malware Detection with Primitive Methods (Statistical Analysis Based Methods)

Malware detection is being performed with different techniques. Many researchers have explored the different practices for malware discovery and recognition. Ref. ^[1] focused on detecting a malicious pattern in executables. Majorly ^[1] has stated that malware detection is a kind of obfuscation-de obfuscation game in today’s era, therefore authors in ^[1] have focused on the techniques of obfuscation to check whether present anti-virus products can overcome the variability introduced by obfuscation or not. They implemented SAFE (Static Analyzer for executables) which is claimed to detect a malicious pattern in executables. Further, they developed an obfuscator for executables that uses four different techniques to obfuscate the executable and then tested antivirus scanners by providing them with obfuscated variants of existing malicious executables. Ref. ^[1] presented a general architecture for detecting a malicious pattern in executables with two main components i.e., Program annotator and malicious code detector. Obfuscation transformations that are supported by the obfuscator detailed in ^[1] include register reassignment, dead-code insertion, code transposition, and instruction substitution.

Ref. ^[2] used a heuristic approach for detecting malware by analyzing windows binary files of obfuscated executables. They have come up with a framework that first generates a risk score by statically analyzing the windows PE file for 8 characteristics (abnormal ordinals, Nonstd_name, In_code, TLSection, DLL_no_export, Flagged Section Name, Low function Call, Other_badPEformat). This framework assigns weight and risk score to each characteristic. The risk score is assigned based on experience and comparison between malware and benign files. A total of 2014 windows files were used in experiments.

Ref. ^[3] primarily focused on malware detection through statistically making use of opcodes. In their methodology, first, the frequency of opcodes appearing in malware and benign files is calculated and then the statistics-based discrimination ratio is calculated through which weights are obtained for opcode sequences. Then the similarity between two executables is computed using weights of opcode sequences. Malware files are collected from the VxHeavens website, which was a total of 13,189 executables. For benign dataset 13,000 files are collected from their computer. The basic assembler is used to disassemble the executables. After obtaining the assembly file, a profile of opcodes’ frequency is maintained. This file contains the unnormalized frequency of opcodes appearing in both datasets. Finally, the relevance of all opcodes is calculated giving mutual information between opcode and classification class. Finally, malware opcode sequences are extracted and their frequency of appearance is calculated to detect maliciousness. After calculating weighted term frequency, a vector of weighted opcode sequence frequency is obtained. Experimentally first opcode sequences of lengths 1 and 2 are extracted and the similarity in the sequences appearing in both malware and executables are calculated but, in both datasets, they are appearing almost with the same frequency due to which afterward opcode sequences of length 1 and 2 are combined to check the similarity of their appearance in both datasets. Malware variants have great similarity in terms of frequency of opcode sequences whereas similarity measure is low between malware and benign dataset.

One kind of malware is a botnet that scans the internet to find vulnerable hosts to perform various malicious activities. Normally botnets are coordinated through a Command-and-Control channel C&C and most of the control protocols are IRC based whereas other protocols such as HTTP can also be used. Ref. ^[4] focused on detecting and confining DDoS and portscan. Authors in ^[4] brought up a platform that focused on detecting malicious activities by monitoring communication between botnet and C&C and by monitoring traffic for detecting and confining DDoS along with the detection of zombie computers on the network. Resultantly they managed to filter botnet-related traffic, confined infected parts of the network, and found methods for disabling botnets. To collect malware, high and low interaction honeypots were used. Low interaction honeypots used in the experiment were (1) Nepenthes and (2) Honeyd. After the malware was captured, it was analyzed manually. They were identified using various anti-virus tools and were sandboxed to collect useful information. Then a victim PC was connected to the analysis workstation and traffic generated by the victim PC in a clean state was monitored. Wireshark was started on an analysis workstation. Afterward, the victim’s PC was rebooted with malware installed on it, and then events related to DNS requests attempted to connect to unknown ports and scanning of unknown ports was recorded. Dnsmsaq, fakemta relay-Http, relay, and Wireshark were used as tools for different purposes. This methodology was cumbersome to perform intended functionalities, therefore, MWNA (Malware Network Analyzer) was developed. It is based on the Linux Packet Filter mechanism. The published method for detecting DDoS analyzes packets during normal traffic: first to establish a baseline and then to derive thresholds. Then finally some attack features are extracted. Finally, above mentioned method is combined with a rate-limiting scheme so that amount of monitored traffic can be reduced.

A hybrid approach is also being used for taking benefit from the amalgam of malware detection methods. Ref. ^[5] focused on availing the advantages of all techniques for malware detection due to which the implemented framework by ^[5] is hybrid. They presented a framework that works on the detection methodology involving API calls extracted from the suspected file by running it in a VM environment. Then a graph is built using the information of API calls and operating system resources being utilized. Graph nodes represent API calls and operating system resources, and edges represent the reference between nodes. Then the constructed graph is minimized. Finally, to find a match between two graphs, the Graph Edit Distance algorithm is used, and to make use of this algorithm cost matrix is utilized.

Ref. ^[6] developed a tool, PyTrigger, which provides the user actions required to trigger, collect, and distill malware behavior profiles. Their paper has made three major contributions including the development of an algorithm that helps in extracting malware behavior, user-triggered malware behavior from among a similar event along with an event recording and playback system, and the full implementation of the PyTrigger system. PyTrigger has two major subsystems: (1) the recording and playback system and (2) the behavior analysis system. The recording and playback subsystem of PyTrigger is supposed to record the values of all objects’ data states such as windows’ titles, mutable text field values, drop-down menu choices, etc. and are then forcibly entered in GUI while being replayed to create the scenario which triggers the malware behavior. PyTrigger system executes the malware sample several times in VM and uses Events Tracing for Windows to trace the events. PyTrigger system was evaluated on 4100 malware samples from 35 different malware families. Typical user activity that was recorded was related to Gmail, Facebook, and Google HSBC, text editing, file browsing, and execution (Windows Explorer). An added advantage of this system is its ability to extract delegated events. Events that are delegated by the malicious process to other processes which are legitimate and lie outside the malware process chain are called delegated events.

Ref. ^[7] concentrated on the solution for detecting malicious activity which should be low cost and should not be using any third-party software so that in less time and low budget detection can be done. Secondly, since some malware behavior can overcome the virtual environment, therefore, running malware in a virtual machine for dynamic analysis can compromise some of the triggering scenarios. The authors manipulated windows audit logs into interpretable features and presented a linear classification model for detecting malicious behavior using the windows audit log as a feature set with high accuracy. This approach explored some new malware behaviors. For performing validation, six different experiment sets were designed. One of the experiments for validation involved a dataset that had malware a year or two older than the malware presented in training. Second experiment for validation was performed based on malware families. Secondly, the same trained classifier was run in a virtual environment as well as in an enterprise environment to cater to the variable of the environment. The experimental dataset consisted of 32,078 samples out of which 17,399 were benign samples and 14,679 malicious samples. 6,898,593 unique features were extracted, and 20,362 audit logs were collected from binaries executed in a cuckoo sandbox.

Figure 1 shows the performance metrics used by the surveyed papers that fall in the category of statistical based methods.

Figure 1. Performance Metrics Used in Literature Proposing Primitive Methods for Malware Detection.

3. Malware Detection with Conventional Machine Learning Based Methods

Machine learning plays an important role to capture helpful properties in malware to advance security measures. This whole process of knowledge extraction and learning of patterns helped the researchers to pave their steps into machine learning-based malware analysis and detection. Machine learning has been extensively used not only in malware detection but also for detecting malicious activity through network traffic ^[8].

Ref. ^[9] worked on Belief propagation with the file system but could not do well for new samples. Ref. ^[10] conducted malicious graph matching and extracted APIs/System calls but they used a small dataset. Ref. ^[11] used a Rule-based classifier and SVM and performed detection based on byte sequences but made use of only specific malware classes for evaluating their model. They built datasets from Windows system files and the Anti-Virus Platform. Ref. ^[12] also used a Rule-Based Classifier and extracted APIs/System calls but this APIs/System calls categorization was not up to the mark. They conducted their tests on features of the Windows XP system and Program Files folders. Authors of ^[13]^[14] used Random Forest and used network and API system calls, Registry, and File system but the dataset was small. Ref. ^[15] used Decision Trees in their research work and ^[16] used Naïve Bayes, Random Forest, and SVM and worked on byte sequences, APIs/system calls, file systems, and Windows registry. Ref. ^[17] used KNN for detecting malicious PEs. Malware code causes damage to the resources, and with a little code change, malware developers can easily beat the protection layer. A lot of research was done for the detection of these variants. Ref. ^[18] explored the Decision Tree and Random Forest and made use of Opcodes. They used small datasets of Windows XP system and Program Files folders and generated code of malware for making part of the dataset. Ref. ^[19] performed Clustering with locality-sensitive hashing Byte sequences but the used dataset was very small. Ref. ^[20] worked on a Rule-based classifier, they worked on APIs/System calls, and Windows Registry. Ref. ^[21] used the clustering technique which was being used for variants detection by past researchers also. The authors chose DBSCAN but their approach was not coping with malware evasion techniques. Ref. ^[22] worked on Logistic Regression and Neural Networks and operated on Byte sequences and APIs/system calls.

Table 1 shows the datasets and performance metrics used by the researchers in the surveyed papers that apply conventional machine learning algorithms.

Table 1. Datasets and Performance Metrics Used in Literature Proposing Machine Learning Methods for Malware Detection.

Title	Author	Data Samples Used			Performance Metrics Used
Title	Author	Source	Malicious	Benign
Support Vector Machine for malware analysis and classification	M. Kruczkowski, E. N. Szynkiewicz	N6 Platform	-	-	Classification Accuracy = 0.9498 Sensitivity = 0.9774 Specificity = 0.8971 AUC = 0.9901 F1 = 0.9623 Precision = 0.9475
Improving the detection of malware behavior using simplified data dependent API call graph	E. Elhadi, M. A. Maarof, B. Barry	VxHeavens	75	10	Detection Rate = 98.6% Accuracy = 98.8% False Alarm = 0%
Dynamic VSA: a framework for malware detection based on register contents	M. Ghiasi, A. Sami, Z. Salehi	Windows XP system, Program Files Folder, and Private Repository	850	390	TP = 0.988 FP = 0.125 Recall = 0.988 Precision = 0.888 F-Measure = 0.940 Accuracy = 0.930
Novel feature extraction, selection, and fusion for effective malware family classification	M. Ahmadi, G. Giacinto, D. Ulyanov, S. Semenov, M. Trofimov	Microsoft’s Malware classification challenge	21,741	0	Accuracy, Logloss
Probabilistic inference on integrity for access behavior based malware detection	W. Mao, Z. Cai, D. Towsley, X. Guan	Windows XP SP3 VxHeavens	7257	534	TPR, AUC
Robust and effective malware detection through quantitative data flow graph metrics	T. W¨uchner, M. Ochoa, A. Pretschner	Legitimate app downloads Malicia	6994	513	Detection Rate, FPR, Precision, F-Measure
An alternative to NCD for large sequences, Lempel Ziv Jaccard distance	E. Raff, C. Nicholas	Industry Partner	237,349	240,000	Balanced Accuracy
Proposing a HMM-based approach to detect metamorphic malware	M. Gharacheh, V. Derhami, S. Hashemi, S. M. H. Fard	Cygwin VxHeavens	-	-	Detection Rate = 0.9803 FPR = 0.0058 Accuracy = 0.9833
Heuristic metamorphic malware detection based on statistics of assembly instructions using classification algorithms	P. Khodamoradi, M. Fazlali, F. Mardukhi, M. Nosrati	Windows XP system and Program Files folder Self-generated metamorphic malware	280	550	Accuracy
A malware similarity testing framework	J. Upchurch, X. Zhou	Sampled from security incidents	85	0	PR Curve
A behavior based malware variant classification technique	G. Liang, J. Pang, C. Dai	Anubis Website	330,248	0	Similarity measure
Scaling Malware Execution with Sequential Multi Hypothesis Testing	P. Vadrevu, R. Perdisci	Security Company and Large Research Institute	1,651,906	0	Jaccard Index
Fast malware classification by automated behavioral graph matching	Y. Park, D. Reeves, V. Mulukutla, B. Sundaravel	Legitimate apps Anubis Sandbox	300	80	Similarity measurement
Automated malware classification based on network behavior	S. Nari, A. A. Ghorban	Communi-cation Research Centre Canada	3768	0	Accuracy = 94.5783%
Malware function classification using APIs in initial behavior	N. Kawaguchi, K. Omote	FFRI Inc.	408	236	Accuracy, FPR, FNR
Feature selection and extraction for malware classification	C.-T. Lin, N.-J. Wang, H. Xiao, C. Eckert	Sandbox	3899	389	Micro Precision, Micro Recall, Micro Specificity, Macro Precision, Macro Recall, Macro F1
High fidelity, behavior based automated malware analysis and classification	A. Mohaisen, O. Alrawi, M. Mohaisen	AMAL system	115,157	0
Clustering for malware classification	S. Pai, F. Di Troia, C. A. Visaggio, T. H. Austin, M. Stamp	Cygwin utility files and Malicia	8052	213	Silhouette coefficient, purity
Towards Automatic Reverse Engineering of Large Datasets of Binaries	M. Polino, A. Scorti, F. Maggi, S. Zanero, Jackdaw	-	-	-	Jaccard Index
Subroutine based detection of APT malware	J. Sexton, C. Storlie, B. Anderson	-	197	4622	Similarity index
A static signal processing based malware triage	D. Kirat, L. Nataraj, G. Vigna, B. Manjunat	Windows XP, ZDNet, NSRL, Anubis	1,200,000	52,750	Precision and Recall

4. Malware Detection with Deep Learning Based Methods

Deep Learning is a specialized form of machine learning in the domain of Artificial Intelligence (AI) that applies deep artificial neural networks also famous as deep neural networks. They are the techniques of machine learning that simulate the process of learning by a human brain. The human brain consists of cells which are referred to as neurons in neural networks. Similarly, in a human brain, all the cells are connected through axons and dendrites with the connection region known as synapses. These connections when found in ANN (Artificial Neural Networks), contain weights to behave as the connections between nerve cells in the human brain.

The major difference between conventional neural networks and deep neural networks is the number of layers. Deep neural networks make use of many hidden layers for the high-level abstraction of data. They can learn the features of data. This process of feature engineering is carried out with the help of a big number of examples input to the deep learning-based algorithm which leads to the production of results in the form of classification, identification, or generation of data after learning the most suitable features during feature engineering. The major motivation for using deep learning in various fields was to organize and analyze a large amount of data. Different areas where deep networks are preferred to be used include image processing, speech processing, healthcare, and with the increase in cyber space, now even cybersecurity.

Depending upon its features, this domain can be further categorized into different sub-domains as shown in Figure 2. All features of PE files hold some significance in defining degree of maliciousness in a particular file. Features from the header and Imports, all play a significant role in defining the nature of PE file as malicious or benign. Ref. ^[23] made use of LSTM for the selection of optimal features of PEs. These optimal features were selected to train a deep learning based model for detecting malicious PE file.

Figure 2. Types of Deep Learning.

Refs. ^[24]^[25] made use of sequential dynamic data and claimed that an ensemble of recurrent neural networks can be capable to detect the maliciousness of an executable within the first 4 s of execution with almost 93% accuracy. GRU (Gated Recurrent Units) were used with RNN to reduce training time. User CPU usage, and system CPU usage, sent packets to count, received bytes count, total bytes sent, count of the processes being executed, the maximum number of processes being carried out, the number of milliseconds elapsed since the file started to run and maximum process ID assigned were used as features.

Ref. ^[26] combined two types of neural network layers i.e., convolutional, and recurrent layers for modeling system call sequences for classifying malware. These two types of layers use dissimilar types of approaches for modeling sequential data. Convolutional networks use sequences in the form of a set of n-grams, and recurrent networks tend to train a stateful model by using full sequential information. The input of the system was 60 distinct system calls.

Ref. ^[27] performed malware detection using stacked AutoEncoders (SAE) with the input of Windows API calls mined from the PE files. The SAEs model worked on a greedy layer-wise training operation for performing unsupervised feature learning. Then this process was followed by supervised parameter fine-tuning. Results showed that the model with 3 hidden layers and 100 neurons at each layer gave the best training and testing accuracy as compared with ANN, SVM, Naïve Bayes, and Decision Tree.

Ref. ^[28] implemented a method that manipulates raw inputs to detect maliciousness. The implemented model called eXpose picks generic short strings from security inputs. These strings include malicious URLs, mutexes, registry keys, etc. Then it learns to identify their maliciousness. eXpose makes use of a neural network convolutional kernel for feature extraction. The architecture is composed of notional components along with character embedding, feature detection components, and classifier. Results showed that eXpose outperformed manual feature extraction approaches, attaining a 5–10% detection rate gain at a 0.1% false-positive rate compared to these baselines.

The proposed model by Ref. ^[29] is comprised of phases of OpCode-Sequence Graph Generation, Deep Eigensapce Learning, and Feature Selection for the detection of Internet of Battlefield Things (IoBT) malware. Ref. ^[29] used a Convolutional Network for the deep learning module, because it can give more accurate results of classification when the data patterns are complex and nonlinear. This approach achieved 99 % accuracy and 98% Recall.

Ref. ^[30] focused on addressing the detection task of malware variants with the help of deep learning methods. The authors got a method published in which they transformed the nasty code into a grayscale image. Then the images were recognized and classified by employing a Convolutional Neural Network (CNN) which could extract the features of the malware images automatically. The implemented CNN was composed of an input layer, convolutional, and subsampling layers. This model also classified malware into related malware families.

Ref. ^[31] used the approach of converting the disassembled malware code into a greyscale image using SimHash and then used a Convolutional Neural Network to identify the malware family. The presented methodology is comprised of three phases: Feature extraction, Malware image generation, and CNN training. Results showed that the authors were successful to obtain an accuracy of approximately 99% with 10,805 samples.

Ref. ^[32] have focused on the description of state-targeted APT using a Deep Neural Network (DNN). Researchers utilized the ability of Deep Neural Networks (DNN) to make use of raw features as input, whereas the learning of higher-level features was done during the training process. In this progression, every hidden layer extracted higher-level features from the preceding layer, building a hierarchy of higher-level features.

Ref. ^[33] devised an approach of using a neural network comprised of convolutional and feed-forward neural constructs for malware classification. In this approach PE file metadata, import features and Assembly opcode features categories were used.

Ref. ^[34] made use of a dynamic analysis approach based on Windows API call graphs and SAE models. A Behavior-based Deep Learning Framework (BDLF) was developed in this paper which makes use of SAE for feature reduction from behavior graphs and then performs classification through Decision Tree, KNN, Naïve Bayes, and SVM.

Ref. ^[35] focused on malware detection based on process behavior in possible infected terminals. The published solution applies DNN in 2 stages, the first stage is for extracting process activities by RNN and converting them into feature vectors. Feature vectors were then treated as images that were classified by CNN.

Ref. ^[36] have worked on a new image processing technique with optimized parameters for Machine Learning algorithms and Deep Learning architectures to produce an efficient zero-day detection system of malware. First malware detection was performed using deep learning based on static analysis on ember dataset and privately collected samples and it was deduced that the performance of malware detection can marginally be enhanced by using a hybrid system pipeline proposed as Windows-Static-Brain-Droid (WSBD), which was composed of both classical machine learning algorithms and deep learning models. In the next stage of research, malware detection was performed using deep learning based on dynamic analysis. It conducted a comparison between classical machine learning algorithms and deep learning architectures based on dynamic analysis, and deep learning architectures outperformed all experiments. Finally, experiments were conducted for categorizing the malware into malware families using deep learning based on image processing. A novel technique DeepImageMAlDetect (DIMD) was proposed which is based on the image processing technique and uses CNN and LSTM. The proposed method can work on malware from different operating systems. Finally, architecture by the name of ScaleMalNet was developed. It collects data from different data sources and uses self-learning techniques such as classical machine learning algorithms, deep learning architectures, and image processing techniques for detecting, classifying, and categorizing malware to their corresponding malware family efficiently.

Authors in ^[37] proposed a new technique to generate a signature for malware that does not depend on any specific behavior of malware so that it can be used for variants of malware as well. To achieve the goal, researchers first recorded the behavior of malware through Sandbox and then converted the output text file into a binary vector sized. After creating a binary vector Deep Belief Network was trained by a Deep Stack of Denoising Autoencoders.

Ref. ^[38] focused on a technique that made use of a Deep Neural Network for malware detection using features extracted statically with more accuracy and minimum FPR. There are three main components of the framework defined in this paper: (1) the First component focuses on the extraction of four features from benign and malicious binaries (2) 2nd component is a Deep Neural Network consisting of an input layer, two hidden layers, and one output layer (3) 3rd component is the score calibrator.

Research of ^[39] focused on one-shot learning which is referred to when there are very few samples to learn from. It implements a model LRUA-MANN which modifies the memory access capability of a Neural Turing Machine to adapt a one-shot learning task. LRUA-MNN is used with LSTM as a controller and makes use of LSTM state and memory bank as memory.

Ref. ^[40] has focused on carrying out the process of malware detection without having in-depth knowledge of malware and its analysis. Two Neural Networks were used; one was fully connected, and the other was a Recurrent Neural Network. The model had 3 LSTM layers with attention mechanisms before classification. Sax et al. used Neural nets and extracted Strings and PE file characteristics but did not cope with obfuscation and did not produce good accuracy in such situations.

Ref. ^[41] implemented the idea of a multitasking learning model which was trained for seven classification tasks for malware image classification. The implemented model by ^[41] consisted of 5 CNN layers with PRelu activation function.

Ref. ^[42] have explored the advantages of using transfer learning in the domain of malware identification. Their research focused on utilizing transfer learning for extracting the features of malware dataset. They made use of an already trained deep learning model (trained over ImageNet) and finally classified the malware into their respective families.

Figure 3 summarizes the types of deep learning algorithms used by researchers over the years and Table 2 summarizes the performance metrics used by researchers while using deep learning based methods for malware detection.

Figure 3. Deep Learning Techniques Used for Malware Detection.

Table 2. Datasets and Performance Metrics Used in Literature Proposing Deep Learning Methods for Malware Detection.

Title	Author	Year	Dataset Samples					Performance Metrics
Title	Author	Year	Source		Malicious		Benign
Early Stage Malware Prediction Using Recurrent Neural Networks	Rhode, Matilda, et al.	2018	Machine Activity collected in VM using Cuckoo Sandbox		594		594	Accuracy = 93% (After 4 min of malware execution)
DL4MD: A Deep Learning Framework for Intelligent Malware Detection	Hardy, William, et al.	2016	Comodo Cloud Security Centre		22,500		22,500	TP = 22,035 FP = 953 TN = 21,547 FN = 465 Accuracy = 96.85%
eXpose: A Character Level Convolutional Neural Network with Embeddings for Detecting Malicious URLs, File Paths and Registry Key	Saxe, Joshua, and Konstant-in Berlin.	2017	VirusTotal		URLs	7,211,705	1,496,198	TPR = 0.77 × 10⁻⁴ FPR = 0.84 × 10⁻³ AUC = 0.993
					File Paths	869,836	3,677,404	TPR = 0.16 × 10⁻⁴ FPR = 0.43 × 10⁻³ AUC = 0.978
					Regist-ry Keys	250,819	1,282,292	TPR = 0.51 × 10⁻⁴ FPR = 0.62 × 10⁻³ AUC = 0.992
Robust Malware Detection for the Internet of (Battlefield) Things Devices Using Deep Eigenspace Learning	Azmood-eh, Amin, Ali Dehghanta-nha, and Kim Kwang Raymond Choo.	2018	VirusTotal		1078		128	Accuracy = 99% Recall = 98%
Detection of Malicious Code Variants Based on Deep Learning	Cui, Zhihua, et al.	2018	Vision Research Lab		9342 (25 Malware Families)		-	Accuracy = 94.5 Precision = 94.6 Recall = 94.5 Runtime = 20 ms
Malware Identification Using visualization images and deep learning	Ni, Sang, Quan Qian, and Rui Zhang	2018	Kaggle 2015		10,085 (9 Malware Families)		-	Accuracy = 99%
End-to-End Deep Neural Networks and Transfer Learning for Automatic Analysis of Nation State Malware	Rosenberg, Ishai, Guillaume Sicard, and Eli David.	2018	Cuckoo Sandbox		3200 (2 APT classes)		-	Accuracy = 98.6%
Empowering Convolutional Networks for Malware Classification and Analysis	Kolosnjaji, Bojan, et al.	2017	Virusshar, Maltrieve, Private Collection		-		-	Precision = 0.93 Recall = 0.93 F-1 Score = 0.92
Malware Detection Based on Deep Learning of Behavior Graphs	Fei Xiao et al.	2019	Vx Heaven		880		880	Precision = 0.986 Recall = 0.992 F-1 Score = 0.989
Deep Learning for Classification of Malware System Call Sequences	Bojan et al.	2016	Virusshar, Maltrieve, Private Collection		4753		-	Precision = 85.6% Recall = 89.4%
Malware Detection with Deep Neural Network Using Process Behavior	Shun Tobiyama et al.	2016	NTT Secure Platform Laboratory		81		69	AUC = 0.96
Robust Intelligent Malware Detection Using Deep Learning	R. Vinaya Kumar et al.	2018	WSBD	Ember	70,140		69,860	Accuracy = 98.9% Precision = 99.7% Recall = 98.1% F-1 score = 98.9%
			WDBD	Cukoo Sandbox	173,946		169,509	Accuracy = 93.6% Precision = 94.8% Recall = 92.0% F-1 Score = 93.4%
			DIMD	Malimg, Virus-sign, Virus-share	24,851		-	Accuracy = 96.3%
Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features	Joshua et al.	2015			81,910		350,016	TPR = 95.2% AUC = 0.999
Learning the PE Header, Malware Detection With Minimal Domain Knowledge	Edward Raff, Jared Sylvester, Charles Nicholas	2017	Group A	Virus- share	301,575		291,285	Accuracy = 90.8% AUC = 97.7%
	Edward Raff, Jared Sylvester, Charles Nicholas	2017	Group B	Industry Partner	240,000		237,349	Accuracy = 83.7% AUC = 91.4%
One Shot Learning Approach for Unknown Malware Classification	True Kien, Hiroshi Sato, Masao Kubo	2018	Malicia Project, Virustotal				23,080	Accuracy (with training) = 0.74 Accuracy (without training) = 0.85
DTMIC: Deep transfer learning for malware image classification	Sanjeev Kumar, B. Janet	2022	MalImg and MS BIG dataset		9339 + 10,868			Accuracy on MalImg = 98.92% Accuracy on BIG dataset = 93.19
Deep multitask learning for malware image classification	Ahmed Bensaoud, Jugal Kalita	2022	Virusshare, Virus total, contagio					Accuracy = 99.97% TPR = 99.98 FPR = 0.73
DTMIC: Deep transfer learning for malware image classification	Sanjeev Kumar, B. Janet	2022	MalImg and Microsoft		9339 + 21,741			Accuracy = 98.92 Precision = 99 Recall = 99

5. Meta Learning Based Detection

Critical analysis of all the surveyed papers that implemented deep learning algorithms, emphasizes the grave need of using a large dataset to produce reliable results. Deep learning architectures heavily make use of supervised learning that requires a large no. of labeled examples for training the model as mentioned by ^[43]. Using the small dataset does not help the model to learn the features properly during the training phase which leads to non-reliable results. Another aspect that got unveiled during this survey referred to the fact that this large dataset is supposed to contain a large no. of examples for each class that must be identified by the trained model. And processing the bulk of data in deep learning needs powerful hardware, high computational processing power, and high training time which diminishes the chance of applying the trained models to real-time data. Because of these unavoidable features of deep learning models, the market could not get successful in replacing the signature-based anti-malware systems with artificially intelligent systems. Therefore, researchers shifted their direction of research from developing deep models for feature learning to finding out the possibilities of developing models that can work over small datasets. In the quest of achieving the previously mentioned objective, researchers explored the concept of Few Shot Learning (FSL) which is based on meta learning with a focus on learning the strategy of how to learn the meaningful properties of data. Meta learning utilizes the concept of transfer learning (multi-task learning) and semi-supervised or unsupervised learning approaches which need a few examples for the training. And thus, according to ^[44], the meta learning model can be trained with the help of prior knowledge. Meta learning based algorithms that are being used in malware analysis include Few Shot Learning (FSL), One shot Learning (OSL), and Zero Shot Learning (ZSL). Figure 4 shows the relationship between machine learning and meta learning models. Major advantages of meta learning based algorithms are listed in Figure 5.

Figure 4. Relationship Between Machine Learning and Meta Learning.

Figure 5. Advantages of Meta Learning.

Ref. ^[45] have explored the Siamese network for malware image classification. Siamese network architecture is the application of one shot learning field. The basic approach used by ^[45] was to transform the features into malware images that were input to Siamese Convolutional Neural Networks shown in Figure 6. Siamese CNNs used by the ^[45] produce 2 feature vectors. Finally, the Manhattan distance between those feature vectors was calculated and given to the sigmoid function to generate the similarity score.

Figure 6. Siamese CNN used by ^[32].

Another surveyed paper ^[39] mentioned the use of one shot learning approach with a memory augmented neural network using the API calls sequence. Ref. ^[39] adapted an approach that has two domains of learning. The first domain in this approach is used to train the model with known malware and 2nd domain is used to train or test with a dataset of an unknown type of malware. Domain 2 makes use of domain 1′s trained model. The working of the implemented approach ^[39] is shown in Figure 7.

Figure 7. Proposed Approach of ^[28]. 2 Phases of Training and Testing, Domain 1 and Domain 2.

Ref. ^[46] have explored one shot learning approach with matching and prototypical networks. The developed model by ^[46] is shown in Figure 8. Ref. ^[46] take advantage of visual dissimilarity in the images of different malware families (shown in Figure 9) and have converted the malware binaries into 8-bit greyscale images to be given as input to the few shot learning models.

Figure 8. Proposed Approach in ^[33].

Figure 9. Visual Samples showing Dissimilarity Between the Images of Different Families ^[33].

Ref. ^[47] presents a few shot learning based neural network ConvProtoNet. ConvProtoNet in ^[47] used stacked convolutional layers rather than only computing means, to generate features of malware classes. ConvProtoNet is capable of being trained on one dataset and tested on another.

Ref. ^[48] composed the dataset of splash screen images showing the message of the system being attacked by the ransomware. They trained their one shot learning model on a dataset of 50 ransomware families splash screen images. Different augmentation techniques are used by ^[48] to tune the images for adapting one shot learning.

References

Christodorescu, M.; Jha, S. Static analysis of executables to detect malicious patterns. In Proceedings of the 12th USENIX Security Symposium (USENIX Security 03), Washington, DC, USA, 4–8 August 2003.
Santos, I. Idea: Opcode-sequence-based malware detection. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2010; Volume 5965.
Sabbatel, G.B.; Korczynski, M.; Duda, A. Architecture of a Platform for Malware Analysis and Confinement. In Proceedings of the Proceeding MCSS 2010: Multimedia Communications, Services and Security, Cracow, Poland, 2–3 June 2011.
Elhadi, A.A.E.; Maarof, M.A.; Osman, A.H. Malware detection based on hybrid signature behavior application programming interface call graph. Am. J. Appl. Sci. 2012, 9, 283–288.
Fleck, D.; Tokhtabayev, A.; Alarif, A.; Stavrou, A.; Nykodym, T. PyTrigger: A system to trigger & extract user-activated malware behavior. In Proceedings of the 2013 International Conference on Availability, Reliability and Security, Regensburg, Germany, 2–6 September 2013.
Berlin, K.; Slater, D.; Saxe, J. Malicious behavior detection using windows audit logs. In Proceedings of the 8th ACM Workshop on Artificial Intelligence and Security, Denver, CO, USA, 16 October 2015.
Kumar, G.; Thakur, K.; Ayyagari, M.R. MLEsIDSs: Machine learning-based ensembles for intrusion detection systems—A review. J. Supercomput. 2020, 76, 8938–8971.
Chen, L.; Li, T.; Abdulhayoglu, M.; Ye, Y. Intelligent malware detection based on file relation graphs. In Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015), Anaheim, CA, USA, 7–9 February 2015.
Elhadi, A.A.E.; Maarof, M.A.; Barry, B.I.A. Improving the detection of malware behaviour using simplified data dependent API call graph. Int. J. Secur. Its Appl. 2013, 7, 29–42.
Feng, Z.; Xiong, S.; Cao, D.; Deng, X.; Wang, X.; Yang, Y.; Zhou, X.; Huang, Y.; Wu, G. HRS: A Hybrid Framework for Malware Detection. In Proceedings of the 2015 ACM International Workshop on International Workshop on Security and Privacy Analytics, San Antonio, TX, USA, 4 March 2015.
Ghiasi, M.; Sami, A.; Salehi, Z. Dynamic VSA: A framework for malware detection based on register contents. Eng. Appl. Artif. Intell. 2015, 44, 111–122.
Kwon, B.J.; Dumitras, T. The Dropper Effect: Insights into Malware Distribution with Downloader Graph Analytics Categories and Subject Descriptors. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security (Ccs’15), Denver, CO, USA, 12–16 October 2015.
Mao, W.; Cai, Z.; Towsley, D.; Guan, X. Probabilistic inference on integrity for access behavior based malware detection. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2015; Volume 9404.
Piyanuntcharatsr, S.S.W.; Adulkasem, S.; Chantrapornchai, C. On the comparison of malware detection methods using data mining with two feature sets. Int. J. Secur. Its Appl. 2015, 9, 293–318.
Wüchner, T.; Ochoa, M.; Pretschner, A. Robust and effective malware detection through quantitative data flow graph metrics. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2015; Volume 9148.
Raff, E.; Nicholas, C. An alternative to NCD for large sequences, lempel-ZiV jaccard distance. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; Volume 129685.
Khodamoradi, P.; Fazlali, M.; Mardukhi, F.; Nosrati, M. Heuristic metamorphic malware detection based on statistics of assembly instructions using classification algorithms. In Proceedings of the 18th CSI International Symposium on Computer Architecture and Digital Systems, (CADS 2015), Tehran, Iran, 7–8 October 2015.
Upchurch, J.; Zhou, X. Variant: A malware similarity testing framework. In Proceedings of the 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), Fajardo, PR, USA, 20–22 October 2015.
Liang, G.; Pang, J.; Dai, C. A Behavior-Based Malware Variant Classification Technique. Int. J. Inf. Educ. Technol. 2016, 6, 291.
Vadrevu, P.; Perdisci, R. MAXS: Scaling malware execution with sequential multi-hypothesis testing. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, Xi’an, China, 30 May–3 June 2016.
Dahl, G.E.; Stokes, J.W.; Deng, L.; Yu, D. Large-scale malware classification using random projections and neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013.
Ravi, V.; Alazab, M.; Selvaganapathy, S.; Chaganti, R. A Multi-View attention-based deep learning framework for malware detection in smart healthcare systems. Comput. Commun. 2022, 195, 73–81.
Rama, K.; Kumar, P.; Bhasker, B. Deep Learning to Address Candidate Generation and Cold Start Challenges in Recommender Systems: A Research Survey. arXiv 2019, arXiv:1907.08674.
Rhode, M.; Burnap, P.; Jones, K. Early-stage malware prediction using recurrent neural networks. Comput Secur 2018, 77, 578–594.
Kolosnjaji, B.; Zarras, A.; Webster, G.; Eckert, C. Deep learning for classification of malware system call sequences. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2016; Volume 9992.
Hardy, W.; Chen, L.; Hou, S.; Ye, Y.; Li, X. DL 4 MD: A Deep Learning Framework for Intelligent Malware Detection; CSREA Press: Las Vegas, NV, USA, 2016; pp. 61–67.
Saxe, J.; Berlin, K. eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys. arXiv 2017, arXiv:1702.08568.
Azmoodeh, A.; Dehghantanha, A.; Choo, K.K.R. Robust Malware Detection for Internet of (Battlefield) Things Devices Using Deep Eigenspace Learning. IEEE Trans. Sustain. Comput. 2019, 4, 88–95.
Cui, Z.; Xue, F.; Cai, X.; Cao, Y.; Wang, G.G.; Chen, J. Detection of Malicious Code Variants Based on Deep Learning. IEEE Trans Ind. Inf. 2018, 14, 3187–3196.
Ni, S.; Qian, Q.; Zhang, R. Malware identification using visualization images and deep learning. Comput Secur 2018, 77, 871–885.
Rosenberg, I.; Sicard, G.; David, E. End-to-end deep neural networks and transfer learning for automatic analysis of nation-state malware. Entropy 2018, 20, 390.
Kolosnjaji, B.; Eraisha, G.; Webster, G.; Zarras, A.; Eckert, C. Empowering convolutional networks for malware classification and analysis. In Proceedings of the International Joint Conference on Neural Networks, Anchorage, AK, USA, 14–19 May 2017.
Xiao, F.; Lin, Z.; Sun, Y.; Ma, Y. Malware Detection Based on Deep Learning of Behavior Graphs. Math. Probl. Eng. 2019, 2019, 8195395.
Tobiyama, S.; Yamaguchi, Y.; Shimada, H.; Ikuse, T.; Yagi, T. Malware Detection with Deep Neural Network Using Process Behavior. In Proceedings of the 2016 IEEE 40th Annual Computer Software and Applications Conference (COMPSAC), Atlanta, GA, USA, 10–14 June 2016; Volume 2.
Vinayakumar, R.; Alazab, M.; Soman, K.P.; Poornachandran, P.; Venkatraman, S. Robust Intelligent Malware Detection Using Deep Learning. IEEE Access 2019, 7, 46717–46738.
David, O.E.; Netanyahu, N.S. DeepSign: Deep learning for automatic malware signature generation and classification. In Proceedings of the International Joint Conference on Neural Networks, Killarney, Ireland, 12–17 July 2015.
Saxe, J.; Berlin, K. Deep neural network based malware detection using two dimensional binary program features. In Proceedings of the 2015 10th International Conference on Malicious and Unwanted Software (MALWARE), Fajardo, PR, USA, 20–22 October 2015.
Tran, T.K.; Sato, H.; Kubo, M. One-shot learning approach for unknown malware classification. In Proceedings of the 2018 5th Asian Conference on Defense Technology (ACDT), Hanoi, Vietnam, 25–26 October 2018.
Raff, E.; Sylvester, J.; Nicholas, C. Learning the PE header, malware detection with minimal domain knowledge. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, Dallas, TX, USA, 3 November 2017.
Bensaoud, A.; Kalita, J. Deep multi-task learning for malware image classification. J. Inf. Secur. Appl. 2022, 64, 103057.
Kumar, S.; Janet, B. DTMIC: Deep transfer learning for malware image classification. J. Inf. Secur. Appl. 2022, 64, 103063.
Mohammadi, F.G.; Amini, M.H.; Arabnia, H.R. An introduction to advanced machine learning: Meta-learning algorithms, applications, and promises. In Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2020; Volume 1123.
Kadam, S.; Vaidya, V. Review and analysis of zero, one and few shot learning approaches. In Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2020; Volume 940.
Hsiao, S.C.; Kao, D.Y.; Liu, Z.Y.; Tso, R. Malware image classification using one-shot learning with siamese networks. Procedia Comput. Sci. 2019, 159, 1863–1871.
Tran, T.K.; Sato, H.; Kubo, M. Image-based unknown malware classification with few-shot learning models. In Proceedings of the 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), Nagasaki, Japan, 26–29 November 2019.
Tang, Z.; Wang, P.; Wang, J. ConvProtoNet: Deep prototype induction towards better class representation for few-shot malware classification. Appl. Sci. 2020, 10, 2847.
Atapour-Abarghouei, A.; Bonner, S.; McGough, A.S. A King’s Ransom for Encryption: Ransomware Classification using Augmented One-Shot Learning and Bayesian Approximation. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019.
Lee, J.; Jeong, K.; Lee, H. Detecting metamorphic malwares using code graphs. In Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre, Switzerland, 22–26 March 2010.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Faiza Babar Khan

View Times: 759

Update Date: 20 Oct 2022

Table of Contents

Video Upload Options

Confirm