Submitted Successfully!
To reward your contribution, here is a gift for you: A free trial for our video production service.
Thank you for your contribution! You can also upload a video entry or images related to this topic.
Version Summary Created by Modification Content Size Created at Operation
1 -- 2245 2023-10-13 10:11:35 |
2 format correct Meta information modification 2245 2023-10-13 10:36:15 |

Video Upload Options

Do you have a full video?

Confirm

Are you sure to Delete?
Cite
If you have any further questions, please contact Encyclopedia Editorial Office.
Singh, A.; Ikuesan, R.A.; Venter, H. MalFe—Malware Feature Engineering Generation Platform. Encyclopedia. Available online: https://encyclopedia.pub/entry/50249 (accessed on 06 July 2024).
Singh A, Ikuesan RA, Venter H. MalFe—Malware Feature Engineering Generation Platform. Encyclopedia. Available at: https://encyclopedia.pub/entry/50249. Accessed July 06, 2024.
Singh, Avinash, Richard Adeyemi Ikuesan, Hein Venter. "MalFe—Malware Feature Engineering Generation Platform" Encyclopedia, https://encyclopedia.pub/entry/50249 (accessed July 06, 2024).
Singh, A., Ikuesan, R.A., & Venter, H. (2023, October 13). MalFe—Malware Feature Engineering Generation Platform. In Encyclopedia. https://encyclopedia.pub/entry/50249
Singh, Avinash, et al. "MalFe—Malware Feature Engineering Generation Platform." Encyclopedia. Web. 13 October, 2023.
MalFe—Malware Feature Engineering Generation Platform
Edit

The growing sophistication of malware has resulted in diverse challenges, especially among security researchers who are expected to develop mechanisms to thwart these malicious attacks. While security researchers have turned to machine learning to combat this surge in malware attacks and enhance detection and prevention methods, they often encounter limitations when it comes to sourcing malware binaries and datasets. 

malware malware feature engineering malware datasets malware detection

1. MalFe Platform

The Malware Feature Engineering (MalFe) platform aims to simplify the difficulty of obtaining and creating custom datasets from a databank through a community-driven platform. To ensure a scientific process, the design approach leverages software engineering principles to develop the processes and features of the platform. On a high level, MalFe contains five key components: Reports, Categories, User Code, Dataset Generator, and Datasets, as depicted in Figure 1. The reports consist of raw Cuckoo Sandbox reports which, are uploaded by the community from experiments run to analyze malware samples. The categories help with organizing these reports, allowing more targeted datasets to be built. User code is used to parse through this data repository of reports and extract the relevant information required to generate a dataset. The user code is then fed into a dataset generator, which runs on a queue basis and executes the user code, based on the selected categories of reports, resulting in a dataset being generated. After being approved by their creator(s), these datasets are then made available for public viewing.
Figure 1. MalFe high-level model.
The MalFe process model is shown in Figure 2; in order to prevent misuse and to grow trust on the platform, this process starts with account creation and verification. After successful authentication, a user is able to browse the existing datasets that have been generated by the community. Thereafter, a user can search for a dataset that matches their need. In the event that the dataset does not exist on the platform yet, the user has the option to browse the uploaded Cuckoo Sandbox reports to see whether the information needed to build the dataset is available. If the information is not sufficient, the user needs to source more reports, either through experimentation or by obtaining them from an external source. Due to the often substantial size of these reports, it can be challenging and often impractical to download all of them for the purpose of data investigation.
Figure 2. MalFe process model.
To solve this problem, a lightweight comparison tool called IntelliView was created and added to the platform. It allows end users to browse sections of reports, making it easier to browse and compare reports. After manual observation conducted through IntelliView, if the user finds any malformed or erroneous reports they can flag these reports, ensuring that high-quality data are always available to the community and further enhancing the trust in the platform. At this stage, the reports are sufficient, and quality control has been conducted. The next step is for the user to add code to allow these reports to be parsed in order to extract the information needed to build the dataset. When this is complete, the metadata needed for the platform to build the dataset is requested from the user. This includes the category of reports for the dataset to be built off, as well as the name and description of the dataset. After the user code is checked for security vulnerabilities and syntax errors, the code is successfully queued on the platform to generate the dataset in a restricted environment. The process of dataset generation involves several status changes, keeping the user aware of what is happening. If the status is set to QUEUED, this means that the dataset has been added to a queue; the next generated dataset is taken from the queue when the previous dataset has been completed. This then sets the dataset generation status to PROCESSING. The processing stage of dataset generation is the core functionality of the platform, providing users with the ability to design the data according to their specific requirements. Error handling is integrated into the platform as well, providing an opportunity to understand any misconfiguration or other sources of errors during the data generation process. After dataset generation is completed, the data are written to a CSV file and saved on the platform, and the status is set to COMPLETE. The owner is then prompted to verify the dataset and perform quality checking. When approved by the owner, the dataset is made publicly available. Finally, the dataset is available to be downloaded, shared, and even commented on for further research and networking with security professionals. The next section discusses the technical details behind the implementation of the platform.

MalFe Implementation

The platform was developed following a modular approach and in line with agile principles to ensure adaptability and flexibility. Python, renowned for its versatility and wealth of built-in frameworks and libraries, was the language of choice for the implementation. To facilitate seamless web-based management, the Django web framework was chosen, which is a robust and widely respected choice in the field [1]. Security remains a paramount concern, and the platform integrates multiple layers of defense. Django’s default security middleware forms the first line of protection, followed by custom sanitization middleware. This custom middleware diligently scrubs metadata of any special characters and tags, further fortifying the security posture of the platform. The critical middleware components in play are SecurityMiddleware, SessionMiddleware, CsrfViewMiddleware, AuthenticationMiddleware, XFrameOptionsMiddleware, OTPMiddleware, and SessionSecurityMiddleware. In the realm of authentication, Two-Factor Authentication (2FA) was implemented to enhance security. This robust 2FA mechanism is mandatory for logging in and is integrated seamlessly into the user experience. Users have the convenience of utilizing either a token generator or physical keys. Token generators leverage the Time-Based One-Time Pin (TOTP) algorithm [2], producing 6–8 unique digits based on the current time and a secret key established during account registration. As a security measure, these tokens regenerate every 30 seconds, thwarting brute force and phishing attacks. Additionally, backup tokens are accessible in case the primary 2FA devices are unavailable, ensuring both recovery and security. The default Django password field was used, which is fortified with PBKDF2 utilizing robust SHA256-bit hashing and a random salt [1]. Furthermore, email verification is enforced, fostering trust and ensuring the integrity of user accounts.
The process of adding a Cuckoo report to the platform is simplified, as the report undergoes rigorous scrutiny, with essential metadata such as the SHA256 hash, name, and description being extracted and validated. A further layer of scrutiny involves an assessment for any analysis errors before the report is deemed suitable for upload, ensuring the preservation of raw data integrity. For added convenience, the platform seamlessly integrates with Cuckoo Sandbox, enabling users to upload samples directly. These samples are then enqueued for execution within the sandbox environment, and the report is subsequently automatically integrated into the platform. The IntelliView feature represents a unique and innovative approach to manual observation, and was achieved through the implementation of asynchronous calls, with data being clipped to enhance performance and usability.
The way datasets are created is a novel feature in which a user provides the code to process an individual report, thereby extracting the data and features needed for building the dataset. A portion of the code responsible for executing user-generated code can be seen in Figure 3.
Figure 3. User code execution snippet.
This is achieved by users implementing a stub function that is fed the Cuckoo report, filename, SHA256 hash, and category, as well as whether or not it is a malicious sample. Since the platform doesn't impose limitations on features, users can have as many features as they need. Furthermore, because executing user code is a security risk, the code is first checked for syntactical errors using an abstract syntax tree. RestrictedPhyton [3], which acts as a Python environment with limited imports and functions, is used to execute the user code. Safe built-ins and safe globals are used to ensure additional security and usability. To improve speed, concurrency is utilized with thread pools, allowing faster processing of Cuckoo reports and feature generation. A cron job is used to create the datasets, resulting in a queue system that limits potential resource abuse. Furthermore, each dataset has a maximum execution time.

2. MalFe Validation

The NIST validation cycle is shown in Figure 4. The first step is to define requirements for the platform, followed by test assertions defining the behaviour of how the system should act. Next, the test cases which test the requirements are defined, and verification is performed to check whether the test assertions have been achieved. Finally, test methodology and validation are reached. The system requirements consist of two categories: Core Requirements (CR), which are listed in Table 1, and Optional Requirements (OR), which can be found in Table 2. For example, in Table 1, the label column provides a reference number that is used in the compliance matrix. The description, on the other hand, provides the requirements for the platform.
Figure 4. NIST validation cycle [4].
Table 1. Core requirements.
Table 2. Optional requirements.
The next step is to define the Test Assertions (TA), which are the post-conditions of a system’s function. The assertions can be seen in Table 3. Thereafter, Test Cases (TC) are defined in order to ensure that the system has both met the standards and satisfied the test assertions. The test cases can be seen in Table 4.
Table 3. Test assertions.
Table 4. Test cases.
A compliance matrix simply maps the requirements to those test cases satisfying the test assertions. For example, if a core test assertion was met, that test assertion is specified in the result column. However, if a manual check was performed, this is indicated with ’–check–’, indicating that the check result is compliant. Note that test assertions and manual checks can occur simultaneously in the result column. In Table 5, the compliance matrix confirms that the results from the test assertions have been fulfilled.
Table 5. Compliance matrix.
No. Requirement Test Case Result
1 CR-01 TC-01 TA-01
2 CR-02 TC-02 TA-01, –check–
3 CR-03 TC-03, TC-05 TA-03, TA-05, –check–
4 CR-04 TC-06 TA-02
5 CR-05 TC-03 –check–
6 CR-06 TC-05 –check–
7 CR-07 TC-07, TC-08 TA-04, –check–
8 CR-08, CR-09 TC-08 TA-04, –check–
9 CR-10 TC-04, TC-10, TC-11 –check–
10 CR-11, CR-12, CR-13 TC-07 TA-03, TA-04, –check–  
11 CR-14 TC-09 –check–
12 CR-15 TC-04 –check–
13 OR-01 TC-13 –check–
14 OR-02 TC-12 TA-03, –check–
15 OR-03, OR-04 TC-08 –check–  
16 OR-05 TC-16 TA-05, –check–
17 OR-06 TC-03 –check–
18 OR-07 TC-15 –check–
19 OR-08 TC-11 –check–
20 OR-09, OR-10 TC-12 –check–

References

  1. Django Auth. Available online: https://docs.djangoproject.com/en/2.2/topics/auth/passwords/ (accessed on 4 October 2022).
  2. M’Raihi, D.; Machani, S.; Pei, M.; Rydell, J. TOTP: Time-Based One-Time Password Algorithm. Available online: https://www.rfc-editor.org/rfc/rfc6238.html (accessed on 4 October 2022).
  3. RestrictedPython. Available online: https://pypi.org/project/RestrictedPython/ (accessed on 25 February 2023).
  4. NIST. Computer Forensics Tool Testing Program. 2014. Available online: http://www.cftt.nist.gov/ (accessed on 12 October 2017).
More
Information
Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register : , ,
View Times: 208
Revisions: 2 times (View History)
Update Date: 13 Oct 2023
1000/1000
Video Production Service