Deep-Learning-Based Approach to Keystroke-Injection Payload Generation: Comparison
Please note this is a comparison between Version 1 by Vitalijus Gurcinas and Version 2 by Catherine Yang.

USB-based keystroke-injection attacks involve manipulating USB devices to inject malicious keystrokes into the target system. These attacks exploit the trust of USB devices and can bypass traditional security measures if they are taken into account and adapted to the systems designed to prevent such attacks. By impersonating a keyboard or using programmable USB devices, attackers can execute unauthorized commands or gain unauthorized access to sensitive information mimicking legitimate user keystrokes. Different attack vectors, such as BadUSB and rogue device attacks, have drawn attention to the potential risks and ramifications involved in these types of attacks. However, the emergence of advanced attack methods necessitates the development of more sophisticated countermeasures. These attacks pose a significant security risk and highlight the importance of implementing strong defenses to mitigate the potential impact of such exploits, especially those that can bypass keystroke dynamics systems using rogue USB devices with implants.

  • keystroke dynamics
  • machine learning
  • deep learning
  • behavioral biometrics

1. USB Imperfections

Infrastructure security is a natural starting point for a data security plan [1][6]. Most USB hardware attacks are closely associated with social-engineering tactics, as these types of attacks often necessitate a compromise of physical security measures. In 1998, the first widely supported USB protocol USB 1.0 was released with a data-transfer rate of 1.5 Mbit/s. There was an updated version—USB 1.1 supported two data-transfer rates, low-speed 1.5 Mbit/s and full-speed 12 Mbit/s. Due to the limitations imposed on transfer speeds, the standard in question only supported a restricted range of devices, such as keyboards and mice. Then, in 2000, the USB 2.0 specification was introduced. High-speed (480 Mbit/s) mode meant that devices such as cameras, external storage devices, printers, and network cards were also supported. The convenience of the high data-transfer rate provided the momentum for the popularity of USB flash drives. Although various peripherals are supported in USB 2.0, there is no reliable way to identify the type of device by ‘vendor_id’ or ‘product_id’ [2][3][7,8]. The absence of robust identification mechanisms creates a vulnerability that can be exploited by keystroke-injection attacks. USB 3.0 and its 2013 update USB 3.1 introduced the USB Type-C connector. This provided a unified connector type for power, HDMI, display port, and Thunderbolt. For example, USB type-C can transfer video stream data, and USB 3.1 cables can deliver 4K (UltraHD) video and audio. This can potentially enhance keystroke-injection attacks as it provides the attacker with visual access to the victim’s machine, enabling them to observe the ongoing activities, choose the best time for the attack or even extract a greater amount of data from the victim within a limited timeframe [4][2]. However, no improvements in the field of security were introduced in these revisions. This allows even more possibilities for successful attacks via the USB interface [4][5][1,2]. Utilization of USB for malicious intents should not be solely understood in terms of USB versions and direct physical connections. USB attacks should be comprehended in a broader context. If there are means for directly detecting harmful USB devices, even with moderate difficulty, the attack surface significantly expands when malicious USB devices are connected through intermediate circuits such as USB hubs or other commonly used expansion devices. To demonstrate this weakness, research has been conducted on the creation of a malicious USB device to bypass USB blocking mechanisms by manipulating USB protocol and spoofing data to trusted USB hubs [6][9]. This shows a diverse range of offensive attack capabilities, as well as the corresponding countermeasures in both direct and side-channel scenarios.
More than 400 vulnerabilities related to USB peripherals are listed on the CVE (common vulnerabilities and exposures) list. As a result, it has become a standard practice for an attacker to use these vulnerabilities and exploit the trust-by-default characteristics of USB to conduct attacks. And this is a security risk for the private, government, and personal sectors [4][2]. New and more advanced USB technologies offer more features which are used by mobile devices and tablets, as well as computers.
A taxonomy of USB-based attacks has been developed by categorizing them into three main categories: programmable microcontrollers, USB peripherals, and electrical [7][10]. This classification is used to analyze USB hardware attacks and gain a deeper understanding of each type of attack in this paper.
  • Electrical attacks are related by nature to the denial-of-service (DoS) attack. As an example, there is a hardware device called the ‘USB killer’. It is an electrical discharger disguised as a simple USB device. Once this device is plugged into the USB port, the capacitors will charge up and discharge a critical amount of current back to the USB port in intervals, making computer hardware components unusable;
  • USB peripheral attacks utilize flash-drive firmware or driver to deliver malicious payloads. These types of attacks can perform buffer overflows, DNS overrides, and even keystroke injection and, in some cases, launch an executable;
  • Microcontroller attacks use microcontrollers that emulate keyboards and can inject keyboard input at high. They belong to a class of HID attacks that has evolved over the years and became much more sophisticated [8][11].
Other researchers proposed a USB-based attacks taxonomy with a more granular approach by creating different bases that cover adversary intentions, object of impact, attack mechanism, level of secrecy, level of complexity, assets, and so forth [9][12]. A recent study discusses attacks on keyboard-firmware attacks in the banking sector, where vulnerable software inside keyboard controllers is used to sniff sensitive data [10][13]. As follows, this highlighted the need for secure USB protocols and firmware-verification mechanisms. The keyboard has become a more security concern over the years due to the sensitive nature of its use.

2. USB Attacks

USB attacks encompass a variety of techniques that exploit vulnerabilities in the USB interface to gain unauthorized access or compromise system security. Keystroke injection, as a specific type of USB attack, involves injecting malicious commands or keystrokes into a target system to perform unauthorized actions or gain elevated privileges. The USB keystroke-injection attack is also known as the keypress-injection attack and the keyboard-injection attack [11][12][13][14,15,16]. It is an attack method where by connecting a malicious USB device it is possible to enter predetermined keystrokes in the terminal, enter and execute scripts, use keyboard shortcuts to control the device, and, thus, maliciously affect the computer [14][15][16][17,18,19].
The significance and feasibility of keystroke-injection attack are derived from several factors. One factor is USB device detection, or what happens before a malicious USB device begins to perform malicious activity; it is essential as further options of keystroke injection will depend on the level of trust gained. When a USB device is plugged into the USB port, the host detects that a new device was connected and waits for 100 milliseconds to ensure, that the new device has the time to be powered properly. The host then issues a reset command to place the device in its default state and allow it to respond. The host will ask the device for the first 64 bytes of its device descriptor. This step is very important to conduct a successful keystroke-injection attack. The device descriptor contains information about the product, its vendor, required power levels, the number of interfaces, endpoint information, etc. Once these are established, the host will communicate with the USB device using the appropriate drivers [17][20]. The utilization of replacement descriptors and static device information in USB penetration testing has been used for some time in automated platforms to prepare such devices [18][21]. Merely pretending that the device is highly trustworthy will not be sufficient to bypass protected systems without implementing additional measures. A USB storage device is completely unsuitable for this type of attack, as a simple device-protection tool can identify it as a threat or authenticate every USB flash drive [19][20][22,23].
The vast majority of researchers who researched keystroke-injection-hardware-based solutions used microcontrollers [12][21][22][23][24][15,24,25,26,27]. In addition to microcontrollers, some research used more powerful small-factor computers that offer additional capabilities [11][25][14,28]. Accordingly, first and foremost, in order to inject keystrokes, a microcontroller must identify itself as a HID (human interface device). Researchers recently engaged in developing a model that establishes a correlation between HIDs and vulnerability categories, thereby aligning them with specific types of attacks [26][29].

3. Keystroke Dynamics and Its Circumvention

When such HID- and BadUSB-type attacks emerged, serious concerns were raised regarding the security of USB devices. Consequently, over the past four years, the research on protective solutions has increased, but also the number of researchers investigating how to circumvent these measures. One of the commonly discussed keystroke-injection-attack-detection solutions outlined in scientific papers is rule-based. Rule-based keystroke-injection protection is a straightforward and widely adopted method that involves logging and monitoring keyboard input for detection. On the contrary, an alternative approach entails analyzing the USB packet traffic to identify and mitigate potential malicious activities [27][30]. Typically, it is beyond human capacity to type at an extremely high speed of thousands of words per minute. Therefore, when the system detects abnormal keypress speeds, it can trigger various protective measures, such as disabling the keyboard, requiring password input, or logging the activity for further analysis [14][17].
Certain rule-based systems rely on white-listing known vendors by USB ID and not allowing other HID devices to function unless the user explicitly confirms and authorizes them [28][4]. Another approach involves employing contextual analysis tools that utilize a combination of contextual events. These tools take advantage of a heuristic approach by executing a USB drive within a sandboxed or isolated environment for a brief period, typically, a few seconds. During this time, the system monitors the processes and actions initiated by the USB drive. Subsequently, an evaluation phase is performed to determine whether the observed actions exhibit malicious behavior [29][31].
Another advanced method employed to detect and mitigate keystroke-injection attacks involves the utilization of behavioral biometrics, particularly keystroke dynamics. The concept of keystroke dynamics originated in the 1970s, primarily focusing on analyzing fixed-text data.
Keystroke-dynamics-based systems identify users based on their interaction with a computer via input devices. Therefore, in some literature, keystroke dynamics can also be referred to as keystroke biometrics. These systems typically use neural networks that are trained using user interactions with computers. Systems that use this for identification may be prompted to enter their password multiple times or type a paragraph of text, allowing the system to train and learn from these inputs. Alternatively, the system can train itself in the background while the user performs its daily tasks, continuously refining its understanding of the user’s keystroke dynamics. Some of these systems may be vulnerable to what is called a frog-boiling attack [30][32]. This implies that the dataset used for identification purposes can be poisoned with small packets containing false data and these would lead to false identification. Such poisoning attacks can have two different objectives: to reduce the performance of the model and to manipulate the model by injecting false data [31][33].
Keystroke dynamics does not require any additional hardware as required by other biometric authentication methods, such as fingerprints or facial recognition. Personal keystroke biometrics is difficult to forge and could be used for authentication [32][33][34,35]. For example, the online learning platform Coursera applies this method to authenticate online students [34][36]. It is worth mentioning that legitimate users may be blocked by the system in cases where training patterns will not match the input patterns in cases like nervous emotional state, hand injury, or, simply, a different keyboard or software update. Other scientists use certain characteristics and behaviors that impede accurate recognition as specific indicators when determining Parkinson’s disease using fixed- and free-text writing habits [35][37]. This shows us that for accurate recognition behavioral patterns need to be updated regularly. However, these problems can be solved, and it is very likely that keystroke biometrics will gain more popularity in the near future [36][38].
There exist a limited number of approaches to implement keystroke analysis which vary depending on the intended purpose. In general, keystroke dynamics can be divided into two main steps: training neural networks from collected user data and performing authentication. Moreover, keystroke dynamics can be classified into two distinct types: fixed-text and free-text. The key difference between these two methods is the dataset that is used to authenticate a user [37][38][39,40]. Fixed-text authentication is mostly used as a second-factor or multi-factor authentication. One of the leaders in commercial keystroke dynamics authentication systems is a company called TypingDNA. And, according to this company, the average fixed-text login credentials (email and password) contain about 30 characters [39][41].
Compared to fixed-text, free-text is a continuous type of authentication rather than second- or multi-factor. The user’s computer interaction is continuously monitored and compared to existing user data for analysis and comparison. If a predetermined threshold of deviance from typical user behavior is passed, then a user might be blocked, logged out, or asked to authenticate himself based on system settings [40][42]. Other researchers consider analyzing both keystroke and mouse usage behavior patterns to prevent a situation where an attacker avoids detection by restricting to one input device because the system only checks the other input device [41][43]. Conversely, alternative proposals state that mouse dynamics requires simpler hardware to capture the biometric data without using sensitive user data from the users and propose a method based on mouse dynamics based on deep-learning for continuous and silent user authentication [42][44].
Free-text is more dynamic and uses a self-adaptive dataset. Typically, users are asked to type a few paragraphs of text or the required data could be collected simply by monitoring how the user interacts with a keyboard on a daily basis [43][45]. These systems are used for continuous authentication and are often used to defend against keystroke-injection attacks. The evaluation of accuracy and performance in free-text keystroke dynamics is an ongoing area of investigation; researchers are actively developing methods to minimize error rates in this domain [44][46], while other researchers have explored the application of cGAN networks to generate fake keystroke dynamics patterns with the intention of deceiving keystroke-authentication systems [45][47]. Given the growing need for remote learning, continuous authentication with keystroke dynamics was implemented and performed very effectively [46][48], although keystroke dynamics has significant inaccuracies when applied to RDP (remote desktop protocol) and VNC (virtual network computing) systems, resulting in poor or non-functioning operation.
In modern approaches for fixed- or free-text keystroke dynamics, different types of neural networks are being used in conjunction. As an example, Siamese neural networks are used [47][49]. In this model, two neural networks are trained using the same parameters and weights and work in conjunction with one another. One neural network receives original legitimate user data and another receives data that should be verified. Later, these results are compared. In contrast, generative adversarial neural networks are also gaining momentum for attacks against user identification systems [45][48][47,50]. In this architecture, two neural networks work against each other. One is trained as a discriminator and tries to identify the user while the generator receives random noise input at the beginning and tries to generate output that would fool the discriminator. Despite the aforementioned, in certain scenarios statistical algorithms have demonstrated superior performance compared to deep-learning approaches, particularly when dealing with large volumes of unlabeled data [49][51].
In subsequent years, Bayesian classifiers based on the mean and variance of time intervals between two or three consecutive keypresses were applied to the problem. The results claim a classification accuracy of 92% on a dataset with 63 users [36][38]. In situations where we only want to implement this as a 2FA (password—what we know and keystroke biometrics—what we are) dataset for training needs to contain data about the parameters of how a user enters his username and password. This means that analyzed text is usually short (in average 8–20 characters long) and that the user will be asked to enter the same text (in this case, his password) for a set number of times. There are several drawbacks associated with the use of keystroke dynamics for continuous authentication. One limitation is that the data collected from user input often consists of a limited set of characters, such as letters, numbers, or symbols. Therefore, if a user changes his password (which is recommended for security purposes), the model would need to be retrained to adapt to the new input patterns. This retraining process can be time-consuming and may introduce delays in the authentication process. Additionally, reliance on fixed-text input may not capture the full range of user behavior and typing patterns, limiting the overall accuracy and effectiveness of the system.
Recently, research in keystroke dynamics has been heavily focused on machine-learning techniques, including random forests, fuzzy logic, RNN (recursive neural network), CNN (convolutional neural network), Gaussian mixture models, k-nearest neighbours (k-NN), K-means clustering, and many other approaches [49][50][51,52].
The two main ideas used to make a convolutional neural network particularly successful are sparse connections and weight sharing. According to the study, activation functions (ReLu, Maxout), loss functions (SoftMax, hinge), regularization technique (dropout), optimization method (data augmentation, batch normalization), and fast processing (sparse convolution) were used in conjunction with CNN [51][53]. As an alternative there are long short-term memory (LSTM) networks and a variation of the LSTM, called a gated recurrent unit (GRU). GRUs have a simpler design with fewer parameters, which allows for quicker training due to the reduced number of operations. On the other hand, convolutional neural networks (CNNs) are predominantly used for image-related tasks like processing, classification, segmentation, and pattern identification. However, they have also demonstrated impressive results in various other classification tasks.
Depending on the machine-learning model and algorithms used, the data from the user has to be modified to fit the algorithm accordingly. A good example could be RNN (recurrent neural network). This automatically learns time series and has shown good performance in applications such as speech recognition, document abstraction, and NLP (natural language processing). However, if RNN is used, the keystroke data must be vectorized [34][36].
In a typical implementation, keystrokes or keystroke pairs are converted into vectors, and neural network weights are adjusted by backpropagation during the model-training phase. The trained model is then used during the authentication phase to assign a probability to the observed typing pattern for a specific user. The typing pattern is periodically compared to the stored user data, and, based on a predefined threshold, the model determines whether the observed pattern belongs to the legitimate user or not.

4. User Keypress Data and Their Minimum

To identify the user hiding behind the keyboard, a dataset with a collection of user keypresses is required. When collecting data solely based on the dynamics of key-presses without additional information, only a limited number of parameters are necessary and collected. Depending on the algorithm and system, some parameters may differ, but Dwell time and Down-to-down time are being used frequently. Dwell time (Dt) is a duration in which a key is pressed down (H.time). And Down-to-down time (DD.time) measures the duration between one keypress to another, Up-to-down time (UD.time) is the time from the release of one key and the press of another (in some literature, called flight time) [32][34]. The data-collection and training phase of keystroke dynamics can be categorized based on the length and type of the text [52][54]. Fixed-length models use a limited dataset that consist of username and password. In other words, the analyzed text is replicated in a concise manner and iterated until a sufficient amount of data are collected for training purposes. However, some datasets are available for testing purposes, including the Carnegie Mellon University (CMU) fixed-text dataset [53][55]. It is often used as a reference to test techniques in keystroke dynamics research. This dataset consists of 51 participants’ keystroke dynamics information, where each participant typed password ‘.tie5Roanl’ a total of about 400 times. Participants had to wait at least one day after certain typing sessions, so that variations of each subject’s typing would be captured daily. Furthermore, this password was chosen as an example of a strong 10-character password, which meets common requirements for password security. The dataset compiled by Gonzalez comprises a combination of publicly accessible keystroke datasets [54][55][56][56,57,58]. It encompasses both authentic human-generated keystrokes and synthesized forgeries [57][59]. Researchers can utilize this dataset to evaluate the efficacy of liveness-detection techniques for keystroke dynamics, specifically compared to a diverse range of state-of-the-art methods for synthesizing samples. The first dataset included in the dataset collection was from CMU [54][56]. The second dataset was collected from individuals performing daily tasks in an enterprise setup and was used for evaluation of free-text keystroke dynamics for authentication [55][57]. The third dataset, obtained from anonymous subjects through a crowd-sourcing platform, was aimed at identifying indicators of fraudulent intent by analyzing variations in typing patterns [56][58]. It is important to note that the datasets referenced in the literature sources do not contain recorded click values.
The minimum number of data needed for fixed-text and free-text keystroke dynamics varies depending on factors such as the complexity of the analysis model, desired accuracy, and specific characteristics of users and their typing behavior as there could be many users with similar typing biometrics. Considering the focus of this research on small datasets and drawing insights from the existing literature, we can conclude that the minimum requirement for fixed-text keystroke data ranges from 10 to 30 samples, while for free-text keystroke data it ranges from 50 to 150 samples. It is important to highlight that the datasets referenced in the literature sources do not contain recorded keypress values, except those used to collect specific fixed-text data, such as the password mention in the CMU dataset.
A trend has started to emerge recently wherein facial recognition is combined with keystroke dynamics (fixed-text) as a MFA (multi-factor authentication) system [58][60]. This solution, although seeming promising, needs to take into consideration that not all devices have cameras and there are various types of keyboards and their layouts. An alternative approach suggested by the researchers was to propose a defense in depth strategy by implementing a three-factor authentication system. This multi-layered approach involves utilizing a passcode as the first factor of authentication, a password as the second factor, and keystroke dynamics as the third factor [59][61]. Such an approach requires the use of even more diverse datasets to increase the reliability of results.
In summary, comparative studies, including machine-learning-based methods and rule-based methods to evaluate the effectiveness of different keystroke-injection-detection and -prevention mechanisms, demonstrate that injection methods can be accurate enough to bypass both fixed-text and continuous authentication systems where the level of biometric accuracy does not need to be as precise as physical biometric data. Keystroke dynamics system armoring, which uses fake keypress data in order to distinguish between artificial keystrokes and real ones, and methods which follow the patterns of keypresses along with the timings have shown good results in recent years, but require a big collection of user data and come with a high false-rejection rate. Furthermore, the evolving landscape of keyboard-injection attacks requires ongoing research to eliminate emerging threats and to develop effective countermeasures. Keystroke-injection attacks are capable of jumping the air gap of secured infrastructures that have a high security level with the help of social engineering. Therefore, the demand for tools that will help defend against these types of attacks is growing over the years. The reliability of these tools should be tested regularly. And these systems may not always work because new attack methods emerge and adapt to increased security standards.
Video Production Service