The rapid development of information and communication technologies and the widespread use of the Internet has made it imperative to implement advanced user authentication methods based on the analysis of behavioural biometric data. In contrast to traditional authentication techniques, such as the simple use of passwords, these new methods face the challenge of authenticating users at more complex levels, even after the initial verification. This is particularly important as it helps to address risks such as the possibility of forgery and the disclosure of personal information to unauthorised individuals. Users can be categorised using keystroke dynamics, in terms of the age group they belong to and in terms of their educational level, with high accuracy rates, which is a strong indication for the creation of applications to enhance user security and facilitate their use of Internet services.
1. Introduction
Communication technologies have brought about many different changes in the way the average person lives. As the Internet becomes an integral part of everyday life of more and more people, the need to accurately identify the demographic characteristics of Internet users has become paramount, for several reasons. The reasons for this are varied and related to user security and the best use of Internet services. Profiling unknown users by identifying certain inherent or acquired characteristics, such as their age and educational level, is essential for various applications, including personalised content delivery, targeted advertising, and customisation of the user experience. In this context, the use of keystroke dynamics as a means of extracting valuable demographic information has garnered considerable attention. Keystroke dynamics, a branch of behavioural biometrics, focuses on analysing the unique typing patterns exhibited by individuals [
1]. These rhythms and patterns are idiosyncratic [
2], in the same way as an individual’s handwriting or signature, due to the similar underlying neurophysiological mechanisms. By studying different typing patterns and their correlations with demographic characteristics, keystroke dynamics provides a novel approach to demographic profiling.
2. Keystroke Dynamics in Personal Characteristics Protection
The idea of keystroke dynamics dates back to the late 1800s. In fact, it came from a long-held belief that Morse code senders could identify each other by speed and rate of transmission. In addition, telegraphers identified each other through what they called the “sender’s punch”. The U.S. National Science Foundation, or NSF, conducted research in the 1980s that determined that each person has his or her own keyboard writing style. This is achieved through the NSF’s keystroke recognition method, which analyses and processes the way a person writes on their keyboard [
4].
As early as the mid-1970s, the examination of how the way one uses the keyboard can be a recognisable hallmark began. This was first highlighted in Spillane’s research [
5], where the idea of identifying users by the way they type was introduced. Also, an important contribution was made through the publication of the study by Forsen et al. [
6], where keystroke dynamics was analysed as one of the biometric characteristics that can be used to verify the identity of a user requesting access to a system. One of the first studies on this topic was conducted by Gaines et al. [
7]. They had a group of seven secretaries write the same three paragraphs twice over a period of four months. A total of 300 to 400 words were required both during the writing phase and for each comparison. Time delays between successive typing were measured, and the analysis was based on a limited number of digraphs (two consecutive letters). Although the results were very encouraging (FAR 0% and FRR 4%), the sample size was too small and the volume of data required was too large.
Another study was conducted by Umphres and Williams in 1985 [
8]. In this work, the time delay between consecutive key presses was also used to authenticate the user. It took approximately 1400 key presses to generate a profile for each user. Each time authentication was required, another 300 characters were required. The FAR achieved was 6%, but it is clear that the volume of data required was particularly large. Also, a similar study was conducted by Leggett and Williams [
9] with data obtained from 17 computer programmers. The system developed showed an FAR of 5% and an FRR of 5.5%. However, a major drawback of this method is the need for a large amount of data. In total, each programmer had to write over 1000 words.
In a different field, Pentel [
23] focused on the analysis of unintended user activities in human–computer interactions. While user interfaces are usually designed to react only to intentional commands, users often perform unintentional activities that produce many cues for the user and can be used to plan the appropriate response by the system. Specifically, the goal of the research was to predict the age and gender of users through the analysis of data generated from mouse and keyboard devices. These data were collected from six different systems from 2011 to 2017 and include information from 1519 individuals. The machine learning models were able to predict both the age and gender of the user with very high accuracy. In particular, the F-score and accuracy metrics were above 0.9.
The study of Schler et al. [
26] was the search for the age of the author of a blog. The researchers collected their data from 71,493 blogs, which they classified according to the age of the author. For several of them, no age information was available, and for some of the classes they created, there were not enough data, resulting in three classes: the 10s (age group 13–17), the 20s (age group 23–27), and what they called the 30s (age group 33–46). As features for the classification, they used the frequency of occurrence of some words. The multi-class real Winnow algorithm was used for classification, in which for each class, a vector of as many dimensions as the set of parameters chosen was defined. The final results proved that the age group of blog creators could be correctly predicted with 73% accuracy.
The study by Rao et al. [
27] aims to identify the characteristics of Twitter users, especially their age group, gender, region origin, and political orientation. They proposed an approach to automatically discover a number of user attributes by examining their status messages, the social network structure, and the communication behaviour of the users. SVM was chosen as the classifier, and users were divided into people over 30 and under 30. The researchers tested the system and attained a classification accuracy rate of about 74%.
Keystroke dynamics can be used to identify the under-18 age group, thus offering an effective way to create a model to protect children from online threats. By implementing a limited firewall, an environment that is more suitable for this particular user group will be created [
28]. It can also be exploited in e-commerce problems by creating product recommendation services that are tailored to the age and gender of the users. Furthermore, the ability to identify the age and user through keystroke dynamics can allow the creation of a system where content or advertisements can be presented efficiently and targeted to the appropriate consumers, taking into account their preferences and characteristics [
29].
Educational systems vary greatly between countries. International data on education should therefore be based on a classification that proposes, for all countries of the world, correct criteria for the distribution of educational programs at levels that can be considered comparable.
The educational level of an individual is an important characteristic in various surveys that have been carried out over the years.
While fixed-text keystroke dynamics biometrics are often used during the login process to provide an authentication, free-text biometric keystroke systems allow continuous authentication of a user during the entire session for increased security [
30]. Furthermore, other studies [
31,
32] have exploited these additional user characteristics, such as age and gender, to improve the performance of the user authentication model.
It is part of everyday life for people to communicate over the Internet, and usually via text messaging. One of the major threats with this way of communication is those users who hide their personal characteristics, such as age and gender, and aim to deceive unsuspecting users. Due to the nature of online communication, hiding such information is an easy task. In order to protect unsuspecting users, various methods have been proposed to reveal some of the characteristics of anonymous users. The contributions of the paper are therefore two-fold: first, the creation of a free-text keystroke dynamics dataset, which is not often found on the Internet, due to the risk of leaking volunteers’ personal data, and second, the novelty of using keystroke dynamics to detect certain features of unknown users. Users can be categorised using keystroke dynamics, in terms of the age group they belong to and in terms of their educational level, with high accuracy rates, which is a strong indication for the creation of applications to enhance user security and facilitate their use of Internet services.
This entry is adapted from the peer-reviewed paper 10.3390/eng4040154