Keystroke Dynamics as a Language Profiling Tool: Comparison
Please note this is a comparison between Version 2 by Jason Zhu and Version 1 by Ioannis Tsimperidis.

Understanding the distinct characteristics of unidentified Internet users is helpful in various contexts, including digital forensics, targeted advertising, and user interaction with services and systems. Keystroke dynamics (KD) enables the analysis of data derived from a user’s typing behaviour on a keyboard as one approach to obtain such information. 

  • mother tongue determination
  • keystroke dynamics
  • user classification
  • machine learning

1. Introduction

The definition of “mother tongue” varies across different sources and continues to evolve to encompass the nuances of language use by individuals. One commonly accepted definition is that it refers to the language a person learns through their interactions with family and society during the early years of their life [1]. According to UNESCO, there are over 7000 known mother tongues in the world, with approximately 3000 of them facing the risk of extinction in the near future [2]. Hundreds of millions of people speak some mother tongues, such as Chinese, Hindi, Spanish, English, Arabic, Japanese, and Russian. Others like Turkish, Korean, French, German, Bengali, and Italian also have many speakers globally.
As the Internet continues to expand its reach across the globe, becoming accessible even to less economically developed populations, the diversity of languages used in the digital world is also increasing. English used to be the dominant language on the Internet; however, this is changing as more non-English speakers access online resources. The ability of people from different countries and cultures to communicate and share information in their mother tongue has resulted in a proliferation of diverse languages being used on the Internet.
The exponential increase in the global Internet user base has expanded the market reach for companies; however, the diverse linguistic landscape online presents a formidable challenge in effectively engaging with people from different language backgrounds. Communicating and marketing successfully with individuals speaking different languages is crucial to understanding and utilising their mother tongue. The mother tongue of Internet users serves as a defining characteristic, and knowledge of this aspect can be leveraged in various ways to enhance business strategies and user experiences.
Understanding a user’s mother tongue can have practical applications in various domains. For instance, Internet service providers (ISPs) can customise their services to align with users’ language preferences, thereby enhancing user experience. Similarly, online businesses can improve their targeted advertising strategies by considering customers’ mother tongues, as different language preferences may entail distinct consumer needs. Additionally, in digital forensics, knowledge of a suspect’s mother tongue can serve as valuable evidence in criminal investigations, allowing investigators to narrow the pool of potential suspects. Investigators often need to sift through substantial amounts of data and digital evidence to identify perpetrators when dealing with cybercrimes. Information about the suspect’s mother tongue can help focus investigative efforts on a smaller subset of suspects. Another practical application is automatically modifying the interface of a website or application based on the user’s mother tongue, making it more accessible and user-friendly, thus enhancing user satisfaction and engagement. Overall, leveraging the knowledge of a user’s mother tongue can have diverse applications in fields such as ISP services, targeted advertising, digital forensics, and website/application design to enhance user experiences and streamline processes.

2. Keystroke Dynamics

The term “mother tongue” refers to the language an individual learns from birth or acquires from their family and community during their formative years. It serves as their primary mode of communication and thought, and they are typically most proficient in using this language. However, the concept of mother tongue has evolved, leading to varying interpretations. Some experts contend that the language spoken by an individual’s biological mother is the true mother tongue, while others argue that it encompasses the language of the immediate environment. A study [5][3] asserts that children who receive education in their mother tongue are more likely to excel academically and achieve better long-term educational outcomes. Similarly, another study [6][4] provides a comprehensive overview of bilingual education and bilingualism, defining the mother tongue as the first language a child learns, typically spoken at home. It underscores the significance of maintaining and fostering the mother tongue in bilingual education, as it can facilitate academic success and social integration. In conclusion, the concept of mother tongue has different interpretations, ranging from the language spoken by one’s biological mother to the language of the immediate environment. However, scholars such as Cummins and Baker emphasise the importance of preserving and developing the mother tongue in bilingual education, as it can positively impact academic achievement and social integration. The concept of a mother tongue, also known as a first language (L1), has been defined in various ways by scholars from different disciplines. In linguistics, it is often defined as the language that a person learns naturally from birth or early childhood and has a high level of proficiency in [7][5]. In education, the mother tongue can also refer to the language used as a medium of instruction in schools and the language of instruction in multilingual contexts [8][6]. However, the definition and concept of the mother tongue have evolved, and there are different views and perspectives on its content. For example, some scholars have criticised the narrow focus on language proficiency in the traditional definition of the mother tongue and have emphasised the sociocultural and affective aspects of language learning and use [9][7]. Kamusella [10][8] argued that language is a political construct and that identities can be constructed and contested through language use. He also emphasised the importance of linguistic diversity and multilingualism in creating more inclusive societies. In another study, Gorter [11][9] posited that language use is complex and dynamic and that individuals can have multiple and changing linguistic identities based on their social context and experiences. He also highlighted the importance of acknowledging and valuing linguistic diversity in education and society. One part of the research focused on converting one language into another, with the conversion concerning written or spoken speech. For example, a study by Fei et al. [12][10] dealt with the problem of incomplete semantic role labelling in low-resource languages. They converted the labels from the source language to the target language in their method. Yi et al. [13][11] dealt with synthesising spoken speech of various languages from text data (text-to-speech) and tried to deal with the problem of incorrect pronunciation. They proposed a triplet training scheme composed of an anchor, a positive, and a negative sample to cover unseen cases. A similar problem was dealt with by Zhou et al. [14][12], who tried to improve pronunciation when converting speech into another language, using cross-lingual voice conversion techniques. Finally, Vaswani et al. [15][13] proposed a new simple network architecture, the “Transformer”, for translating one language to another and achieved outstanding results. Two important terms related to the mother tongue are “language loss” and “language policy”. Language loss refers to the gradual or rapid decline in the proficiency or use of an individual’s mother tongue, often due to language change or death [16][14]. A study looking for the effect of the use of computers and Internet use on language loss would be noteworthy. Language policy refers to decisions and practices related to language use in various sectors, such as education, government, media, and commerce, which can significantly impact the status, use, and development of the mother tongue and other languages [17][15]. Regarding the detection of the mother tongue, Mechti et al. [18][16], utilising a gated recurrent unit (GRU) network, introduced a deep learning model that can accurately identify the mother tongue of Arabic language learners, an essential aspect of language education. The primary objective was to tackle the challenge of recognising the mother tongue of Arabic language learners to customise personalised language learning strategies for each learner. The learners’ written work is presented as input to the proposed model to generate writing samples. The pre-trained word embedding layer transforms the input text into a sequence of vectors, then passed to the GRU network to capture and model the input data’s long-term dependencies, given its ability to model sequential data. The model is trained on a dataset of writing samples from Arabic language learners with different mother tongues and is evaluated on a separate test set. The results show that the proposed model outperforms several baseline models and achieves high accuracy in identifying the mother tongue of Arabic language learners. In addition, Siddhant et al. [19][17] investigated the use of pronunciation information for speaker and language recognition. They tested their models on conversational speech datasets in multiple languages and found that pronunciation information improves the accuracy of mother tongue recognition. The papers presented in this discussion demonstrate the diverse range of approaches and methods that researchers have used to identify mother tongues. In addition to the tools used in the works mentioned above, many others can also be used to find a user’s mother tongue. Some of them are attention models [20[18][19],21], transformer models [22[20][21],23], and graph models [24,25][22][23]. By developing more accurate and efficient methods for mother tongue recognition, researchers can potentially improve language-related applications such as speech recognition, language teaching, and natural language processing. One of the earliest studies on keystroke dynamics (KD) was conducted by Gaines et al. in 1980 [26][24], where they investigated the variation in typing patterns between individuals and found that individuals had unique typing patterns that could be used for identification purposes. Since then, several studies have focused on using KD in authentication systems. For example, Monrose et al. [27][25] proposed a keyboard-based authentication system that used neural networks to identify users based on their typing patterns, achieving high accuracy rates. Bergadano et al. [28][26] also conducted a study on using KD for biometric authentication, developing a model based on KD and evaluating its effectiveness through experiments. Other studies have explored this using KD to detect impostors and anomalies in typing behaviours. Killourhy and Maxion [29][27] used a dataset from users typing a fixed text at regular intervals over several weeks to train and test anomaly detection algorithms, such as Principal Component Analysis, Mahalanobis distance, and Support Vector Machines. They found that the performance of the algorithms varied depending on the specific keystroke features being analysed. Gunetti and Picardi [30][28] analysed the KD of free text to investigate its feasibility as a biometric authentication mechanism for text entry, obtaining promising results with low false alarm rates and impostor pass rates. KD has also been explored in user classification and recognition of the user’s physical or mental situation. Tsimperidis and Arampatzis [31][29] attempted to identify characteristics of users, such as gender, age, and handedness, using KD features and a rotation forest classifier, achieving high accuracy rates in user profiling. Tsimperidis et al. [32][30] used keystroke durations and diagram latencies extracted from a dataset to develop a system that could accurately distinguish the age group of an unknown user. Roy et al. [33][31] proposed a KD-based indicator for Parkinson’s disease screening at home, using ensemble learning and addressing key hypotheses related to the screening process to enhance the accuracy and effectiveness of the method. As it became evident from the literature, on the one hand, the identification of a user’s mother tongue has been attempted using various approaches, such as methods of natural language processing and exploitation of pronunciation information. On the other hand, KD has been mainly used for user authentication, recognising some inherent or acquired characteristics of users, and recognising users’ mental and physical state. However, at least according to what is known, KD has not been used so far to identify users’ mother tongue.

References

  1. Ulker, M. The Approach of Learning a Foreign Language by Watching TV Series. Educ. Res. Rev. 2019, 14, 608–617.
  2. UNESCO. The International Year of Indigenous Languages: Mobilizing the International Community to Preserve, Revitalize and Promote Indigenous Languages; UNESCO Publishing: Paris, France, 2021.
  3. Cummins, J. Bilingual children’s mother tongue: Why is it important for education? Sprogforum 2001, 7, 15–20.
  4. Baker, C. Foundations of Bilingual Education and Bilingualism, 3rd ed.; Buffalo, N.Y., Ed.; Bilingual education and bilingualism; Multilingual Matters: Clevedon, UK, 2001.
  5. Grosjean, F. Bilingual: Life and Reality; Harvard University Press: Cambridge, MA, USA, 2010.
  6. Petrovic, J.E.; Olmstead, S. Language, Power, and Pedagogy: Bilingual Children in the Crossfire, by J. Cummins. Biling. Res. J. 2001, 25, 405–412.
  7. Pavlenko, A.; Blackledge, A. Negotiation of Identities in Multilingual Contexts; Multilingual Matters: Bristol, UK, 2004.
  8. Kamusella, T. The Politics of Language and Nationalism in Modern Central Europe; Palgrave Macmillan UK: London, UK, 2009.
  9. Gorter, D.; Zenotz, V.; Cenoz, J. Minority Languages and Multilingual Education: Bridging the Local and the Global; Educational Linguistics; Springer: Dordrecht, The Netherlands, 2014; Volume 18.
  10. Fei, H.; Zhang, M.; Ji, D. Cross-Lingual Semantic Role Labeling with High-Quality Translated Training Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7014–7026.
  11. Ye, J.; Zhou, H.; Su, Z.; He, W.; Ren, K.; Li, L.; Lu, H. Improving Cross-Lingual Speech Synthesis with Triplet Training Scheme. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6072–6076.
  12. Zhou, Y.; Wu, X.; Tian, X.; Li, H. Optimization of Cross-Lingual Voice Conversion with Linguistics Losses to Reduce Foreign Accents. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1916–1926.
  13. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: New York, NY, USA; pp. 5998–6008.
  14. Hinton, L.; Hale, K.L. The Green Book of Language Revitalization in Practice; Brill: Leiden, The Netherlands; Boston, MA, USA, 2013.
  15. García, O.; Baetens Beardsmore, H. Bilingual Education in the 21st Century: A Global Perspective; Wiley-Blackwell Pub: Malden, MA, USA; Oxford, UK, 2009.
  16. Mechti, S.; Alroobaea, R.; Krichen, M.; Rubaiee, S.; Ahmed, A. Deep Learning Model for Identifying the Arabic Language Learners Based on Gated Recurrent Unit Network. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 620–627.
  17. Siddhant, A.; Jyothi, P.; Ganapathy, S. Leveraging Native Language Speech for Accent Identification Using Deep Siamese Networks. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 621–628.
  18. Fei, H.; Zhang, Y.; Ren, Y.; Ji, D. Latent emotion memory for multi-label emotion classification. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7692–7699.
  19. Wu, S.; Fei, H.; Ren, Y.; Ji, D.; Li, J. Learn from Syntax: Improving Pair-wise Aspect and Opinion Terms Extraction with Rich Syntactic Knowledge. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI-21), Montreal, QB, Canada, 19–26 August 2021; pp. 3957–3963.
  20. Thara, S.; Poornachandran, P. Transformer Based Language Identification for Malayalam-English Code-Mixed Text. IEEE Access 2021, 9, 118837–118850.
  21. Ranasinghe, T.; Zampieri, M. An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India. Information 2021, 12, 306.
  22. Huang, Y.-H.; Harryyanto, K.; Tsai, C.-W.; Pornvattanavichai, R.; Chen, Y.-S. Graph Knowledge Transfer for Offensive Language Identification with Graph Neural Networks. In Proceedings of the 23rd International Conference on Information Reuse and Integration for Data Science (IRI), San Diego, CA, USA, 9–11 August 2022; pp. 216–221.
  23. Mishra, P.; Tredici, M.D.; Yannakoudakis, H.; Shutova, E. Abusive Language Detection with Graph Convolutional Networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ACL, Minneapolis, MN, USA, 2–7 June 2019; pp. 2145–2150.
  24. Gaines, R.S.; Lisowski, W.; Press, S.J.; Shapiro, N. Authentication by Keystroke Timing: Some Preliminary Results; Rand Corporation; R-2526-NSF. Rand: Santa Monica, CA, USA, 1980.
  25. Monrose, F.; Reiter, M.K.; Wetzel, S. Password Hardening Based on Keystroke Dynamics. In Proceedings of the 6th ACM Conference on Computer and Communications Security, Singapore, 1–4 November 1999; pp. 73–82.
  26. Bergadano, F.; Gunetti, D.; Picardi, C. User Authentication through Keystroke Dynamics. ACM Trans. Inf. Syst. Secur. 2002, 5, 367–397.
  27. Killourhy, K.S.; Maxion, R.A. Comparing Anomaly-Detection Algorithms for Keystroke Dynamics. In Proceedings of the 2009 IEEE/IFIP International Conference on Dependable Systems & Networks, Lisbon, Portugal, 29 June–2 July 2009; pp. 125–134.
  28. Gunetti, D.; Picardi, C. Keystroke Analysis of Free Text. ACM Trans. Inf. Syst. Secur. 2005, 8, 312–347.
  29. Tsimperidis, I.; Arampatzis, A. User Profiling Using Keystroke Dynamics and Rotation Forest: In Advances in Information Security, Privacy, and Ethics; Lobo, V., Correia, A., Eds.; IGI Global: Hershey, PA, USA, 2022; pp. 1–24.
  30. Tsimperidis, I.; Yucel, C.; Katos, V. Age and Gender as Cyber Attribution Features in Keystroke Dynamic-Based User Classification Processes. Electronics 2021, 10, 835.
  31. Roy, S.; Roy, U.; Sinha, D.; Pal, R.K. Imbalanced Ensemble Learning in Determining Parkinson’s Disease Using Keystroke Dynamics. Expert Syst. Appl. 2023, 217, 119522.
More
ScholarVision Creations