Privacy policies are the main method for informing Internet users of how their data are collected and shared. Automated privacy policy analysis, including machine learning methods, has grown in popularity during the last decade. The main goal is to grant users a better understanding of how their data are used and help them make informed decisions regarding their privacy.
1. Introduction
Natural language privacy policies serve as the primary means of disclosing data practices to consumers, providing them with crucial information about what data are collected, analyzed, and how they will be kept private and secure. By reading these policies, users can enhance their awareness of data privacy and better manage the risks associated with extensive data collection. However, for privacy policies to be genuinely useful, they must be easily comprehensible to the majority of users. Lengthy and vague policies fail to effectively inform the average user, rendering them ineffective in ensuring data privacy awareness.
Privacy policies are often excessive in length, requiring a substantial amount of time to read through. Estimates show that the average Internet user would spend around 400 h per year reading all encountered privacy terms
[4][1]. This time investment may deter users from thoroughly reviewing policies, leading them to hurriedly click the “I agree” button without fully understanding the implications.
Addressing the significance of readability and privacy regulations, such as General Data Protection Regulation (GDPR), mandate that privacy policies should be concise, easy to understand, and written in plain language. Additionally, the California Consumer Privacy Act (CCPA) emphasizes the need to present policies in a clear and straightforward manner, avoiding technical or legal jargon.
To enhance clarity and conciseness, the GDPR guidelines recommend the use of active voice instead of passive voice in writing
[5][2]. The active voice directs the reader’s attention to the performer of the action, reducing ambiguity and making the text more straightforward.
Additionally, policies become less comprehensible due to ambiguity, which occurs when a statement lacks clarity and can be interpreted in multiple ways. The use of imprecise language in a privacy policy hinders the clear communication of the website’s actual data practices. The presence of language qualifiers like “may”, “might”, “some”, and “often” contributes to ambiguity, as noted by the European Commission’s GDPR guidelines
[5][2]. Recent research suggests an increasing use of terms such as “may include” and “may collect” in privacy policies, which may result in policies becoming more ambiguous over time
[7][3].
2. Automated Privacy Policy Analysis
2.1. Privacy Policy Datasets
Various privacy policy datasets have been made accessible to researchers (see
Table 1), with the Usable Privacy Policy Project
[19][4] playing a significant role in this regard. Their OPP-115 corpus
[20][5] contains annotated segments from 115 website privacy policies, enabling advanced machine learning research and automated analysis. Another dataset from the same project is the OptOutChoice-2020 corpus
[21][6], which includes privacy policy sentences with labeled opt-out choices types. PolicyIE
[22][7] offers a more recent dataset with annotated data practices, including intent classification and slot filling, based on 31 web and mobile app privacy policies. Nokhbeh Zaeem and Barber
[23][8] created a corpus of over 100,000 privacy policies, categorized into 15 website categories, utilizing the DMOZ directory. PrivaSeer
[12][9] is a privacy policy dataset and search engine containing approximately 1.4 million website privacy policies. It was built using web crawls from 2019 and 2020, utilizing URLs from “Common Crawl” and the “Free Company Dataset”. Finally, Amos et al.
[7][3] released the Princeton-Leuven Longitudinal Corpus of Privacy Policies, a large-scale longitudinal corpus spanning two decades, consisting of one million privacy policy snapshots from around 130,000 websites, enabling the study of trends and changes over time.
Table 1.
Publicly available privacy policy datasets.
Dataset |
# Policies |
# Websites |
Timeframe |
Labeling |
OPP-115 |
115 |
115 |
2015 |
Yes |
OptOutChoice-2020 |
236 |
236 |
- |
Yes |
PolicyIE |
400 |
400 (websites + apps) |
2019 |
Yes |
DMOZ-based Corpus |
117,502 |
- |
2020 |
No |
PrivaSeer |
1,005,380 |
995,475 |
2019 |
No |
Princeton-Leuven Corpus |
910,546 |
108,499 |
1997–2019 |
No |
2.2. Classification and Information Extraction
Classification and information extraction from privacy policies have been widely explored using machine learning techniques. Kaur et al.
[11][10] employed unsupervised methods such as Latent Dirichlet Allocation (LDA) and term frequency to analyze keywords and content in 2000 privacy policies. Supervised learning approaches have also been utilized, including classifiers trained on the OPP-115 dataset. Audich et al.
[24][11] compared the performance of supervised and unsupervised algorithms to label policy segments, while Kumar et al.
[25][12] trained privacy-specific word embeddings for improved results. Deep learning models like CNN, BERT, and XLNET have further enhanced their classification performance
[26,27,28][13][14][15]. Bui et al.
[29][16] tackled the extraction of personal data objects and actions using a BLSTM model with contextual word embeddings. Alabduljabbar et al.
[30,31][17][18] proposed a pipeline called TLDR for the automatic categorization and highlighting of policy segments, enhancing user comprehension. Extracting opt-out choices from privacy policies has also been studied
[21,32,33][6][19][20]. In the field of summarization, Keymanesh et al.
[34][21] introduced a domain-guided approach for privacy policy summarization, focusing on labeling privacy topics and extracting the riskiest content. Several studies have worked on developing automated privacy policy question-answering assistants
[35,36,37][22][23][24].
Furthermore, the PrivacyGLUE
[38][25] benchmark was proposed to address the lack of comprehensive benchmarks specifically designed for privacy policies. The benchmark includes the performance evaluations of transformer language models and emphasizes the importance of in-domain pre-training for privacy policies.
2.3. Privacy Policy Applications for Enhancing Users’ Comprehension
Applications enhancing the comprehension of privacy policies have been developed to provide users with useful and visually appealing presentations of policy information. PrivacyGuide
[39][26] employs a two-step multi-class approach, identifying relevant privacy aspects and predicting risk levels using a trained model on a labeled dataset. The user interface utilizes colored icons to indicate risk levels. Polisis
[40,41][27][28] combines a summarizing tool, policy comparison tool, and chatbot. The query system employs neural network classifiers trained on the OPP-115 dataset and privacy-specific language models. PrivacyCheck is a browser extension that extracts 10 privacy factors and displays their risk levels through icons and text snippets
[42,43,44,45][29][30][31][32]. Opt-Out Easy is another browser extension that utilizes the OptOutChoice-2020 dataset to identify and present opt-out choices to users during web browsing
[21,46][6][33].
2.4. Regulatory Impact
User research has also focused on evaluating privacy policies for regulatory compliance, particularly in response to the implementation of General Data Protection Regulation (GDPR) in Europe. The tool Claudette detects unfair clauses and evaluates privacy policy compliance with GDPR
[47,48][34][35]. KnIGHT (“Know your rIGHTs”) utilizes semantic text matching to map policy sentences to GDPR paragraphs
[49][36]. Cejas et al.
[50][37] and Qamar et al.
[51][38] leveraged NLP and supervised machine learning to identify GDPR-relevant information in policies and assess their compliance. Similarly, Sánchez et al.
[52][39] used manual annotations and machine learning to tag policies based on GDPR goals, offering both aggregated scores and fine-grained ratings for better understanding. Degeling et al.
[53][40] and Linden et al.
[54][41] examined the effects of GDPR on privacy policies through longitudinal analysis, observing updates and changes in policy length and disclosures. Zaeem and Barber
[55][42] compared pre- and post-GDPR policies using PrivacyCheck, highlighting deficiencies in transparency and explicit data processing disclosures. Libert
[56][43] developed an automated approach to audit third-party data sharing in privacy policies.
2.5. Comprehensibility of Privacy Policies
Studies on privacy policy comprehensibility have examined deficiencies in readability, revealing that privacy policies are difficult to read and demonstrating correlations between readability measures
[9,10][44][45]. Furthermore, researchers have examined the changes in length and readability of privacy policies over time
[4,7][1][3].
Other scholars have studied ambiguous content in privacy policies. Kaur et al.
[11][10] and Srinath et al.
[12][9] analyzed the use of ambiguous words in a corpus of 2000 policies. Furthermore, Kotal et al.
[57][46] studied the ambiguity in the OPP-115 dataset and showed that ambiguity negatively affects the ability to automatically evaluate privacy policies. Srinath et al.
[12][9] reported on privacy policy length and the use of vague words in their PrivaSeer corpus of policies. Lebanoff and Liu
[14][47] investigated the detection of vague words and sentences using deep neural networks.
2.6. Mobile Applications
The research community has also examined privacy policies in the context of mobile applications, establishing several corpora of mobile app privacy policies
[58,59][48][49]. Those policies are well-suited for compliance analysis, because they are studied along with the app code and the traffic generated by the app
[59,60][49][50].