Datasets in the Field of Click Fraud Detection: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: ,

Researchers go over the most relevant datasets to detect and prevent ad click fraud using AI techniques. Private/non-open-source datasets, as well as those related to other fraud types, such as ad or impression fraud, are excluded. 

  • click fraud
  • artificial intelligence
  • machine learning
  • deep learning

1. FDMA 2012 BuzzCity Dataset

1.1. Dataset Description

The FDMA BuzzCity dataset [1] was introduced in 2012. This dataset has been used in several experiments [2][3][4][5][6][7][8][9][10][11][12][13][14][15] and is divided into two portions—publishers and clicks—The publisher dataset contains data relevant to the publisher profile, whereas the click dataset provides click records associated with each publisher. This dataset was originally offered as part of a competition with the goal of developing an effective model/technology to detect fake publishers and understand publishers’ lack of credibility patterns based on publishers’ profiles and clickers’ click behavior. Each publisher has three statuses: OK indicates that the publisher is benign, fraud denotes that the publisher is illegitimate (intentionally generates high-cost clicks with no real interest in the ads by using automated software or click farms), and observation indicates that the publisher’s status has not been verified because the publisher is either new or has too many clicks but is not yet declared fraudulent. There are three sets of data available in FDMA 2012 BuzzCity: a training set (to develop predictive models), a validation set (to select models), and a test set (to test model generalizability). During a period of three days, the click dataset captures data related to clicks, while each publisher dataset records publishers who received at least one click. Table 1 presents the statistics of the dataset.
Table 1. Statistics of the FDMA 2012 BuzzCity dataset.
        No. of Publishers
Dataset Time Period No. of Clicks Fraud Observation OK Total
Train 9–11 Feb 2012 3,173,834 72 (2.34%) 80 (2.60%) 2929 (95.07%) 3081
Validation 23–25 Feb 2012 2,689,005 85 (2.77%) 84 (2.74%) 2895 (94.48%) 3064
Test 8–10 Mar 2012 2,598,815 82 (2.73%) 71 (2.37%) 2847 (94.90%) 3000

1.2. Raw Features

As stated previously, the FDMA 2012 BuzzCity dataset has two sub-datasets. Because the data are obtained through mobile devices, certain features of similar data on desktop computer networks are not available. The publisher dataset contains information about the publisher, such as ID, address, bank account, and status, as listed in Table 2. The click dataset, on the other hand, contains information on the source of the click, such as ID and IP, as well as other attributes that describe the click behavior, such as device and click time, as shown in Table 3. For the sake of privacy protection, most of these data have been anonymized. The studies did not use these features in their raw form to train models for detecting click fraud but instead went through a series of processes with the purpose of engineering features using statistical approaches, a collection of two/three features, or other methods. For example, consider studies [3][5], which calculated the average, variance, maximum, and entropy of the click time feature at various time intervals, as well as other raw features. Another study [10] separated the click time feature into several time features, such as day, month, and period of day.
Table 2. Features in the publisher dataset.
Attribute Description
publisherid Unique identifier of a publisher.
bankaccount Bank account associated with a publisher (anonymized; may be missing/unknown).
address Mailing address of a publisher (anonymized; may be missing/unknown)
status Label of a publisher, which falls into three categories:
  • OK: Publishers whom BuzzCity deems as having healthy traffic (or those who slipped their detection mechanisms).
  • Observation: Publishers who may have just started their traffic or their traf- fic statistics deviates from system wide average. BuzzCity does not have any conclusive stand with these publishers yet.
  • Fraud: Publishers who are deemed as fraudulent with clear proof. BuzzCity suspends their accounts and their earnings will not be paid.
Table 3. Features in the click dataset.
Attribute Description
id Unique identifier of a particular click.
numericip Public IP address of a clicker/visitor.
deviceua Phone model/agent used by a clicker/visitor.
publisherid Unique identifier of a publisher.
campaignid Unique identifier of a given advertisement campaign.
usercountry Country from which the clicker/visitor is.
clicktime Timestamp of a given click (in yyyy-mm-dd format).
referredurl URL where ad banners are clicked (anonymized; may be missing/unknown).
channel Publisher’s channel type, which consists of:
  • ad: Adult sites.
  • co: Community.
  • es: Entertainment and lifestyle·
  • gd: Glamour and dating.
  • in: Information
  • mc: Mobile content.
  • pp: Premium portal.
  • se: Search, portal, services.

2. TalkingData AdTracking Dataset

2.1. Dataset Description

The TalkingData AdTracking dataset was put up on Kaggle in 2017 as a competition by the Chinese company TalkingData, which processes three billion clicks every day, 90% of which are possibly illegitimate. [16]. Its method of detecting and preventing click fraud is to monitor and analyze users’ click journeys across their portfolios; IP addresses that generate many clicks but never install apps are flagged. A blacklist of IP addresses and devices is then created based on this information. The goal of the competition was to develop the best model for predicting whether a user would proceed to install an app after clicking on an ad and to distinguish fraudulent clicks from benign ones. The data provided were extensive, comprising a total of 203,694,359 real-time ad click records captured on a mobile platform, with an overall size of roughly 7 GB over four days. Table 4 illustrates the statistics of the TalkingData dataset.
Table 4. Statistics of the TalkingData dataset.
      No. of Publishers
Dataset Time Period No. of Clicks Fraudulent Non-Fraudulent
Train 6 Nov 2017 184,903,890 509,235.8975 (0.25%) 203,185,123.103 (99.75%)

2.2. Raw Features

The TalkingData dataset is generated from mobile phones and is widely used by studies in recent years to build and train click fraud detection models and approaches. It has millions of clicks distributed among eight features. These features are described in detail in Table 5; seven of these features are considered independent features, whereas one is deemed dependent (Is_attributed feature. It is the class that will be predicted; a value of 0 indicates a legitimate/non-fraudulent click, whereas a value of 1 indicates a fraudulent click). Some studies used raw features as they are, whereas others added primitive feature engineering. For instance, several studies have focused on extracting temporal features, such as minutes and seconds, from the click time feature [17][18][19][20][21][22] and on grouping IP features along other attributes in a combination of one/two attributes [17][19][23].
Table 5. Features in the TalkingData dataset.
Attribute Description
ip Ip address of click.
app App id for marketing.
device Device type id of user mobile phone.
os OS version id of user mobile phone.
channel Channel id of mobile ad publisher.
click_time Ad timestamp of click (UTC).
attributed_time If user download the app after clicking an ad, this is the time of the app download.
is_attributed the target that is to be predicted, indicating the app was downloaded.

3. Avazu Click-Through Rate Prediction Dataset

3.1. Dataset Description

The Avazu dataset was also presented on the Kaggle platform in 2014 [24]. As part of a competition hosted jointly by Avazu and Kaggle to determine the best strategy for predicting the CTR, which is a vital measure for analyzing ad performance. The dataset includes around 40 million records captured over 11 days, with 10 days serving as the training set and 1 day serving as the test set. Click prediction systems are critical, and they commonly use sponsored searches to rank ad links. The goal of CTR prediction is to estimate the likelihood that advertisements on a website will be clicked. By predicting the CTR, an advertising agency selects the potential visitors who are most likely to engage with the advertisements. Table 6 illustrates the statistics of the dataset.
Table 6. Statistics of the Avazu click-through rate prediction dataset.
    Class of Clicks
Dataset No. of Clicks Non-Clicked Clicked
Train 40,000,000 40,140,000 (0.25%) 4,460,000 (10%)

3.2. Raw Features

The dataset includes 21 distinct features, some of which describe the ad, such as the ad ID and position, as well as features describing the source of the click, such as device type and connection type. The target feature is the click class, which is binary: 0 means that an ad was not clicked, and 1 indicates that the visitor clicked an ad. About eight out of the 21 features in the dataset are anonymous, as illustrated in Table 7. These are categorical fields that contain specific data about users’ and advertisers’ profiles and are hashed to a unique value to enable investigators to construct vectors [25]. These anonymous fields and their meanings were not publicly revealed or investigated in the studies researchers examined; however, there were some efforts in other experiments in which it was inferred that C14 is the ad ID, C17 is the ad group ID, and C21 is the ad sponsor ID on the basis of interpreting the hierarchy of unknown features [26]. All the studies that applied this dataset used all the information provided in the dataset to detect click fraud, along with separating the click time feature into month, day, and hour and determining the frequency of clicks in 10 h [18][22][27].
Table 7. Features in the Avazu click-through rate prediction dataset.
Attribute Definition
ID The unique identifier for all details that corresponds to one occurrence of an advertisement. This is a continuous variable.
Hour The hour, in YYMMDDHH format. Researchers could break this down and add additional features during the cleaning process. This is a continuous variable.
Banner_pos The position in the screen where the advertisement was displayed. This shows the prominent place for an advertisement to get the attention of the user. This is a categorial integer
Site_id The identifier to unique identify a site in which the advertisement was displayed. This is a hashed value.
Site_domain The domain information of the website in which the advertisement was displayed.
Site_category This is a categorical variable representing the field to which the website belongs to. This can be used 27 to understand if any site category has more visitor attraction during any particular time.
App_id The identifier to unique identify a mobile application in which the advertisement was displayed. This is a hashed value.
App_domain The domain information of the application in which the advertisement was displayed.
App_category This is a categorical variable representing the field to which the application belongs to. This can be used to understand if any app category has more visitor attraction during any particular time. This is similar to the site category and can be compared relatively to check if app has more clicks over the website.
Device_id The unique identifier that marks the device from which the click was captured. This is a hashed continuous variable and can be repeated in the data set.
Device_ip The ipv4 address of the device from which the click was received. Hashed to a different value for privacy reasons to avoid trace back to the device.
Device_model The model of the device. Researchers choose not to use this value.
Device_type The type of the device, is a categorical variable and has around 7 categories.
Device_conn_type This is a hashed value about the connection type. Researchers do not use this value for forming the vector.
C1 An anonymous variable. It has influence over the prediction.
C14–C21 Anonymous categorical variables that might have information about the advertisers’ profile and the users’ profile like the age, gender, etc.
Click The target variable, 0 means an advertisement was not clicked and 1 means the ad was clicked. This is a categorical variable, binary typed.

4. Challenges in the Common Datasets in the Field of Click Fraud Detection

Most datasets related to click fraud detection suffer from imbalance, with the majority of the negative class (i.e., fraudulent clicks) being few and the benign class (i.e., legitimate clicks) compensating for most data records, causing the prediction model to be skewed toward the majority. Previous studies used various resampling procedures, such as up/down sampling and the SMOTE. In up-sampling, minority samples are reproduced until they are equal to those from the majority class. Under-sampling, on the other hand, excludes from the majority class at random until the class distribution is balanced. This issue was addressed in the FDMA 2012 BuzzCity datasets by implementing several workarounds. Under-sampling, for example, was used to overcome the problem of dataset imbalance in study [28], in which certain majority-class records were chosen for use along with all minority-class records. In another study [17] to address the same issue, the process was carefully aimed at minimizing information loss caused by undersampling or overfitting because of oversampling. As a result, the following approach was used: oversampling positive samples and undersampling negative samples. Furthermore, the QDPSKNN method [13] was applied to address dataset imbalance, along with reducing capacity needs and improving implementation performance. It divides the data into four quadrants and then undersamples to balance class distribution. The authors in [29] also illustrated that to deal with imbalanced class distribution, in nonlinear classifications involving combinations of variable types and noisy or missing patterns, tree-based ensemble classifiers and backward feature exclusion can produce promising results.
On the other hand, the TalkingData dataset also suffers from a massive imbalance, with legitimate clicks representing only 0.25% ((Is_attributed feature = 0) of all clicks in the entire dataset, while fraudulent clicks ((Is_attributed feature = 1) represent 99.75%. The problem with the model being trained on an imbalanced dataset is that it will be skewed exclusively toward the majority class. This presents an issue when researchers are concerned with the prediction of the minority class. Several efforts have been made in earlier studies to overcome this challenge when training models. For example, a 15% random selection of unique IPs was used, followed by an 8% stratified sample from the rest to decrease the data size in [30]. To address the imbalance, the SMOTE [31] with neighbors = 5 was used, and the positive class oversampled by 11%. In [12], however, undersampling was used to balance skewed datasets by preserving all data in the minority class and decreasing the volume of the majority class. The authors suggested that the test set should be more comprehensive, so the original training set was divided into two new training and test sets, and then additional samples were drawn from the new test set by selecting the is_attributed feature = zero and one click in a 1:1 ratio, resulting in a dataset that contains 50% legitimate and 50% fraudulent clicks. Furthermore, they trained all the different ML models using variable samples of training data; the goal of changing the training sample sizes was to enhance precision while lowering overfitting. In another effective way to address this issue [14], the entire dataset was divided into six classes based on the IP and app ID features as follows: class 1: 25,974 rows with a unique IP count < 20 and an app ID frequency < 70%; class 2: 27,174 rows with a unique IP count in the range => 20 and <1000 and app ID frequency < 70%; class 3: 112,790 rows with a unique IP count => 1000 and an app ID frequency < 70%; class 4: 784,964 rows with a unique IP count < 20 and an app ID frequency => 70%; class 5: 19,914,810 rows with a unique IP count in the range => 20 and <1000 times and an app ID frequency => 70%; and class 6: 164,038,178 rows with a unique IP count => 1000 and an app ID frequency => 70%. The classifiers were tested separately on each of these classes. In another study [32], the model was trained using only the first class.
And according to the statistics in Table 7, there is a significant disparity in the Avazu click-through rate prediction dataset, with the percentage of those who viewed the ad but did not click on it being 90%, and the percentage of those who viewed the ad but clicked on it being 10%. All the studies used one million samples from the dataset [22][27], and applied undersampling technique to balance the dataset [33].

This entry is adapted from the peer-reviewed paper 10.3390/jsan12010004

References

  1. Thejas, G.S.; Hariprasad, Y.; Iyengar, S.S.; Sunitha, N.R.; Badrinath, P.; Chennupati, S. An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets. Mach. Learn. Appl. 2022, 8, 100267.
  2. Weideman, M.; Kritzinger, W. Parallel search engine optimisation and pay-per-click campaigns: A comparison of cost per acquisition. S. Afr. J. Inf. Manag. 2017, 19, 1–13.
  3. Stone-Gross, B.; Stevens, R.; Zarras, A.; Kemmerer, R.; Kruegel, C.; Vigna, G. Understanding fraudulent activities in online ad exchanges. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, Berlin, Germany, 2–4 November 2011; pp. 279–294.
  4. Li, Z.; Zhang, K.; Xie, Y.; Yu, F.; Wang, X.F. Knowing your enemy: Understanding and detecting malicious Web advertising. In Proceedings of the 2012 ACM Conference on Computer and Communications Security-CCS, Raleigh, NC, USA, 16–18 October 2012; pp. 674–686.
  5. Berrar, D. Random forests for the detection of click fraud in online mobile advertising. In Proceedings of the 1st International Workshop on Fraud Detection in Mobile Advertising (FDMA), Singapore, 4 November 2012; pp. 1–10. Available online: http://berrar.com/resources/Berrar_FDMA2012.pdf (accessed on 17 August 2022).
  6. Yan, J.H.; Jiang, W.R. Research on information technology with detecting the fraudulent clicks using classification method. Adv. Mater. Res. 2014, 859, 586–590.
  7. Perera, K.S.; Neupane, B.; Faisal, M.A.; Aung, Z.; Woon, W.L. A novel ensemble learning-based approach for click fraud detection in mobile advertising. Lect. Notes Comput. Sci. 2013, 8284, 370–382.
  8. Oentaryo, R.; Lim, E.P.; Finegold, M.; Lo, D.; Zhu, F.; Phua, C.; Cheu, E.Y.; Yap, G.E.; Sim, K.; Nguyen, M.N.; et al. Detecting click fraud in online advertising: A data mining approach. J. Mach. Learn. Res. 2014, 15, 99–140.
  9. Xu, H.; Liu, D.; Koehl, A.; Wang, H.; Stavrou, A. Click fraud detection on the advertiser side. Lect. Notes Comput. Sci. 2014, 8713, 419–438.
  10. Vani, M.S.; Bhramaramba, R.; Vasumati, D.; Babu, O.Y. TUI based touch-spam detection in mobile applications to increase the security from advertisement networks. Int. J. Adv. Comput. Commun. Control 2014, 2, 17–22.
  11. Li, Z.; Jia, W. The Study on Preventing Click Fraud in Internet Advertising. J. Comput. 2020, 31, 256–265.
  12. Li, W.; Zhong, Q.; Zhao, Q.; Zhang, H.; Meng, X. Multimodal and Contrastive Learning for Click Fraud Detection. arXiv 2021, arXiv:2105.03567.
  13. Dekou, R.; Savo, S.; Kufeld, S.; Francesca, D.; Kawase, R. Machine Learning Methods for Detecting Fraud in Online Marketplaces. In Proceedings of the 2021 International Workshop on Privacy, Security, and Trust in Computational Intelligence, Gold Coast, QLD, Australia, 1–5 November 2021; Volume 3052.
  14. Zhang, X.; Liu, X.; Guo, H. A click fraud detection scheme based on cost sensitive BPNN and ABC in mobile advertising. In Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, 7–10 December 2018; pp. 1360–1365.
  15. Harsha, C.; Aswale, S.; Pawar, V.N. Advertisement Click Fraud Detection Using Machine Learning Techniques. In Proceedings of the 2021 International Conference on Technological Advancements and Innovations (ICTAI), Tashkent, Uzbekistan, 10–12 November 2021; Volume 10, ISBN 9789811696695.
  16. Buzzcity Mobile Advertisement Dataset. 2014. Available online: https://larc.smu.edu.sg/buzzcity-mobile-advertisement-dataset (accessed on 14 September 2022).
  17. Guo, Y.; Shi, J.; Cao, Z.; Kang, C.; Xiong, G.; Li, Z. Machine learning based cloudbot detection using multi-layer traffic statistics. In Proceedings of the 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 10–12 August 2019; pp. 2428–2435.
  18. Sisodia, D.; Sisodia, D.S. Gradient boosting learning for fraudulent publisher detection in online advertising. Data Technol. Appl. 2021, 55, 216–232.
  19. Dash, A.; Pal, S. Auto-Detection of Click-Frauds using Machine Learning Auto-Detection of Click-Frauds using Machine Learning. Int. J. Eng. Sci. Comput. 2020, 10, 27227–27235.
  20. Sisodia, D.; Sisodia, D.S. Feature space transformation of user-clicks and deep transfer learning framework for fraudulent publisher detection in online advertising. Appl. Soft Comput. 2022, 125, 109142.
  21. Borgi, M. Advertisement Click Fraud Detection System: A Survey; Springer: Singapore, 2021; Volume 10, ISBN 9789811696695.
  22. Iqbal, S.; Zulkernine, M.; Jaafar, F.; Gu, Y. Protecting internet users from becoming victimized attackers of click-fraud. J. Softw. Evol. Process 2016, 30, e1871.
  23. Hu, J.; Li, T.; Zhuang, Y.; Huang, S.; Dong, S. GFD: A Weighted Heterogeneous Graph Embedding Based Approach for Fraud Detection in Mobile Advertising. Secur. Commun. Netw. 2020, 2020, 1–12.
  24. TalkingData. TalkingData AdTracking Fraud Detection Challenge. 2017. Available online: https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection (accessed on 14 September 2022).
  25. Click-Through Rate Prediction. 2014. Available online: https://www.kaggle.com/c/avazu-ctr-prediction (accessed on 12 November 2022).
  26. Ramanathan, M. An Ensemble Model for Click through Rate Prediction. 2019. Available online: https://scholarworks.sjsu.edu/etd_projects/697 (accessed on 14 September 2022).
  27. Crussell, J.; Stevens, R.; Chen, H. MAdFraud: Investigating ad fraud in Android applications. In Proceedings of the MobiSys 2014—12th Annual International Conference on Mobile Systems, Applications, and Services, Bretton Woods, NH, USA, 16–19 June 2014; pp. 123–134.
  28. Oentaryo, R.J.; Lim, E. Mining Fraudulent Patterns in Online Advertising. In Proceedings of the First International Network on Trust (FINT) Workshop 2013, Singapore, 21–23 November 2013; pp. 21–23.
  29. Perera, B.K. A Class Imbalance Learning Approach to Fraud Detection in Online Advertising. Citeseer 2013. Available online: https://pdfs.semanticscholar.org/24ca/e6b0d1d192e6421905dc65fe8efa2d4343d9.pdf (accessed on 22 August 2022).
  30. Mikkili, B.; Sodagudi, S. Advertisement Click Fraud Detection Using Machine Learning Algorithms. Smart Innov. Syst. Technol. 2022, 282, 353–362.
  31. Zhou, X.; Yan, P. Avazu Click—Through Rate Prediction Problem description. 2015. Available online: http://techblog.youdao.com/wp-content/uploads/2015/03/Avazu-CTR-Prediction.pdf (accessed on 15 September 2022).
  32. Pan, L.; Mu, S.; Wang, Y. User click fraud detection method based on Top-Rank- k frequent pattern mining. Int. J. Mod. Phys. B 2019, 33, 1950150.
  33. Srivastava, A. Real-Time Ad Click Fraud Detection. 2020, pp. 1–63. Available online: https://scholarworks.sjsu.edu/etd_projects/916 (accessed on 4 September 2022).
More
This entry is offline, you can click here to edit this entry!
Video Production Service