Datasets in the Field of Click Fraud Detection

Datasets in the Field of Click Fraud Detection: Comparison

Please note this is a comparison between Version 1 by Reem A. ALZAHRANI and Version 2 by Dean Liu.

ReIn this searchersction, we go over the most relevant datasets mentioned in the literature review to detect and prevent ad click fraud using AI techniques. Private/non-open-source datasets, as well as those related to other fraud types, such as ad or impression fraud, are excluded. We discuss the descriptions, raw features, and some of the issues with the datasets, as identified in the literature review, as well as how these issues are addressed.

click fraud
artificial intelligence
machine learning
deep learning

1. FDMA 2012 BuzzCity Dataset

1.1. Dataset Description

The FDMA BuzzCity dataset ^[1][79] was introduced in 2012. This dataset has been used in several experiments ^{[2][3][4][5][6][7][8][9][10][11][12][13][14][15]}[26,28,29,30,31,32,35,37,39,43,59,62,66,73] and is divided into two portions—publishers and clicks—The publisher dataset contains data relevant to the publisher profile, whereas the click dataset provides click records associated with each publisher. All field descriptions for both datasets are shown in Section 4.1.2. This dataset was originally offered as part of a competition with the goal of developing an effective model/technology to detect fake publishers and understand publishers’ lack of credibility patterns based on publishers’ profiles and clickers’ click behavior. Each publisher has three statuses: OK indicates that the publisher is benign, fraud denotes that the publisher is illegitimate (intentionally generates high-cost clicks with no real interest in the ads by using automated software or click farms), and observation indicates that the publisher’s status has not been verified because the publisher is either new or has too many clicks but is not yet declared fraudulent. There are three sets of data available in FDMA 2012 BuzzCity: a training set (to develop predictive models), a validation set (to select models), and a test set (to test model generalizability). During a period of three days, the click dataset captures data related to clicks, while each publisher dataset records publishers who received at least one click. Table 1 presents the statistics of the dataset.

Table 1.

Statistics of the FDMA 2012 BuzzCity dataset.

				No. of Publishers

Table 4.

Statistics of the TalkingData dataset.

			No. of Publishers
Dataset	Time Period	No. of Clicks	Fraud	Observation	OK	Total
Train	9–11 Feb 2012	3,173,834	72 (2.34%)	80 (2.60%)	2929 (95.07%)	3081
Validation	23–25 Feb 2012	2,689,005	85 (2.77%)	84 (2.74%)	2895 (94.48%)	3064
Test	8–10 Mar 2012	2,598,815	82 (2.73%)	71 (2.37%)	2847 (94.90%)	3000

1.2. Raw Features

As stated previously, the FDMA 2012 BuzzCity dataset has two sub-datasets. Because the data are obtained through mobile devices, certain features of similar data on desktop computer networks are not available. The publisher dataset contains information about the publisher, such as ID, address, bank account, and status, as listed in Table 2. The click dataset, on the other hand, contains information on the source of the click, such as ID and IP, as well as other attributes that describe the click behavior, such as device and click time, as shown in Table 3. For the sake of privacy protection, most of these data have been anonymized. The studies in the literature review did not use these features in their raw form to train models for detecting click fraud but instead went through a series of processes with the purpose of engineering features using statistical approaches, a collection of two/three features, or other methods. For example, consider studies ^[3][5][28,30], which calculated the average, variance, maximum, and entropy of the click time feature at various time intervals, as well as other raw features. Another study ^[10][39] separated the click time feature into several time features, such as day, month, and period of day. Section 3 will provide further details on the new features derived using feature engineering.

Table 2.

Features in the publisher dataset.

Attribute	Description
Attribute	Description
publisherid	Unique identifier of a publisher.
id	Unique identifier of a particular click.
bankaccount	Bank account associated with a publisher (anonymized; may be missing/unknown).
numericip	Public IP address of a clicker/visitor.	address
deviceua	Mailing address of a publisher (anonymized; may be missing/unknown)
Phone model/agent used by a clicker/visitor.	status	Label of a publisher, which falls into three categories: OK: Publishers whom BuzzCity deems as having healthy traffic (or those who slipped their detection mechanisms). Observation: Publishers who may have just started their traffic or their traf- fic statistics deviates from system wide average. BuzzCity does not have any conclusive stand with these publishers yet. Fraud: Publishers who are deemed as fraudulent with clear proof. BuzzCity suspends their accounts and their earnings will not be paid.

Table 3.

Features in the click dataset.

Dataset	Time Period	No. of Clicks	Fraudulent	Non-Fraudulent
publisherid
Unique identifier of a publisher.
campaignid
usercountry	Country from which the clicker/visitor is.
clicktime	Timestamp of a given click (in yyyy-mm-dd format).
referredurl
Train
Unique identifier of a given advertisement campaign.

] and on grouping IP features along other attributes in a combination of one/two attributes ^[17][19][23][42,48,69]. Section 3 provides the details of the new features derived using feature engineering.

Table 5.

Features in the TalkingData dataset.

Attribute	Description
is_attributed
the target that is to be predicted, indicating the app was downloaded.

3. Avazu Click-Through Rate Prediction Dataset

3.1. Dataset Description

The Avazu dataset was also presented on the Kaggle platform in 2014 ^[24][81]. As part of a competition hosted jointly by Avazu and Kaggle to determine the best strategy for predicting the CTR, which is a vital measure for analyzing ad performance. The dataset includes around 40 million records captured over 11 days, with 10 days serving as the training set and 1 day serving as the test set. Click prediction systems are critical, and they commonly use sponsored searches to rank ad links. The goal of CTR prediction is to estimate the likelihood that advertisements on a website will be clicked. By predicting the CTR, an advertising agency selects the potential visitors who are most likely to engage with the advertisements. Table 6 illustrates the statistics of the dataset.

Table 6.

Statistics of the Avazu click-through rate prediction dataset.

		Class of Clicks
Dataset	No. of Clicks	Non-Clicked	Clicked
6 Nov 2017
URL where ad banners are clicked (anonymized; may be missing/unknown).
184,903,890	509,235.8975 (0.25%)	203,185,123.103 (99.75%)
ip	509,235.8975 (0.25%)	203,185,123.103 (99.75%)	Ip address of click.
app	App id for marketing.
channel
device	Publisher’s channel type, which consists of: ad: Adult sites. co: Community. es: Entertainment and lifestyle· gd: Glamour and dating. in: Information mc: Mobile content. pp: Premium portal. se: Search, portal, services.

2. TalkingData AdTracking Dataset

2.1. Dataset Description

The TalkingData AdTracking dataset was put up on Kaggle in 2017 as a competition by the Chinese company TalkingData, which processes three billion clicks every day, 90% of which are possibly illegitimate. ^[16][80]. Its method of detecting and preventing click fraud is to monitor and analyze users’ click journeys across their portfolios; IP addresses that generate many clicks but never install apps are flagged. A blacklist of IP addresses and devices is then created based on this information. The goal of the competition was to develop the best model for predicting whether a user would proceed to install an app after clicking on an ad and to distinguish fraudulent clicks from benign ones. The data provided were extensive, comprising a total of 203,694,359 real-time ad click records captured on a mobile platform, with an overall size of roughly 7 GB over four days. Table 4 illustrates the statistics of the TalkingData dataset.

2.2. Raw Features

The TalkingData dataset is generated from mobile phones and is widely used by studies in recent years to build and train click fraud detection models and approaches. It has millions of clicks distributed among eight features. These features are described in detail in Table 5; seven of these features are considered independent features, whereas one is deemed dependent (Is_attributed feature. It is the class that will be predicted; a value of 0 indicates a legitimate/non-fraudulent click, whereas a value of 1 indicates a fraudulent click). Some studies used raw features as they are, whereas others added primitive feature engineering. For instance, several studies have focused on extracting temporal features, such as minutes and seconds, from the click time feature ^{[17][18][19][20][21][22]}[42,47,48,53,58,75

Train	40,000,000	40,140,000 (0.25%)	4,460,000 (10%)
		40,140,000 (0.25%)	4,460,000 (10%)
Device type id of user mobile phone.
os	OS version id of user mobile phone.
channel	Channel id of mobile ad publisher.
click_time	Ad timestamp of click (UTC).

searchers examined examined in the literature review; however, there were some efforts in other experiments in which it was inferred that C14 is the ad ID, C17 is the ad group ID, and C21 is the ad sponsor ID on the basis of interpreting the hierarchy of unknown features ^[26][83]. All the studies that applied this dataset in the literat ure review used all the information provided in the dataset to detect click fraud, along with separating the click time feature into month, day, and hour and determining the frequency of clicks in 10 h ^[18][22][27][47,74,75].

Table 7.

Features in the Avazu click-through rate prediction dataset.

Attribute	Definition
ID	The unique identifier for all details that corresponds to one occurrence of an advertisement. This is a continuous variable.
Hour	The hour, in YYMMDDHH format. ResearchWers could break this down and add additional features during the cleaning process. This is a continuous variable.
Banner_pos	The position in the screen where the advertisement was displayed. This shows the prominent place for an advertisement to get the attention of the user. This is a categorial integer
attributed_time
If user download the app after clicking an ad, this is the time of the app download.

3.2. Raw Features

The dataset includes 21 distinct features, some of which describe the ad, such as the ad ID and position, as well as features describing the source of the click, such as device type and connection type. The target feature is the click class, which is binary: 0 means that an ad was not clicked, and 1 indicates that the visitor clicked an ad. About eight out of the 21 features in the dataset are anonymous, as illustrated in Table 7. These are categorical fields that contain specific data about users’ and advertisers’ profiles and are hashed to a unique value to enable investigators to construct vectors ^[25][82]. These anonymous fields and their meanings were not publicly revealed or investigated in the studies rwe

Site_id
The identifier to unique identify a site in which the advertisement was displayed. This is a hashed value.
Site_domain	The domain information of the website in which the advertisement was displayed.
Site_category	This is a categorical variable representing the field to which the website belongs to. This can be used 27 to understand if any site category has more visitor attraction during any particular time.
App_id	The identifier to unique identify a mobile application in which the advertisement was displayed. This is a hashed value.
App_domain	The domain information of the application in which the advertisement was displayed.
App_category	This is a categorical variable representing the field to which the application belongs to. This can be used to understand if any app category has more visitor attraction during any particular time. This is similar to the site category and can be compared relatively to check if app has more clicks over the website.
Device_id	The unique identifier that marks the device from which the click was captured. This is a hashed continuous variable and can be repeated in the data set.
Device_ip	The ipv4 address of the device from which the click was received. Hashed to a different value for privacy reasons to avoid trace back to the device.
Device_model	The model of the device. RWesearchers ch choose not to use this value.
Device_type	The type of the device, is a categorical variable and has around 7 categories.
Device_conn_type	This is a hashed value about the connection type. RWesearchers do not use this value for forming the vector.
C1	An anonymous variable. It has influence over the prediction.
C14–C21	Anonymous categorical variables that might have information about the advertisers’ profile and the users’ profile like the age, gender, etc.
Click	The target variable, 0 means an advertisement was not clicked and 1 means the ad was clicked. This is a categorical variable, binary typed.

4. Challenges in the Common Datasets in the Field of Click Fraud Detection

Most datasets related to click fraud detection suffer from imbalance, with the majority of the negative class (i.e., fraudulent clicks) being few and the benign class (i.e., legitimate clicks) compensating for most data records, causing the prediction model to be skewed toward the majority. Previous studies used various resampling procedures, such as up/down sampling and the SMOTE. In up-sampling, minority samples are reproduced until they are equal to those from the majority class. Under-sampling, on the other hand, excludes from the majority class at random until the class distribution is balanced. This issue was addressed in the FDMA 2012 BuzzCity datasets by implementing several workarounds. Under-sampling, for example, was used to overcome the problem of dataset imbalance in study ^[28][33], in which certain majority-class records were chosen for use along with all minority-class records. In another study ^[17][42] to address the same issue, the process was carefully aimed at minimizing information loss caused by undersampling or overfitting because of oversampling. As a result, the following approach was used: oversampling positive samples and undersampling negative samples. Furthermore, the QDPSKNN method ^[13][62] was applied to address dataset imbalance, along with reducing capacity needs and improving implementation performance. It divides the data into four quadrants and then undersamples to balance class distribution. The authors in ^[29][34] also illustrated that to deal with imbalanced class distribution, in nonlinear classifications involving combinations of variable types and noisy or missing patterns, tree-based ensemble classifiers and backward feature exclusion can produce promising results.

On the other hand, the TalkingData dataset also suffers from a massive imbalance, with legitimate clicks representing only 0.25% ((Is_attributed feature = 0) of all clicks in the entire dataset, while fraudulent clicks ((Is_attributed feature = 1) represent 99.75%. The problem with the model being trained on an imbalanced dataset is that it will be skewed exclusively toward the majority class. This presents an issue when rwesearchers are are concerned with the prediction of the minority class. Several efforts have been made in earlier studies to overcome this challenge when training models. For example, a 15% random selection of unique IPs was used, followed by an 8% stratified sample from the rest to decrease the data size in ^[30][61]. To address the imbalance, the SMOTE ^[31][84] with neighbors = 5 was used, and the positive class oversampled by 11%. In ^[12][59], however, undersampling was used to balance skewed datasets by preserving all data in the minority class and decreasing the volume of the majority class. The authors suggested that the test set should be more comprehensive, so the original training set was divided into two new training and test sets, and then additional samples were drawn from the new test set by selecting the is_attributed feature = zero and one click in a 1:1 ratio, resulting in a dataset that contains 50% legitimate and 50% fraudulent clicks. Furthermore, they trained all the different ML models using variable samples of training data; the goal of changing the training sample sizes was to enhance precision while lowering overfitting. In another effective way to address this issue ^[14][66], the entire dataset was divided into six classes based on the IP and app ID features as follows: class 1: 25,974 rows with a unique IP count < 20 and an app ID frequency < 70%; class 2: 27,174 rows with a unique IP count in the range => 20 and <1000 and app ID frequency < 70%; class 3: 112,790 rows with a unique IP count => 1000 and an app ID frequency < 70%; class 4: 784,964 rows with a unique IP count < 20 and an app ID frequency => 70%; class 5: 19,914,810 rows with a unique IP count in the range => 20 and <1000 times and an app ID frequency => 70%; and class 6: 164,038,178 rows with a unique IP count => 1000 and an app ID frequency => 70%. The classifiers were tested separately on each of these classes. In another study ^[32][56], the model was trained using only the first class.

And according to the statistics in Table 7, there is a significant disparity in the Avazu click-through rate prediction dataset, with the percentage of those who viewed the ad but did not click on it being 90%, and the percentage of those who viewed the ad but clicked on it being 10%. All the studies in our literature review used one million samples from the dataset ^[22][27][74,75], and applied undersampling technique to balance the dataset ^[33][50].