Deepfake 识别和可追溯性: Comparison
Please note this is a comparison between Version 2 by Yi Sun and Version 6 by Catherine Yang.

RDeepfakes are becoming increasingly ubiquitous, particularly in facial manipulation. Numerous researchers and companies have released multiple datasets of face deepfakes labeled to indicate different methods of forgery. NHowever, naming these labels is often arbitrary and inconsistent, leading to the fact that most researchers now choose to use only one of the datasets for research work. However, researchers must use multiplthese datasets in practical applications toand conduct traceability research. The researchersIn this study, we employ some models to extract forgery features from various deepfake datasets and utilize the K-means clustering method to identify datasets with similar feature values and. We analyze the feature values using the Calinski Harabasz Index method. DOur findings reveal that datasets with the same or similar labels in different deepfake datasets exhibit different forgery features. T We proposed the KCE system canto solve this problem, which combines multiple deepfake datasets according to feature similarity. In thWe analyzed four groups of test datasets and found that the model trained based on KCE combined data, the faced unknown data types, and Calinski Harabasz scored 42.3% higher than the combined data by the same forgery nameby forged names. Furthermore, it is 2.5% higher than the model using all data, although the latter has more training data. It shows that this method improves the generalization ability of the model. This paper introduces a fresh perspective for effectively evaluating and utilizing diverse deepfake datasets and conducting deepfake traceability research.

  • deepfake
  • datasets
  • correlation
  • traceability
  • clustering
  • Calinski Harabasz

1. Introduction

WFacial recognition has become increasingly prevalent in recent years, with many applications utilizing it as the primary method for identity recognition. However, with the rapid development of deep learning-driven facial forgery technologies in recent years, such as deepfakes [1], there has been a rise in fraudulent practices within media and financial fields, which has sparked widespread social concern [2][3][4][2,3,4]. Consequently, there is a crucial need for the traceability of forged data.
Deepfake tracking methods can be based on roadly classified into traditional [5,6,7] and deep learning-based methods [8,9]. Traditional methods rely on machine learning algorithmtechniques, such as image forensics and metadata analysis to detect signs of manipulation in a deepfake. These methods are based on analyzing the visual properties of an image or video, and they can include analyzing the distribution of colors, identifying inconsistencies in lighting and shadows, or detecting distortions in the image caused by manipulation. These traditional methods require extensive domain knowledge and specialized software to execute. On the other hand, deep learning-based methods rely on machine learning algorithms’ power to detect deepfakes. These methods train deep neural networks on large datasets of real and fake images or videos. However, the category labels in, and they can detect deepfakes by analyzing the patterns in the data. Deep learning-based methods are highly effective at detecting deepfake datasets fundamentally differ from those is, but they require large amounts of training data and computing resources to execute. This paper mainly conducts related research based on the genelatter method.
Tralcing the computer vision field. The objectivsource of deep forgery relies on identifying the forgery algorithms used. However, the category labels have real-world meaning i in deepfake datasets fundamentally differ from those in the general computer vision field. In typical computer vision datasets lik, such as the CIFAR [10], ImageNet [11], and MNIST [12], the category labels are objective and have real-world meaning. For instance, the labels for salamander and setosa are assigned by biologists based on the biological characteristics of these species, or humans can accurately recognize facial expressions such as anger or happiness, as shown in Figure 1. These labels remain unchanged despite variations in camera equipment, lighting conditions, and post-processing of images. However, humans cannot classify deepfake pictures visually, and the images can only be named based on their forgery method. Different producers' The names given to the forgery methods by different producers are highly subjective and arbitrary. Many “wild datasets” do not provide forgery method labels. Furthermore, subsequent operations such as image compression and format conversion [13] may significantly alter the forgery characteristics of the images.
Figure 1. The first row shows the common CV dataset, the second row shows the human facial expression dataset, and the third row shows the deepfake dataset.
Improving facial forgery recognition and tracking technology relies on collecting and utilizing as many facial forgery datasets as possible. Thiese datasets include ForgeryNet [14], DeepfakeTIMIT [15], FakeAVCeleb [16], DeeperForensics-1.0 [17], and others. Addituation leaionally, numerous “wild datasets” are gathered from the Internet. However, these datasets are published by different institutions, use varying forgery methods, and have different naming conventions. In some cases, the exact generation algorithm is not provided. This situation leads some researchers to use only one dataset in their experiments. Dealing with those with similar or identical names can create challenges for users when multiple datasets are employed.
Figure 1. The first row shows the common CV dataset, the second row shows the human facial expression dataset, and the third row shows the deepfake dataset.

Measuring the relevance of each deepfake dataset is crucial. To address this problem, We establish the KCE-System. It uses the Xception model [5] as a forgery feature extractor that maps various deepfake images into the feature space. Then, we use PCA for dimensionality reduction and the K-means method for clustering. We use these clustered datasets to retrain the Xception model and use the Calinski Harabasz Index [6] to judge the models' performance. To improve the credibility of the experimental results, we repeat part of the experiments on The Frequency in the Face Forgery Network (F3-Net) [7] and Residual Neural Network (ResNet) [8]. We also combine these deepfake datasets based on forgery method labels as a control group.

Our experiments prove that some forgery category labels of the same name differ significantly across different datasets. When the forgery method of the deepfake dataset is unknown, the KCE-System can achieve better generalization performance by training on merged datasets based on closer feature distances.

2. Deepfake Datasets

Numerous deepfake datasets have been created by researchers and institutions, including FaceForensics++ [9][21], Celeb-DF [10][22], DeepFakeMnist+ [2][15], DeepfakeTIMIT [11][1], FakeAVCeleb [3][16], DeeperForensics-1.0 [4][17], ForgeryNet [1][14], and Patch-wise Face Image Forensics [12][23]. These datasets cover various forgery methods, have significant data scales, and are widely used. Please refer to Table 1 for more details.
Table 1. The standard deepfake datasets. The symbol * represents the number of pictures.Table 1.
Common deepfake datasets, the symbol * represents the number of pictures.

3. Deepfake Identification and Traceability

3.1. Methods Based on Spectral Features

Many scholars consider upsampling to be a necessary step in generating most face forgeries. Cumulative upsampling can cause apparent changes in the frequency domain, and minor forgery defects and compression errors can be well described in this domain. Using this information can identify fake videos. Spectrum-based methods have certain advantages in generalization because they provide another perspective. Most existing image and video compression methods are also related to the frequency domain, making the method based on this domain particularly robust.
Chen et al. [44] proposed a forgery detection algorithm that combines spatial and frequency domain features using an attention mechanism. The method uses a convolutional neural network and an attention mechanism to extract spatial domain features. After the Fourier transform, the frequency domain features are extracted, and, finally, these features are fused for classification. Qian et al. [9] proposed a network structure called F3-Net (Frequency in Face Forgery Network) and designed a two-stream collaborative learning framework to learn the frequency domain adaptive image decomposition branch and image detail frequency statistics branch. The method has a significant lead over other methods on low-quality video. Liu et al. [45] proposed a method based on Spatial Phase Shallow Learning (SPSL). The method combines spatial images and phase spectra to capture upsampled features of facial forgery. For forgery detection tasks, local texture information is more critical than high-level semantic information. By making the network shallower, the network is more focused on local regions. Li et al. [46] proposed a learning framework based on frequency-aware discriminative features and designed a single-center loss function (SCL), which only compresses the intra-class variation of real faces while enhancing the inter-class variation in the embedding space. In this way, the network can learn more discriminative features with less optimization difficulty.

3.2. Methods Based on Generative Adversarial Network Inherent Traces

Scholars suggest that fake faces generated by generative adversarial networks have distinct traces and texture information compared to real-world photographs.
Guarnera et al. [47] proposed a detection method based on forgery traces, which uses an Expectation Maximization algorithm to extract local features that model the convolutional generation process. Liu et al. [48] developed GramNet, an architecture that uses global image texture representation for robust forgery detection, particularly against image disturbances such as downsampling, JPEG compression, blur, and noise. Yang et al. [49] argue that existing GAN-based forgery detection methods are limited in their ability to generalize to new training models with different random seeds, datasets, and loss functions. They propose DNA-Det, which observes that GAN architecture leaves globally consistent fingerprints, and model weights leave varying traces in different regions.

4. Troubles with Current Deepfake Traceability

Methods Based on Spectral Features [7][33][34][35] are currently the primary deepfake traceability method. Cumulative upsampling can cause apparent changes in the frequency domain, and minor forgery defects and compression errors can be well described in this domain. Using this information can identify fake videos. Spectrum-based methods have certain advantages in generalization because most existing image and video compression methods are also related to the frequency domain. Methods Based on Generative Adversarial Network Inherent Traces [36][37][38] are another primary deepfake traceability method. The fake faces generated by generative adversarial networks have distinct traces and texture information compared to real-world photographs, including using an Expectation Maximization algorithm to extract local features that model the convolutional generation process. Use global image textures and methods based on globally consistent fingerprints.

Methods based on frequency domain and model fingerprints provide traceability for different forgery methods. Although researchers claim high accuracy rates in identifying and tracing related forgery methods, they typically only use a specific dataset for research. This approach reduces the comprehensiveness of traceability and the model’s generalization ability. Therefore, researchers need to consider the similarity and correlation between samples in each dataset to make full use of these datasets.

However, this presents a significant challenge. Unlike typical computer vision datasets, deepfake datasets’ labels are based on technical methods and forgery patterns rather than human concepts, making it impossible for humans to identify and evaluate them. The more severe problem is that the labels of forgery methods used in various deepfake datasets are entirely arbitrary. Some labels are based on implementation technology, while others are based on forgery modes. For example, many datasets have the label “DeepFakes.” The irregularity and ambiguity of these labeling methods make it difficult to fully utilize the forged data of various deepfake datasets. Some deepfake datasets do not indicate specific forgery methods.

Methods based on frequency domain and model fingerprints provide traceability for different forgery methods. Although researchers claim high accuracy rates in identifying and tracing related forgery methods, they typically only use a specific dataset for research. This approach reduces the comprehensiveness of traceability and the model’s generalization ability. Therefore, researchers need to consider the similarity and correlation between samples in each dataset to make full use of these datasets.

4. The KCE-System

We assume that incorporating datasets that use the same forgery methods will beneficially enhance the model’s performance. Conversely, merging different datasets or dividing the similar dataset into separate subsets may adversely affect the model’s performance. Based on the above assumptions, we developed the K-means and Calinski Harabasz Evaluation System. For the sake of simplicity, we refer to it as the KCE-System for short.

The KCE-System incorporates unsupervised learning. The system divided the deepfake datasets into training sets and evaluation sets. Then it trains a deepfake recognition model using training sets and extracting high-dimensional vectors from the middle layer of the model. After dimensionality reduction, the system used the K-means clustering method to merge various deepfake datasets. Using these datasets, the system then trains the new Xception, F3-net, and ResNet models. The trained models are then used to extract 2048-dimensional or 512-dimensional values from the evaluation set as feature values. Finally, the system uses the Calinski Harabasz Index method on the feature values after dimensionality reduction to evaluate The model’s performance, as shown in Figure 2. Next, we will introduce several main parts of the system in detail.

Figure 2. Overview of the KCE-System. The proposed architecture consists of two parts: the cluster section and the evaluation section.

4.1. Feature Extractor

Theoretically, when a model reaches a high classification accuracy for various categories of deep fake data, the model can extract the corresponding deepfake feature. The Xception is a traditional CNN model based on separable convolutions with residual connections. The model has shown high accuracy when detecting deepfake videos. The training accuracy rate reaches 94%. We use it as the main Feature Extractor. We take out its 2048-dimensional data as the sample’s feature from the global pooling layer of Xception. The ResNet is an improvement over the traditional deep neural network architecture that solves the problem of vanishing gradients and allows the training of much deeper networks. Another notable model in facial forgery detection is the F3-Net. This model leverages frequency domain analysis and comprises two branches; one learns forgery patterns via Frequency-aware Image Decomposition, and the other extracts high-level semantics from Local Frequency Statistics. Given the widespread applicability of the ResNet model in various computer vision fields and the unique position of the F3-Net in the domain of deepfake detection, we also select these two models as Feature Extractors and test them on half of the test group. To avoid the interference of the model itself on the experimental results to the greatest extent.

4.2. Dimensionality Reduction and Clustering

In this field, clustering algorithms, such as K-means [39], Gaussian Mixture, and DBSCAN [40] are commonly used. However, the DBSCAN algorithm is ineffective in controlling the number of clusters formed. In our system, we need to control the number of clusters formed for easy comparison with the data merged by name. The Gaussian Mixture algorithm is mainly designed for non-spherical clusters, while we focus more on the distance between categories in feature space, which emphasizes spherical clustering. Therefore, we chose to use the K-means clustering algorithm in our system.

The K-means algorithm uses Euclidean distance for clustering, but it can fail in high dimensions, so a dimension reduction method must be used. PCA [41] and t-SNE [42] are two methods we utilized for comparison. PCA is stable but retains less information when reduced to two or three dimensions. When reducing dimensions to 64 using PCA, the interpretable variance contribution rate can be preserved at 95.2%. From Figure 3, it effectively preserves most of the information needed for clustering. The t-SNE supports low-dimensional reduction for visual analysis but has poor stability.

Figure 3. Illustration of dimensionality reduction using PCA. After using PCA to reduce the dimension, use the t-SNE method to reduce the dimension to two dimensions for display (Different colors indicate different forgery methods).

4.3 Selection of Evaluation Algorithms

Evaluating the performance of models trained with unreliably labeled or unlabeled data is difficult. We can not use precision and recall because we do not have a way to figure out whether each sample is classified correctly. To address this issue, we utilize the Calinski Harabasz Index [6], introduced by Calinski and Harabasz in 1974, as an effective evaluation method. This index is defined in Equation (1) as the ratio of the sum of between cluster dispersion and inter-cluster dispersion for all clusters. Therefore, the Calinski Harabasz Index can be used to evaluate the models, with higher scores indicating that the model performs better on the test datasets.

For a set of data of size , which has been clustered into clusters, the Calinski Harabasz score s is defined as the ratio of the between-cluster dispersion means and the within-cluster dispersion, as shown in Equation (1).

                                    (1)

where is trace of the between group dispersion matrix and is the trace of the within-cluster dispersion matrix defined by:

                       (2)

                        (3)

Here, represents the set of points in cluster, represents the center of cluster , represents the center of , and represents the number of points in cluster .

When using the Calinski Harabasz Index to evaluate clustering quality, it can be observed that the elbow points of the Calinski Harabasz Index tend to be around 3 or 4 of cluster number, as depicted in Figure 4. The results obtained from the Calinski Harabasz Index are consistent with the number of forged method categories in the actual evaluation set. This suggests that the Calinski Harabasz Index is a valuable method to assess the model’s ability to identify new categories of deepfakes.

Figure 4. Using Calinski Harabasz Index to evaluate its clustering quality, it can be found that its elbow point is about 3 to 4.

5. Experiment

In this section, we first introduce the overall experimental setup. Our equipment includes four NVIDIA GeForce2080Ti GPUs. We use PyTorch to train and evaluate models, OpenCV to image data preprocessing, and Scikit-learn algorithm library for data analysis. We extract 620,000 fake face images from 10 deepfake datasets and train 40 models, including 32 Xception, 4 F3-net, and 4 ResNet models. The entire data preparation and experimental process spanned approximately three months.

5.1. Data Dividing and Preprocessing

The researchers select 31 datasets labeled with forgery method names from CelebDF, DeeperForensics1.0, DeepFakeMnist+, FaceForensics++, ForgeryNet, and FakeAVCeleb; see Table 1 for details. The researchers use a random method to divide 31 deepfake categories into two sets, where the training set contains 27 categories, and the evaluation set contains four categories. The researchers repeat the above division four times to obtain four sets of training sets and evaluation sets. See Table 2 for details. The researchers extract the frame data of each category according to the instructions of the relevant dataset and use the face detection model Retinaface [43] to intercept the face area. Then, the researchers increase the side length of the area by a factor of 1.25. Finally, the researchers randomly select 20,000 fake faces of each category and save these images as test data in png format.

Table 2. The table displays four sets of experimental data, each containing four evaluation datasets, with the remaining 27 datasets designated for training purposes.

DatasetsSynthesis MethodCountGroup1Group2Group3Group4
CelebDFv1FaceSwapPRO20,000    
CelebDFv2FaceSwapPRO20,000   evaluate
DeeperForensicsDF-VAE20,000 evaluate  
DeepFakeMnist+FOMM20,000    
DeepfakeTIMITFaceSwap-GAN20,000  evaluate 
FaceForensics++ DeepFakeDetectionFaceSwap20,000    
Faceforensics++DeepFakes20,000    
Faceforensics++Face2Face20,000 evaluate  
Faceforensics++FaceShifter20,000evaluate   
Faceforensics++FaceSwap20,000    
Faceforensics++NeuralTextures20,000   evaluate
FakeAVCelebFaceSwap20,000evaluate   
FakeAVCelebFSGAN20,000    
FakeAVCelebWav2Lip20,000 evaluate  
ForgeryNetATVG-Net20,000evaluate   
ForgeryNetBlendFace20,000  evaluate 
ForgeryNetDeepFakes20,000    
ForgeryNetDeepFakes-StarGAN-Stack20,000    
ForgeryNetDiscoFaceGAN20,000 evaluate  
ForgeryNetFaceShifter20,000    
ForgeryNetFOMM20,000evaluate   
ForgeryNetFS-GAN20,000   evaluate
ForgeryNetMaskGAN20,000    
ForgeryNetMMReplacement20,000    
ForgeryNetSC-FEGAN20,000    
ForgeryNetStarGAN-BlendFace-Stack20,000    
ForgeryNetStarGAN220,000  evaluate 
ForgeryNetStyleGAN220,000    
ForgeryNetTalking_Head_Video20,000   evaluate
Patch-wise_Face_Image_ForensicsPROGAN20,000  evaluate 
Patch-wise_Face_Image_ForensicsStyleGAN220,000    

5.2. Merge Training Data Based on the Category Name

To verify our conjecture that there is large randomness in the naming of the forged methods in the deepfake dataset, we specially merged the training set data according to the principle of the same or close to the forged method names and used them as a control group. We use the merging rules see Table 3. The number of training set categories of the merged four groups are that Group 1, 3, and 4 have 19 categories, and Group 2 has 17.

Table 3. The researchers randomly sample corresponding proportions of data from the merged dataset and reassemble them into 20,000 images per category.

Rule NumberMerge Categories
1CelebDFv1_FaceSwapPRO, CelebDFv1_FaceSwapPRO
2DeepFakeMnist+_FOMM, ForgeryNet_FOMM
3DeepfakeTIMIT_FaceSwap-GAN, DeepFakeDetection_FaceSwap, FaceForensics++_FaceSwap, FakeAVCeleb_FaceSwap
4Faceforensics++_DeepFakes, ForgeryNet_DeepFakes
5FakeAVCeleb_FSGAN, ForgeryNet_FS-GAN
6ForgeryNet_DeepFakes-StarGAN-Stack,ForgeryNet_StarGAN-BlendFace-Stack ,ForgeryNet_StarGAN2
7ForgeryNet_StyleGAN2, Patch-wise_Face_Image_Forensics_STYLEGAN2

5.3. Merge Training Data Based on the Results of K-Means Clustering

One of the purposes of our experiment is to determine the appropriate dimensionality for K-means clustering to address this type of problem. We need to ensure that we do not lose too many classification features due to excessive dimensionality reduction, nor do we cause the K-means algorithm to fail due to excessive dimensionality. We use the PCA algorithm to reduce the Xception model's 2048-dimensional output to 128, 64, and 32 dimensions. We also reduce it to two dimensions using the t-SNE algorithm. For the F3-net and ResNet models, we only use the PCA algorithm to reduce the output feature value to 64 dimensions since we only need to verify that our method applies to these models.

In the previous section, we created training data for the control group based on name mergers. To facilitate comparison, we ensure that the number of categories of the experimental data for each group is identical. Therefore, we use the K-means clustering algorithm to cluster these training sets based on the specified number of clusters. Groups 1, 3, and 4 have 19 clusters, while Group 2 has 17 clusters.

5.4. Experimental Results

The researchers train Xception, F3-net, and ResNet models using training data merged by K-means clustering results and category names, respectively. For comparison, the researchers also train the same models using the original training set without merging. To obtain feature vectors for the validation set, we used these models as feature extractors and applied PCA to reduce them to 64 dimensions. The researchers then calculated the Calinski Harabasz Index. Please refer to Table 4 for the result.

Table 4. The Calinski Harabasz Index results. Italicized and underlined marks indicate the best result for that group of tests.

ModelTrain Data Merge byGroup 1 CHGroup 2 CHGroup 3 CHGroup 4 CHAvg CH
XceptionWithout merging128.02825117.44849968.699468493.5723306101.937137
XceptionName84.083700973.817208674.57995761.265192773.4365148
XceptionK-means on 2048D124.241305105.07065576.221876184.21205897.4364735
XceptionK-means on t-SNE 2D103.62782987.146105566.614300376.526427383.4786656
XceptionK-means on PCA 64D137.241584101.19232785.253537694.2137508104.4753
XceptionK-means on PCA 128D101.197038101.50216374.844199786.635834191.0448087
XceptionK-means on PCA 32D114.24763589.193480162.393277975.959614785.4485019
F3-netName  62.659281365.651086264.1551837
F3-netK-means on PCA 64D  85.36106772.01870878.6898875
ResNetName  42.89565147.971653345.4336522
ResNetK-means on PCA 64D  49.752911654.078626351.915769

The Calinski Harabasz Index of the model trained on the data merged by K-means is 42.27% higher than that pooled by name. Furthermore, these scores are slightly higher than those directly using the original training set, even though the original set contains more data. At the same time, the Calinski Harabasz Index is also higher at 22.66% and 14.27% in F3-net and ResNet models. These prove an appropriate combination of deepfake datasets with similar features improves the model’s generalization in the unknown forgery categories.

Compared with the other three groups, the results of Group 2 are different. Furthermore, its Calinski Harabasz Index is lower than the training results on the original data. Because Group 2 has only 17 categories after the merger, with fewer training samples than other groups. More information loss can destroy the performance of the model.

6. Conclusions

The researchers prove the labels of various deepfake datasets contain many randomnesses. If researchers use more than two deepfake datasets, combining these datasets only based on forgery labels will hurt the model's performance. We propose K-means and Calinski Harabasz evaluation systems to evaluate the similarity of various deepfake datasets, laying the foundation for future researchers to use them comprehensively. The generalization ability of the deepfake recognition model in the face of new samples can be improved by merging datasets with high forgery feature similarity.

Our research revealed the arbitrariness of label naming in deepfake datasets and the resulting troubles in the traceability of forgery methods. There is still a long way to go to solve this problem completely. In addition, different image compression algorithms and image resolutions significantly impact the fake features of deepfake datasets, which will seriously interfere with the model’s extraction of fake features from deepfake datasets. We are committed to conducting further research to address these challenges effectively.

Furthermore, to ensure the healthy development of the field, we appeal to researchers and companies to standardize the label nomenclature of deepfake datasets.

However, this presents a significant challenge. Unlike typical computer vision datasets, deepfake datasets’ labels are based on technical methods and forgery patterns rather than human concepts, making it impossible for humans to identify and evaluate them. The more severe problem is that the labels of forgery methods used in various deepfake datasets are entirely arbitrary. Some labels are based on implementation technology, while others are based on forgery modes. For example, many datasets have the label “DeepFakes”. The irregularity and ambiguity of these labeling methods make it difficult to utilize the forged data of various deepfake datasets fully. Additionally, some deepfake datasets do not indicate specific forgery methods, such as “wild datasets”.
Video Production Service