With the fast development of Fifth-/Sixth-Generation (5G/6G) communications and the Internet of Video Things (IoVT), a broad range of mega-scale data applications emerge (e.g., all-weather all-time video). These network-based applications highly depend on reliable, secure, and real-time audio and/or video streams (AVSs), which consequently become a target for attackers. While modern Artificial Intelligence (AI) technology is integrated with many multimedia applications to help enhance its applications, the development of General Adversarial Networks (GANs) also leads to deepfake attacks that enable manipulation of audio or video streams to mimic any targeted person. Deepfake attacks are highly disturbing and can mislead the public, raising further challenges in policy, technology, social, and legal aspects. As a primary cause of misinformation, an imminent need for fast and reliable authentication techniques is of a high priority.
1. Introduction
Modern Artificial Intelligence (AI)/Machine Learning (ML) technology is widely integrated with many multimedia applications to help enhance its applications, and General Adversarial Networks (GANs) enable the manipulation of audio or video streams seamlessly based on the probability distribution of each dataset class
[1]. Since first introduced in 2015, the development of the generator and the discriminator module of the GAN has led to the generation of deepfaked images that are indistinguishable from real images
[2]. Such high-resolution and accurate generation of images had found many applications in modern media. The potential applications of deepfakes include e-health/medical field, commercial applications, and secure privacy in media. With the capability to generate feature characteristics based on a learned probability distribution, a deepfake generation model was proposed to help physically challenged people with entertainment media, where the model extracts motion features from a source subject and generates similar movements using the targeted subject
[3]. In medical applications, deepfakes are readily applicable to develop better plastic surgery procedures for facial reconstruction
[4]. Along with a guidance-based AI system in surgery, deepfakes are also used to generate training samples for rare medical conditions where the data are limited
[4]. Commercial companies develop deepfake techniques to translate text-based messages delivered by artificial or deepfake characters, and similar applications are seen in social media platforms to create online avatars
[5]. With the emergence of the metaverse, online deepfake avatars are created to represent virtual presence. Holographic technologies leverage deepfakes to generate 3D historical characters using accurate audio and video data and deliver their story for future generations. Lastly, deepfake applications in privacy preservation stand on a fragile line. One such application includes preserving victims’ identities appearing on media platforms by altering their visual and audio characteristics
[6].
However, deepfaked video, audio, or photos also can be highly disturbing and able to mislead the public, raising further challenges in policy, technology, social, and legal aspects
[7][8]. Currently, there are deepfake tools available in the public domain that allow people to impersonate anyone, from businessmen to music stars, during video chats
[9][10][11]. Deepfake video “attacks” on some public scenarios have raised serious concerns
[12][13]. Political leaders’ messages are altered to create fake news for the public and lower trust in broadcast messages
[14]. Researchers have pointed out that disinformation may actually cause societal disturbance and ruin the foundation of trust
[15][16][17][18]. For instance, the most recent case was on March 17: a deepfaked video was posted on social media showing that President Zelensky was calling the Ukraine soldiers to lay down their arms
[19][20]. Domains such as smart surveillance, which highly depends on the audio and the visual layer input for its functionality, could lose the track of malicious actions when the incoming frames are altered
[3]. Government agencies such as the U.S. Defense Advanced Research Projects Agency (DARPA) are concerned about losing the war against deepfake attacks from adversarial hackers that use popular ML techniques to automatically incorporate artificial components into existing video streams
[21][22]. Therefore, as a primary cause of misinformation, an imminent need for fast and reliable authentication techniques is of a high priority
[14][23].
While the community has been engaging in the endless AI arms race “fighting fire with fire” hoping to have “smarter” ML algorithms
[24][25][26], new ML algorithms keep making fake AVS data more real. Therefore, it is compelling to explore alternative ML deepfake detection solutions. The effectiveness of a fingerprint technique against the deepfake generation model depends on its uniqueness and randomness to avoid forgery and predictions.
The ENF is the instantaneous frequency in the electrical power grid with a nominal value of 50/60 Hz, depending on the geographical location
[27][28]. The Instantaneous Frequency (IF) varies over time due to the varying load balance mechanism and power supply demands, resulting in the fluctuations from the nominal frequency resulting in the ENF signal
[29]. The variation in fluctuations is small, and the fluctuations are similar throughout the power grid interconnect. Among the four major power grid interconnects in the USA, the experimental data were collected in the Eastern power grid where the variation of the ENF is in the range of [−0.02, 0.02] Hz from the nominal frequency
[30]. While the ENF signal functions as the main power supply, it also gets embedded in the digital multimedia through background hum
[31][32] or illumination frequency in audio and video recordings
[27][33][34]. Due to the presence of the ENF in audio–video channels, the manipulation in the ENF signal with respect to time is treated as the manipulation or modification of the multimedia recordings
[35][36][37]. The ENF signal is also used for forensic analysis of digital evidence, time of recording estimation
[38], media synchronization among multiple channels
[39], and geographical tagging of the recording
[40].
2. Deepfake Detection Using Traditional and Trained Models
Deepfake detection has become a critical problem in digital media authentication. With advanced computational power and the developments in GANs, the resulting media output is very realistic
[2]. However, along with its development, many detection techniques were proposed in the early stages to leverage the artifacts introduced in deepfakes. Artifacts such as eye blinking
[41], facial distortion, facial symmetry construction
[42], and motion artifacts can be visually inspected and identified
[43]. Machine-learning-based models were also trained to identify the artifacts. However, the artifacts result from low training data and improvement in the GAN architecture; with more data, the artifacts can be reduced, and more realistic images can be created, leaving the visual-artifact-based detectors redundant.
Hidden features such as GAN fingerprints are unique to the deepfake model architecture
[44], and biometric signatures such as heartbeat detection through the skin do not depend on visual artifacts
[45]. The signatures can be reliable when the visual artifacts are removed by better training. The GAN also introduces frequency-level artifacts due to the upsampling method in the GAN pipeline
[46], and the modified frames can be identified by frequency analysis and studying the compression map
[47][48]. Noiseprint is one such fingerprint extracted by suppressing the high-level scene content and leveraging the in-camera processes for unique fingerprints
[49]. Noiseprint is applied to reliably localize the frame modification with high performance. Other camera-based fingerprint techniques such as Photo Response Non-Uniformity (PRNU) sensor noise and JPEG compression artifacts were also used in detecting frame-level forgeries due to their dependence on the source device
[50][51]. However, these unique artifact-based detectors can also be spoofed using a GAN-based approach where camera traces are inserted into the synthetic images
[52]. Along with the reliability of the unique fingerprint for its detection capability, it is also essential that the fingerprint be less prone to forgeries.
3. ENF Applications in Digital Multimedia
The ENF was initially introduced as a forensic verification technique for law enforcement applications to verify the authenticity of audio recordings
[27]. Due to electromagnetic induction, the audio recorders directly connected to the power grid can also embed the ENF fluctuations in the audio recordings
[28]. The applications were limited to devices connected directly to the power grid until the presence of the ENF was verified in battery-powered devices through the background hum generated by surrounding electrical appliances connected to the grid and increasing its range of devices
[31].
Along with audio, video recordings were also discovered to carry ENF fluctuations in the form of illumination frequency
[33][34]. The captured photons from artificial light have similar fluctuations, and the method estimation from the video recordings depended on the imaging sensor used in the capture device. Complementary Metal–Oxide Semiconductors (CMOSs) and Charge-Coupled Devices (CCDs) are the most commonly used imaging sensors with different shutter mechanisms
[38]. In the case of CCD sensors, a global shutter mechanism is used where the whole sensor grid is exposed to photon capture at one instant, resulting in capturing the ENF samples equal to the number of frames per second. However, in CMOS, a rolling shutter mechanism captures the ENF sample per row in the sensor grid and vastly increases the captured samples
[34]. Due to limited samples in the CCD sensor, an alternative aliasing frequency technique can be used to estimate the ENF fluctuations
[33]; however, it is prone to signal noise. Most commercial-grade camera devices use CMOS sensors due to their cost-effective nature, resulting in an effective solution for ENF estimation through video recordings.
The presence of the ENF signal in audio and video recordings has increased its viable applications in identifying the recording time due to its unique fluctuation nature. Although the fluctuations in the ENF are similar throughout the power grid interconnect, the propagation delay can be used to identify the geographical location of the recording within the grid, essentially enabling the ENF technology with the geotagging feature
[53]. ENF presence in audio and video recordings can be used to synchronize the media recordings from multiple recorders in commercial applications
[39]. Smart grid infrastructure relies on ENF fluctuations to analyze power consumption, create a feedback loop for power outages, and prevent grid-level blackouts
[30].
4. ENF-Based Digital Media Authentication
The ENF signal can essentially be used for both audio and video forgeries with its forensic capabilities. Modifications such as copy and move, frame replay, spatial modifications, and inserting external recordings can be identified using ENF inconsistencies
[36][37]. Many ENF estimation techniques are already proposed using multiple spectrum estimation techniques and phase identifications.