We read a lot about fake content like videos nowadays. So-called deepfakes may be convincing audio and video where – for example – a person appears as someone else or to support fake news reporting. At the least an interesting and challenging development, since we cannot longer trust our eyes and ears. With the rise of AI and techniques like machine-learning this kind of content becomes more accessible and more convincing every day. This blog focusses on my experiments to create and detect fake video content. I specifically describe how it works and how you might use it yourself.
Before diving in the details, convince yourself about the power of deepfakes by watching this video:
You need compute power!
We call fake video’s often deepfakes because of the use of deep neural network technology. Over the last decade, computer scientists have discovered that neural networks become more and more powerful when you add additional layers of neurons. To unlock the full power of these deeper networks, you need a lot of data and – very obvious - a lot of compute power.
As a starting point, for experimenting deepfake technology you need a computer with at least 16 GB of RAM and one NVIDIA graphics GPU processor to speed up model training and testing using parallel data processing. But as always it depends on the task you want to do. For example: for training a custom model, more resources are needed than in a situation where ready-to-use models are implemented. Training your own models needs at least 100 – 300 GB of raw data (images or videoframes) such as VoxCeleb, a dataset I used to train the models for this article. The VoxCeleb set contains 22.496 videos, extracted from YouTube videos with lengths varying from 64 to 1024 frames.
The process of creating deepfake videos makes use of so-called face reenactment. This is an emerging research topic in the field of computer vision, which attempts to accomplish two objectives simultaneously:
- Transfer the facial expression and head-pose from the driver face to the source face (see figure below).
- Preserve the appearance and the facial features of the target face guaranteeing a good image quality without deformations.
In recent years, face reenactment has attracted enormous research efforts due to its practical values in virtual reality and the entertainment industry. It has applications in video editing and movie postproduction, visual effects and virtual reality for example.
My research focused on the task of recreating a photorealistic talking head, what comes down to a system that can synthesize video sequences with facial expressions and head-poses of a source person.
- Source contains the corresponding information of the target face.
- Driver contains the facial expression and head-pose to recreate a face.
- Reenact contains as a result the face of the source with the expression and head-pose from driver.
As explained in the figure below:
The goal of the experiment is to transfer the source expression and head-pose of the image driver to the face target. Therefore, the output should contain the target-persons face with the same expression and head-pose as the image driver.
For face recreation I tested different deep-learning models using the same driver image. For an example, see some of the differences between two models below:
|First Order Motion|
The best results were achieved using the First Order Motion model, a model that learns to process the movements as a combination of key point displacements between of target and source face.
I used an image source and a driver video to be able to extract frames at different moments in time. With this it is possible to recreate the source with different facial expressions and head-poses. Combining all the regenerated frames creates a video that performs the same movements as the driver video but with the facial features of the source. It sure looks like magic as you can experience yourself in the video below.
Video: example of a fake video in real-time, the person on the left is real and on the right is fake.
The Fake video creation process uses the key points (landmarks) of the face for the recreation process.
Why it’s important to be able to detect fake videos
As a result of the ever-increasing consumption of social media and online content more and more people are relying on the internet as their (trusted) source for news. Nowadays more video content is generated in about 30 days than the major television networks have created in the past 30 years
In a 2018 survey, it was found that 50% of the people age between 18 to 29 years believe that the content on new websites, apps, and social media are truthful and real.
It is a fact that the creation, editing and propagation of digital videos are an effective way to propagate fake news. Human faces do get the main interest when manipulating, this is obvious due to the fact that face and expressions play a central role in human communication, expressions can emphasize a message for example.
New tools have already demonstrated their dangerous potential, having been used to abuse persons or causing political or religious tensions between countries.
It is assumed that more complex tools will appear in the near future. Therefore it is necessary to understand the technologies behind to detect fake videos and to filter out malicious content from the internet.
Detecting fake content
I extended the experiment to find out if it is possible to detect fake content using the selected First Order Motion model. The Exposing DeepFake videos model is based on the observation that current deepfake algorithms can only generate images in limited resolutions, which then are further enhanced to match the original faces in the source video as good as possible. Luckily such transforms leave distinctive artifacts in the resulting fake videos. So, the approach is to detects these artifacts by comparing the generated face areas and their surrounding regions with a dedicated Convolutional Neural Network (CNN) model.
Video: detecting fake videos created with the First Order Motion model.
It is relatively easy to use neural networks to create – but also detect – fake video content. Although you need a large amount of training data and computing power to generate – and detect - realistic (fake) content.
One of my learnings throughout my research is that good quality results can be achieved by correctly pre-processing your data, choosing videos with high resolution and making sure that there are no objects near the face.
To conclude: Detecting fake content is possible to some extent, however in most cases it will be very hard to reach the best quality because not all methods to create fake videos are open-source, so we can’t use this data in model training to detect more fake contain. This will be a continuous challenge in the future because better methods are going to be created and models must retrain to detect these contents.