Videos Data Augmentation for Deep Learning Models

Videos Data Augmentation for Deep Learning Models: Comparison

Please note this is a comparison between Version 2 by Nino Cauli and Version 1 by Nino Cauli.

In most Computer Vision applications, Deep Learning models achieve state-of-the-art performances. One drawback of Deep Learning is the large amount of data needed to train the models. Unfortunately, in many applications, data are difficult or expensive to collect. Data augmentation can alleviate the problem, generating new data from a smaller initial dataset. Geometric and color space image augmentation methods can increase accuracy of Deep Learning models but are often not enough. More advanced solutions are Domain Randomization methods or the use of simulation to artificially generate the missing data. Data augmentation algorithms are usually specifically designed for single images. Most recently, Deep Learning models have been applied to the analysis of video sequences. The aim of this paper is to perform an exhaustive study of the novel techniques of video data augmentation for Deep Learning models and to point out the future directions of the research on this topic.

data augmentation
deep learning

1. Introduction

We live in a world where most of our actions are constantly captured by cameras. Video cameras are spread almost everywhere: in smartphones, computers, drones, surveillance systems, cars, robots, intercoms, etc. Image Processing (IP) and Computer Vision (CV) models, able to extract and analyse information from images, are becoming more and more important. With the advent of Deep Learning (DL) and the increase in computational power, classical CV algorithms are quickly being replaced by Convolutional Neural Networks (CNN) or other DL models [1,2]. Typically, DL models possess a huge number of parameters that need to be trained. The risk of overfitting with such big models is very high and big datasets with high variability are needed for networks to be able to generalise.

Unfortunately, collecting a big collection of images or videos and labellng them is both resource and time consuming, and, in some cases, even impossible. In medical image analysis, data such as computerized tomography (CT) and magnetic resonance imaging (MRI) scans are expensive and time consuming to collect. Moreover, medical data are protected by strict privacy protocols, making it difficult to obtain past recordings from hospitals. In robotics, a prolonged operation of robots for collecting data can result in the wearing or damaging of components, labour intensive procedures and dangerous interactions between machines and operators. Collecting data for autonomous vehicles control have similar problems. Data collection in this case consists of running a vehicle (car, drone, boat) with a camera mounted on top in various environmental conditions (weather, time of the day, city versus countryside, etc.). This process can take a conspicuous amount of time, it is expensive, the vehicle can be damaged and special permissions to operate in restricted areas are often needed. From these examples, it is clear how data collection can become a complex and troublesome process, but it is only part of the problem. In order to generate a dataset for supervised learning models, data need to be labelled. In many occasions, the labelling process cannot be automatised, and each image needs to be labelled manually by humans (e.g., medical images segmentation).

The consequence of the aforementioned problems in data collection and labelling is the generation of small and unbalanced datasets. Several techniques exist to tone down this problem, reducing the overfitting and improving the generalisation capabilities of the models. For some problems like object recognition, face recognition and autonomous driving, big generic and public datasets have already been collected [3,4,5,6]. Pretraining is a technique where models are first trained on big existing datasets built for more generic tasks. In this way, pretrained models can learn a base knowledge to be transferred to a specific problem. A pretrained model is able to converge faster on a new dataset, needing less data [7]. A similar approach is Transfer Learning: models pretrained on a dataset for a specific data distribution are able to transfer part of the acquired knowledge to a different distribution with small or no fine-tuning. Data regularization techniques (Dropout and Batch normalization) are other approaches to reduce overfitting. Using a combination of these techniques, tasks where data are scarce can be more easily handled by DL models. However, none of the previous methods directly solve the problems of shortage of data and unbalanced datasets. Data augmentation techniques, on the other hand, address the lack of data artificially generating new ones. The most basic technique of data augmentation for image analysis is noise injection: the dataset is expanded creating duplicates of the original images injected of random values in the RGB space. Since the introduction of AlexNet in 2012 [8], geometric and color space transformations are common data augmentation techniques used to improve the performance of DL models for image analysis. Cropping, flipping, rotating, translating, histogram and RGB values alteration all fall in this category. With the improvements in Neural Networks (NN) and DL, more advanced data augmentation methods increased. Strategies based on generative modeling are able to generate new input images belonging to a similar distribution of the original dataset. These strategies use Generative adversarial networks (GANs) to generate the new images [9]. A GAN consists of two networks, a generator and a discriminator that compete against each other during training: the generator tries to produce an image belonging to a distribution of interest from input images, while the discriminator tries to distinguish generated images from the ones belonging to the true data distribution. After training, the generator can be used to augment the original dataset with newly generated images from the same distribution of the original dataset. Neural Style Transfer is another DL based methodology able to augment the size of image datasets. The idea is to alter the latent space of an Encoder/Decoder CNN in order to generate images with different styles. The output image of the Decoder is similar to the input one but with a difference in style that depends on the changes applied to the latent layer. Video analysis adds the temporal dimension to the images problem, resulting in a very complex challenge. With the introduction of industry 4.0, robotics and autonomous vehicles, video analysis is becoming a focal problem for the research community. In this case, the input of the DL models is not single images but streams of multiple images with temporal and spatial correlation between each others. While some of the models meant for image analysis can be used out of the box to analyse videos, usually some changes have to be done to take into account the temporal dimension. Optical flow [15], 3D convolutions [16] and Recurrent Neural Networks (RNN) [17,18] are the most common methods used to handle image sequences. However, the correlation in time and space in between images of the same sequence needs to be taken into account not only in the design of the DL models, but also in the design of the datasets. Geometric and color space transformation can usually be applied to videos keeping them constant for the entire image sequence, but, for more complex methods, the changes need to be more significant. In generative modeling, the Generator network needs to keep some information of the past frames. The DL models used to analyse image sequences (Optical flow, 3D convolutions and RNN) are a proper solution. A different approach is to generate the images for the augmented dataset from physical models that approximate the world. In this case, detailed models of the environment, the physics and the cameras are defined by the researcher and used to generate synthetic approximation of real images. In simulation, the physical interaction between objects needs to be taken into account. If the focus is in human action recognition or prediction, the skeletal animation of the subjects is needed to simulate the motion. In domain randomization methods, camera motion must be taken into account and the variation in textures, illumination and objects shapes must be constant or coherent through the entire video sequence.

2. Review on Video Data Augmentation

There are five classes of methodologies for video data augmentation: basic transformations (geometric, color space, temporal, erasing and mixing), feature space augmentation, DL models, simulation, and methods that improve data generated though simulation using Generative Adversarial Networks.

2.1. Basic Transformations

A simple technique for temporal data augmentation in videos was proposed in [26]. The paper focuses on the problem of action recognition from videos. The authors augment the training set for their model applying iteratively a temporal cropping several times to each original video sequence. They temporally sub-sampled each video sequence of length l with a stride s, obtaining s new sequences of length of l/s. A three-stream CNN was trained with and without data augmentation. The accuracy of both networks was evaluated on four different datasets: UCF101, HMDB51, Hollywood2 and Youtube. The network trained with data augmentation improved the accuracy on all the datasets (+1.3% on UCF101, +1.1% on HMDB51, +1.2% on Hollywood2 and +2.5% on Youtube). Data augmentation using temporal cropping is proposed also by Lee et al. [38]. The authors augment a video dataset of hand gestures splitting the original 12 frames videos in 3 videos of 8 frames each (1st to 8th, 3rd to 10th and the 5th to 12th frame). They also invert the temporal order of the frames obtaining an augmented dataset six times larger than the original. The proposed data augmentation strategy was used to augment the VIVA dataset. Their mdCNN trained on the augmented dataset improved the accuracy of 6% over the same network trained without data augmentation.

. A three-stream CNN was trained with and without data augmentation. The accuracy of both networks was evaluated on four different datasets: UCF101, HMDB51, Hollywood2 and Youtube. The network trained with data augmentation improved the accuracy on all the datasets (+1.3% on UCF101, +1.1% on HMDB51, +1.2% on Hollywood2 and +2.5% on Youtube). Data augmentation using temporal cropping is proposed also by Lee et al. [38]. The authors augment a vpplyideo dataset of hand gestures splitting the original 12 frames videos in 3 videos of 8 frames each (1st to 8th, 3rd to 10th and the 5th to 12th frame). They also invert the temporal order of the frames obtaining an augmented dataset six times larger than the original. The proposed data augmentation strategy was used to augment the VIVA dataset. Their mdCNN trained on the augmented dataset improved the accuracy of 6% over the same network trained without data augmentation. Applying commonly used imageg commonly used image-level data augmentation strategies to video sequences may introduce unnecessary noise corrupting the temporal cues of intra-clip frames. In [44], the authors solve the problem applying the same transformation to all the frames of a mini-batch clip instead of randomly changing it for each frame. Random cropping, flipping and erasing are used to augment a video dataset for person re-identification. Image mixing techniques (e.g., Mixup [60] and CutMix [61]) have been widely used for image data augmentation. These types of approaches generate the augmented images mixing the pixel values from two different images of the original dataset. Some algorithms, (i.e., Mixup), averages the RGB values of the two images, while methods like CutMix replace randomly shaped patches of one image with the other. In order to extend image mixing techniques to video data augmentation, temporal cues in between frames must be taken into account. VideoMix [46] is a data augmentation method proposed by Yun et al. that extends CutMix to video data augmentation. The temporal consistency is preserved keeping the patch size and position the same for all the frames of each video clip. The authors tested VideoMix on three tasks (action recognition, localization and detection) training different 3D CNNs. They compared the performances of their algorithm against the vanilla CutMix method. After training the SlowFast-50 network on the Mini-Kinetics dataset, VideoMix achieved the best improvement in accuracy (+2.4%) for action recognition. In video synopsis applications, motion information is more important than video fidelity. Namitha et al. [43] proposed a toolbox for data augmentation able to generate synthetic surveillance videos of static cameras for video synopsis analysis. The synthetic videos are composed superimposing to an extracted background a series of coloured rectangular boxes that represent moving objects or persons. The toolbox permits to choose number, size, trajectory and speed of the boxes added to the synthetic video. In order to test the efficiency of their data augmentation method, the authors compared real camera footage from different real-world video datasets to their synthetic counterparts. When evaluated on frame compact ratio (CR), total true collision area (TCA) and total false overlapping area (FOA) metrics, the results obtained by both real-world and synthetic data were close, demonstrating the validity of the data augmentation method. In their paper, Hu et al. [54] introduced AMMC (Augmentation by Mimicking Motion Change), a data augmentation strategy for object tracking that takes into consideration tracking motion features. AMMC first separates the target and background from the images. The cropped target images are transformed with operations like rotation, projection, resizing, blurring, and occlusion that reflect motion changes. The augmented target images are then superimposed on the background images at a random position in order to obtain new synthetic data. The authors trained ATOM and DiMP trackers on their simulated dataset, and they perform comprehensive experiments on five popular tracking benchmarks: LaSOT, GOT-10k, TrackingNet, OTB-100 and UAV123.

3.2. Feature Space

DL models often extract a one-dimensional, feature vector from the input images. Sometimes, it is more convenient to perform the data augmentation on the feature space instead than on the image space (lack of availability of the original videos due to privacy constraints, ad hoc organization of the feature space, etc.). In their works, Dong et al. [31,53] proposed a data augmentation strategy for a content-based video recommendation challenge. The authors did not have access to the RGB video frames and applied the data augmentation directly on the feature vector extracted from an InceptionV3 deep network. They propose a data augmentation technique similar to the one used by Wang et al. [26] for video action recognition. Their frame-level data augmentation sub-samples each feature sequence skipping frames with a stride s. Repeating the process starting from a different frame of the original feature sequence, they are able to generate s distinct new sequences. The authors compared the performance metric scores (recall/hit scores) of the network trained with and without data augmentation on the Hulu Content-based Video Relevance Prediction Challenge 2018. In the most recent work, the network trained with data augmentation achieved an improvement of the performances both for TV-Shows (2.708 → 3.092) and Movies datasets (2.030 → 2.289).

3.3. DlL Models

A GAN is also used by [36] to augment video datasets for action recognition. For each video sequence representing an action, the generator outputs a single frame that encodes all the information regarding motion features. The generated frames and original datasets are then joined together to obtain the augmented training set. The GAN features generator can enlarge the differences between similar classes. The data augmentation model was tested on UCF101 and KTH action recognition datasets. A 2DCNN and 3DCNN were trained with and without data augmentation, with the data augmentation networks obtaining an increase in accuracy on both datasets with respect to the one trained on the original ones: 2DCNN +35% on KTH and +26% on UCF101, 3DCNN +37% on KTH and +21% on UCF101.

More recently, Wei et al. [51] presented a novel GAN based model for Appearance-Controllable Human Video Motion Transfer. The GAN model is able to generate a novel video from a source motion video and multiple target appearance videos. The innovation of their technique is the ability to control the appearance of the subject and the background in the generated synthetic videos without any retraining of the model. To achieve this result, the input are first preprocessed, extracting the skeletal poses sequence from the source motion video together with the appearance of face, upper garment and lower garment from the target appearance videos. Using the preprocessed inputs, a GAN generates a synthetic video of a new subject performing the source action. This video is then superimposed to a selected background to generate the final video sequence.

3.4. Simulation

The great success of the video game industry is leading to an exponential improvement of graphic cards and real-time rendering systems. Several graphic and physic engines exist that are able to render photo realistic scenes at high frame rates. Game engines like Unreal Engine [13] and Unity [12] not only produce high quality synthetic videos, but they also come with a powerful, programmable and user friendly interface, making them the perfect tool to generate augmented simulated datasets. In robotics, simulators are often used to test and train the control models and 3D robotic simulator, which have existed for more than two decades. As far as DL model training is concerned, Reinforcement Learning (RL) agents have often been trained in simulations, due to their need to continuously explore the environment that surrounds them [62]. One of the first attempts to generate a video simulated dataset for gait recognition was made by Charalambous et al. in 2016 [25]. The authors used Vicon’s motion capture data extracted from recordings of humans walking and running on a treadmill. The Vicon data were then imported into Blender [63] and attached to randomly generated avatars (with differences in age, sex, weight, etc.). Using Blender, it was possible to automatically label the data. Compared to a more recent simulated dataset, the images were quite simplistic, with a single avatar centered in the frame and with a plain grey background. De Souza et al. [27] made a step ahead generating a diverse, realistic, and physically plausible dataset of human action videos, called PHAV. The authors used Unity to render the videos, and they were able to randomise the scene based on different parameters and preset assets (environment, camera position, weather, lighting, time of the day, number of actors). The approach is not limited to existing motion capture sequences, but it procedurally defines synthetic actions via a combination of atomic motions. In their follow up paper [42], the authors improve and deeper describe the generative 3D model and the procedural algorithm to randomise the scene and generate the actions. The improved framework is also able to generate multiple sensor modalities like semantic segmentation and optical flow. The proposed parametric simulation tool is able to generate fully annotated action videos at 3.6 FPS using one consumer-grade gaming GPU (NVIDIA GTX 1070). The authors tested data augmentation performances of the model on two main stream action recognition datasets: UCF-101 and HMDB-51. A Temporal Segment Network (TSN) was trained with and without data augmentation, with the former (named CoolTSN) obtaining higher accuracy on both datasets: TSN on UCF-101 93.6%; CoolTSN on UCF-101 94.2%; TSN on HMDB-51 66.6%; and CoolTSN on HMDB-51 69.5%.

3.5. Solving the Reality Gap (Simulation + GAN)

The reality gap is the subtle discrepancy between reality and simulation that prevents DL models to properly learn from simulated images. One way to alleviate the problem is to exploit the recent advancement in generative adversarial networks. GAN models can be used to refine synthetic images to be visually closer to real ones. Reently, Wang et al. [48] used a similar idea in their data augmentation framework for crowd videos. They created two synthetic datasets. The first one is a large synthetic video training set with labels generated using the video game GTAV; the second one is a smaller dataset of synthetic images generated by a CycleGAN. The CycleGAN takes as input real and simulated images and generates realistic images based on the two. CycleGAN generated dataset preserves the labels of the original simulated videos. The large synthetic dataset was used to pretrain a CNN crowd understanding model. The crowd model was then fine-tuned on the smaller refined dataset.