2. Computer Vision and Synthetic Image Data
The rise of CNN and deep learning networks in computer vision has necessitated ever larger amounts of image data for training and testing. Such image data is commonly stored in the form of photos and videos. The traditional method of obtaining image data for training, testing, and validation of neural networks has been to capture data from the real world, followed by manual annotation and labelling of the collected data. This methodology is relatively simple and cost effective for smaller image data sets, where there are a minimal number of key objects per image, the exact position of objects is not important, and only the general classification is needed. However, this methodology becomes increasingly more costly when scaled up to large data sets or datasets that require more detailed annotation.
The first problem is the collection of large amounts of data. It is possible to automate the collection of real world image data to a certain extent in applications that utilise fixed cameras or vehicle mounted cameras, but this is not the case in all computer vision applications. Collection of large datasets for applications such as facial recognition and medical scans can be very difficult for a range of reasons including privacy, cost, and other legal restrictions. It can also be difficult to reliably collect data of specific environmental conditions such as foggy roads, where the presence of fog is not something that can be typically created in a public space just for the sake of data collection.
The difficulty only increases when detailed data annotation is needed. Manual data annotation can be a slow and laborious task depending on what needs to be annotated and to what degree of accuracy and precision. Some data sets, such as ImageNet
[20][13], are relatively simple to annotate manually as the images primarily need to be placed into categories based on what the primary object in focus is. However, the annotation of over a million images is a massively time consuming task, even using a large group of people. When considering the annotation of more complex details such as the number of people in a crowd, object poses, or the depth of objects, the cost effectiveness decreases significantly. Time and money are not the only concerns either, manual annotation quality tends to decrease when dealing with large data sets due to human error. In some applications, such as the previously mentioned crowd counting, it may not even be possible for a human to reliably count the number of people depending on image quality and crowd density.
For such applications, synthetic data provides two key benefits. The first benefit is that data generation can be automated. Generation of synthetic human faces with varying levels of realism has been possible for many years and has enabled the creation of data sets for facial recognition without the privacy concerns that come with taking photos of people or the time and cost required for many people to have their photo taken. The second benefit is that as long as the data synthesis model is able to keep track of various objects and features during the synthesis process, detailed automatic annotation is possible. Of course, it is important to note that although automatic annotation is possible, it is still dependant on how the data synthesis is set up. Detailed information on all objects within a virtual 3D environment is usually available in 3D modelling software or game engines, but if that information is not extracted as part of the data synthesis process then there is fundamentally no difference to collecting data from the real world.
These are the primary difficulties behind data collection and annotation for computer vision applications and the major benefits synthetic data can provide.
3. Types of Synthetic Imagery for Computer Vision
Synthetic image data can be broadly categorised into two types, synthetic composites and virtual synthetic data.
3.1. Synthetic Composite Imagery
Synthetic composite imagery refers to real image data that has been digitally manipulated or augmented to introduce elements that were not originally in the image data. This includes the digital manipulation of the image environment, introduction of synthetic objects into the image or the splicing of different real images into a new image.
Synthetic composite datasets such as SURREAL
[21][14] are created by projecting 3D synthetic objects or people into real background environments,
Figure 1. This data type is often used in situations where the background environment contains enough useful or significant features that it is not worth the loss in domain shift or the effort to recreate the environment synthetically. The SURREAL dataset was primarily created to train networks on human depth estimation and part segmentation. As a result, the synthetic humans do not take into account the background environments they are placed in. The resulting scenes can be easily identified as synthesised by the human eye, but features the network needs to learn are attached to the human object, so the background simply serves as a way to reduce the domain gap to real data by increasing environmental diversity.
Figure 1.
Synthesis via projection of synthetic 3D object onto real 2D background.
Similarly the RarePlanes dataset
[22][15] provides synthetic composite satellite imagery of aircraft at different airport locations. However, instead of projecting 3D objects onto a background, 2D images are directly overlaid onto the backgrounds,
Figure 2. Satellite imagery is one of many fields of computer vision where it is difficult to obtain large data sets due to the nature of the image data required. The
author
s of the paper notesesearchers note that there are no expansive permissively licensed synthetic data sets for such data. The RarePlanes dataset consists of a mix of real and synthetic satellite imagery that has had aerial images of planes overlaid on top, while
Figure 2 notes the use of real 2D backgrounds, in practice, this can be extended to synthetic 2D backgrounds as well as it does not affect the overall process of overlaying 2D images onto a background. The synthetic data was created using the AI.Reverie platform, which used Unreal engine to create realistic synthetic data based off real world airports.
Figure 2.
Synthesis via overlaying 2D image onto real 2D background.
Large crowd data sets are resource intensive to annotate, both images and videos, with large numbers of people, in excess of 1000 people in some cases. People in crowds are also often not fully in view, potentially only having part of their head visible with the rest of their body obscured by the surroundings. Manual annotation can result in cases where data is not fully labelled due to the difficulty in doing so, thereby introducing data set bias. There are two common methods of synthesising crowd data. The first is to use 3D human models and either project them onto a 2D background or place them into a 3D virtual environment. In practice rendering scenes with over 1000 models would be highly computationally demanding, but if video data is needed, this is still the easiest method of generating crowd data. The second method is to use 2D overlays to project images of humans onto a 2D background. A paper on large crowd analysis using synthetic data
[23][16] projected synthetic humans onto real scenes. The synthesis enabled illumination, movement and density of people to be controlled while providing ground truth information.
Data sets such as foggy scenes
[2] use real data as a basis and digitally manipulate the image data in order to produce synthetic variations. Such data is created for applications where it is difficult to obtain data due to specific environmental requirements, but real environments and objects still hold enough value that it is not worth the effort of recreating the entire scene virtually to create the necessary data. In practice, this method of image synthesis can be considered an extension of overlaying 2D images onto a background, but instead of overlaying an image, a filter is used to project the required environmental conditions. Compared to 2D images, filters are also comparatively simpler to extend to video data if so required.
While all synthetic composites are image composites by definition, there are also some synthetic composites that do not use any synthetic objects or images in its creation. Image compositing works the same way as 2D image overlays, but uses labelled 2D objects from objects from one set of data and placing the object into scenes from other sets of data. This method of data synthesis tends to create data set with lower domain gap than virtual synthetic data sets, possibly due to domain randomisation increasing data diversity and improving generalisation
[4].
The fish identification data set
[24][17] is an example which uses instances of real fish cropped out from data collected using the Deep Vision system
[25][18] and places them onto backgrounds from Deep Vision footage where no other fish or objects are present, in random orientations, positions, and sizes. The resultant composite image comprises of only real data, but is still considered synthetic data as the exact scene was not capture in the real world. The reason for the generation of such data is primarily the difficulty in annotating existing Deep Vision data. Generating synthetic data with known fish species allows for much cheaper labelled data and extracting fish from scenes where the species can be readily identified by a human is also a significantly less time consuming task than manually labelling the original Deep Vision data set.
Image synthesis could be considered an extreme version of image compositing where instead of extracting labelled objects and placing them into other scenes, image synthesis takes labelled object features and combines them with other labelled object features to produce a new object. Visually, the new object may look nothing like the objects from which the features were extracted, but from the perspective of a neural network, the synthesised object still contains all the necessary features to identify what the object is
[6].
The KITTI-360 dataset
[26][19] was created with the goal of augmenting the KITTI dataset
[27][20] with more objects, increasing data efficiency for training. The paper noted that while 3D rendered virtual worlds were becoming more popular for producing urban environment data, the creation of such an environment requires significant human input before data can begin automatic generation. Instead, the paper proposed a process to integrate synthetic objects into real environments in a photo-realistic manner. By creating 360 degree environment maps, KITTI-360 was able to place high quality vehicle models into existing KITTI scenes with realistic lighting conditions. The models themselves are created by projecting 2D texture images onto 3D meshes,
Figure 3, which are then projected onto backgrounds to give a realistic view of the object as the perspective on the object changes over the course of the video.
Figure 3.
Synthesis via projection of real 2D image onto 3D mesh.
The SafeUAV synthetic dataset is rarer extension of mesh projection by projecting real 2D backgrounds to create a full 3D background
[28][21]. SafeUAV uses a 3D mesh reconstruction of an urban environment in CityEngine before overlaying real photo data over the mesh,
Figure 3. The result ends up warping the photo data significantly from ground angles but provides a reasonably similar view from above, which is all that is required as this dataset was generated for semantic segmentation and depth perception tasks from a drone.
The last type of synthetic composite imagery is less so an image composite and more of an extreme extension to digital manipulation. Images synthesised using variational autoencoders, generative adversarial networks, and diffusion models use noise maps as inputs to generate an image,
Figure 4. By learning compressed representations of images, noise maps can be used by these models to extrapolate features into a complete image.
Figure 4.
Synthesis via processing of noise maps.
3.2. Virtual Synthetic Data
Virtual synthetic data refers to image data that is completely synthesised, containing no real data directly. This can apply to a wide range of synthetic image data from synthetic objects with patterned textures placed in front of artificial backgrounds to photorealistc 3D environments designed to emulate the real world. Based on the types of virtual synthetic data generation methodologies, virtual synthetic data can be categorised into three groups, virtual scenes, virtual environments, and virtual worlds. This categorisation is independent of the photo-realism of the data that is produced.
Virtual scenes are the simplest form of virtual synthetic data. Virtual scenes typically use the minimum amount of 2D and 3D objects to create a scene to capture synthetic image data. The generation of synthetic faces for facial recognition tasks using 3D morphable models or parametric models are an example of virtual scenes. Synthetic faces often do not generate anything below the neck, some models only generate a mask of the face. When viewed from the front, the faces can be captured and used as synthetic image data. If a background is required, it only needs to look correct from the viewing angle, in some situations a realistic background might not even be required. Observing the face from different angles or positions makes it clearly visible that the head is not a complete object. It is not a complete virtual environment, and for the purposes of such applications, a complete virtual environment is not required.
Virtual environments are a step above virtual scenes and comprise a complete 3D virtual construction of a specific environment. The environment could be the inside of a house or a pedestrian crossing. Either way, its goal is to enable the capture of image data from multiple perspectives without risking the degradation of data quality due to problems such as object artefacts. When viewed from outside, the virtual environment may still look incomplete, but within the environment, it is self consistent.
Virtual worlds are effectively virtual environments on a larger scale. Scenes outside a virtual environment that may have been flat 2D background are fully constructed with events occurring beyond the view of the virtual camera. This is most commonly found in virtual data captured from games that have pre-built large scale environments, such as from the game Grand Theft Auto V. Creating virtual worlds to collect such data is labour intensive, which is why collecting data from games with pre-built worlds is a common alternative. Virtual KITTI
[29][22] is an example of a virtual world where the environment from parts of the KITTI dataset
[27][20] was recreated digitally to produce a virtual copy of the KITTI dataset.
In the field of object detection, some research has moved towards highly photorealistic object renders to reduce the domain gap to the target domain. Other research has found that photorealism might not be the only method of reducing domain gap, instead by using domain randomization, where the objects of interest are placed into random non-realistic environments, it is possible to force a model to learn object features
[30][23]. Compared to photorealistic objects, this type of synthetic data may not fit the target domain as well, but its generalisation means that it stands to have better average performance across multiple domains. Virtual synthetic data offers a way to create both photorealistic and photo-unrealistic environments that can be manipulated as required to produce the necessary image data.
Facial recognition has made significant progress over the past few years thanks to developments in deep learning networks and large scale data sets. However, it has started to become increasingly difficult to obtain larger data sets from the internet by trawling for faces due to labelling noise and privacy concerns. As a result, synthetic data has become the alternative to obtaining large data sets. The performance of networks trained on synthetic data for facial recognition has historically not been good, the domain gap has often been very large, resulting in poor real world performance. However, synthetic synthesised faces still offer the great benefit of avoiding issues with privacy and developments over the years have shown increased performance in face generation technology
[31][24].
Moving past the generation of virtual synthetic data for standalone objects and faces, there are some applications that necessitate the construction of a larger virtual scene. In the scenario where a task such as pedestrian detection is required, but there is no existing real data to conduct network training or even domain adaptation, synthetic data is the only available method of sourcing any data to train a pedestrian detection model
[3]. The problem is that synthetic data suffers from domain gaps with real data and without any real data, traditional methods of reducing the domain gap, such as mixing data or fine tuning after pre-training on synthetic data are not possible. In cases like this, the best option is to provide as much ground truth as possible from the virtual scene that has been constructed.
Vehicle re-identification is another field that can utilise virtual synthetic data in the scope of a virtual scene, while vehicle detection and identification is closely related to tasks such as urban driving, unlike urban driving, vehicle re-identification is primarily concerned with stationary vehicles and so synthetic vehicles can be placed into small virtual scenes for data collection. Similarities between vehicle types when viewed from different angles as well as the lack of differences between some vehicles types can cause many difficulties with real data. To address this issue, highly diverse data sets are required to learn specific features. However, even if such data is available, manually annotating such data is prohibitively expensive. Synthetic data provides an alternative source of large automatically labelled data that can also be generated from many different perspectives, allowing for much more diverse data sets than what might be normally available from the real world
[32][25].
In cases where synthetic scenes are not sufficient to produce the data required for the application, synthetic worlds offer a much larger environment from which to capture data at a computational cost. Most virtual worlds are not fully utilised all the time. Instead, virtual worlds allow for the capture of data in different environments, which can be useful in applications such as autonomous vehicles, while photo-realistic environments are not possible without the use of dedicated designers and significant rendering time, it is possible to generate more basic environments using city layout generation algorithms combined with pre-textured buildings, allowing for the creation of grid-like city environments. The effect of photorealism on performance is substantial, but the biggest advantage of virtual synthesised environments lies in the automatic labeling of objects as well as complete control over environment variables
[33][26].
While virtual synthetic data does not directly contain any real data, this does not mean that it cannot reference or replicate real data. The Virtual KITTI data set
[29][22] is a fully synthetic recreation of a subset of the KITTI data set. The goal of creating a virtual copy of the KITTI data set was to provide evidence that models trained on real data would perform similarly in virtual environments and that pre-training on synthetic data should provide improvements in performance after fine-tuning.