2. Deep Neural Networks
2.1. Face Recognition
In general, a face recognition system is described in several phases. The first phase consists of acquiring the facial images and pre-processing them, such as locating the faces and cropping them. In a second phase, the features are extracted from the facial image, for instance, the position of facial landmarks, eye distance or even the face tones. Finally, these features are used in a classifier for identification or verification purposes.
Face recognition can be performed in a controlled or uncontrolled environment. The controlled environment, also known as consent recognition, is one in which the user cooperates in the recognition by facilitating it through correct and static posture in a place with good lighting. In the uncontrolled environment, recognition is dynamic, without the user cooperating in acquiring an image, making the face recognition process very difficult due to the diversity of the surrounding environment (e.g., low visibility), facial poses and expressions.
2.2. Multispectral Imaging in an Uncontrolled Environment
The databases of the VIS domain and the use of image synthesizers, which generate multiple poses and facial expressions from the obtained images, have allowed the difficulties associated with the variety of poses and facial expressions to be circumvented. However, two points have proved more difficult to overcome: the change of illumination and occlusions. This has led to the use of multiple spectral bands, with particular emphasis on the infrared (IR) spectral band, which can acquire images in environments with little or no brightness and overcome occlusions such as smoke and fog. In short, multispectral analysis allows a face recognition system to extract facial features that would be impossible to obtain with images from the VIS spectral band.
The IR bands can be categorized according to several spectral bands
[7]. The active bands are the near-infrared (NIR) and short-wavelength infrared (SWIR). To acquire images in these bands, the object must receive illumination, even if scarce, because it is through reflection that the image is acquired. Such a fact means these images are commonly used in night vision devices. The NIR band allows the difficulties posed by the variation of illumination to be overcome, while the SWIR has the advantage of obtaining images through smoke and fog. The passive bands are the mid-wavelength infrared (MWIR) and long-wavelength infrared (LWIR). Unlike the active bands, the passive bands allow
peopleus to acquire images using only the thermal radiation emitted by a body, commonly known as thermal images.
The use of IR images for automatic face recognition is not without challenges, as these images are sensitive to the emotional, physical and health conditions of the individual, as well as the surroundings, and do not serve as an absolute alternative to the use of the VIS spectrum, but rather as a complement
[8]. Another difficulty arises from the low number of public databases with images from both spectral ranges and in an uncontrolled environment
[9], which limit the creation of rich classification models and the ability to characterize the performance of those systems in realistic conditions.
3. Current Work
Multi-spectral face recognition in an uncontrolled environment can be subdivided into two areas. The first is face recognition in an uncontrolled environment, which is already challenging. The second is multi-spectral face recognition, i.e., using different spectral bands in face recognition.
3.1. Face Recognition in an Uncontrolled Environment
The uncontrolled environment, strongly characterized by pose-light-expression factors, emerges as a problem for current recognition systems. A significant step was taken towards solving this type of problem by introducing very large databases to train Deep Convolutional Neural Networks (DCNN) in combination with the emergence of image synthesis methods
[5]. The two main image synthesis methods are: (i) one-to-many augmentation, which consists of generating different poses of a face from a canonical face image; (ii) many-to-one normalization, which consists of normalizing any pose of the face to a canonical face pose
[5]. The use of Generative Adversarial Networks (GAN), introduced by Goodfellow et al.
[10], is characterized by the use of a generator and a discriminator (see
Figure 1). The generator is responsible for producing samples given an input image so that the discriminator cannot discern which of the samples is real and which is false.
Figure 1. Schematic of the training of a GAN. The dashed line shows the process of sample generation.
Since their appearance in face normalization, with DR-GAN
[11], GANs have taken the lead in solving the problem of pose and facial expression variation. As for one-to-many augmentation using GANs, as is the case with the DA-GAN network
[12], their image production power also gives them an advantage compared to other algorithms.
Normalization of many-to-one images is an extreme image synthesis problem due to the pose differences of a face. Cao et al.
[13] proposed HF-PIM, normalizing the face to a frontal pose through a texture fusion deformation procedure leveraging a dense matching field to interconnect the 2D and 3D surface spaces. Qian et al.
[14] presented Face Normalization Module (FNM), which encodes images using a pre-trained network for feature extraction and generates realistic images.
One-to-many augmentation is another approach to achieve face recognition regardless of the pose. Tran et al.
[15] synthesized different poses through 3D modeling and then trained a DCNN to perform face recognition with varied poses. The DA-GAN proposed by Zhao et al.
[12] created 2D images through 3D modeling and then refined the obtained 2D images to be as realistic as possible, using a GAN to try to preserve the identity of the face. Thus, the DA-GAN network was also used to augment the training data.
3.2. Multispectral Face Recognition
The main multi-spectral face recognition methods can be characterized by three important features: Image Synthesis Methods, Fusion Methods and Loss Functions.
Fusion methods are subdivided into feature fusion and score fusion. In the first, a fusion of features from the different spectral bands of the facial image is performed, allowing the most relevant features to be extracted from the different bands and joining them in a vector. The second method combines the scores obtained from each classifier uni-band (e.g., a classifier operating only in the LWIR band and another operating only in the NIR band)
[16].
The image synthesis methods allow an image of a spectral band to be transformed into another, helping to compare two images. The main advantage of image synthesis is that it enables an image to be passed from any spectral band to the VIS band, making it possible to use classifiers implemented to process images of the VIS spectrum
[17]. One of the most recent works in this area synthesizes VIS images from NIR images using GANs
[18].
Finally, all neural networks have cost functions for the training moment to update the network weights. However, certain cost functions have been proposed to proceed specifically to the classification of multi-spectral images. Examples of these cost functions are the Scatter Loss
[19] and the Wasserstein Distance
[20].
3.3. Gaps
Although several scientific works address multi-spectral face recognition, few of these demonstrate its power in an uncontrolled environment due to the limitations in current databases of multi-spectral face images. In existing datasets, the variations of conditions are not extreme, as they are usually semi-controlled environments and not
in the wild (uncontrolled environment). For example, the most studied database in multi-spectral face recognition, CASIA NIR-VIS 2.0
[21], uses images in which the pose has few deviations from the frontal position, which does not reliably characterize the uncontrolled environment. Thus, the fact that these databases are incomplete (compared to those of the VIS band) is still a barrier to improving the capability of multi-spectral face recognition systems in an uncontrolled environment.