Visual Tracking Related to Age or Gender Information: History
Please note this is an old version of this entry, which may differ significantly from the current revision.

Visual tracking of multiple targets, also referred to as multiple object tracking (MOT), since the target can be any moving object or entity, is a well-investigated computer vision task. Actually, the goal is to detect one or more targets in a time-variate scene and then obtain their trajectories in terms of following their tracklets, for a given video sequence. This is completed by associating newly detected instances with current ones. Typically, the association part assumes a prediction task whose aim is to favor the most possible correspondence among detections of consecutive frames for a given target. When the targets of interest are real people, resulting detections from this procedure are usually post-processed so as to extract useful information related, for instance, with their age or gender. 

  • multi-object tracking
  • age estimation
  • gender classification
  • multi-attribute classification
  • images

1. Multi-Object Tracking

A taxonomy of methods that solve multi-object tracking involve the way an MOT system processes video sequences, namely, online and offline. Online methods operate on the video in a frame-by-frame basis, and thus, perform tracking by only using information up to the current frame, a design principle that makes them suitable for trending applications. In contrast, offline methods have access to the entire video sequence, including future frames, as they process videos in batches. While the latter are designed to handle the problem of data association more efficiently by utilizing future information, they are limited in their applicability in scenarios with real-time and scalability requirements.
A different way to separate methods is the strategy that they follow to handle the different aspects of the MOT task. Most approaches to the problem of multi-object tracking (MOT) generally follow the tracking-by-detection design framework. They formulate the tracking problem as a two-stage workflow. A detection step which localizes targets in an image and a second data association step, where the goal is to match the detections with existing, corresponding trajectories, generate new ones in case a new target appears on the scene, or discard old ones, when a target is no longer visible. The second step in this paradigm involves the actual tracking part and typically consists of several subtasks, such as motion forecasting, embedding extraction (image representation), and data association, among others.
Most approaches that follow this strategy initially utilized two separate models, usually deep-learning-based architectures for their success in both tasks of detection and feature extraction. A popular choice is convolutional neural networks (CNNs) [32,33,34] even though, more recently, graph neural networks [35,36] and transformers [37,38,39] have also been used. The descriptive capabilities of deep networks have enabled these methods to achieve remarkable results, by continuously improving upon one of the two models, as they are both important in the final tracking performance. However, using two computationally intensive models entails some drawbacks. Most importantly, the computational overhead required to run both models prohibits their application in real-time scenarios because of slow running speeds. In addition, considering the fact that two resource intensive types of neural networks are typically used in both steps, this requires two separate training processes and also results in a significant amount of redundant computations that are generally similar and can be avoided.
To tackle some of these shortcomings, a similar approach has emerged that utilizes a single model to perform both steps of detection and target tracking, avoiding some of the aforementioned issues. Such methods jointly train a unified network to handle both tasks and are known as joint-detection-and-tracking methods [9,40,41,42]. Apart from applications in MOT, this design has also been applied to human pose estimation [43]. Similarly to the previous strategy, CNNs remain the most prevalent models for this task due to their significant research improvements over the last few years that enable them to handle both steps while constantly improving their accuracy and running speeds.
More recently, target association methods that rely solely on detector outputs to associate all detected bounding boxes have been proposed [44,45]. These methods match new detections with existing tracklets without the necessity of an embedding extraction step, which was typically handled by a deep learning network (e.g., [33]). Consequently, the use of a single computationally intensive module in the overall pipeline leads to a reduction in the system’s required resources and latency, rendering such methods accurate trackers with real-time capabilities, depending on detector performance. This design benefits from a simplified training procedure as well, since the only trainable component is the detector unit, as without an embedding extraction network, a second dataset is no longer necessary, reducing the amount of training data and enabling faster deployment.

2. Age Estimation

The task of human age estimation has been well studied for a few decades by researchers. Age estimation techniques are often based on shape- and texture-based cues from faces, which are then followed by traditional classification or regression methods. Earlier approaches to the task utilized classic computer vision methods for feature extraction, such as Gabor filters [47,48], histogram of oriented gradients (HoG) [49], or local binary patterns (LBP) [50,51].
Currently, with advances in machine learning research as well as hardware capabilities, the predominant approach is the application of deep learning methods to solve the problem of feature extraction. CNNs have been widely adopted for their performance as capable feature extractors to obtain powerful representations of the input data. For instance, the works presented in [23,52,53,54] utilized convolutional-based networks and structures, whereas, Pei and co-workers [55] proposed an end-to-end architecture that uses CNNs as well as recurrent neural networks (RNNs). Duan et al. [56] combined a CNN with an extreme learning machine (ELM) [57], which is a feed-forward neural network that achieves very fast training speeds and can outperform SVMs in many applications, while the authors of [58,59,60] explored more compact and low resource convolutional models. Other deep learning methods, such as auto-encoders [23] and random forests [61], have also been adopted.
Most of these works make use of face images due to the fact that they provide more descriptive information about age ranges, since as people get older, certain common changes in facial characteristics can be observed, leading to better representations and higher accuracy of age estimation. Additionally, the majority of available corpora in the literature comprise media that depict faces exclusively, or at least contain face images, which are utilized after a detection and cropping step, discarding any other information.
Using images of the full body for this task has largely been an unexplored research topic, in part because of challenges in associating visual information from the body with apparent age, but also due to the lack of large publicly available datasets. Consequently, very few works have been proposed that use whole body images to estimate just the age of a person, for example, earlier approaches include [51,62,63], in which hand-crafted features were used. More recently, CNNs have been applied to the problem [53] obtaining accurate results demonstrating that full body images provide adequate visual information and can be successfully used to deal with this problem.
A subcategory to this problem is apparent age estimation, meaning that the actual age of the persons is not known beforehand, but is based on the subjective estimations of the annotator(s). In these methods, evaluation is performed on apparent ground-truth data [64]. Due to the nature of real-world data, apparent age estimation is a well-suited subclass for real-time applications where visual perception of age plays an important role. 

3. Gender Classification

The task of classifying the gender of people that appear in images is similar in nature with that of estimating their age. Over the last decades, a few works that focus solely on this task have been proposed. Conventional methods rely on shallow-learned features, such as histogram of gradients [65,66] or local binary patterns [67,68,69] for feature extraction and support vector machines for classification [30,70] and still remain popular and are widely used.
As with most image processing and computer vision problems, CNNs have also been adopted for gender classification, usually to obtain robust representations [71]. For example, Aslam and colleagues [72] propose wavelet-based convolutional neural networks for gender classification, while isolated facial features and foggy faces are used as inputs in CNNs in [73,74]. Ref. [75] provides a comparison of traditional and deep-learned features for gender recognition from faces in the wild, and [76] explored several popular convolutional architectures used in other tasks for identifying the gender of humans wearing masks.
Since images of the face contain more relevant information about gender compared to full body, they lead to better accuracy, and therefore, most methods that have been proposed for this task utilize datasets that contain images of faces. This reason is also an additional factor that contributes to the lack of publicly available full-body datasets. In contrast to age estimation, using full body images for this task has received some attention [30,65,66], but still remains an open area of research. Some methods deviate from the standard approach of using two-dimensional images to the application of three-dimensional data for gender recognition to alleviate some difficulties present in 2D data [77,78].
A different avenue of research for this problem is the combination of different modalities to assist with performance by taking advantage of features from different sources. More specifically, multi-modal data, such as depth [79] or thermal images [31,80,81], of the body have also been explored as auxiliary inputs to classification systems for improving performance and helping to overcome challenges arising when only RGB images of the body are available.
Apart from aforementioned approaches, a few works have focused on gait [82,83,84] as an indicator of gender. Gait-based methods assume information accrued from the gait of a person, which is related to change of pose in consecutive frames. The typical assumption is a controlled environment where multiple views of the objects are available so that the change in pose can be determined [85]. As a consequence, this limitation does not allow gait-based methods to be employed for practical consumer demographic estimation.

4. Related Age and Gender Multi-Attribute Classification Methods

Both the age and gender information about a person can be estimated from face images with great accuracy, and therefore, several works have been published that attempt to solve both tasks. Due to challenges present when using body images as previously discussed, as well as owing to dataset availability, the preferred form of data used by these works favors facial images. One of the earliest methods can be found in [86,87], where classic image processing techniques are employed to extract information based on textures of wrinkles and colors. More recently, Eidinger et al. [88] proposed a SVM-based approach for age and gender classification from face images in the wild.
With advances in deep learning, various CNNs have been adopted for predicting age along with gender, replacing older methods, typically comprising feature extractors as parts of larger systems or end-to-end models that handle the additional process of classification. For example, all works presented in [89,90,91,92,93,94] used only convolution-based architectures to tackle both problems with images of faces as inputs, whereas Uricár and co-workers [95] proposed a combined CNN feature extractor with a SVM classifier. In a similar fashion, Duan et al. [96] developed a hybrid technique that utilizes CNNs for feature extraction, whereas classification is handled by an extreme learning machine (ELM) for faster training and more accurate predictions. Another hybrid method that leverages non-convolutional neural networks and CNNs by fusing their decisions for a final prediction is presented in [97]. Lately, owing to their success in various tasks, vision transformers have also been explored for age and gender classification [98].
Using full body images is a much more rare approach, and in this case, most works that classify age as well as gender do so as part of a multi-attribute classification problem, where the goal is to predict a larger set of attributes. Analogous to the problem of gender-only estimation, gait-based methods have also been developed for the combined task, featuring multiple views of a person’s entire body [99], operating on a single image in real-time [100], or employing data from wearable sensors [101]. However, such approaches often assume a controlled monitoring environment of the involved subjects of interest, not readily applicable in real-time consumer tracking.

This entry is adapted from the peer-reviewed paper 10.3390/s23239510

This entry is offline, you can click here to edit this entry!
Video Production Service