Gait Recognition: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: , , , , ,

Gait recognition aims to identify a person based on his unique walking pattern. Compared with silhouettes and skeletons, skinned multi-person linear (SMPL) models can simultaneously provide human pose and shape information and are robust to viewpoint and clothing variances. 

  • gait recognition
  • skinned multi-person linear (SMPL)

1. Introduction

Gait describes the walking pattern of a person. Unlike other biometrics, e.g., face, fingerprint, or iris, gait can be observed at a distance without the cooperation of the target. Thus, it is a perfect choice for criminal investigation and social security management. Gait recognition aims to signify the target person by learning his unique walking pattern through video sequences or pictures. It can be categorized into two types by input: appearance-based methods and model-based methods. The advantages and disadvantages of different gait modalities are shown in Table 1. The main-stream appearance-based methods [1,2,3,4,5] take silhouettes as input and achieve impressive performance on the in-the-lab scenarios [6,7]. However, being sensitive to clothing and viewpoint variances, these methods are more likely to lose their advantages in the uncontrolled environments. Model-based methods [8,9,10,11,12] use articulated human body representations (e.g., skeleton and SMPL) as inputs. These methods are robust to the carrying status and clothing variance because they focus on human body structure and movements. Among them, skeletons are sparse representations and are unaware of the human shape information, which is, unfortunately, a critical characteristic for identification. Therefore, a viewpoint-robust and dense representation is needed for accurate gait recognition.
Table 1. Advantages and disadvantages of different gait modalities. The representations are taken from the same person at the same timestamp in the Gait3D dataset.
  Sensors 23 08627 i001 Sensors 23 08627 i002 Sensors 23 08627 i003
Type Silhouette Skeleton SMPL
Cloth Sensors 23 08627 i004Sensitive Sensors 23 08627 i004Sensitive Sensors 23 08627 i005Robust
Viewpoint Sensors 23 08627 i004Sensitive Sensors 23 08627 i005Robust Sensors 23 08627 i005Robust
Shape Sensors 23 08627 i005Easy Sensors 23 08627 i004Hard Sensors 23 08627 i005Easy
Space Sensors 23 08627 i0042D Sensors 23 08627 i0052D/3D Sensors 23 08627 i0053D

The presence of the skinned multi-person linear (SMPL) model [13] makes it possible to break the above limitations. The SMPL model parameterizes the human mesh by 3D joint angles into low-dimension linear shape space and can implicitly provide dense 3D human mesh information. Hence, it is invariant to viewpoint and clothing interference, making it a more suitable representation of gait recognition. Zheng et al. introduce SMPLGait [11], which utilizes a multi-layer perceptron (MLP) network to extract SMPL features and then aggregates the SMPL features with silhouette features for gait recognition. They indicate that incorporating the SMPL modality can improve the accuracy of gait recognition. However, the effectiveness of using SMPL alone, without the additional input of silhouettes, has not been demonstrated yet. The dense shape information provided by silhouettes may impede the network’s ability to excavate the shape information from SMPLs. 

2. Graph Structure

Gait recognition can be categorized into two main categories: appearance-based methods and model-based methods. The former typically employs silhouette sequences as inputs, while the latter employs human body models including skeletons and meshes.

2.1. Silhouette-Based Gait Recognition

Silhouette-based methods rely on silhouettes obtained through background subtraction from videos. Early approaches [14,15,16,17] use gait energy images (GEIs) as a compressed representation of gait silhouette sequences. Recently, deep CNNs have been applied to learn gait representations [2,4,5,18], and demonstrated promising performance. For instance, GaitSet [2] proposes a set-pooling technique that regards a sequence of silhouettes as a set, thereby reducing the impact of unnecessary sequence order information. Lin et al. propose GaitGL [5] to exploit global and local features from frames. GLN [4] merges silhouette-level and set-level features in a top-down manner. Yuki H et al. [19] leverage an encoder-decoder structure to deform gait silhouette images from videos. Sheth A et al. [20] leverage a convolutional neural network consisting of eight layers to identify human gait. Dou H et al. [21] design a framework based on counterfactual intervention learning to focus on the regions that reflect effective walking patterns. Ma K et al. [22] propose DANet to simultaneously capture the global gait motion patterns and the local ones. Although silhouettes can provide informative appearance features, they may lose some information regarding the motion patterns and body structures of humans. Consequently, this modality is susceptible to clothing and viewpoint variances, especially for cases in the wild. Additionally, silhouettes are 2D representations and can be sensitive to viewpoints.

2.2. Skeleton-Based Gait Recognition

Skeletons guarantee robustness against variations in clothing and viewpoint in gait recognition. Recent advances in human pose estimation have reached high accuracy, which have made skeleton-based approaches increasingly popular [8,9,10]. Liao et al. propose the pose-based temporal-spatial net (PTSN) [23], which leverages pose keypoints for gait recognition. PTSN incorporates a CNN to extract spatial features and an LSTM to extract temporal features. They further generate handcraft features from the skeleton keypoints, including joint angles, bone lengths, and joint motion, and then learn high-level features using a CNN [8]. Teepe et al. [9] model the human skeleton as a graph and use graph convolutional networks for gait recognition. They also combine higher-order inputs with residual networks [10]. Liu et al. [24] design a symmetry-driven hyper feature graph convolutional network to automatically learn multiple dynamic patterns and hierarchical semantic features. PoseMapGait [25] exploits the pose estimation maps to preserve rich clues of the human body and enhance robustness. Jun et al. [26] leverage a composition of the graph convolutional network, the recurrent neural network, and the artificial neural network to encode skeleton sequences, joint angle sequences, and gait parameters. Han et al. [27] propose a discontinuous frame screening module for the front end of the feature extraction part, to filter rich information. However, it is challenging to capture global appearance descriptions of a human using skeleton-based approaches.

2.3. SMPL-Based Gait Recognition

The SMPL model can be a compelling modality as it overcomes the limitations of the two modalities discussed above. On the one hand, SMPLs record the keypoints of skeletons, which allows for a focus on the motion pattern of human gait. On the other hand, SMPLs include human shape information, which is crucial in distinguishing between individuals. Furthermore, the human shape information in SMPLs is of low dimension, making it less sensitive to human appearance variations. Li et al. [28] propose an end-to-end method for gait recognition through human mesh recovery (HMR), which is the first SMPL-based gait recognition method, and further exploit multi-view constraints to extract more consistent pose sequences [12]. However, they do not focus on human gait priors and lack illustrations of real-world performances. Zheng et al. introduce SMPLGait [11], which is based on accurate SMPL estimations. They use an MLP network to extract SMPL features and then aggregate them with silhouette features for gait recognition. However, these methods did not fully utilize the articulated characteristics of SMPLs and failed to capture the detailed relationships among joints.

3D Human Reconstruction

The 3D human body can be represented in various ways, such as template parameters, meshes, voxels, UV position maps, and probabilistic outputs [29]. Currently, template parameters are the most widely used representation in the research community. A typical type of template parameters is the SMPL model [13], which is a vertex-based parametric model. The SMPL factors are deformed into shape and pose parameters. The shape parameters are obtained by performing the principal component analysis (PCA) in a low-dimensional shape space, which helps to prevent the gait recognition network from getting bogged down in silhouette details. The SMPL model depicts minimally clothed humans, allowing for the restoration of human body pose and appearance to a great extent and making it a favorable modality for gait recognition. Additionally, the SMPL family includes other models such as SMPL-X [30] and SMPL-H [31]. These models extend the SMPL model by including detailed hand poses and facial expressions.

This entry is adapted from the peer-reviewed paper 10.3390/s23208627

This entry is offline, you can click here to edit this entry!
Video Production Service