Real-time sensing and modeling of the human body, especially the hands, is an important research endeavor for various applicative purposes such as in natural human computer interactions. Hand pose estimation is a big academic and technical challenge due to the complex structure and dexterous movement of human hands. Boosted by advancements from both hardware and artificial intelligence, various prototypes of data gloves and computer-vision-based methods have been proposed for accurate and rapid hand pose estimation in recent years. However, existing reviews either focused on data gloves or on vision methods or were even based on a particular type of camera, such as the depth camera. The purpose of this survey is to conduct a comprehensive and timely review of recent research advances in sensor-based hand pose estimation, including wearable and vision-based solutions. Hand kinematic models are firstly discussed. An in-depth review is conducted on data gloves and vision-based sensor systems with corresponding modeling methods. Particularly, this review also discusses deep-learning-based methods, which are very promising in hand pose estimation. Moreover, the advantages and drawbacks of the current hand gesture estimation methods, the applicative scope, and related challenges are also discussed.
With the rapid growth of computer science and related fields, the way that humans interact with computers has evolved towards a more natural and ubiquitous form. Various technologies have been developed to capture users’ facial expressions as well as body movements and postures to serve two types of applications: information captured becomes a “snapshot” of a user for computers to better understand users’ intentions or emotional states; and users apply natural movements instead of using dedicated input devices to send commands for system control or to interact with digital content in a virtual environment.
Among all body parts, we depend heavily on our hands to manipulate objects and communicate with other people in daily life, since hands are dexterous and effective tools with highly developed sensory and motor structures. Therefore, the hand is a critical component for natural human–computer interactions, and many efforts have been made to integrate our hands in the interaction loop for more convenient and comfortable interactive experiences, especially in a multimodal context as demonstrated in the “put-that-there” demonstration [[1]].
We can use hands for human–computer interaction either directly or through predefined gestures. These two modes have formed two different but highly related issues for hand-based interactions: hand gesture recognition and hand pose estimation. They are both challenging problems to be solved with existing sensing technology because the hand has a high degree of freedom with articulated joints, and because hands can have delicate and rapid movements. Hand gesture recognition is a pattern recognition problem that maps the hand's appearance and/or motion related features to a gesture vocabulary set, whereas hand pose estimation can be considered as a regression problem that aims to recover the full kinematic structure of hands in 3D space.
Driven by applications like sign language interpretation and gesture-based system control, hand gesture recognition has been extensively studied from early on and there exist many comprehensive reviews [[2][3][4][5][6]]. Hand gestures, either static or dynamic, can now be successfully recognized if the gesture categories are well defined with proper inter-class distances. Many consumer-level applications, such as the gesture control on Microsoft Hololens [[7]], can already provide robust recognition performance. Nevertheless, despite sharing some common points with gesture recognition, accurate hand pose estimation of all hand joint, remains a challenging problem.
With the emergence of low-cost depth sensors such as Microsoft Kinect [[8]] and Intel RealSense [[9]], and also the boost of machine learning methods, especially the rapid development of convolutional neural networks, there has been considerable progress in hand pose estimation, and state-of-the-art methods can now achieve good performance in a controlled environment. However, hand posture estimation has had much less attention in the literature compared to the recognition. The goal of this paper is to provide a timely overview of the progress in the field of hand pose estimation, including devices and methods proposed in the last few years.
Hand pose estimation can be roughly put into two categories based on the corresponding sensing hardware: wearable sensors and vision-based sensors. While glove-shaped wearable sensors are mostly self-contained and portable, vision-based sensors are very popular since they are more affordable and allow unconstrained finger movements. Both types of devices find their usage under certain circumstances and are still in constant development.
EThe main contributions of this paper are summarized as follows:
1. Existing surveys focus either on glove-based devices [[10][11]] or vision-based [[12][13][14]] systems, since these works were carried out in two distinct research communities, i.e., human–computer interaction and computer vision. We covered both directions to provide a complete overview of the state-of-the-art for hand pose estimation, which can be particularly helpful for people making applications with hand pose estimation technology.
2. With the boost of data-driven machine learning methods, a large number of new solutions have been proposed recently, especially in the last three years. It is now urgent to provide a comprehensive review of current progress to help researchers that are interested in this field to obtain a quick overview of existing solutions and unsolved challenges.
From the analyses above, we can see that existing hand pose estimation systems can already accurately track the movement of the human hand in real time in a relatively controllable environment. However, hand pose estimation cannot yet be considered as a solved problem and still faces many challenges, especially in open and complex environments, where we should take the amount of computing resources needed into consideration.
Wearable sensors, or data gloves, are promising for accurate and disturbance-free hand modeling since they generally have compact design and become lighter and less cumbersome for dexterous hand movements. However, there are three main challenges remaining to be solved.
Most data gloves are still “in the lab” and there is no industrial standard on the design and fabrication of such devices, which leads to high costs of available commercial products, making them unaffordable for daily use. Second, except gloves that are based on stretch sensors, most gloves have fixed size and are difficult to match different users’ hands. Lastly, gloves are unsuitable to be used in certain cases, for example, some stroke patients have difficulties opening their hands to wear gloves designed for normal users, or in situations when the user needs to manipulate tiny objects, or put their hands into water, etc.
Vision-based methods, haon the other hand, have overcome many difficulties faced by common computer vision tasks, such as rotation, scale and illumination invariance, and cluttered backgrounds. The high dimensional nature of hand pose representation, and even hand self-occlusion, are no longer obstacles in the way of achieving accurate hand pose estimation in real time. However, vision-based methods still face the following challenges:
First, occlusion is still the major problem. As the hands are extensively used to manipulate objects in daily life, they are very likely to be blocked or partially blocked by objects during interaction, which forms the hand–object–interaction (HOI) problem. There are already some efforts to deal with object occlusion. For example, Tekin et al. [[115]] proposed an end-to-end architecture to jointly estimates the 3D hand and object poses from egocentric RGB images. Myanganbayar et al. [[126]] proposed a challenging dataset consisting of hands interacting with 148 objects as a novel benchmark for HOI.
Second, since many methods are data-driven, the quality and coverage of training datasets is of great importance. As discussed in Section 4.4, there are already many useful datasets with 2D/3D annotations. However, a larger portion of annotated data comes from synthetic simulations. Existing methods tried to employ weakly supervised learning, transfer learning, or different data augmentation approaches to better cope with insufficiency of real world data, but more data representing tremendous viewpoints, shapes, illumination, background variations, and objects in interaction are required to train deep learning-based architecture, or we must find a new way to incorporate the hand model for 3D pose recovery.
Moreover, most deep learning-based methods also require large amounts of computational resources during the training and inference stages. Many algorithms need to run on a graphics processing unit (GPU) to achieve a real-time frame rate, making it difficult to be deployed to portable devices such as mobile phones and tablets. Thus, it is important to find effective and efficient solutions on mobile platforms for ubiquitous applications.
To conclude, various devices and methods have already enabled hand pose estimation for different applicative purposes in a controlled environment, and we are not far from real-time, efficient, and ubiquitous hand modeling.
In the near future, expertise from material science and electronics is needed to build easy to wear and maintain, yet more affordable data gloves for accurate hand modeling. Regarding vision-based methods, data-efficient methods such as weakly supervised learning or hybrid methods are needed to minimize the dependency on large hand pose datasets and to improve the generalization ability to unseen situations. Moreover, we can already see the benefits of new sensors, e.g., the depth sensor, as they can largely reduce the computation complexity by using 2D data to deduce 3D poses; thus, novel accurate long-range 3D sensors will definitely contribute to contactless hand pose estimation.