With the increasing deployment of autonomous taxis in different cities around the world, recent studies have stressed the importance of developing new methods, models and tools for intuitive human–autonomous taxis interactions (HATIs). Street hailing is one example, where passengers would hail an autonomous taxi by simply waving a hand, exactly like they do for manned taxis.
1. Introduction
Despite the uncertainty regarding their ability to deal with all real-world challenges, autonomous vehicle (AV) technologies have been receiving a lot of interest from industry, governmental authorities and academia
[1]. Autonomous taxis (also called Robotaxis) are a recent example of such interest. Different companies are competing to lunch their autonomous taxis, such as Waymo (Google)
[2], the Chinese Baidu
[3], Hyundai with its Level 4 Ioniq 5s
[4], NuTonomy (MIT) and the Japanese Tier IV, to mention a few. Robotaxis are not in the prototyping stage anymore, and different cities have regulated their usage on the roads, such as Seoul
[5], Las Vegas
[6], San Diego, and recently San Francisco, where California regulators gave Cruise’s robotic taxi service the permission to start driverless rides
[7]. Uber and Lyft, two ride e-hailing giants, have partnered with AV technology companies to launch their driverless taxi services
[8]. The Chinese DeepRoute.ai is planning to start the mass production of Level 4 autonomous taxis in 2024, to be available for consumer purchase afterwards
[9]. In academia, recent research works have concluded that, besides the safety issue, the acceptance of Robotaxis by users is hugely affected by the user experience
[10][11][12]. Researchers have identified different autonomous taxi service stages, mainly calling, pick-up, traveling and drop-off stages, and have conducted experiments to study the user experience during these different stages. Flexibility during the pick-up and drop-off stages is one of the issues raised by users. Currently, the users use mobile applications to “call” an autonomous taxi, and after boarding they interact either through a display installed in the vehicles or using a messenger app on their smartphones. One of the problems raised by participants is the difficulty of identifying the specific taxi they called, especially if there are many autonomous taxis around
[10]. Mutual identification using QR code was inconvenient
[10], and some participants suggested that other intuitive ways would be more interesting, such as hailing taxis through eye movement using Google glasses
[12] and hailing gesture signals using smartwatches
[12]. Participants also requested more flexibility with respect to the communication of the pick-up and drop-off locations to autonomous taxis. The current practice is to select pick-up and drop-off locations from a fixed list of stands, but users have the perception that taxis can be everywhere
[10], and they expect autonomous taxis to be like traditional manned taxis in this respect
[11].
These research findings show that scientists from different disciplines need to join their efforts to propose human–autonomous taxi interaction (HATI) models, frameworks and technologies that allow for developing efficient and intuitive solutions for the different service stages. Researchers address the case of one service stage that has been explored to a very limited extent, which is street hailing. The current state of the art reveals that only two works have studied taxi street hailing from an interaction perspective. The first work is the study of Anderson
[13], a sociologist who explored traditional taxi street hailing as a social interaction between drivers and hailers. Based on a survey that he distributed to a sample of taxi drivers, he found that the hailing gestures used by passengers largely vary in relation to the visual proximity of the hailer to the taxi and to the speed at which the taxi is passing
[13]. The second work, also from a social background, is the conceptual model proposed by
[14] to design their vision of what would be humanized social interactions between future autonomous taxis and passengers during street-hailing tasks. Both works model taxi street hailing as a sequence of visually driven interactions using gestures that differ according to the distance between autonomous taxis and hailing passengers. Consequently, researchers believe that computer-vision techniques are fundamental in the development of automated HATI, particularly for the recognition of street-hailing situations. First, they are not intrusive, as they do not require passengers to use any device or application. Second, they mimic how passengers communicate their requests to taxi drivers in the real world, through visual communication. Third, like for manned taxis, they allow passengers to hail autonomous taxis everywhere without being limited to special stands
[14]. Finally, they allow for better accessibility of the service, given that passengers who cannot/do not want to use mobile applications—for whatever reason—can still make their requests.
2. Human–Autonomous Vehicle Interaction (HAVI) and Body Gesture Recognition
Since its introduction by W. Myron in 1991
[15], gesture detection and recognition have been widely used for the implementation of a variety of human–machine interaction applications, especially in robotics. With the recent technological progress in autonomous vehicles, gestures have become an intuitive choice for the interaction between autonomous vehicles and humans
[16][17][18]. From an application perspective, gesture recognition techniques have been used to support both indoor and outdoor human–autonomous vehicle interactions (HAVIs). Indoor interactions are those between the vehicles and the persons inside them (drivers or passengers). Most of the indoor HAVI applications focus on the detection of unsafe driver behavior, such as fatigue
[19] and on vehicle control
[20]. Outdoor interactions are those between autonomous vehicles and persons outside them, such as pedestrians. Most of the outdoor HAVI applications focus on the car-to-pedestrian interaction
[21][22] and car-to-cyclist communication
[23] for road safety purposes, but other applications have been explored, such as traffic control gestures recognition
[24], where traffic control officers can request an autonomous vehicle to stop or turn with specific hand gestures.
From a technological perspective, a lot of research work has been conducted on the recognition of body gestures from video data using computer-vision techniques. Skeleton-based recognition is one of the most widely used techniques
[25], both for static and dynamic gesture recognition. A variety of algorithms have been used. Traditional techniques include Gaussian mixture models
[26], recurrent neural network (RNN)with bidirectional long short-term memory (LSTM) cells
[27], deep learning
[28] and CNNs
[29]. The current state of the art for indoor and outdoor gesture recognition builds on deep neural networks. A recent review of hand gesture recognition techniques in general can be found in
[30][31].
3. Predicting Intentions of Pedestrians from 2D Scenes
The ability of autonomous vehicles to detect pedestrians’ road-crossing intentions is crucial for their safety. Approaches of pedestrian intention detection can be categorized into two major categories. The first category formalizes intention detection as a trajectory prediction problem. The second category considers pedestrian intention as a binary decision problem.
Several models and architectures have been developed and deployed, aiming at achieving a high-accuracy prediction of pedestrian intention using a binary classification approach. Unlike other methods, binary classification utilizes different tools and techniques depending on the data source and the data characteristics and features. For instance, the models based on RGB input use either 2D or 3D convolutions. In 2D convolution, a sliding filter is used along the height and width, and in 3D settings, the filter slides along the height, width and temporal depth. Using 2D convolutional networks, the information is propagated across time either via LSTMs or feature aggregation over time
[32]. For instance, the authors in
[33] proposed a two-stream architecture that takes as input a single excerpt from typical traffic scenes bounding an entity of interest (EoI) corresponding to the pedestrian. The EoI is processed by two independent CNNs producing two feature vectors that are then concatenated for classification. Authors in
[34] presented an extension of these models by integrating LSTMs and 3D CNNs, and those in
[35] did so by feeding many frames into the future and carrying out the classification using these frames.
Other methods that use the skeletal data extracted from the frames have been proposed. These methods directly operate on the skeleton of the pedestrians. The main advantage of these methods is that the data dimensions are significantly reduced. The yielded models are therefore less prone to overfitting
[36]. Recently, a new method was proposed based on the individual keypoints in order to achieve the prediction of pedestrian intentions but from a single frame. Another method proposed by
[37] exploits contextual features, such as the distance separating the pedestrian from the vehicle, his lateral motion, and his surroundings as well as the vehicle’s velocity as input to a CRF (conditional random field). The purpose of this model is to predict in an early and accurate fashion pedestrian’s crossing/not-crossing behavior in front of a vehicle.
4. Identification of Taxi Street-Hailing Behavior
The topic of recognizing taxi street hailing has been studied by sociologists in order to explore how taxi drivers perceive and culturally interact with their environment, including passengers
[38]. An interesting work is the work of Anderson, who studied gestures, in particular, as a communication channel between taxi drivers and passengers during hailing interactions
[13]. Based on a survey that he distributed to a group of taxi drivers in San Francisco, CA, USA, the researcher wanted to explore how taxi drivers evaluate street hails in terms of clarity and propriety. Clarity refers to “the ability of clearly recognizing a hailing behavior and distinguishing it from other one”, such as waving to a friend. Propriety refers to “the ability to identify if the passenger can be trusted morally and socially” so the taxi driver can decide either to accept the hailing request or not
[13]. In the context of the work, it is the clarity aspect that is more relevant, and with this respect, Anderson’s results are interesting. He found that the method of hailing adopted varies largely in relation to the visual proximity of the hailer to the taxi, and to the speed at which the vehicle is passing
[13]. When the driver and the hailer are within range of eye contact, the hailer “can use any waving or beckoning gestures to communicate his intention, such as raising one’s hand, standing on the curb while sticking arms out sideways into the view of the oncoming driver”
[13], etc. However, if the hailer and the driver are too far from each other, “hailers need to make clear that they are hailing a cab as opposed to waving to a friend, checking their watch, or making any number of other gestures which similarly involve one’s arm”
[13]. Taxi drivers who participated in the survey specified that the best and most clear gesture, in this case, is the “Statue of Liberty”, where the “hailer stands on the curb, facing oncoming traffic, and sticks the street-side arm stiffly out at an angle of about 100–135 degree”
[13]. However, taxi drivers pointed out that there are many other hailing gestures, and they may even depend on the hailers’ cultural backgrounds. Similarly, the conceptual model of human–taxi street-hailing interaction proposed in
[14] assumed that, depending on the distance between taxis and passengers, different hailing gestures can be used, such as waving or nodding.