Expressive Qualifiers, Feature Learning and Expressive Movement Generation

Expressive Qualifiers, Feature Learning and Expressive Movement Generation: Comparison

Please note this is a comparison between Version 1 by Pablo Osorio and Version 2 by Jessie Wu.

Expressive body movements are defined by low- and high-level descriptors. Low-level descriptors focus on kinematics or dynamic quantities such as velocity and acceleration, whereas high-level descriptors use low-level features to describe their perceptual or semantic evaluation optimally. Notable high-level systems include Pelachaud’s qualifiers, Wallbot’s descriptors, and the Laban Movement Analysis (LMA) system, which is commonly used for dance performance evaluation.

human–robot interaction
human-centered robotics
human-in-the-loop
human factors

1. Introduction

Bartra [1] asserts that symbolic elements, including speech, social interactions, music, art, and movement shape human consciousness. This theory extends to interactions with society and other living beings [2], suggesting that robotic agents, as potential expressive and receptive collaborators [3], should also be integrated into this symbolic framework. However, current human–robot interactions, whether via generated voices, movement, or visual cues ^{[4][5][6][7][8][9]}[4,5,6,7,8,9], are often anthropomorphized [10], leading to challenges due to unsolved problems in natural language processing ^[11][12][11,12] and the need for the users’ familiarization with system-specific visual cues [13]. Moreover, these systems still struggle with context understanding, adaptability, and forethought ^[14][15][14,15]. The ideal generalized agent capable of formulating contextually appropriate responses remains unrealized [16]. Nonetheless, the prospect of body movement could enhance these interactions.

In the dance community, body movement is acknowledged for its linguistic properties [17], from minor gestures [18] to significant expressive movements conveying intent or state of mind [19]. This expressiveness can be employed in robots to create meaningful and reliable motion ^[20][21][22][20,21,22], leveraging elements such as legibility [23], language knowledge [24], and robust descriptors ^[25][26][25,26]. By so doing, robots can create bonds, enhance the rapport between users and robots, persuade, and facilitate collaborative tasks ^[27][28][29][27,28,29]. Currently, however, the selection of these expressive qualities often relies on user preference or expert design ^[20][30][20,30], limiting motion variability and affecting the human perception of the robot’s expression [31].

In [32], the authors demonstrated the need for an explainable interaction between embodied agents and humans; furthermore, it was suggested that expressivity could hold the necessary terms for the robot to communicate its internal state effectively. Ref. [33] points out that this representation will be required for the realization of sounds and complex interactions with humans. Movement then could be the medium to realize such a system (this is further visualized in the following dance video from Boston Dynamics: https://www.youtube.com/watch?v=fn3KWM1kuAw, accessed on 20 November 2023). As discussed in [34], modeling these human factors can be accomplished using machine-learning techniques. However, direct human expressivity is often set aside in the literature, favoring definitions that could effectively be used as design guidelines for specific embodied agents or interactive technologies [35]. This leads to the question of whether or not it is then possible to rely on human expressivity and expressive movement to communicate this sense effectively. Moreover, can the robot recognize this intent and replicate the same expressive behavior to the user? The robot should communicate its internal state and do it in a manner understandable to humans. This woresearchk aims to answer these questions, exploring human expressivity transmission to any robot morphology. In doing so, the approach will be generalizable to any robot and make it possible to ascertain whether the expressive behavior contains the necessary qualities. By addressing this challenge, it is possible to enhance the human–robot interaction and open scenarios where human users could effectively modify and understand robot behavior by demonstrating their expressive intent.

2. Expressive Qualifiers

The LMA system explores the interaction between effort, space, body, and shape, serving as a link between movement and language ^[36][40]. It focuses on how the body moves (body and space), its form during motion (shape), and the qualitative aspects of dynamics, energy, and expressiveness (effort). Because it quantifies expressive intent, the Effort component of LMA has been widely used in animation and robotics ^[37][41], and is utilized in this woresearchk to describe movement expressiveness. Movements are often associated with emotions, and numerous psychological descriptors have been used to categorize body movement ^[38][42]. Scales like Pleasure–Arousal–Dominance (PAD) and Valence–Arousal–Dominance(VAD) have been used in animation and robotics ^[24][39][40][24,43,44]. However, manual selection can introduce bias ^[41][45]. While motion and behavioral qualifiers can improve user engagement with animated counterparts ^[42][43][46,47], no unified system effectively combines effective and expressive qualities.

3. Feature Learning

The idea of feature extraction and exploitation has seen widespread use and advancement in classifying time series across diverse domains ^[44][45][46][48,49,50]. These techniques have also been applied in image processing and natural language processing to extract meaning and establish feature connections ^[47][48][51,52]. Such methods have been repurposed for cross-domain applications, like the co-attention mechanism that combines image and sentence representations as feature vectors to decipher their relationships ^[49][53]. These mechanisms can analyze and combine latent encodings to create new style variations, as seen in music performances ^[50][54]. The results demonstrate that these networks can reveal a task’s underlying qualities, context, meaning, and style. When applied to motion, the formation and generation of movement can be conducted directly in the feature or latent space, where the representation contains information about the task and any anomalies or variations ^[51][55]. Studies have shown that multi-modal signals can be similarly represented by leveraging these sub-spaces ^[52][56]. The resultant latent manifolds and topologies can be manipulated to generalize to new examples ^[53][57].

4. Style Transfer and Expressive Movement Generation

Previous research focused on style transfer using pose generation systems, aiming to generate human-like poses from human input, albeit with limitations in creating highly varied and realistic poses ^[54][55][56][58,59,60]. To address this, Generative Adversarial Networks (GAN), attention mechanisms, and transformers have been introduced, which, while improving pose generation performance, are usually confined to specific morphologies, compromising their generalizability ^[57][58][59][61,62,63]. Research suggests that a robot’s movement features can be adaptable, with human input specifying the guiding features of the robot’s motion, serving as a foundation for a divide-and-conquer strategy to learn user-preferred paths ^[60][64]. A system built on these features assists the robot’s pose generation system, showing that human motion can influence the basis functions to align with the user’s task preferences. Although it has been shown that expressive characteristics can be derived from human movement and integrated into a robot arm’s control loop, the generated motions often lack legibility and variability ^[61][65]. In addition, much of the essence of higher-order expressive descriptors and affective qualities is lost or unmeasured. Although re-targeting can be used to generate expressive motion, it often faces cross-morphology implementation issues ^[62][63][64][66,67,68]. Burton emphasized that “imitation does not penetrate the hidden recesses of inner human effort” ^[36][40]. However, modulating motion through expert descriptors and exploiting kinematic redundancy can feasibly portray emotional characterizations, provided the motion is within the robot’s limits and the interaction context is suitable ^[65][69]. Therefore, effective expressive generation should consider both the user’s expressive intents and the task or capabilities of the robot.