Intelligent virtual agents (IVAs) are artificially intelligent animated characters, designed to mimic natural human–human interaction for various objectives including entertainment
[1], education
[2], and healthcare
[3]. Compared to conversational agents (like chatbots or voice assistants) that do not have a visual representation or embodiment, building acceptable and engaging IVAs for fruitful interactions is more challenging due to the appearance dimension included in the design of IVAs. For this reason, they are also commonly known as embodied conversational agents. The design of IVAs requires multidisciplinary effort, including psychology
[4] and artificial intelligence
[5] expertise to achieve capabilities such as emotion modeling
[6][7]. Yet, these capabilities are largely still under laboratory investigation, and what factors may influence their acceptance by users is an ongoing research question
[3][8].
People tend to perceive the presence, and consequently interact socially, with computers in similar ways as they naturally do with each other
[9][10]. With the increasing use of IVAs in social, entertainment, health, and work settings, the acceptance of this technology is growing. Great effort is being devoted to increase this acceptance by giving the technology life by simulating the natural look and behavior of a human being (i.e., anthropomorphism)
[11]. However, the theory of the “uncanny valley”, which was first introduced by Mori
[12], predicts that this acceptance will not last, and at some point, IVAs will manifest an unacceptable behavior/appearance and induce a feeling of eeriness and discomfort.
This differentiation between social presence and co-presence is important to understand the impact of anthropomorphism. In the literature, it is assumed that increasing the agent’s realism (anthropomorphism) leads to higher social presence, and consequently increases the IVA’s likeability and acceptance (e.g.,
[14][15]), with contradictory findings regarding the impact of anthropomorphism on the user–agent relationship and interaction goal (e.g.,
[16][17][18][19]). However, considering the uncanny valley problem, Brenton et al.
[20] suggested that the eeriness problem or “uncanny valley” occurs as a result of how the user perceives the agent’s co-presence. Co-presence is defined as having the sense of perceiving the IVA by the user and having the sense of being actively perceived by the IVA as well
[13]. This mutual awareness is a conscious process that is developed when an interaction goal is established
[13]. This conscious awareness and established goal help the user to adapt to a low level of features (low anthropomorphism) when, for example, using talking objects (agents) and interacting socially with them
[21][22]. On the other hand, social presence is defined as having the sense of being ‘there’, in the same environment with the virtual agent. This clear distinction makes measurement of social presence more appropriate in virtual reality studies, such as
[23].
The study of anthropomorphism has included the agent’s appearance, facial expressions, voice, spoken responses, and physical movements. Due to the need for speech to be adaptive and the predominant use of synthetic voice/text-to-speech (TTS) technology rather than recorded real human voice, the role of voice anthropomorphism (human vs. synthetic voice), its effect on co-presence, and its impact on the interaction outcome (intentions to change a behavior).
2. Anthropomorphism and Co-Presence
People naturally enter into others’ presence with very little information about the interaction situation, and hence they pay more attention to others’ appearance and social cues
[24]. The same rule applies when people interact with a human being or virtual character
[9][25]. Presence has been considered as a major tool to evaluate social virtual agents by measuring to what extent the agent could engage the human user in the virtual world.
According to the ethopoeia theory, adding human social characteristics to an object increases the human’s tendency to unconsciously accept the object as a social entity
[9]. However, the threshold model of social influence posits that these characteristics should be as real as those of human beings (anthropomorphism) to boost a virtual agent’s presence
[26]. However, realism and presence are not always in a direct proportional relationship, and there is a point where eeriness is sensed—the uncanny valley problem
[20] in human–agent interaction—which has still not been fully explained. Some researchers regard the problem as a subconscious cognitive reaction when the user’s expectation is violated
[27], while others regarded it as an emotional arousal response to the agent’s appearance and behavior, correlated with the agent’s presence
[28]. Therefore, measuring the agent’s presence, specifically co-presence, can reveal when the eeriness occurs as a result of anthropomorphism
[20].
Voice is one of the human characteristics that increase the IVA’s realism. It was found to be of great influence in applying social rules to objects where people perceived different voices of the same object as different agents
[29]. A simple talking object, a tissue box saying “bless you”, was perceived as social, as a humanoid robot, and as a human being
[22]. Although the participants in
[22] rated the robot and the box as eerier and less human than the human being, they rated them as being as social, attractive, friendly, and intelligent as the human being.
Voice anthropomorphism boosts the users’ likeability
[30]; however, there is a debate on its influence on the final interaction outcome. On one hand, some studies found anthropomorphism contributed towards increasing human–agent understanding, trust, connectedness, and task outcome
[16][17][31][32]. On the other hand, other studies reported no influence on the outcome
[33][34], or they had concerns with the use of higher anthropomorphism. For example, adding a human tone to the chatbot’s voice influenced the user’s online shopping behavior. A corporate’s formal and emotionless tone, compared to an employee’s informal and emotional tone, was perceived to be less credible when making important decisions during shopping, such as when shopping for gifts
[35], and to have a higher privacy risk, which impacts the behavioral intention—to register on the business website
[18].
3. Anthropomorphism and Congruence
Early works in psychology concluded that humans naturally tend to prefer clarity and consistency in perception and interaction. The principle of grouping (Gestalt theory) postulates that people always perceive and organize surroundings/objects according to certain rules, including similarity
[36]. When inconsistency occurs, the human brain gets confused and takes a longer time to respond, which is called the Stroop effect
[37]. This confusion is not only limited to perceiving objects but extends to almost everything, including perceiving other humans and their personalities. The auditory Stroop effect was examined by Green and Barber
[38], where their study participants found it more difficult to identify the gender of a speaker when a male speaker says ‘girl’ or a female speaker says ‘man’.
The same experience is extended to human–computer and human–robot interaction. Isbister and Nass
[4] underlined the importance of the consistency in verbal and non-verbal cues of a virtual character, regardless of its personality or that of the users. Mitchell, Szerszen Sr, Lu, Schermerhorn, Scheutz, and MacDorman
[33] reported that users found that a human face with a synthetic voice or a humanoid robot with a human voice caused significantly higher eeriness than when the face and voice were matched. Users rated the mismatched/inconsistent scenarios as eerier than the matched ones and the robot with a synthetic voice as the warmest. Similar results were reported earlier by Gong and Nass
[39] using a human face or humanoid face matched with human or computerized voices. Moore
[40] proved this phenomenon of the mismatch effect mathematically using a Bayesian model of categorical perception and called it ‘perceptual tension’.
The impact of matching the agent’s look with its voice (congruency) on user–agent trust has been further studied, with contrary findings. For example, Torre et al.
[41] introduced an investment game to 120 participants to play with either a generous or a mean robot that switches from a human to a synthetic voice (or vice versa) in the middle of the game. The results indicate the importance of matching the voice with the look to form the first impression, which impacts the user’s trust in the agent. However, other studies endorsed the use of a human recorded voice over a synthetic voice, such as Chérif and Lemoine
[42], who concluded that a virtual assistant with a human voice on commercial websites increases the assistant’s presence, and the users’ behavioral intentions as well (N = 640), but it does not impact the users’ trust in the website. They reported a negative influence of TTS on user–agent trust, which then negatively impacts behavioral intentions.
Further studies failed to find a significant difference between the use of human recorded voices and TTS regarding the system’s desired outcome. With 138 undergraduate students, Zanbaka, Goolkasian, and Hodges
[32] investigated how different persuasive messages would be perceived by students when the messages were delivered by human beings, IVAs, and a talking cat. The three settings had two versions: male and female. Human voices were used in the three settings. The results showed that participants perceived the IVAs and cats to be as persuasive as human speakers. Further, they showed a similar attitude towards the virtual characters and the human—male participants were more persuaded by real and virtual female speakers, while female participants were more persuaded by real and virtual male speakers. This stereotypical response is in line with the findings from voice-only studies by Mullennix, Stern, Wilson, and Dyson
[30]. Similarly, other studies reported no significant differences between voice settings in terms of social reactions
[43] or keeping a distance between a user and a robot
[44].
The impact of the agent’s voice type (human recorded voice vs. synthetic/computerized voice) could depend on context. In a learning scenario, Dickerson et al.
[45] reported that participants found synthetic voice unnatural, but they quickly adapted and rated it as being as intelligent as the agent with a human voice. The system outcome was met equivalently in both groups, but they concluded that a human voice is preferred only for expressive virtual agents, as it can convey the speaker’s emotions and attitudes. In a voice-only system, Noah et al.
[46] concluded that synthetic voice would harm the user’s trust in high-risk and highly personal contexts and that the use of human or synthetic voice should be investigated before deploying a system by studying the user’s expectations and concerns with the type of the voice.
In a matching task, Torre, Latupeirissa, and McGinn
[41] asked 60 participants to match one robot out of eight robots’ pictures with a voice from a set of voices that varied in terms of gender and naturalness, for four different contexts: home, school, restaurant, and hospital. The results showed that the more mechanical the look of the robot, the greater the tendency for the robot to be matched with a synthetic voice, and that the matching was significantly different across the contexts.
Black and Lenzo
[47] suggested the use of domain-limited (professional vs. general) synthetic voice to replace the use of human voice; however, Georgila et al.
[48] reported no significant difference between a general-purpose synthetic voice and a good domain-limited synthetic voice in terms of perceived naturalness, conversational aspects, and likeability.