Audio–Tactile Feedback in Volumetric Music Video

Audio–Tactile Feedback in Volumetric Music Video: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Interdisciplinary Applications

Contributor:

Gareth W. Young

Néill O’Dwyer

Mauricio Flores Vargas

Rachel Mc Donnell

Aljosa Smolic

With the reinvigoration of XR technology in general, the current market offers several innovative modes of music creativity in 3D computer-generated imagery (CGI) production environments that can be accessed via virtual reality (VR) head-mounted displays (HMDs). To facilitate multimodality in VR, audio-tactile haptic feedback in volumetric music videos can have a positive impact on user experience.

volumetric video
virtual reality
music
user experience
audio–tactile feedback

1. Introduction

The use of haptic technology, which mimics the sense of touch through force and vibration, has raised questions about its relevance in contemporary artistic practices of the 21st century. In music, the concept of musical haptics has long explored the connection between auditory experiences and somatosensory stimulation using acoustic sound-generating musical interfaces (Papetti and Saitis 2018). In today's digital music landscape, musicians can utilize multimodal and 3D interactive platforms to interact with digital sound generators, giving them greater control over their musical creations. Moreover, the resurgence of extended-reality (XR) technology (Evans 2018), such as affordable computational ambisonics and volumography, offers the next generation of musicians unique control over audience perspectives, making it a promising field of creative media research.

Beyond musicians, the concept of a 21st-century musical performance has evolved, moving beyond traditional static setups to embrace novel immersive technologies like augmented and virtual reality (AR/VR). Digital performances on XR devices can now reach wider audiences through contemporary multimodal AR/VR head-mounted displays (HMDs). By incorporating multimodality in music production, new interaction paradigms are emerging. Audiences can experience stimulating and cognitively challenging performances through immersive installations or recitals, with musical haptics becoming a distinct and captivating element. Using haptic technology in virtual performances also indirectly affects the audience's presence and engagement.

In the VR domain, digital audio workstations (DAWs) come in various forms. With the rise of XR technology, there are innovative modes of music creativity in 3D computer-generated imagery (CGI) production environments accessible via VR HMDs. Haptic technology has gained recognition as a crucial component of XR, bringing the sense of touch into what was previously an audiovisual-focused technology. As new paradigms for home media consumption emerge, artists can now interact with their audiences in novel and exciting ways. Consequently, emergent 3D capture and display systems are becoming essential instruments for new audiovisual production techniques, continually shaping music performance.

For audiences, the experience of a live musical performance is momentary and inherently shared with others, making it difficult to replicate the same version without context. While traditional digital technology can capture the audiovisual elements effectively, it often fails to convey the feeling and intimacy of a live audience. Even when the audience is unaware of vibrations, they can influence recognizable features such as presence (Cerdá et al. 2012). In the realm of VR, the exploration of soloistic performance experiences and feelings of presence has been a subject of research interest for many years.

2. Audience Experiences of Audio–Tactile Feedback in a Novel Virtual Reality Volumetric Music Video

Several studies have explored how to augment an audience’s experience in live performances, such as theater, dance, and music (Sparacino et al. 1999; Hödl 2016). Researchers have explored new ways for seated audiences to experience embedded actuators in chairs to provide audio-based vibrotactile stimuli (Merchel and Altinsoy 2009; Nanayakkara et al. 2009; Karam et al. 2010). While seating is available in most drama and dance performances, standing is often required for live pop, rock, or dance music concerts. Still, relatively few haptic interfaces are developed for standing-only audiences, with notable exceptions providing free-standing capabilities (Gunther and O’Modhrain 2003; West et al. 2019; Turchet et al. 2021). This factor is significant when considering the contemporary application of immersive technology in musical performance.

VR technology is hardware that harnesses multimodal human–computer interaction to create the feeling of presence in a virtual world (Seth et al. 2011). Thus, contemporary VR employs numerous advanced digital technologies to immerse users in imaginary digital worlds. VR, as technology, is nascent; however, virtual realities, in general, have existed as immersive media entertainment experiences for millennia—as books (Saler 2012; Ryan 1999), films (Visch et al. 2010), theatre (Reaney 1999; Laurel 2013), and games (Jennett et al. 2008). The immersive qualities of such works are often attributed to the quality of the work and not their ability to stimulate multiple senses at once, for example, in the case of vision with film and audio with music. VR experiences are not necessarily modally locked in the same way as other media and can stimulate audiences’ senses differently from traditional immersive media.

Haptic cues in music performance and their perception have been observed to affect user experiences—including usability, functionality, and the perceived quality of the musical instruments being used (Young and Murphy 2015b). Haptics can also render and exploit controlled feedback for digital musical instruments (DMIs) (Young and Murphy 2015b). This creative application space highlights the multidisciplinary power of musical haptics from the perspective of computer science, human–computer interaction, engineering, psychology, interaction design, musical performance, and theatre. Therefore, it is hoped that the presented study will contribute to developing a multidisciplinary understanding of musical haptics in 21st-century artistic practices. The role of supplementary senses in immersive media is often undervalued or misrepresented in reductive, single-sensory approaches to lab-based research. In the wild, audiences do not experience a single stimulus while consuming art; they use all their senses to experience the world of live music performance holistically. A notable example would be the severely deaf percussionist Evelyn Glennie, who has used vibrotactile cues in their musical performance to recognize pitch based on where the vibrations were felt on the body (Glennie 2015).

2.1. Immersive Virtual Environments and Presence

Psychologically, virtual realities are presented as 3D immersive virtual environments (IVEs), digitally providing sensory stimuli that encapsulate the user’s senses and creating the perception that the IVE is genuine and not synthetic (Blascovich et al. 2002). IVEs have been used for years to convey virtual realities via CAVE and HMD systems (Mestre 2017). Today, VR technology can be used as an erudite psychological platform for cultural heritage (Zerman et al. 2020), theatre performance (O’Dwyer et al. 2022), teaching (Wang et al. 2021), and empathy building (Young et al. 2021).

The most common concepts in discussions about virtual realities are immersion, presence, co-presence, flow, and simulation realism. Immersion is “the degree of involvement with a game” (Brown and Cairns 2004, p. 1298). Immersion is also a deep engagement when people “enter a make-believe world” (Coomans and Timmermans 1997, p. 6). While some research points to experiencing virtual engagement or disassociation from reality in virtual worlds (Brown and Cairns 2004; Coomans and Timmermans 1997; Haywood and Cairns 2006; Jennett et al. 2008), others consider immersion as a substitution for reality by virtuality and becoming part of the virtual experience (Grimshaw 2007; Pine and Gilmore 1999). Immersion also includes a lack of awareness of time and the physical world, feeling present within a virtual world, and a sense of real-world dissociation (Haywood and Cairns 2006; Jennett et al. 2008). While broad, these definitions of immersion are universally applicable to VR technology. Moreover, it should also be noted that measures of immersion target the technology and not the user’s experience of the IVE.

Factors of presence, on the other hand, can be classified as subjective experiences (Witmer and Singer 1998). As an aspect of immersion, presence can indicate if a “state of deep involvement with technology” has been achieved (Zhang et al. 2006, p. 2). Therefore, presence can be defined as a “state of consciousness, the (psychological) sense of being in the virtual environment” (Slater and Wilbur 1997, p. 605). Whether directly or indirectly, immersion is required to induce presence. Furthermore, the social aspect of a virtual experience, as co-presence, is also a factor for consideration (Slater and Wilbur 1997) and a state of “flow.” Flow describes the feeling of full engagement and enjoyment of an activity (Csikszentmihalyi et al. 2016; Csikszentmihalyi and Larson 2014) and is strongly linked to feeling present and increased task performance in IVEs (Weibel et al. 2008). VR is driven to pursue simulation realism (Bowman and McMahan 2007). The conscious sense of presence is modeled by presenting bodily actions as possible actions in the IVE and suppressing incompatible sensory input (Schubert et al. 2001). However, a digital representation does not require perfect rendering to be perceived as physically accurate (Witmer and Singer 1998). Furthermore, objective and subjective realism does not always balance when an audience experiences esthetic art practices.

In creative media practices, the connection between presence and visual esthetics is relatively unknown and could be assessed from an immersive arts perspective on realism as an art movement. The relationship between IVEs and esthetics may imply other consequences, as esthetics is associated with pleasure and positive emotions (Reber et al. 2004; Hekkert 2006). Therefore, assessing the feeling of presence in VR experiences as immersive technologies may induce satisfaction and positive affect. As such, presence measures can be effectively applied in user experience studies for evaluating different artistic virtual realities when presented in IVEs without relying on visual realism for immersion.

Using haptics in VR experiences can help increase feelings of perceived presence (Sallnäs 2010), and the effect of haptics on the presence of virtual objects has also been observed (Gall and Latoschik 2018). Moreover, multimodal IVEs, consisting of video, audio, and haptic feedback, have impacted user expectations and satisfaction levels of professional and conventional users (García-Valle et al. 2017). Therefore, evaluating a haptic experience’s design can be taken from an audience, performer/composer, instrument designer, and manufacturer perspective (Barbosa et al. 2015). The goal of each stakeholder is different, and their means of assessment vary accordingly.

2.1.1. Volumetric Video

Volumetric video (VV) is a media format representing 3D content captured and reconstructed from the real world by cameras and other sensors similarly commonly used in computer graphics (Smolic et al. 2022). VV enables the visualization of such content with full six degrees of freedom (6DoF). Over the last decades, VV has seen interest from researchers in computer vision, computer graphics, multimedia, and related fields, often under other terms such as free viewpoint video (FVV), 3D video, and others. However, the commercial application has been limited to a few special effects and game design cases. Recent years have seen significant interest in VV, including research, industry, and media streaming standardization. On the one hand, this reinvigoration is driven by the maturation of VV content creation technology, which has reached acceptable quality today for various commercial applications. On the other hand, current interest in extended reality (XR) also drives the importance of VV because VV facilitates bringing real people into immersive XR experiences.

Traditionally, VV content creation starts with synchronized multiview video capture in a specifically designed studio. Figure 2 shows an affordable setup used in the V-SENSE lab in Dublin, which only uses 12 conventional cameras. Larger, more complex, and expensive studios can have up to a hundred cameras and additional depth sensors (Collet et al. 2015). The captured video and other data are typically passed to a dedicated 3D reconstruction process. Classical VV content creation approaches mainly rely on structure-from-motion (SfM) or shape-from-silhouette (SfS) approaches. While SfM relies on features and matching and results in a dynamic 3D point cloud in the first place, SfS computes a volume populated by the object of interest in the first place. Both approaches have their advantages and drawbacks. Pagés et al. (2018) presented a system that combines benefits and addresses the creation of affordable capture setups.

Figure 2. Musical performance VV capture by New Pagans1 Cahir O’Doherty (Left) and Lyndsey McDougall (Right) at the V-SENSE studio in Trinity College Dublin, Ireland.

Recently, powerful deep-learning approaches have been presented for 3D geometry processing and reconstruction (Valenzise et al. 2022). For instance, the first examples of deep learning VV reconstruction algorithms were able to recreate 3D shapes of an object from a particular class of objects, such as a chair, from a single 2D image. A 3D reconstruction of human faces from monocular images or video is another area that has received much attention. PIFu (Habermann et al. 2019) is a single-image 3D reconstruction method of human bodies, representing a milestone in this area. The resulting VV, a dynamic 3D graphics model, can be rendered and visualized for any viewpoint and viewing direction (6DoF). As such, it can be used as an asset in XR content and other media.

2.1.2. Spatial Sound

The success of a VR experience relies on effectively replacing real-world sensory feedback with a virtual representation (Slater and Sanchez-Vives 2016). Since sounds convey multiple types of information, such as emotional expression, localization information, and environmental cues, auditory feedback is an essential component in the perception of an IVE. The purpose of auditory feedback in immersive media is to replace the existing sounds and the acoustic response of the environment with virtual ones (Schutze 2018). Furthermore, presence, immersion, and interaction are essential for a successful experience in VR development. The more accurate or plausible the auditory representation, the higher the sense of presence, immersion, and place illusion is felt by users (Avanzini 2022).

Spatial audio, often referred to as immersive audio, is any audio production technique that allows rendering sounds with the necessary perceptual properties to be perceived as having a distinct direction and distance from the user (Begault 2000; Yang and Chan 2019). Sound localization lets us recognize a sound source’s presence, distribution, and interaction (Letowski and Letowski 2012). It is defined as the collection of perceptual characteristics of audio signals that allow the auditory system to determine a sound source’s specific distance and angular position using a combination of amplitude, monoaural cues, inter-aural level differences (ILDs), and inter-aural time differences (ITDs) (Bates et al. 2019). Sound auralization is crucial for creating a plausible auditory scene and increasing the user’s spatial perception and the VR environment’s overall immersiveness. Utilizing a range of acoustic phenomena, such as early reflections and reverberation, produces a realistic auditory response and helps place audio sources in the virtual space (Geronazzo and Serafin 2022; Yang and Chan 2019).

2.1.3. Haptics

The sense of touch in humans is often categorized as cutaneous, kinesthetic and proprioceptive, or haptic perception. Haptic perception is achieved through actively exploring surfaces and objects using the forces experienced during contact with mechanical stimuli, including pressure and vibration. In human physiology and psychology, haptic stimuli and their perception by the brain relate to the actions of the somatosensory system and the sensory gathering of force and tactile information immediately affecting a person, all highlighting the existence of corresponding external stimuli sources. Contact with haptic stimuli is usually made via the skin, explicitly stimulating cutaneous receptors in the dermis, epidermis, and ligament tissue. Cutaneous receptors are found in the skin for touch, and proprioceptors are located in the muscles for kinesthetic and proprioceptive awareness. Cutaneous receptors include mechanoreceptors (pressure or distortion), nociceptors (pain), and thermoreceptors (temperature). Mechanoreceptors need to be stimulated to experience the touch of a vibration.

In physics, vibrations are a mechanical phenomenon whereby oscillations occur around an equilibrium point (Papetti and Saitis 2018). On the one hand, “sound” is a vibration that spreads as an “acoustic wave” via some medium and stimulates the auditory system. On the other, for haptics, the perception of vibration is a measure of vibration as cutaneous stimuli, and this somatosensory information then allows humans to explore their immediate world. Direct physical contact is often required; this is not the case for auditory perception. The radiated sound can also stimulate the surface of the human body. Airborne vibrations, such as sound, can also be perceived by the skin if they are of sufficient amplitude to displace the receptors under the skin, as is often experienced in live concerts.

When an acoustic or digital musical instrument produces a sound, that sound is created by some vibrating element of the instrument’s design or an amplified speaker. Therefore, haptics and music can be innately connected through multimodal vibration, where the biological systems of the somatosensory and auditory systems are engaged simultaneously. The combination of haptic and auditory stimuli can be multimodal and experienced by a performer and audience alike, creating new practices that can be mixed and analyzed in multiple contemporary use-case scenarios. The musician and the audience are reached by vibration through the air and solid media, for example, the floor or the seats of a concert space or stage. However, in the case of the audience, vibrotactile and audio stimuli are experienced passively, as no physical contact is made between the instrument and listener.

2.1.4. VR Performance

The permeation of XR technologies into the hands of creative artists has provoked varied and innovative technological employments toward aesthetic ends (Young et al. 2023). The arrival of these technologies has been proposed by several theorists and critics (Bailenson 2018?) as analogous to the advent of film technologies at the beginning of the 20th century, which (arguably) gave rise to the wealthiest epoch of modern, avant-garde, inventive art in the 20th century. Even within the more focused subcategory of the performing arts, there are many creative techniques, styles, and strategies, as well as opinions and views on the most effective solutions for harnessing these technologies and captivating audiences. To date, VR (as a subsection of the totality of platforms offered on the spectrum of XR technologies) has enjoyed the most significant level of investigation by performing artists.

Even within the more focused purview of VR performance, several taxonomies still have to be negotiated, for example, live versus prerecorded material and the creative techniques employed. Within the scope of this manuscript, it is suitable to focus the discussion on VR performance content created using VV, yet even within this narrowed category, there are varying techniques: those that purely use computer vision (V-SENSE 2019; O’Dwyer et al. 2021) and those that include the use of depth camera data (Wise and Neal 2020). Focusing specifically on offline VV content generated purely through the computer vision techniques outlined above, it is essential to note that, in the context of the presented research, there is currently no possibility of generating a live (real-time) representation of a 3D character. Leaving aside consumer bandwidth, the postproduction processes are currently too slow and memory-intensive; however, as processing capabilities increase and algorithms and pipelines become more refined, it is possible that, in the next few years, the latency between capture and representation may be reduced to less than a minute, which is not that far off the latency associated with straightforward video webcasting.

This entry is adapted from the peer-reviewed paper 10.3390/arts12040156

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.