1. Digital Human Reconstruction
How to create digital humans has been a much-studied subject recently, due to the rising demand for virtual reality applications, including the Metaverse. One of the core drivers of mathematical progress is the discovery of objects, patterns and ultimately their formulaic representations; in the course of such progress, scientists often need to leverage a variety of tools and data to help them cultivate ideas, propose a conjecture, and eventually prove/disprove with experiments and evidence, where possible. There is no doubt that the evolution of computational methodology has not only changed the way scientists conduct their studies, but has also accelerated the life cycle of scientific research, leading to profound impacts on people’s daily lives—including, for example, the early hand-calculated prime number tables used by Gauss (which led to the prime number theorem)
[1], the RSA public key algorithm
[2] inspired by prime number theory, and our modern blockchain infrastructure.
The introduction of computational methodology has given scientists an understanding of problems previously incomprehensible; however, while previous computational methodologies have proven effective in certain scientific problems or domains, they are not easily generalized to other domains. Big data technologies, especially the field of deep learning that has emerged in recent years, offer a range of techniques capable of effectively detecting patterns in data, and are increasingly proving their utility in scientific disciplines. A specific case of virtual human reconstruction in the Metaverse will serve as an example, to illustrate how deep learning can be used to solve mathematical problems in practical settings.
Virtual human reconstruction is one of the essential tasks in various Metaverse applications: it aims to utilize sensory data to recover the three-dimensional geometry and appearance of humans, achieving accurate photorealistic reconstructions, and ultimately producing compact 3D representations that can be ported to a variety of devices. This problem involves many practical facets that require sophisticated engineering; however, its core challenges lie in deep learning modeling and mathematical optimization, as shown in Figure 1.
Figure 1. A hybrid approach of regression-based and optimization-based paradigms (courtesy of Kolotouros et al.
[3]): an iterative optimization routine is embedded into a neural network training loop, leading to a self-improving loop. Better fits help the network train better, while better initial estimates from the network help the optimization routine converge to better fits.
Various techniques have been applied to recreate human models in the Metaverse. Many studies start from simple image-based 2D feature detection, such as key points
[4], silhouettes
[5] and limb segments
[6]. It seems that simple movements can be represented relatively clearly by two-dimensional contents; however, it is becoming clear that complex human behaviors, which often occur in practical settings, do not fit the simple assumptions imposed by two-dimensional models, and that more descriptive models with finer granularity are desirable; consequently, more studies
[7][8][9] have turned to exploring more complex human pose modeling in three dimensions. Recently, researchers have noticed that body shapes, contacts, gestures and expressions which directly interact with the world are much easier to measure and evaluate; consequently, the focus of researchers has shifted towards three-dimensional mesh recovery of the human body
[10][11]. Human body modeling is then further extended by face and hands support
[12][13][14][15]. Meanwhile, similar techniques have also facilitated downstream tasks, such as clothed human reconstruction
[16][17][18], volume rendering
[19], virtual try-on
[20], the computer-assistant system
[21] and many more Metaverse applications. There are two common paradigms for dealing with virtual human reconstruction: the optimization-based paradigm and the regression-based paradigm.
Although these two paradigms may have different advantages/disadvantages, and address different aspects, both paradigms can share similar human body modeling techniques. Figure 2 shows an interesting possible way of integrating both paradigms into one coherent framework. The next section will review the existing approaches, in terms of human body modeling.
Figure 2. A virtual reality shop developed by Unity3D for future integration into the Metaverse.
2. Review of Human Body Modeling
Early human body modeling started with the study of articulated geometric primitives, including line segments
[22], cylinders
[23], planar rectangles
[24] and ellipsoids
[25]. As three-dimensional full-body scanners became accessible, more detailed measurements of body surfaces could be accurately recorded, such as the CAESAR (Civilian American and European Surface Anthropometry Resource)
[26] dataset. The availability of large amounts of body scan data has given rise to a powerful representation: the statistical body model, which factors body deformations into identity-dependent and pose-dependent components. Among the statistical body models, SCAPE
[27], SMPL
[28], SMPL-X
[13], SMPL+H
[29], 3DMM
[30] and STAR
[31] are popular ones, which are not only capable of effectively modeling both shape and pose deformations, but are also highly compatible with existing graphics rendering engines, benefiting from the explicit mesh model. This family of explicit approaches first learns shape deformations through principal component analysis of body scans, and then combines them with skeletal pose-driven deformations (so-called linear blend skinning in traditional skeletal animation), to construct a shape-and-pose parametric human body model. Despite the popularity of explicit approaches, they still have their limitations: firstly, global blend shapes may capture spurious long-range correlations
[31], resulting in non-local deformation artifacts; secondly, correlations between body shape and pose-dependent shape deformation may be ignored; furthermore, due to the linear nature of principal component analysis, it can be difficult to reproduce the highly nonlinear deformations of body soft tissue.
In order to overcome the limitations of explicit approaches, instead of explicitly defining the human body as mesh vertices and edges or other elements, implicit approaches try to define surfaces as level sets of continuous functions. Due to these continuous properties, this implicit representation has a better chance of being elegantly optimized and integrated with deep learning frameworks: it is continuous across the spatial domain, and thus theoretically has infinite resolution, and it can easily handle highly nonlinear deformations, and even topological changes, which are not possible with explicit approaches. Study
[32][33] estimated implicit surface functions, by aligning image pixels with the global three-dimensional shape or texture of the photographed object, and then using a dedicated multi-level network to refine the resulting geometry. The flexibility of implicit approaches enabled it to handle intricate surfaces and topological changes with ease, but there was one drawback, which was that topologically distinct human representations can exist across time: in other words, implicit human representations may not be topologically consistent in time.
3. Optimization-Based Paradigm
In this paradigm, the human body model is explicitly optimized, by minimizing an objective function that fits the model to the observations in an iterative manner. The objective function typically consists of two parts: (1) the data term is a measure of the alignment between the extracted observation features and the transformed human body features; (2) the regularization term is added, to constrain the convergence that preserves a physically plausible body model. In earlier work, the silhouette feature played a crucial role in fitting the body model to the image, as it was used to penalize pixels in non-overlapping regions
[34][35].
With the emergence of deep learning, many studies have utilized it to calibrate the optimization initial conditions. SMPLify
[10] adopts off-the-shelf neural networks
[36] to detect two-dimensional key points, and then iteratively fits a SMPL model, to detect the key points of an unconstrained image. While SMPLify produces relatively well-aligned results, sparse key points do not offer sufficient constraints for body shape optimization. To improve geometric details,
[37][38][39] combined key points, silhouettes and part segments, to further constrain the optimization process. Moreover,
[40][41] have shown that deep learning techniques can learn local landscapes and decent directions of optimization from training data, and then use them to guide the gradient-based optimization process: in this way, traditional problem-independent optimization schemes can be endowed with the ability to adaptively learn problem-specific convergence schemes. Image-based key point regression was performed by
[42][43], to obtain three-dimensional body key points, then solve the inverse kinematics based on the key points and the skeletal structure, so as to calculate the accurate joint rotations, ultimately estimating the parameters of a SMPL model.
Although the optimization-based paradigm can faithfully reconstruct the human body when high quality data is available, it performs poorly in situations where data is scarce and useful information is latent; furthermore, as the optimization-based paradigm intrinsically tries to solve complex non-convex optimization problems in high-dimensional spaces, its outcomes are susceptible to initialization and prone to falling into spurious local minima.
4. Regression-Based Paradigm
Alternatively, the regression-based paradigm exploits the powerful learning and approximation capabilities of neural networks, to recover model parameters directly from sensory data. To achieve better performance, researchers have explored a wide variety of network architectures and regression objectives—for example,
[12] was one of the pioneering efforts to incorporate the SMPL model into an end-to-end network architecture that minimized the reprojection errors between manually annotated and estimated key points. An end-to-end adversarial learning framework was proposed by
[11], which used a discriminator to supervise the training process, so as to exclude anthropometrically implausible or self-intersecting body structures. A top-down framework was proposed by
[44], to simultaneously regress SMPL parameters of multiple people in a coherent manner, where depth ordering was consistent, and no interpenetration occurred among reconstructed people. Instead of regressing the SMPL parameters,
[45] opted to directly regress the mesh vertices using a Graph Convolutional network, thus allowing the template mesh structure to be explicitly encoded within the network, easily exploiting the mesh spatial locality. Inspired by
[11], VIBE
[46] went a step further, to estimate dynamic motion sequence from videos. By replacing the regression network with a temporal generative network, and changing the three-dimensional supervision dataset to a motion capture dataset, AMASS
[47], VIBE empowered an adversarial learning framework with temporal information, enabling motion sequence estimation as a whole.
To leverage expressive human models and paired data,
[14][48][49] adopted a divide-and-conquer strategy, by breaking down the human reconstruction problem into part-specific estimation subproblems, where body, hand and face estimates were performed using the respective part-specific models. The final expressive model was obtained by assembling the individual results of the subproblems into the corresponding body template layers. ExPose
[14] directly regressed hands, face and body parameters in the SMPL-X format, and utilized body-driven attention to localize the face and hands regions for refinement, using part-specific knowledge learned from existing face- and hand-only datasets. A real-time method was introduced by
[50], to capture body, hands and face with competitive accuracy, by exploiting correlations between body and hands. Pose2Pose
[51] extracted joint-specific local and global features, to train a graph convolutional neural network, and regress body/hand joint rotations from it. PIXIE
[48] first fused the features from body, face and hand experts, according to their part-specific confidences, and then fed these features into the part-specific networks, for robust regression.
5. Technologies in AR/VR/XR Platforms and the Metaverse: Future Trends
In the researchers' opinion, AR/VR/XR applications will undoubtedly, in the near future, become the ultimate customer service platforms. In other words, AR/VR/XR applications will at least become the dominant platforms, if they do not completely wipe out the current mobile and computer platforms. Consequently, a big data surge will very soon occur in the virtual world. The Metaverse is likely to be the front platform to face the data surge challenge, due to its rapid growth in recent years. The following figure shows a recently developed VR-based shopping platform.
The researchers observed that two extreme situations would occur in the Metaverse, while conducting user recommendation and data analysis: (1)
The cold start problem. This situation often occurs when too little data is available for data analysis, due to the VR platforms being new to users, and to not much information having been generated and accumulated for analysis, a common situation in the big data environment, when new platforms are released for users; (2)
The virtual data explosion problem. This situation occurs when the Metaverse or VR platforms generate too much data, including user interaction data, wearable sensor data, eye tracking data, location trajectory data, brain EEG data, and business transaction data.
Figure 3 shows the data sources of the Metaverse and its architecture
[52], which indicates that the Metaverse consists of various data sources from physical, social and digital worlds.
Figure 3. Metaverse architecture of integrated social, physical and digital worlds, modified based on
[52]. The social world mainly consists of human communities.
Several methods have been suggested for solving the abovementioned problems. In
[53], a position-based VR online shopping recommendation system was developed, to solve the cold start problem in VR platforms. In such a system, the cold start problem is tackled by analyzing new users’ interaction and behaviors within the virtual world. For instance, the position-based VR online shopping system acquires new users’ trajectories in the virtual world, and conducts analysis based on their movements, to generate user recommendations, as shown in
Figure 4.
Figure 4. Position-based analysis for VR shopping recommendation (green line is user trajectory).
Future trends in solving the cold start problem in the Metaverse will further utilize users’ behavior and sentiment data, including user eye tracking data, user movement trajectory, wearable user device data, and user sentiment data. In particular, human brain data analysis will likely become an essential technology for user analysis in VR platforms, such as the Metaverse.
The cold start problem is not a persistent problem in VR platforms, as it can be solved automatically when data accumulation reaches a certain quantity, whereas the virtual data explosion problem is a persistent challenge to VR platforms like the Metaverse. The wide range of data sources in the Metaverse will grow exponentially, due to its digitization in nature. Some research studies have suggested adopting the Data as a Service (DaaS) framework [54], as the solution to the data explosion problem in the digital world, including the Metaverse. Several other solutions, including tensor networks and sentiment analysis, have been proposed, to solve this problem. The future trends of technical development in the Metaverse and other VR platforms can be summarized as follows:
- Digital human reconstruction is becoming a crucial area for the Metaverse and other VR platforms: this is a core technology that can accelerate the development of the Metaverse, so as to truly realize human–machine interaction in virtual worlds, as mentioned in the previous sections;
- Digital Twin-related methods are the foundation for creating digital worlds that can mimic the physical world. The digital twin is defined as the effortless integration of data between a physical and virtual environment, in either direction [167]. VR-developing tools, such as Unreal Engine, Unity, 3DS Max & Maya, SketchUp, etc., will be the major developer’s toolkits for digital twin models in the coming decades. The future trends in digital twin will focus on the following: enabling a conformance relationship between digital twin and the real world; digital world autonomy, runtime self-adaptation and self-management; and integration and cooperation, to achieve common goals or provide services [168]. A number of digital twin applications have been developed, based on Microsoft Kinect sensors and the Oculus VR headset.
- Brain–Computer Interface (BCI) technology will become a very important area for the Metaverse and for VR platforms. Previous research indicates that non-invasive BCI technology has been applied extensively in various areas in recent years, because of its minimal potential risks and time precision [55]. Figure 5 shows the high-performance EEG BCI method (left), and EEG BCI experiments (right) [55][56].
Figure 5. Segmented EEG time window (left), source: [55]; EEG experiment (right), source: [56].
The NDA/PDA-based methods are adopted, to enhance EEG data analytical efficiency, in order to accommodate the real-time interaction in the Metaverse and VR platforms [74]. The definition for the NDA method is as follows: if S [a, b] ⊆ A [1, k], if x∈[a, b] satisfies:
where mr is the adjusting parameter, and S [a, b] is an NDA set. The ND-based method derives the data values using ksdensity function, to generate a probability distribution [56]. The definition for the PDA method is as follows: the PDA model takes one of the calculated σ and λ values as λ × t, as indicated in the following equations, 11 and 12. Assuming the original data set has σ, then Mean (λ) is the event rate. If Mean (λ) − λ = ∆, then λ × t is lying between Mean (λ) and λ. With |y − λ × t| = a, a1/2+a = ∆ is satisfied.
where N(t) is the sample data in the t time window. The Gamma function is utilized in the PDA method for processing complex numbers, which is expressed in (5) below [57]:
The ∆ parameter is used to regulate the size of the sample data sets, to get the nearest λ and σ values. The ∆ parameter in the PDA plays the same role that it plays in the NDA method. The PDA model employs a PDA benchmark point selection method [55][56][57].
- Blockchain technology is an efficient and secure solution for digital worlds, such as the Metaverse. In the blockchain model, a new transaction can be verified and added to existing records, i.e., blocks, through linking the new transaction to previous ones, by cryptographic hash operation [58]. Each block contains a cryptographic hash of the previous block, a timestamp, and transaction data [59]. The main characteristics of blockchain technology are that it is secure, decentralized, digitized, collaborative and immutable: these characteristics make blockchain technology a perfect solution for digital virtual worlds, such as the Metaverse. Currently, the most successful security technology for blockchain employs the Public Key Infrastructure (PKI)-based blockchain methods [60]. Researchers in the field have started to search for more efficient solutions. The future trends in blockchain technology development in the Metaverse intend to focus on more autonomous, intelligent and scalable models, such as intelligence-agent-based blockchain [61], Self-Sovereign Identity (SSI) blockchain [62], non-fungible tokens (NFTs) [63] and bio-identity-based blockchain.
- Artificial intelligence (AI) is a discipline essential to almost all areas in our modern world, particularly for future virtual worlds such as the Metaverse. AI can accelerate analytical efficiency, enhance security and privacy, improve interoperability, and provide better solutions for human–machine interaction and collaboration. The increase in applications of Natural Language Processing (NLP), sentiment analysis and brain informatics technologies to digital worlds is stimulating the development of AI in these areas. The successful stories of AI implementation in image recognition, voice recognition, human–machine interaction and intuition, reveal the promising future of AI in the Metaverse and other virtual worlds. A recent survey showed that a majority of studies had focused on exploring efficient integration and collaboration between Edge AI architecture and the Metaverse [64].
The following Figure 6 demonstrates how the Metaverse and its related technologies, which include big data, have evolved and developed [64].
Figure 6. A chronicle of the Metaverse and its related techniques, modified based on [64].
Data sources in the Metaverse and other virtual platforms are growing exponentially; therefore, big data technologies are crucial for the Metaverse, if it is to efficiently manage its digital world, and provide users with real-time analytical services. Big data technologies are fundamental tools for rendering virtual platforms, such as the Metaverse, feasible for users. In other words, big data is a fundamental component in the Metaverse; and the Metaverse accelerates the development of big data technologies; however, big data is not only crucial in the virtual world—it is also an important component of our real physical world, as evidenced in various areas. Figure 7 shows the relationship between big data and the Metaverse.
Figure 7. Big data plays a key component in both the physical world and virtual worlds. The Metaverse is a virtual world parallel to the real physical world: the two are sometimes connected by augmented reality and digital twin.
The current definitions of the Metaverse vary according to different studies; however, many researchers share a common view that the Metaverse is imitating our physical world. In this work, the researchers believe that future virtual worlds, including the Metaverse, will develop to be totally different world from our physical world: these virtual worlds will go beyond our current social structure and civil life. Table 1 shows the example applications of the Metaverse and big data in several key sectors.
Table 1. A brief review of example applications of big data and the Metaverse in major sectors.
Sectors |
Big Data |
Metaverse |
Healthcare |
|
|
Finance and Economy |
|
|
Education |
-
Learning performance analysis and customization [78];
-
Education data warehouse, BD curriculum, etc., [79].
|
|
Entertainment and Social |
-
User behavior and opinion analysis, social trends [81];
-
Game data monitoring, sentiment analysis [82].
|
-
Metaverse games (Roblox, Sandbox) [83];
-
Virtual social (Meta, Altspace VR) [84].
|
6. Conclusion and Discussion
The Metaverse and other virtual platforms have grown rapidly in recent years. The PwC Co. predicts that VR and AR platforms will boost global GDP by USD 1.5 trillion by 2030 [85]. To date, applications of the Metaverse have included online shopping, virtual social media, video games, virtual tours, and online museums and arts [86][87][88]. Many large technology companies have announced plans to launch their Metaverse products, such as Facebook Horizon, Nvidia Omniverse, and Amazon Metaverse. The future trends in technical development in the Metaverse and other VR platforms can be grouped into five main areas: digital human; digital twin; brain–computer interface (BCI), blockchain and artificial intelligence. Notably, brain–computer interface technologies have become increasingly important to Metaverse development in recent years, as immersive interactions provided by BCI can enhance user experience [89][90][91][92][93].
This entry is adapted from the peer-reviewed paper 10.3390/math11010096