现实环境中的多模态情感分析: Comparison
Please note this is a comparison between Version 3 by Fangyi Wang and Version 4 by Alfred Zheng.

In the real world, multimodal sentiment analysis (MSA) enables the capture and analysis of sentiments by fusing multimodal information, thereby enhancing the understanding of real-world environments. The key challenges lie in handling the noise in the acquired data and achieving effective multimodal fusion. When processing the noise in data, existing methods utilize the combination of multimodal features to mitigate errors in sentiment word recognition caused by the performance limitations of automatic speech recognition (ASR) models.

  • real world
  • multimodal sentiment analysis

1. Introduction 

In the research field of intelligent human–computer interaction, the ability to recognize, analyze, understand, and express emotions is essential for intelligent machines. Therefore, the utilization of computer technology to automatically recognize, understand, analyze, classify, and respond to emotion holds significant value for establishing a harmonious human–machine interaction environment, improving interaction efficiency, and enhancing user experience [1][2][3][1–3]. Previous studies [4][5][4,5] have primarily focused on sentiment analysis using textual data and have achieved remarkable accomplishments. However, as compared to unimodal analysis, MSA can effectively leverage the coordinated and complementary information from different modalities to enhance emotional understanding and expression capabilities and provide richer information that is more consistent with human behavior. 
In recent years, there has been a growing interest in multimodal data for sentiment analysis. MSA aims to utilize the information interaction between texts, images, speech, etc., enabling machines to automatically utilize comprehensive multimodal emotional information for the identification of users’ sentiment tendencies. Early research often employed multimodal fusion [6] through early fusion, directly combining multiple sources of raw features [7][8][7,8], or late fusion, aggregating the decisions of multiple sentiment classifiers [9][10][11][9–11]. However, the former approach may result in a large number of redundant input vectors, leading to increased computational complexity, while the latter approach may struggle to capture the correlations between different modalities. Therefore, various methods have been proposed for feature fusion in multimodal sentiment analysis. Existing fusion methods include those based on simple operations [12][13][12,13], attention-based methods [14][15][16][17][14–17], tensor-based methods [18], translation-based methods [19], GAN-based methods [20], routing-based methods [21], and hierarchical fusion [22][23][24][22–24]. Although there is a wide range of fusion methods, attention-based fusion methods have shown superior efficiency and performance [25]. However, weighting and summing the features of each modality in the attention mechanism alone may not be able to effectively adapt to the differences in the features across different modalities. Consequently, certain modal features might be disregarded or underestimated, ultimately impacting the accuracy of the fused feature representation. Additionally, complex nonlinear interactions may exist between different modalities, and the attention mechanism may struggle to model such relationships accurately, thereby impacting the effectiveness of feature fusion. Furthermore, previous methods have rarely considered the simultaneous utilization of the interaction information within a single modality and between modalities. 

2. Multimodal Sentiment Analysis in Realistic Environments

Sentiment computing, as an emerging interdisciplinary research field, has been widely studied and explored since its introduction in 1995 [26]. Previous research has primarily focused on unimodal data representation and multimodal fusion. In terms of unimodal data representation, Pang et al. [4] were the first to employ machine learning-based methods to address textual sentiment classification, achieving better results than traditional manual annotation, by using movie reviews as the dataset. Yue et al. [5] proposed a hybrid model called Word2vec-BiLSTM-CNN, which leveraged the feature extraction capability of convolutional neural networks (CNN) and the ability of bi-directional long short-term memory (Bi-LSTM) to capture short-term bidirectional dependencies in text. Their results demonstrated that hybrid network models outperformed single-structure neural networks in the context of short texts. Colombo et al. [27] segmented different regions in image and video data based on features such as color, warmth, position, and size, enabling their method to obtain higher semantic levels beyond the objects themselves. They applied this approach to a sentiment analysis of art-related images. Wang et al. [28] utilized neural networks for the facial feature extraction of images. Bonifazi et al. [29] proposed a space–time framework that leveraged the emotional context inherent in a presented situation. They employed this framework to extract the scope of emotional information concerning users’ sentiments on a given subject. However, the use of unimodal analysis in sentiment analysis has had some limitations since humans express emotions through various means, including sound, content, facial expressions, and body language, all of which are collectively employed to convey emotions. Multimodal data describe objects from different perspectives, providing richer information, as compared to unimodal data. Different modalities of information can complement each other in terms of content. In the context of multimodal fusion, previous research can be categorized into three stages: early feature fusion, mid-level model fusion, and late decision fusion. Wollmer et al. [30] and Rozgic et al. [31] integrated data from audio, visual, and text sources to extract emotions and sentiments. Metallinou et al. [32] and Eyben et al. [33] combined audio and text patterns for emotional recognition. These methods relied on early feature fusion, which mapped them to the same embedding space through simple concatenation, and the lack of interaction between different modalities. For late-stage decision-fusion methods, internal representations were initially learned within each modality, followed by learning the fusion between modalities. Zadeh et al. [18] utilized tensor-fusion networks to calculate the outer product between unimodal representations, yielding tensor representations. Liu et al. [34] introduced a low-rank multimodal-fusion method to reduce the computational complexity of tensor-based approaches. These methods aimed to enhance efficiency by decomposing the weights of high-dimensional fusion tensors, reducing redundant information, yet they struggled to effectively simulate intermodal or specific modality dynamics. Intermediate model fusion amalgamates the advantages of both early feature fusion and late-stage decision fusion, facilitating the selection of fusion points and enabling multimodal interaction. Poria et al. [35] further extended the combination of convolutional neural networks (CNNs) and multiple kernel learning (MKL). In contrast to Ghosal et al. [36], Poria et al. utilized a novel fusion method to effectively enhance fusion features. Zhang et al. [37] introduced a quantum-inspired framework for the sentiment analysis of bimodal data (texts and images) to address semantic gaps and model the correlation between the two modalities using density matrices. However, these methods exhibited limited adaptability to feature differences and suffered from significant feature redundancy. Concerning hierarchical fusion, Majumder et al. [22] [22]

employed a hierarchical-fusion strategy, initially combining two modalities and, subsequently, integrating all three modalities. However, this approach struggled to adequately capture intramodal dynamics. Georgiou et al. [23] introduced a deep hierarchical-fusion framework, applying it to sentiment-analysis problems involving audio and text modalities. Yan et al. [24] introduced a hierarchical attention-fusion network for geographical localization. Nevertheless, these methods overlooked the potential existence of complex nonlinear interactions between modalities. Moreover, many fusion approaches have seldomly considered simultaneously harnessing intramodal and intermodal interactions. Verma et al. [38] emphasized that each modality possesses unique intramodality features, and multimodal sentiment analysis methods should capture both common intermodality information and distinctive intramodality signals. In addition to considering fusion strategies in the MSA model, it is crucial to address the noise present in modal data. Pham et al. [39] proposed the MCTN model to handle the potential absence of visual and acoustic data. Liang et al. [40] and Mittal et al. [41] also focused on addressing noise introduced by visual and acoustic data, relying on word-level features there were obtained by aligning the audio with the actual text. Xue et al. [42] introduced a multi-level attention-graph network to reduce noise within and between modalities. Cauteruccio et al. [43] introduced a string-comparison metric that could be employed to enhance the processing of heterogeneous audio samples, mitigating modality-related noise. However, these models did not investigate the impact of ASR errors on the MSA model. Notably, Wu et al. [44] utilized a sentiment word position-detection module to determine the most likely positions of sentiment words in text. They dynamically refined sentiment-word embeddings using a multimodal sentiment-word-refinement module that incorporated the improved embeddings as the textual input for the multimodal feature-fusion module. This approach reduced the influence of ASR errors on the MSA model. The sentiment word position-detection module and multimodal sentiment-word-refinement module have proven highly effective, achieving state-of-the-art performance on real-world datasets. However, the original SWRM simply concatenated modalities in feature fusion without capturing intramodal and intermodal features, even when genuine correlations existed.
Video Production Service