现实环境中的多模态情感分析

现实环境中的多模态情感分析: Comparison

Please note this is a comparison between Version 1 by Fangyi Wang and Version 4 by Alfred Zheng.

In the real world, multimodal sentiment analysis (在现实世界中，多模态情感分析（MSA) enables the capture and analysis of sentiments by fusing multimodal information, thereby enhancing the understanding of real-world environments. The key challenges lie in handling the noise in the acquired data and achieving effective multimodal fusion. When processing the noise in data, existing methods utilize the combination of multimodal features to mitigate errors in sentiment word recognition caused by the performance limitations of automatic speech recognition (ASR) models.）通过融合多模态信息来捕获和分析情感，从而增强对真实世界环境的理解。关键挑战在于处理采集数据中的噪声并实现有效的多模态融合。在处理数据中的噪声时，现有方法利用多模态特征的组合来减轻由自动语音识别（ASR）模型的性能限制引起的情感词识别错误。

real world
multimodal sentiment analysis

1. Introduction 简介

In the research field of intelligent human–computer interaction, the ability to recognize, analyze, understand, and express emotions is essential for intelligent machines. Therefore, the utilization of computer technology to automatically recognize, understand, analyze, classify, and respond to emotion holds significant value for establishing a harmonious human–machine interaction environment, improving interaction efficiency, and enhancing user experience ^[1][2][3]. Previous studies ^[4][5] have primarily focused on sentiment analysis using textual data and have achieved remarkable accomplishments. However, as compared to unimodal analysis, 在智能人机交互的研究领域，识别、分析、理解和表达情感的能力对于智能机器至关重要。因此，利用计算机技术自动识别、理解、分析、分类和响应情绪，对于建立和谐的人机交互环境、提高交互效率、增强用户体验具有重要价值[1，2，3]。以前的研究[4，5]主要集中在使用文本数据的情感分析上，并取得了显着的成就。然而，与单峰分析相比，MSA can effectively leverage the coordinated and complementary information from different modalities to enhance emotional understanding and expression capabilities and provide richer information that is more consistent with human behavior.可以有效地利用来自不同模态的协调和互补信息来增强情感理解和表达能力，并提供更符合人类行为的更丰富的信息。

In recent years, there has been a growing interest in multimodal data for sentiment analysis. 近年来，人们对用于情绪分析的多模态数据越来越感兴趣。MSA aims to utilize the information interaction between texts, images, speech, etc., enabling machines to automatically utilize comprehensive multimodal emotional information for the identification of users’ sentiment tendencies. Early research often employed multimodal fusion ^[6] through early fusion, directly combining multiple sources of raw features ^[7][8], or late fusion, aggregating the decisions of multiple sentiment classifiers ^[9][10][11]. However, the former approach may result in a large number of redundant input vectors, leading to increased computational complexity, while the latter approach may struggle to capture the correlations between different modalities. Therefore, various methods have been proposed for feature fusion in multimodal sentiment analysis. Existing fusion methods include those based on simple operations ^[12][13], attention-based methods ^{[14][15][16][17]}, tensor-based methods ^[18], translation-based methods ^[19], 旨在利用文本，图像，语音等之间的信息交互，使机器能够自动利用全面的多模态情感信息来识别用户的情绪倾向。早期研究通常采用多模态融合[6]，通过早期融合，直接结合多个原始特征来源[7，8]，或晚期融合，聚合多个情感分类器的决策[9，10，11]。然而，前一种方法可能会导致大量冗余输入向量，导致计算复杂性增加，而后一种方法可能难以捕获不同模态之间的相关性。因此，在多模态情感分析中提出了多种特征融合方法。现有的融合方法包括基于简单运算的方法[12，13]、基于注意力的方法[14，15，16，17]、基于张量的方法[18]、基于翻译的方法[19]、基于GAN-based methods ^[20], routing-based methods ^[21], and hierarchical fusion ^[22][23][24]的方法[20]、基于路由的方法[21]和分层融合[22，23，24]。]. Although there is a wide range of fusion methods, attention-based fusion methods have shown superior efficiency and performance ^[25]. However, weighting and summing the features of each modality in the attention mechanism alone may not be able to effectively adapt to the differences in the features across different modalities. Consequently, certain modal features might be disregarded or underestimated, ultimately impacting the accuracy of the fused feature representation. Additionally, complex nonlinear interactions may exist between different modalities, and the attention mechanism may struggle to model such relationships accurately, thereby impacting the effectiveness of feature fusion. Furthermore, previous methods have rarely considered the simultaneous utilization of the interaction information within a single modality and between modalities.尽管融合方法种类繁多，但基于注意力的融合方法显示出优越的效率和性能[25]。然而，仅对注意力机制中每种模态的特征进行加权和求和，可能无法有效适应不同模态之间特征的差异。因此，某些模态特征可能会被忽略或低估，最终影响融合特征表示的精度。此外，不同模态之间可能存在复杂的非线性相互作用，注意力机制可能难以准确地对这种关系进行建模，从而影响特征融合的有效性。此外，以前的方法很少考虑在单一模态内和模态之间同时利用相互作用信息。

2. Multimodal Sentiment Analysis in Realistic Environments现实环境下的多模态情感分析

Sentiment computing, as an emerging interdisciplinary research field, has been widely studied and explored since its introduction in 情感计算作为一个新兴的跨学科研究领域，自1995 ^[26]. 年推出以来得到了广泛的研究和探索[26]。以前的研究主要集中在单峰数据表示和多模态融合上。在单模态数据表示方面，Previous research has primarily focused on unimodal data representation and multimodal fusion. In terms of unimodal data representation, Pang et al. ^[4] were the first to employ machine learning-based methods to address textual sentiment classification, achieving better results than traditional manual annotation, by using movie reviews as the dataset. ng等人[4]是第一个采用基于机器学习的方法来解决文本情感分类的人，通过使用电影评论作为数据集，取得了比传统的手动注释更好的结果。Yue et al. ^[5] proposed a hybrid model called 等人[5]提出了一个名为Word2vec-BiLSTM-CNN, which leveraged the feature extraction capability of convolutional neural networks (的混合模型，该模型利用卷积神经网络（CNN) and the ability of bi-directional long short-term memory (）的特征提取能力和双向长短期记忆（Bi-LSTM) to capture short-term bidirectional dependencies in text. Their results demonstrated that hybrid network models outperformed single-structure neural networks in the context of short texts. ）的能力来捕获文本中的短期双向依赖关系。他们的结果表明，混合网络模型在短文本环境中优于单结构神经网络。Colombo et al. ^[27] segmented different regions in image and video data based on features such as color, warmth, position, and size, enabling their method to obtain higher semantic levels beyond the objects themselves. They applied this approach to a sentiment analysis of art-related images. 等人[27]根据颜色、暖度、位置和大小等特征对图像和视频数据中的不同区域进行分割，使他们的方法能够获得超越对象本身的更高语义级别。他们将这种方法应用于对艺术相关图像的情感分析。Wang et al. ^[28] utilized neural networks for the facial feature extraction of images. 等人[28]利用神经网络对图像进行面部特征提取。Bonifazi et al. ^[29] proposed a space–time framework that leveraged the emotional context inherent in a presented situation. They employed this framework to extract the scope of emotional information concerning users’ sentiments on a given subject. However, the use of unimodal analysis in sentiment analysis has had some limitations since humans express emotions through various means, including sound, content, facial expressions, and body language, all of which are collectively employed to convey emotions.等人[29]提出了一个时空框架，该框架利用了所呈现情境中固有的情感背景。他们利用这个框架来提取有关用户对给定主题的情绪的情感信息的范围。然而，在情感分析中使用单模分析存在一些局限性，因为人类通过各种方式表达情绪，包括声音、内容、面部表情和肢体语言，所有这些都被共同用于传达情绪。 Multimodal data describe objects from different perspectives, providing richer information, as compared to unimodal data. Different modalities of information can complement each other in terms of content. In the context of multimodal fusion, previous research can be categorized into three stages: early feature fusion, mid-level model fusion, and late decision fusion. 与单模数据相比，多模态数据从不同角度描述对象，提供更丰富的信息。不同的信息形式在内容方面可以相互补充。在多模态融合的背景下，既往研究可分为三个阶段：早期特征融合、中级模型融合和后期决策融合。Wollmer et al. ^[30] and 等人[30]和Rozgic et al. ^[31] integrated data from audio, visual, and text sources to extract emotions and sentiments. 等人[31]整合了来自音频，视频和文本来源的数据，以提取情绪和情绪。Metallinou et al. ^[32] and 等人[32]和Eyben et al. ^[33] combined audio and text patterns for emotional recognition. These methods relied on early feature fusion, which mapped them to the same embedding space through simple concatenation, and the lack of interaction between different modalities. For late-stage decision-fusion methods, internal representations were initially learned within each modality, followed by learning the fusion between modalities. 等人[33]将音频和文本模式结合起来进行情感识别。这些方法依赖于早期的特征融合，通过简单的串联将它们映射到相同的嵌入空间，并且不同模态之间缺乏交互。对于后期决策融合方法，首先在每种模态内学习内部表示，然后学习模态之间的融合。Zadeh et al. ^[18] utilized tensor-fusion networks to calculate the outer product between unimodal representations, yielding tensor representations. 等人[18]利用张量融合网络来计算单峰表示之间的外积，从而产生张量表示。Liu et al. ^[34] introduced a low-rank multimodal-fusion method to reduce the computational complexity of tensor-based approaches. These methods aimed to enhance efficiency by decomposing the weights of high-dimensional fusion tensors, reducing redundant information, yet they struggled to effectively simulate intermodal or specific modality dynamics. Intermediate model fusion amalgamates the advantages of both early feature fusion and late-stage decision fusion, facilitating the selection of fusion points and enabling multimodal interaction. 等人[34]引入了一种低秩多模态融合方法，以降低基于张量的方法的计算复杂度。这些方法旨在通过分解高维融合张量的权重来提高效率，减少冗余信息，但他们难以有效地模拟多态间或特定模态动力学。中间模型融合融合了早期特征融合和后期决策融合的优点，便于融合点的选择，实现多模态交互。Poria et al. ^[35] further extended the combination of convolutional neural networks (等人[35]进一步扩展了卷积神经网络（CNNs) and multiple kernel learning (）和多核学习（MKL). In contrast to ）的组合。与Ghosal et al. ^[36], 等人[36]相反，Poria et al. utilized a novel fusion method to effectively enhance fusion features. 等人利用一种新的融合方法来有效地增强融合特征。Zhang et al. ^[37] introduced a quantum-inspired framework for the sentiment analysis of bimodal data (texts and images) to address semantic gaps and model the correlation between the two modalities using density matrices. However, these methods exhibited limited adaptability to feature differences and suffered from significant feature redundancy. Concerning hierarchical fusion, 等人[37]引入了一个量子启发框架，用于双模数据（文本和图像）的情感分析，以解决语义差距并使用密度矩阵对两种模态之间的相关性进行建模。然而，这些方法对特征差异的适应性有限，并且存在明显的特征冗余。关于分层融合，Majumder et al. ^[22] employed a hierarchical-fusion strategy, initially combining two modalities and, subsequently, integrating all three modalities. However, this approach struggled to adequately capture intramodal dynamics. 等人[22]采用了分层融合策略，最初结合了两种模态，随后整合了所有三种模态。然而，这种方法难以充分捕捉模态内动态。Georgiou et al. ^[23] introduced a deep hierarchical-fusion framework, applying it to sentiment-analysis problems involving audio and text modalities. 等人[23]引入了一个深度分层融合框架，将其应用于涉及音频和文本模式的情感分析问题。Yan et al. ^[24] introduced a hierarchical attention-fusion network for geographical localization. Nevertheless, these methods overlooked the potential existence of complex nonlinear interactions between modalities. Moreover, many fusion approaches have seldomly considered simultaneously harnessing intramodal and intermodal interactions. 等人[24]引入了用于地理定位的分层注意力融合网络。然而，这些方法忽略了模态之间复杂非线性相互作用的潜在存在。此外，许多聚变方法很少考虑同时利用模式内和模式间的相互作用。Verma et al. ^[38] emphasized that each modality possesses unique intramodality features, and multimodal sentiment analysis methods should capture both common intermodality information and distinctive intramodality signals.等人[38]强调，每种模态都具有独特的模态内特征，多模态情感分析方法应同时捕获共同的模态信息和独特的模态内信号。 In addition to considering fusion strategies in the 除了考虑MSA model, it is crucial to address the noise present in modal data. Pham et al. ^[39] proposed the 模型中的融合策略外，解决模态数据中存在的噪声也至关重要。Pham等人[39]提出了MCTN model to handle the potential absence of visual and acoustic data. 模型来处理视觉和声学数据的潜在缺失。Liang et al. ^[40] and 等人[40]和Mittal et al. ^[41] also focused on addressing noise introduced by visual and acoustic data, relying on word-level features there were obtained by aligning the audio with the actual text. 等人[41]也专注于解决视觉和声学数据引入的噪声，依靠通过将音频与实际文本对齐而获得的单词级特征。Xue et al. ^[42] introduced a multi-level attention-graph network to reduce noise within and between modalities. 等人[42]引入了一个多级注意力图网络来减少模态内部和模态之间的噪声。Cauteruccio et al. ^[43] introduced a string-comparison metric that could be employed to enhance the processing of heterogeneous audio samples, mitigating modality-related noise. However, these models did not investigate the impact of 等人[43]引入了一种字符串比较指标，可用于增强异构音频样本的处理，从而减轻与模态相关的噪声。但是，这些模型没有调查ASR errors on the 错误对MSA model. Notably, Wu模型的影响。值得注意的是，Wu等人[44]利用情感词位置检测模块来确定文本中情感词最可能的位置。他们使用多模态情感词细化模块动态细化情感词嵌入，该模块将改进的嵌入作为多模态特征融合模块的文本输入。这种方法减少了 et al. ^[44] utilized a sentiment word position-detection module to determine the most likely positions of sentiment words in text. They dynamically refined sentiment-word embeddings using a multimodal sentiment-word-refinement module that incorporated the improved embeddings as the textual input for the multimodal feature-fusion module. This approach reduced the influence of ASR errors错误对 on the MSA model. The sentiment word position-detection module and multimodal sentiment-word-refinement module have proven highly effective, achieving state-of-the-art performance on real-world datasets. However, the original SWRM simply concatenated modalities in feature fusion without capturing intramodal and intermodal features, even when genuine correlations existed.MSA 模型的影响。情感词位置检测模块和多模态情感词细化模块已被证明非常有效，在真实数据集上实现了最先进的性能。然而，最初的SWRM只是在特征融合中连接模态，而没有捕获模态内和模态间特征，即使存在真正的相关性。