Multimodal Sentiment Analysis in Realistic Environments

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Fangyi Wang	--	48	2023-09-05 11:25:43	\|
2	Change the format to English	Fangyi Wang	+ 5 word(s)	53	2023-09-06 10:14:16	\| \|
3	Change the format to English	Fangyi Wang	+ 73 word(s)	1325	2023-09-06 16:06:48	\| \|
4	only format change	Alfred Zheng	Meta information modification	1325	2023-09-07 04:21:53	\|

This entry is adapted from the peer-reviewed paper 10.3390/electronics12163504

In the real world, multimodal sentiment analysis (MSA) enables the capture and analysis of sentiments by fusing multimodal information, thereby enhancing the understanding of real-world environments. The key challenges lie in handling the noise in the acquired data and achieving effective multimodal fusion. When processing the noise in data, existing methods utilize the combination of multimodal features to mitigate errors in sentiment word recognition caused by the performance limitations of automatic speech recognition (ASR) models.

real world multimodal sentiment analysis

1. Introduction

In the research field of intelligent human–computer interaction, the ability to recognize, analyze, understand, and express emotions is essential for intelligent machines. Therefore, the utilization of computer technology to automatically recognize, understand, analyze, classify, and respond to emotion holds significant value for establishing a harmonious human–machine interaction environment, improving interaction efficiency, and enhancing user experience ^[1]^[2]^[3]. Previous studies ^[4]^[5] have primarily focused on sentiment analysis using textual data and have achieved remarkable accomplishments. However, as compared to unimodal analysis, MSA can effectively leverage the coordinated and complementary information from different modalities to enhance emotional understanding and expression capabilities and provide richer information that is more consistent with human behavior.

In recent years, there has been a growing interest in multimodal data for sentiment analysis. MSA aims to utilize the information interaction between texts, images, speech, etc., enabling machines to automatically utilize comprehensive multimodal emotional information for the identification of users’ sentiment tendencies. Early research often employed multimodal fusion ^[6] through early fusion, directly combining multiple sources of raw features ^[7]^[8], or late fusion, aggregating the decisions of multiple sentiment classifiers ^[9]^[10]^[11]. However, the former approach may result in a large number of redundant input vectors, leading to increased computational complexity, while the latter approach may struggle to capture the correlations between different modalities. Therefore, various methods have been proposed for feature fusion in multimodal sentiment analysis. Existing fusion methods include those based on simple operations ^[12]^[13], attention-based methods ^[14]^[15]^[16]^[17], tensor-based methods ^[18], translation-based methods ^[19], GAN-based methods ^[20], routing-based methods ^[21], and hierarchical fusion ^[22]^[23]^[24]. Although there is a wide range of fusion methods, attention-based fusion methods have shown superior efficiency and performance ^[25]. However, weighting and summing the features of each modality in the attention mechanism alone may not be able to effectively adapt to the differences in the features across different modalities. Consequently, certain modal features might be disregarded or underestimated, ultimately impacting the accuracy of the fused feature representation. Additionally, complex nonlinear interactions may exist between different modalities, and the attention mechanism may struggle to model such relationships accurately, thereby impacting the effectiveness of feature fusion. Furthermore, previous methods have rarely considered the simultaneous utilization of the interaction information within a single modality and between modalities.

2. Multimodal Sentiment Analysis in Realistic Environments

Sentiment computing, as an emerging interdisciplinary research field, has been widely studied and explored since its introduction in 1995 ^[26]. Previous research has primarily focused on unimodal data representation and multimodal fusion. In terms of unimodal data representation, Pang et al. ^[4] were the first to employ machine learning-based methods to address textual sentiment classification, achieving better results than traditional manual annotation, by using movie reviews as the dataset. Yue et al. ^[5] proposed a hybrid model called Word2vec-BiLSTM-CNN, which leveraged the feature extraction capability of convolutional neural networks (CNN) and the ability of bi-directional long short-term memory (Bi-LSTM) to capture short-term bidirectional dependencies in text. Their results demonstrated that hybrid network models outperformed single-structure neural networks in the context of short texts. Colombo et al. ^[27] segmented different regions in image and video data based on features such as color, warmth, position, and size, enabling their method to obtain higher semantic levels beyond the objects themselves. They applied this approach to a sentiment analysis of art-related images. Wang et al. ^[28] utilized neural networks for the facial feature extraction of images. Bonifazi et al. ^[29] proposed a space–time framework that leveraged the emotional context inherent in a presented situation. They employed this framework to extract the scope of emotional information concerning users’ sentiments on a given subject. However, the use of unimodal analysis in sentiment analysis has had some limitations since humans express emotions through various means, including sound, content, facial expressions, and body language, all of which are collectively employed to convey emotions.

Multimodal data describe objects from different perspectives, providing richer information, as compared to unimodal data. Different modalities of information can complement each other in terms of content. In the context of multimodal fusion, previous research can be categorized into three stages: early feature fusion, mid-level model fusion, and late decision fusion. Wollmer et al. ^[30] and Rozgic et al. ^[31] integrated data from audio, visual, and text sources to extract emotions and sentiments. Metallinou et al. ^[32] and Eyben et al. ^[33] combined audio and text patterns for emotional recognition. These methods relied on early feature fusion, which mapped them to the same embedding space through simple concatenation, and the lack of interaction between different modalities. For late-stage decision-fusion methods, internal representations were initially learned within each modality, followed by learning the fusion between modalities. Zadeh et al. ^[18] utilized tensor-fusion networks to calculate the outer product between unimodal representations, yielding tensor representations. Liu et al. ^[34] introduced a low-rank multimodal-fusion method to reduce the computational complexity of tensor-based approaches. These methods aimed to enhance efficiency by decomposing the weights of high-dimensional fusion tensors, reducing redundant information, yet they struggled to effectively simulate intermodal or specific modality dynamics. Intermediate model fusion amalgamates the advantages of both early feature fusion and late-stage decision fusion, facilitating the selection of fusion points and enabling multimodal interaction. Poria et al. ^[35] further extended the combination of convolutional neural networks (CNNs) and multiple kernel learning (MKL). In contrast to Ghosal et al. ^[36], Poria et al. utilized a novel fusion method to effectively enhance fusion features. Zhang et al. ^[37] introduced a quantum-inspired framework for the sentiment analysis of bimodal data (texts and images) to address semantic gaps and model the correlation between the two modalities using density matrices. However, these methods exhibited limited adaptability to feature differences and suffered from significant feature redundancy. Concerning hierarchical fusion, Majumder et al. ^[22] employed a hierarchical-fusion strategy, initially combining two modalities and, subsequently, integrating all three modalities. However, this approach struggled to adequately capture intramodal dynamics. Georgiou et al. ^[23] introduced a deep hierarchical-fusion framework, applying it to sentiment-analysis problems involving audio and text modalities. Yan et al. ^[24] introduced a hierarchical attention-fusion network for geographical localization. Nevertheless, these methods overlooked the potential existence of complex nonlinear interactions between modalities. Moreover, many fusion approaches have seldomly considered simultaneously harnessing intramodal and intermodal interactions. Verma et al. ^[38] emphasized that each modality possesses unique intramodality features, and multimodal sentiment analysis methods should capture both common intermodality information and distinctive intramodality signals.

In addition to considering fusion strategies in the MSA model, it is crucial to address the noise present in modal data. Pham et al. ^[39] proposed the MCTN model to handle the potential absence of visual and acoustic data. Liang et al. ^[40] and Mittal et al. ^[41] also focused on addressing noise introduced by visual and acoustic data, relying on word-level features there were obtained by aligning the audio with the actual text. Xue et al. ^[42] introduced a multi-level attention-graph network to reduce noise within and between modalities. Cauteruccio et al. ^[43] introduced a string-comparison metric that could be employed to enhance the processing of heterogeneous audio samples, mitigating modality-related noise. However, these models did not investigate the impact of ASR errors on the MSA model. Notably, Wu et al. ^[44] utilized a sentiment word position-detection module to determine the most likely positions of sentiment words in text. They dynamically refined sentiment-word embeddings using a multimodal sentiment-word-refinement module that incorporated the improved embeddings as the textual input for the multimodal feature-fusion module. This approach reduced the influence of ASR errors on the MSA model. The sentiment word position-detection module and multimodal sentiment-word-refinement module have proven highly effective, achieving state-of-the-art performance on real-world datasets. However, the original SWRM simply concatenated modalities in feature fusion without capturing intramodal and intermodal features, even when genuine correlations existed.

References

Qin, Z.; Zhao, P.; Zhuang, T.; Deng, F.; Ding, Y.; Chen, D. A survey of identity recognition via data fusion and feature learning. Inf. Fusion 2023, 91, 694–712.
Tu, G.; Liang, B.; Jiang, D.; Xu, R.J.I.T.o.A.C. Sentiment-Emotion-and Context-guided Knowledge Selection Framework for Emotion Recognition in Conversations. IEEE Trans. Affect. Comput. 2022, 1–14.
Noroozi, F.; Corneanu, C.A.; Kamińska, D.; Sapiński, T.; Escalera, S.; Anbarjafari, G. Survey on emotional body gesture recognition. IEEE Trans. Affect. Comput. 2018, 12, 505–523.
Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment classification using machine learning techniques. arXiv 2002, arXiv:cs/0205070.
Yue, W.; Li, L. Sentiment analysis using Word2vec-CNN-BiLSTM classification. In Proceedings of the 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS), Paris, France, 14–16 December 2020; pp. 1–5.
Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379.
Mazloom, M.; Rietveld, R.; Rudinac, S.; Worring, M.; Van Dolen, W. Multimodal popularity prediction of brand-related social media posts. In Proceedings of the Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 197–201.
Pérez-Rosas, V.; Mihalcea, R.; Morency, L.-P. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 973–982.
Poria, S.; Cambria, E.; Gelbukh, A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2539–2544.
Liu, N.; Dellandréa, E.; Chen, L.; Zhu, C.; Zhang, Y.; Bichot, C.-E.; Bres, S.; Tellez, B. Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme. Comput. Vis. Image Underst. 2013, 117, 493–512.
Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727.
Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 439–448.
Nguyen, D.; Nguyen, K.; Sridharan, S.; Dean, D.; Fookes, C. Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput. Vis. Image Underst. 2018, 174, 33–42.
Lv, F.; Chen, X.; Huang, Y.; Duan, L.; Lin, G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2554–2562.
Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 27 July–2 August 2019; p. 6558.
Cheng, J.; Fostiropoulos, I.; Boehm, B.; Soleymani, M. Multimodal phased transformer for sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual, 7–11 November 2021; pp. 2447–2458.
Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.-P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the Conference Association for Computational Linguistics Meeting, Seattle, WA, USA, 5–10 July 2020; p. 2359.
Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250.
Wang, Z.; Wan, Z.; Wan, X. Transmodality: An end2end fusion method with transformer for multimodal sentiment analysis. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2514–2520.
Peng, Y.; Qi, J. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 1–24.
Tsai, Y.-H.H.; Ma, M.Q.; Yang, M.; Salakhutdinov, R.; Morency, L.-P. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; p. 1823.
Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133.
Georgiou, E.; Papaioannou, C.; Potamianos, A. Deep Hierarchical Fusion with Application in Sentiment Analysis. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 1646–1650.
Yan, L.; Cui, Y.; Chen, Y.; Liu, D. Hierarchical attention fusion for geo-localization. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2220–2224.
Fu, Z.; Liu, F.; Xu, Q.; Qi, J.; Fu, X.; Zhou, A.; Li, Z. NHFNET: A non-homogeneous fusion network for multimodal sentiment analysis. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6.
Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125.
Colombo, C.; Del Bimbo, A.; Pala, P. Semantics in visual information retrieval. IEEE Multimed. 1999, 6, 38–53.
Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6897–6906.
Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Marchetti, M.; Sciarretta, L.; Ursino, D.; Virgili, L. A Space-Time Framework for Sentiment Scope Analysis in Social Media. Big Data Cogn. Comput. 2022, 6, 130.
Wöllmer, M.; Weninger, F.; Knaup, T.; Schuller, B.; Sun, C.; Sagae, K.; Morency, L.-P. Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intell. Syst. 2013, 28, 46–53.
Rozgić, V.; Ananthakrishnan, S.; Saleem, S.; Kumar, R.; Prasad, R. Ensemble of SVM Trees for Multimodal Emotion Recognition. In Proceedings of the 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, Hollywood, CA, USA, 3–6 December 2012; pp. 1–4.
Metallinou, A.; Lee, S.; Narayanan, S. Audio-visual emotion recognition using gaussian mixture models for face and voice. In Proceedings of the 2008 Tenth IEEE International Symposium on Multimedia, Berkeley, CA, USA, 15–17 December 2008; pp. 250–257.
EyEyben, F.; Wöllmer, M.; Graves, A.; Schuller, B.; Douglas-Cowie, E.; Cowie, R. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J. Multimodal User Interfaces 2010, 3, 7–19.
Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064.
Poria, S.; Peng, H.; Hussain, A.; Howard, N.; Cambria, E.J.N. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing 2017, 261, 217–230.
Ghosal, D.; Akhtar, M.S.; Chauhan, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3454–3466.
Zhang, Y.; Song, D.; Zhang, P.; Wang, P.; Li, J.; Li, X.; Wang, B. A quantum-inspired multimodal sentiment analysis framework. Theor. Comput. Sci. 2018, 752, 21–40.
Verma, S.; Wang, C.; Zhu, L.; Liu, W. Deepcu: Integrating both common and unique latent information for multimodal sentiment analysis. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 3627–3634.
Pham, H.; Liang, P.P.; Manzini, T.; Morency, L.P.; Póczos, B. Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6892–6899.
Liang, P.P.; Liu, Z.; Tsai, Y.H.H.; Zhao, Q.; Salakhutdinov, R.; Morency, L.P. Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization. arXiv 2019, arXiv:1907.01011.
Mittal, T.; Bhattacharya, U.; Chandra, R.; Bera, A.; Manocha, D. M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1359–1367.
Xue, X.; Zhang, C.; Niu, Z.; Wu, X. Multi-level attention map network for multimodal sentiment analysis. IEEE Trans. Knowl. Data Eng. 2022, 35, 5105–5118.
Cauteruccio, F.; Stamile, C.; Terracina, G.; Ursino, D.; Sappey-Mariniery, D. An automated string-based approach to White Matter fiber-bundles clustering. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8.
Wu, Y.; Zhao, Y.; Yang, H.; Chen, S.; Qin, B.; Cao, X.; Zhao, W. Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors. arXiv 2022, arXiv:2203.00257.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Ju Huang

Pengtao Lu

Shuifa Sun

Fangyi Wang

View Times: 301

Update Date: 07 Sep 2023

Table of Contents

Video Upload Options

Confirm

1. Introduction

2. Multimodal Sentiment Analysis in Realistic Environments

References