Tasks for Multimodal Federated Learning

Version	Summary	Created by	Modification	Content Size	Created at	Operation
1		Liwei Che	--	1076	2023-08-16 02:55:36	\|
2	format correct	Catherine Yang	Meta information modification	1076	2023-08-16 02:57:07	\|

This entry is adapted from the peer-reviewed paper 10.3390/s23156986

Multimodal federated learning (MFL) offers many advantages, such as privacy preservation and addressing the data silo problem. However, it also faces limitations such as communication costs, data heterogeneity, and hardware disparities compared to centralized multimodal learning. Therefore, in addition to the unique challenges of modal heterogeneity, the original multimodal learning tasks become more challenging when performed within a federated learning framework.

federated learning multimodal learning Internet of Things

1. Introduction

In various real-world scenarios, data are usually collected and stored in a distributed and privacy-sensitive manner—for instance, multimedia data on personal smartphones, sensory data from various vehicles, and examination data and diagnostic records of patients across different hospitals. The significant volume of sensitive yet multimodal data being collected and shared has heightened people’s concerns regarding privacy protection. Consequently, there has been an emergence of increasingly stringent data regulation policies, such as the General Data Protection Regulation (GDPR) in the European Union and the Health Insurance Portability and Accountability Act (HIPAA) in the United States. These regulations have given rise to challenges in data collaboration and have raised privacy concerns for traditional centralized multimodal machine learning approaches ^[1].

2. Vision–Language Interaction

Visual and language data are widely present in data centers and edge devices, making vision–language interaction an important task in MFL. Specifically, a federated learning system targeting visual and language data should be capable of handling diverse and complex vision–language learning tasks, including visual question answering, visual reasoning, image captioning, image-text retrieval, and text-to-image generation. In the context of local training on client devices, the system needs to achieve efficient and robust multimodal matching and cross-modal interactions. On the one hand, due to constraints imposed by client hardware and communication requirements, lightweight and high-performance characteristics are desired in MFL approaches. Integrating state-of-the-art pre-trained large-scale models from the forefront of federated learning and vision–language learning has become a promising research direction. On the other hand, the heterogeneity of data, particularly in terms of labels and modalities, often leads to differences in model architectures and tasks among clients. These task gaps and semantic gaps can negatively impact global aggregation on the server side, posing challenges for achieving convergent global optimization.

Several pioneering studies have explored the field of MFL in the context of vision–language tasks. In ^[2], the authors proposed aimNet and evaluated it under horizontal FL, vertical FL, and federated transfer learning (FTL) settings when the clients were conducting either the VQA task or the image captioning task. CreamFL ^[3] utilized contrastive learning for ensembling uploaded heterogeneous local models based on their output representations. CreamFL ^[3] allowed both unimodal and multimodal vision–language tasks in federated systems. pFedPrompt ^[4] adapted the prompt training method to leverage large foundation models into federated learning systems to connect vision and language data. FedCMR ^[5] explored the federated cross-modal retrieval task and mitigated the representation space gap via weighted aggregation based on the local data amount and category number.

3. Human Activity Recognition

Wireless distributed sensor systems, such as IoT systems, where multiple sensors provide consistent observations of the same object or event, are a signification application scenario for MFL. Human activity recognition (HAR) is one of the most representative tasks in this setting, due to the privacy preservation requirement.

The data partition method for the HAR task in existing MFL works has two types, client-as-device and client-as-sensor. The former is represented by MMFed ^[6], which equally divides the multimodal data for each client. The local co-attention mechanism then performs multimodal fusion. Zhao et al. conducted their experiment by giving each client only a single type of modality ^[7]. The local network was divided into five modules for either modality-wise aggregation for clients with the same modality or general aggregation for all clients. However, the modality distribution or data partition method could vary according to hardware deployment and environmental factors.

4. Emotion Recognition

Emotion recognition plays a crucial role in improving social well-being and enhancing societal vitality. The multimodal data generated during the use of mobile phones often provide valuable insights into identifying users who may have underlying mental health issues. Effective emotion recognition algorithms can target specific users to enhance their experience and prevent the occurrence of negative events such as suicide and depression. However, multimedia data associated with user emotions, including chat records and personal photos, are highly privacy-sensitive. In this context, the MFL framework offers the capability of efficient collaborative training while ensuring privacy protection. Therefore, emotion recognition undoubtedly holds a significant position within the realm of MFL.

There are several MFL works that have investigated the emotion recognition task in the vertical and hybrid MFL setting. In ^[8], each client in the system held only one modality, and the unimodal encoders were trained on the local side. The proposed hierarchical aggregation method aggregated the encoders based on the modality type held by the clients and utilized an attention-based method to align the decoder weights regardless of the data modality. The FedMSplit approach ^[9] utilized a dynamic and multiview graph structure to flexibly capture the correlations among client models in a multimodal setting. Liang et al. in ^[10] proposed a decentralized privacy-preserving representation learning method that used multimodal behavior markers to predict users’ daily moods and identify an early risk of suicide.

5. Healthcare

Numerous healthcare centers and hospitals have accumulated vast amounts of multimodal data during patient consultations and treatments, including X-ray images, CT scans, physician diagnoses, and physiological measurements of patients. These multimodal data are typically tightly linked to patient identifiers and require stringent privacy protection measures. As a result, these healthcare institutions have formed isolated data islands, impeding direct collaboration in terms of co-operative training and data sharing through open databases. This presents a series of crucial challenges within the realm of multimodal federated learning, encompassing tasks such as AI-assisted diagnosis, medical image analysis, and laboratory report generation.

Some works in the field of healthcare have explored multimodal federated learning, often assuming that all institutions have the same set of modalities, referred to as horizontal MFL, or that each institution possesses only a single modality, known as vertical MFL. Agbley et al. in ^[11] applied federated learning for the prediction of melanoma and obtained a performance level that was on-par with the centralized training results. FedNorm ^[12] performed modality-based normalization techniques to enhance liver segmentation and was trained with unimodal clients holding CT and MRI data, respectively. Qayyum et al. utilized cluster federated learning for the automatic diagnosis of COVID-19 ^[13]. Each cluster contained healthcare entities that held the same modality, such as X-ray and ultrasound data.

References

Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443.
Liu, F.; Wu, X.; Ge, S.; Fan, W.; Zou, Y. Federated learning for vision-and-language grounding problems. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11572–11579.
Yu, Q.; Liu, Y.; Wang, Y.; Xu, K.; Liu, J. Multimodal Federated Learning via Contrastive Representation Ensemble. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023.
Guo, T.; Guo, S.; Wang, J. pFedPrompt: Learning Personalized Prompt for Vision-Language Models in Federated Learning. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 1364–1374.
Zong, L.; Xie, Q.; Zhou, J.; Wu, P.; Zhang, X.; Xu, B. FedCMR: Federated Cross-Modal Retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021; pp. 1672–1676.
Xiong, B.; Yang, X.; Qi, F.; Xu, C. A unified framework for multi-modal federated learning. Neurocomputing 2022, 480, 110–118.
Zhao, Y.; Barnaghi, P.; Haddadi, H. Multimodal Federated Learning on IoT Data. In Proceedings of the 2022 IEEE/ACM Seventh International Conference on Internet-of-Things Design and Implementation (IoTDI), Milano, Italy, 4–6 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 43–54.
Zhang, R.; Chi, X.; Liu, G.; Zhang, W.; Du, Y.; Wang, F. Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning with Hierarchical Aggregation. arXiv 2023, arXiv:2303.15486.
Chen, J.; Zhang, A. FedMSplit: Correlation-Adaptive Federated Multi-Task Learning across Multimodal Split Networks. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 14–18 August 2022; pp. 87–96.
Liang, P.P.; Liu, T.; Cai, A.; Muszynski, M.; Ishii, R.; Allen, N.; Auerbach, R.; Brent, D.; Salakhutdinov, R.; Morency, L.P. Learning language and multimodal privacy-preserving markers of mood from mobile data. arXiv 2021, arXiv:2106.13213.
Agbley, B.L.Y.; Li, J.; Haq, A.U.; Bankas, E.K.; Ahmad, S.; Agyemang, I.O.; Kulevome, D.; Ndiaye, W.D.; Cobbinah, B.; Latipova, S. Multimodal melanoma detection with federated learning. In Proceedings of the 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 17–19 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 238–244.
Bernecker, T.; Peters, A.; Schlett, C.L.; Bamberg, F.; Theis, F.; Rueckert, D.; Weiß, J.; Albarqouni, S. FedNorm: Modality-Based Normalization in Federated Learning for Multi-Modal Liver Segmentation. arXiv 2022, arXiv:2205.11096.
Qayyum, A.; Ahmad, K.; Ahsan, M.A.; Al-Fuqaha, A.; Qadir, J. Collaborative federated learning for healthcare: Multi-modal covid-19 diagnosis at the edge. arXiv 2021, arXiv:2101.07511.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.

Upload a video for this entry

Information

Subjects: Computer Science, Artificial Intelligence

Contributors MDPI registered users' name will be linked to their SciProfiles pages. To register with us, please refer to https://encyclopedia.pub/register :

Liwei Che

Jiaqi Wang

Yao Zhou

Fenglong Ma

View Times: 234

Update Date: 16 Aug 2023

Table of Contents

Video Upload Options

Confirm