Multimodal federated learning (MFL) offers many advantages, such as privacy preservation and addressing the data silo problem. However, it also faces limitations such as communication costs, data heterogeneity, and hardware disparities compared to centralized multimodal learning. Therefore, in addition to the unique challenges of modal heterogeneity, the original multimodal learning tasks become more challenging when performed within a federated learning framework.
1. Introduction
In various real-world scenarios, data are usually collected and stored in a distributed and privacy-sensitive manner—for instance, multimedia data on personal smartphones, sensory data from various vehicles, and examination data and diagnostic records of patients across different hospitals. The significant volume of sensitive yet multimodal data being collected and shared has heightened people’s concerns regarding privacy protection. Consequently, there has been an emergence of increasingly stringent data regulation policies, such as the General Data Protection Regulation (GDPR) in the European Union and the Health Insurance Portability and Accountability Act (HIPAA) in the United States. These regulations have given rise to challenges in data collaboration and have raised privacy concerns for traditional centralized multimodal machine learning approaches
[1].
2. Vision–Language Interaction
Visual and language data are widely present in data centers and edge devices, making vision–language interaction an important task in MFL. Specifically, a federated learning system targeting visual and language data should be capable of handling diverse and complex vision–language learning tasks, including visual question answering, visual reasoning, image captioning, image-text retrieval, and text-to-image generation. In the context of local training on client devices, the system needs to achieve efficient and robust multimodal matching and cross-modal interactions. On the one hand, due to constraints imposed by client hardware and communication requirements, lightweight and high-performance characteristics are desired in MFL approaches. Integrating state-of-the-art pre-trained large-scale models from the forefront of federated learning and vision–language learning has become a promising research direction. On the other hand, the heterogeneity of data, particularly in terms of labels and modalities, often leads to differences in model architectures and tasks among clients. These task gaps and semantic gaps can negatively impact global aggregation on the server side, posing challenges for achieving convergent global optimization.
Several pioneering studies have explored the field of MFL in the context of vision–language tasks. In
[2], the authors proposed aimNet and evaluated it under horizontal FL, vertical FL, and federated transfer learning (FTL) settings when the clients were conducting either the VQA task or the image captioning task. CreamFL
[3] utilized contrastive learning for ensembling uploaded heterogeneous local models based on their output representations. CreamFL
[3] allowed both unimodal and multimodal vision–language tasks in federated systems. pFedPrompt
[4] adapted the prompt training method to leverage large foundation models into federated learning systems to connect vision and language data. FedCMR
[5] explored the federated cross-modal retrieval task and mitigated the representation space gap via weighted aggregation based on the local data amount and category number.
3. Human Activity Recognition
Wireless distributed sensor systems, such as IoT systems, where multiple sensors provide consistent observations of the same object or event, are a signification application scenario for MFL. Human activity recognition (HAR) is one of the most representative tasks in this setting, due to the privacy preservation requirement.
The data partition method for the HAR task in existing MFL works has two types, client-as-device and client-as-sensor. The former is represented by MMFed
[6], which equally divides the multimodal data for each client. The local co-attention mechanism then performs multimodal fusion. Zhao et al. conducted their experiment by giving each client only a single type of modality
[7]. The local network was divided into five modules for either modality-wise aggregation for clients with the same modality or general aggregation for all clients. However, the modality distribution or data partition method could vary according to hardware deployment and environmental factors.
4. Emotion Recognition
Emotion recognition plays a crucial role in improving social well-being and enhancing societal vitality. The multimodal data generated during the use of mobile phones often provide valuable insights into identifying users who may have underlying mental health issues. Effective emotion recognition algorithms can target specific users to enhance their experience and prevent the occurrence of negative events such as suicide and depression. However, multimedia data associated with user emotions, including chat records and personal photos, are highly privacy-sensitive. In this context, the MFL framework offers the capability of efficient collaborative training while ensuring privacy protection. Therefore, emotion recognition undoubtedly holds a significant position within the realm of MFL.
There are several MFL works that have investigated the emotion recognition task in the vertical and hybrid MFL setting. In
[8], each client in the system held only one modality, and the unimodal encoders were trained on the local side. The proposed hierarchical aggregation method aggregated the encoders based on the modality type held by the clients and utilized an attention-based method to align the decoder weights regardless of the data modality. The FedMSplit approach
[9] utilized a dynamic and multiview graph structure to flexibly capture the correlations among client models in a multimodal setting. Liang et al. in
[10] proposed a decentralized privacy-preserving representation learning method that used multimodal behavior markers to predict users’ daily moods and identify an early risk of suicide.
5. Healthcare
Numerous healthcare centers and hospitals have accumulated vast amounts of multimodal data during patient consultations and treatments, including X-ray images, CT scans, physician diagnoses, and physiological measurements of patients. These multimodal data are typically tightly linked to patient identifiers and require stringent privacy protection measures. As a result, these healthcare institutions have formed isolated data islands, impeding direct collaboration in terms of co-operative training and data sharing through open databases. This presents a series of crucial challenges within the realm of multimodal federated learning, encompassing tasks such as AI-assisted diagnosis, medical image analysis, and laboratory report generation.
Some works in the field of healthcare have explored multimodal federated learning, often assuming that all institutions have the same set of modalities, referred to as horizontal MFL, or that each institution possesses only a single modality, known as vertical MFL. Agbley et al. in
[11] applied federated learning for the prediction of melanoma and obtained a performance level that was on-par with the centralized training results. FedNorm
[12] performed modality-based normalization techniques to enhance liver segmentation and was trained with unimodal clients holding CT and MRI data, respectively. Qayyum et al. utilized cluster federated learning for the automatic diagnosis of COVID-19
[13]. Each cluster contained healthcare entities that held the same modality, such as X-ray and ultrasound data.