Learning paradigms are more like methodologies that guide problem solving. In addition to the most widely used supervised learning paradigm, other learning paradigms are also employed in the multimodal field, such as semi-supervised learning, self-supervised learning, and transfer learning.
1. Semi-Supervised Learning
Semi-supervised learning is a learning paradigm concerned with the study of how computers and natural systems such as humans learn in the presence of both labeled and unlabeled data
[1]. It can use readily available unlabeled data to improve supervised learning tasks when the labeled data are scarce or expensive. The semi-supervised learning paradigm is important in multimodal learning because aligned and structured multimodal datasets are often expensive and difficult to obtain.
Guillaumin et al.
[2] presents an early example of successful multimodal image classification using non-deep learning methods. Text assistance is used to assist unlabeled image classification. This suggests the potential for complementarity between modalities. Cheng et al.
[3][4] apply this learning paradigm to the RGB-D Object Recognition task. Their idea is to train an RGB-based and depth-based classifier separately on the labeled dataset and design a fusion module to obtain the result. For the unlabeled dataset, they first obtain the two prediction results of the RGB and depth streams, respectively, and exchange them as pseudo-labels of the other stream for training, to achieve the purpose of semi-supervision. This method naively considers the possibility of cross-validation between modalities, but it does not necessarily work well for other multimodal forms such as text and image multimodality. In
[5][6], some methods applied to vision–language mapping with the variational auto-encoding Bayes framework are extended to a semi-supervised model for an image–sentence mapping task.
2. Self-Supervised Learning
The self-supervised paradigm
[7] can be viewed as a special form of unsupervised learning method with a supervised form, where supervision is induced by self-supervised tasks rather than preset prior knowledge. In contrast to a completely unsupervised setting, self-supervised learning uses information from the dataset itself to construct pseudo-labels. In terms of representation learning, self-supervised learning has great potential to replace fully supervised learning. For self-supervised signal representation within unimodality, Taleb et al.
[8] cut an image into patches of uniform size and shuffle the order and trained the network to stitch the shuffled patches into the original image, which is similar to solving a puzzle problem. Training the network to solve the jigsaw puzzle allows the network to learn the deep features of the image in a self-supervised manner, thereby improving the performance of the network in downstream tasks such as segmentation and classification.
This learning paradigm is especially suitable for multimodal domains. This is because, in multimodal learning, not only a single modality itself will generate self-supervised signals, but also the alignment and constraints between modalities are also important sources of self-supervised signals. The rich self-supervised signals enable multimodal self-supervised learning. Tamkin et al.
[9] introduce a domain-agnostic benchmark for self-supervised (DABS) multimodal learning on seven diverse domains: realistic images, multichannel sensor data, English text, speech recordings, multilingual text, chest X-rays, and images with text descriptions. It is an attempt to create the latest benchmark in the field. Valverde et al.
[10] present a novel self-supervised framework consisting of multiple teachers that have diverse leverage modalities, including RGB, depth, and thermal images, to simultaneously exploit complementary cues and distill knowledge into a single audio student network. This work also proves that the single modality is a sufficiently robust one on some multimodal tasks with the training assistance of other modalities. Coen et al.
[11] also train with signals transported across modalities. Gomez et al.
[12] use textual information to train a CNN
[13] to extract unlabeled image features. The motivation is that textual descriptions and annotations are easier to obtain than images. The first step is to learn image topics through latent Dirichlet allocation
[14], and then perform parameter training of the image feature extraction network based on these topics.
In the video field, some work has also been developed in recent years. Afouras et al.
[15] demonstrate that object embedding obtained from a self-supervised network facilitates a number of downstream audio-visual tasks that have previously required hand-engineered supervised pipelines. Asano et al.
[16] propose a novel clustering method that allows pseudo-labeling of a video dataset without any human annotations by leveraging the natural correspondence between the audio and visual modalities. Specifically, it is to learn a clustering labeling function without access to any ground-truth label annotations. They think that each modality is equally informative in the algorithm to learn a more robust model. Alayrac et al.
[17] use video and audio signals to extract features and design a contrastive loss, and then fuse the video and audio features with text features for a contrastive loss. The advantage of this method is that the parts with the same semantic level can be aligned when comparing modalities because the semantics of text are often more advanced in video and audio. Cheng et al.
[18] separated the audio and video in video and determined whether they are from the same video to turn self-supervised learning into a binary classification problem. Alwassel et al.
[19] conduct a comprehensive study of self-supervised clustering methods for video and audio modalities. They proposed four approaches, namely single-modality deep clustering (SDC), multihead deep clustering (MDC), concatenation deep clustering (CDC), and cross-modal deep clustering (XDC). These approaches differentiate how intramodal and intermodal supervisory signals are utilized when the clustering algorithm iterates.
3. Transfer Learning
Transfer learning
[20] is an indispensable part of today’s deep learning field. The essence of transfer learning is to adjust the model parameters that have been trained on the source domain to the target domain. Since the dataset of downstream tasks is often relatively small in practical applications, training directly on it will lead to overfitting or difficulty in training. Taking natural language processing as an example, the way that has been developing this year is to train on large-scale datasets and then use pretrained models to transfer learning to downstream tasks. Such pretraining models often have a large amount of parameters, such as BERT
[21], GPT
[22], GPT-2
[23], and GPT-3
[24]. After the success of transfer learning in the field of natural language processing, various pretraining models have sprung up in the unimodality situation, such as ViT
[25] in the field of computer vision and Wave2Vec
[26] in the field of speech. There has been extensive work showing that they benefit downstream unimodal tasks in performance and efficiency.
This need is even more important in the multimodal field since multimodal aligned data are rare and expensive. A large number of the downstream tasks in the multimodal field rely on transfer learning, e.g.,
[10][27][28][29]. Hu et al.
[30] share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. This section describes the different methods of multimodal transfer learning. An important type of transfer learning is to unify vision and language features into a shared hidden space to generate a common representation for the source domain, then adjust the common representation to the target domain
[31][32][33][34]. It can be subdivided into non-sequence-based, such as image–text, and sequence-based, such as video–text and video–audio.
Non-sequence-based. Rahman et al.
[35] believe that non-text modalities (vision and audio) will affect the meaning of words and then affect the position of feature vectors in semantic space
[36], so non-text and text jointly determine the new position of feature vectors in semantic space. It is a method of assisting transfer learning of text with information from other modalities. Gan et al.
[37] propose a method to enhance the generalization ability of models using large-scale adversarial training
[38], which consists of two steps of pretraining and transfer learning. It is a general framework that can be applied to any multimodal pretrained model to improve the model’s generalization ability.
Sequence-based. Different from non-sequential tasks, sequential tasks represented by videos have more difficulties in transfer learning. Consecutive clips usually contain similar semantics from consecutive scenes, which means sparsely sampled clips already contain critical visual and semantic information in the video. Therefore, a small number of clips are sufficient to replace the entire video for training. Based on this, a large part of the work
[39][40] is to take clips from the video for training randomly. Many approaches extract features from text input and clip input from sampled video separately and then aggregate them before the prediction layer. Lei et al.
[41] propose to constrain each frame of video information with textual information such as “early fusion” and finally summarize the resulting predictions for each frame. Sun et al.
[42] propose to convert video frames into discrete token sequences by applying hierarchical vector quantized features to generate a sequence of “visual words” that are aligned with the text. Furthermore, it is self-supervised by a masked language model method similar to BERT. This method of converting visual information into “visual words” is also reflected in
[43], which is a good solution for aligning different modal representations in transfer learning.
In conclusion, multimodal learning allows the use of rich learning paradigms. In the absence of supervised signals, complementary and alignment information between modalities can be an alternative to self-supervision and semi-supervision. Multimodal transfer learning is also more diverse and generalizable.