Learning Paradigms: Comparison
Please note this is a comparison between Version 4 by Amina Yu and Version 3 by Amina Yu.

Learning paradigms are more like methodologies that guide problem solving. In addition to the most widely used supervised learning paradigm, other learning paradigms are also employed in the multimodal field, such as semi-supervised learning, self-supervised learning, and transfer learning. 

  • deep learning
  • Learning Paradigms

1. Semi-Supervised Learning

Semi-supervised learning is a learning paradigm concerned with the study of how computers and natural systems such as humans learn in the presence of both labeled and unlabeled data [55][1]. It can use readily available unlabeled data to improve supervised learning tasks when the labeled data are scarce or expensive. The semi-supervised learning paradigm is important in multimodal learning because aligned and structured multimodal datasets are often expensive and difficult to obtain.
Guillaumin et al. [56][2] presents an early example of successful multimodal image classification using non-deep learning methods. Text assistance is used to assist unlabeled image classification. This suggests the potential for complementarity between modalities, which will be discussed later in this article. Cheng et al. [57,58][3][4] apply this learning paradigm to the RGB-D Object Recognition task. Their idea is to train an RGB-based and depth-based classifier separately on the labeled dataset and design a fusion module to obtain the result. For the unlabeled dataset, they first obtain the two prediction results of the RGB and depth streams, respectively, and exchange them as pseudo-labels of the other stream for training, to achieve the purpose of semi-supervision. This method naively considers the possibility of cross-validation between modalities, but it does not necessarily work well for other multimodal forms such as text and image multimodality. In [59[5][6],60], some methods applied to vision–language mapping with the variational auto-encoding Bayes framework are extended to a semi-supervised model for an image–sentence mapping task.

2. Self-Supervised Learning

The self-supervised paradigm [61][7] can be viewed as a special form of unsupervised learning method with a supervised form, where supervision is induced by self-supervised tasks rather than preset prior knowledge. In contrast to a completely unsupervised setting, self-supervised learning uses information from the dataset itself to construct pseudo-labels. In terms of representation learning, self-supervised learning has great potential to replace fully supervised learning. For self-supervised signal representation within unimodality, Taleb et al. [62][8] cut an image into patches of uniform size and shuffle the order and trained the network to stitch the shuffled patches into the original image, which is similar to solving a puzzle problem. Training the network to solve the jigsaw puzzle allows the network to learn the deep features of the image in a self-supervised manner, thereby improving the performance of the network in downstream tasks such as segmentation and classification.
This learning paradigm is especially suitable for multimodal domains. This is because, in multimodal learning, not only a single modality itself will generate self-supervised signals, but also the alignment and constraints between modalities are also important sources of self-supervised signals. The rich self-supervised signals enable multimodal self-supervised learning. Tamkin et al. [63][9] introduce a domain-agnostic benchmark for self-supervised (DABS) multimodal learning on seven diverse domains: realistic images, multichannel sensor data, English text, speech recordings, multilingual text, chest X-rays, and images with text descriptions. It is an attempt to create the latest benchmark in the field. Valverde et al. [53][10] present a novel self-supervised framework consisting of multiple teachers that have diverse leverage modalities, including RGB, depth, and thermal images, to simultaneously exploit complementary cues and distill knowledge into a single audio student network. This work also proves that the single modality is a sufficiently robust one on some multimodal tasks with the training assistance of other modalities. Coen et al. [64][11] also train with signals transported across modalities. Gomez et al. [65][12] use textual information to train a CNN [66][13] to extract unlabeled image features. The motivation is that textual descriptions and annotations are easier to obtain than images. The first step is to learn image topics through latent Dirichlet allocation [67][14], and then perform parameter training of the image feature extraction network based on these topics.
In the video field, some work has also been developed in recent years. Afouras et al. [68][15] demonstrate that object embedding obtained from a self-supervised network facilitates a number of downstream audio-visual tasks that have previously required hand-engineered supervised pipelines. Asano et al. [69][16] propose a novel clustering method that allows pseudo-labeling of a video dataset without any human annotations by leveraging the natural correspondence between the audio and visual modalities. Specifically, it is to learn a clustering labeling function without access to any ground-truth label annotations. They think that each modality is equally informative in the algorithm to learn a more robust model. Alayrac et al. [70][17] use video and audio signals to extract features and design a contrastive loss, and then fuse the video and audio features with text features for a contrastive loss. The advantage of this method is that the parts with the same semantic level can be aligned when comparing modalities because the semantics of text are often more advanced in video and audio. Cheng et al. [71][18] separated the audio and video in video and determined whether they are from the same video to turn self-supervised learning into a binary classification problem. Alwassel et al. [72][19] conduct a comprehensive study of self-supervised clustering methods for video and audio modalities. They proposed four approaches, namely single-modality deep clustering (SDC), multihead deep clustering (MDC), concatenation deep clustering (CDC), and cross-modal deep clustering (XDC). These approaches differentiate how intramodal and intermodal supervisory signals are utilized when the clustering algorithm iterates.

3. Transfer Learning

Transfer learning [73][20] is an indispensable part of today’s deep learning field. The essence of transfer learning is to adjust the model parameters that have been trained on the source domain to the target domain. Since the dataset of downstream tasks is often relatively small in practical applications, training directly on it will lead to overfitting or difficulty in training. Taking natural language processing as an example, the way that has been developing this year is to train on large-scale datasets and then use pretrained models to transfer learning to downstream tasks. Such pretraining models often have a large amount of parameters, such as BERT [13][21], GPT [74][22], GPT-2 [75][23], and GPT-3 [76][24]. After the success of transfer learning in the field of natural language processing, various pretraining models have sprung up in the unimodality situation, such as ViT [10][25] in the field of computer vision and Wave2Vec [77][26] in the field of speech. There has been extensive work showing that they benefit downstream unimodal tasks in performance and efficiency.
This need is even more important in the multimodal field since multimodal aligned data are rare and expensive. A large number of the downstream tasks in the multimodal field rely on transfer learning, e.g., [49,53,78,79][10][27][28][29]. Hu et al. [80][30] share the same model parameters across all tasks instead of separately fine-tuning task-specific models and handle a much higher variety of tasks across different domains. This section describes the different methods of multimodal transfer learning. An important type of transfer learning is to unify vision and language features into a shared hidden space to generate a common representation for the source domain, then adjust the common representation to the target domain [47,81,82,83][31][32][33][34]. It can be subdivided into non-sequence-based, such as image–text, and sequence-based, such as video–text and video–audio.
Non-sequence-based. Rahman et al. [84][35] believe that non-text modalities (vision and audio) will affect the meaning of words and then affect the position of feature vectors in semantic space [85][36], so non-text and text jointly determine the new position of feature vectors in semantic space. It is a method of assisting transfer learning of text with information from other modalities. Gan et al. [86][37] propose a method to enhance the generalization ability of models using large-scale adversarial training [87][38], which consists of two steps of pretraining and transfer learning. It is a general framework that can be applied to any multimodal pretrained model to improve the model’s generalization ability.
Sequence-based. Different from non-sequential tasks, sequential tasks represented by videos have more difficulties in transfer learning. Consecutive clips usually contain similar semantics from consecutive scenes, which means sparsely sampled clips already contain critical visual and semantic information in the video. Therefore, a small number of clips are sufficient to replace the entire video for training. Based on this, a large part of the work [88,89][39][40] is to take clips from the video for training randomly. Many approaches extract features from text input and clip input from sampled video separately and then aggregate them before the prediction layer. Lei et al. [90][41] propose to constrain each frame of video information with textual information such as “early fusion” and finally summarize the resulting predictions for each frame. Sun et al. [91][42] propose to convert video frames into discrete token sequences by applying hierarchical vector quantized features to generate a sequence of “visual words” that are aligned with the text. Furthermore, it is self-supervised by a masked language model method similar to BERT. This method of converting visual information into “visual words” is also reflected in [92][43], which is a good solution for aligning different modal representations in transfer learning.
In conclusion, multimodal learning allows the use of rich learning paradigms. In the absence of supervised signals, complementary and alignment information between modalities can be an alternative to self-supervision and semi-supervision. Multimodal transfer learning is also more diverse and generalizable.

References

  1. Zhu, X.; Goldberg, A.B. Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. 2009, 3, 1–130.
  2. Guillaumin, M.; Verbeek, J.; Schmid, C. Multimodal semi-supervised learning for image classification. In Proceedings of the 2010 IEEE Computer society conference on computer vision and pattern recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 902–909.
  3. Cheng, Y.; Zhao, X.; Cai, R.; Li, Z.; Huang, K.; Rui, Y. Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; pp. 3345–3351.
  4. Cheng, Y.; Zhao, X.; Huang, K.; Tan, T. Semi-supervised learning for rgb-d object recognition. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 2377–2382.
  5. Tian, D.; Gong, M.; Zhou, D.; Shi, J.; Lei, Y. Semi-supervised multimodal hashing. arXiv 2017, arXiv:1712.03404.
  6. Shen, Y.; Zhang, L.; Shao, L. Semi-supervised vision-language mapping via variational learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1349–1354.
  7. Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 2021, Early Access.
  8. Taleb, A.; Lippert, C.; Klein, T.; Nabi, M. Multimodal self-supervised learning for medical image analysis. In International Conference on Information Processing in Medical Imaging; Springer: Berlin/Heidelberg, Germany, 2021; pp. 661–673.
  9. Tamkin, A.; Liu, V.; Lu, R.; Fein, D.; Schultz, C.; Goodman, N. DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning. arXiv 2021, arXiv:2111.12062.
  10. Valverde, F.R.; Hurtado, J.V.; Valada, A. There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11612–11621.
  11. Coen, M.H. Multimodal Dynamics: Self-Supervised Learning in Perceptual and Motor Systems. Ph.D. Thesis, Massachusetts Institute of Technology, Boston, MA, USA, 2006.
  12. Gomez, L.; Patel, Y.; Rusinol, M.; Karatzas, D.; Jawahar, C. Self-supervised learning of visual features through embedding images into text topic spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4230–4239.
  13. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90.
  14. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022.
  15. Afouras, T.; Owens, A.; Chung, J.S.; Zisserman, A. Self-supervised learning of audio-visual objects from video. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 208–224.
  16. Asano, Y.; Patrick, M.; Rupprecht, C.; Vedaldi, A. Labelling unlabelled videos from scratch with multi-modal self-supervision. Adv. Neural Inf. Process. Syst. 2020, 33, 4660–4671.
  17. Alayrac, J.B.; Recasens, A.; Schneider, R.; Arandjelović, R.; Ramapuram, J.; De Fauw, J.; Smaira, L.; Dieleman, S.; Zisserman, A. Self-supervised multimodal versatile networks. Adv. Neural Inf. Process. Syst. 2020, 33, 25–37.
  18. Cheng, Y.; Wang, R.; Pan, Z.; Feng, R.; Zhang, Y. Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3884–3892.
  19. Alwassel, H.; Mahajan, D.; Korbar, B.; Torresani, L.; Ghanem, B.; Tran, D. Self-supervised learning by cross-modal audio-video clustering. Adv. Neural Inf. Process. Syst. 2020, 33, 9758–9770.
  20. Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 1–40.
  21. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805.
  22. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://openai.com/blog/language-unsupervised/ (accessed on 1 June 2022).
  23. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9.
  24. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901.
  25. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
  26. Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862.
  27. Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1780–1790.
  28. Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. arXiv 2021, arXiv:2102.04830.
  29. Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv 2019, arXiv:1908.08530.
  30. Hu, R.; Singh, A. Unit: Multimodal multitask learning with a unified transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1439–1449.
  31. Tan, H.; Bansal, M. Lxmert: Learning cross-modality encoder representations from transformers. arXiv 2019, arXiv:1908.07490.
  32. Chen, F.; Zhang, D.; Han, M.; Chen, X.; Shi, J.; Xu, S.; Xu, B. VLP: A Survey on Vision-Language Pre-training. arXiv 2022, arXiv:2202.09061.
  33. Li, G.; Duan, N.; Fang, Y.; Gong, M.; Jiang, D. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7 February 2020; Volume 34, pp. 11336–11344.
  34. Zhou, M.; Zhou, L.; Wang, S.; Cheng, Y.; Li, L.; Yu, Z.; Liu, J. Uc2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4155–4165.
  35. Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. Integr. Multimodal Inf. Large Pretrained Transform. 2020, 2020, 2359.
  36. Wang, Y.; Shen, Y.; Liu, Z.; Liang, P.P.; Zadeh, A.; Morency, L.P. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 2019; Volume 33, pp. 7216–7223.
  37. Gan, Z.; Chen, Y.C.; Li, L.; Zhu, C.; Cheng, Y.; Liu, J. Large-scale adversarial training for vision-and-language representation learning. Adv. Neural Inf. Process. Syst. 2020, 33, 6616–6628.
  38. Tramèr, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; McDaniel, P. Ensemble adversarial training: Attacks and defenses. arXiv 2017, arXiv:1705.07204.
  39. Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), München, Germany, 8–14 September 2018; pp. 305–321.
  40. Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6202–6211.
  41. Lei, J.; Li, L.; Zhou, L.; Gan, Z.; Berg, T.L.; Bansal, M.; Liu, J. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7331–7341.
  42. Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7464–7473.
  43. Tan, H.; Bansal, M. Vokenization: Improving language understanding with contextualized, visual-grounded supervision. arXiv 2020, arXiv:2010.06775.
More
Video Production Service