The emergence of our hyper-connected and hyper-digitalized world (IoT, ubiquitous sensing, etc.) requires any education organization to have the ability to handle a system that produces huge amounts of different data. A key area of research in multimodal data is the process of building multimodal representations, the quality of which determines the modeling and prediction of organizational learning.
2. Types of Modalities
This section introduces the types of modalities that can be encountered while working on any machine learning problem. According to
[7], the researchers identify four main groups of modalities:
-
Tabular data: observations are stored as rows and their features as columns;
-
Graphs: observations are vertices and their features are in the form of edges between individual vertices;
-
Signals: observations are files of appropriate extension (images:.jpeg, audio:.wav, etc.) and their features are the numerical data provided within files;
-
Sequences: observations are in the form of characters/words/documents, where the type of character/word corresponds to features.
The research includes modalities coming from each of the identified groups: labels are an example of tabular data; reviews, titles, and descriptions represent textual data (sequences); images of products and movie posters are examples of visual data;, and relations between movies (movies seen by one user) are graphs.
Table 1 introduces some other examples of works on various combinations of modalities. Additionally, in
[8], the authors concentrated on the image segmentation task to learn the optimal joint representation from rich and complementary features of the same scene based on images retrieved from different sensors. All these studies showed no restrictions on modalities that can be studied together to solve a specific problem. Moreover, they proved that the combination of modalities boosts performance; the multimodal model achieved better results than the unimodal models.
Table 1. Examples of multimodal tasks.
Furthermore, these works encapsulate all multimodal fusion techniques that the researchers examine in the research: early fusion, late fusion, and sketch. They are proven effective but have not been compared yet.
3. Multimodal Representation
Learning to represent is an unsupervised task, and there is no single way to describe a good representation. However, several works have identified the main features demanded while deriving any numerical representation of a given modality.
The problem of unimodal representation has already been solved with modality-dedicated models, such as BERT
[15] for textual data, ResNet
[16] for images, etc. However, a universal method that could be applied to any machine learning task when it comes to multimodal data has not been established
[4].
Bengio et al.
[17] characterized several features that an appropriate vector representation should possess, including:
-
Manifolds: probability mass is concentrated within regions of lower dimensionality than the original space, e.g., we can expect the words “Poland”, “USA”, and “France” to have embeddings within a certain region, and the words “jump”, “ride”, and “run” in another distinct region;
-
Natural clustering: categorical values could be assigned to observations within the same manifold, e.g., a region with the words “Poland”, “USA”, and “France” can be described as “countries”;
-
Sparsity: given an observation, only a subset of its numerical representation features should be relevant. Otherwise, we end up with complicated embeddings whose highly correlated features may lead to numerous ambiguities.
For multimodal representation, ref.
[18] identifies more factors that should be taken into account: (1) the similarity between individual modalities should be preserved in their joint representation and (2) robustness to the absence of some modalities; it should still be possible to create multimodal embedding.
4. Multimodal Data Fusion
Multimodal data fusion is an approach for combining single modalities to derive multimodal representation. A few issues should be taken into account
[4] when it comes to fusing several modalities:
Classically, the existing multimodal representation techniques are divided into two categories
[2][20]: early (feature) and late (decision) fusion. In the early feature approach, all modalities are combined. This is usually achieved by concatenating their vector representations at an initial stage, and then one model is trained
[21]. In the case of late fusion, several independent models concerning each modality are trained, then their outputs are connected. The connection can be made arbitrarily. One can average the outputs and pick the most frequent one (in classification tasks), or concatenate them and build a model to obtain a final output
[21]. Neither of these data fusion approaches can be described as the best one
[20]; both have been proven to yield promising results in various scenarios.
4.1. Deep Learning Models
The most popular multimodal fusion techniques are based on deep learning solutions. The authors of
[4] describe such architecture ideas, along with their most representative cases. Four prominent approaches are deep belief nets, stacked autoencoders, convolution networks, and recurrent networks. However, despite their promising results in the field of multimodal data fusion, deep learning models suffer from two main issues
[4]. Firstly, deep learning models contain enormous free weights, especially parameters associated with a modality that brings little information. This results in high resource requirements; an undesirable feature in a production scenario. Secondly, multimodal data usually come from very dynamic environments. Therefore, there is a need for a flexible model that can quickly adapt to all changes in data.
Enormous computational requirements and low flexibility suggest exploring other techniques applied to any task, despite the types of modalities. Furthermore, the authors of
[4] suggest that these ideas can be combined with deep learning techniques and existing multimodal models to obtain state-of-the-art solutions, which would be applicable in every field and robust to all data imperfections (missing modalities, data distribution changes over time in a production case, etc.). According to
[22], the best approach is based on deep learning; the challenge is modality fusion. One of the possible solutions is the use of hashing methods. The following section discusses the strengths and weaknesses of such algorithms.
4.2. Hashing Ideas
Another promising approach in multimodal data fusion is associated with hashing models. They identify manifolds in the original space and then transform data to lower-dimensional spaces while preserving observation similarities. Such algorithms can construct multimodal representation on the fly and have been proven effective in information retrieval problems
[3], recommendation systems
[23], and object detection cases. The main advantages
[3] of hashing methods are that they (1) are cost-effective in terms of memory usage, (2) detect and work within manifolds, (3) preserve semantic similarities between points, (4) are usually data-independent, and (5) are suitable for production cases as they are robust to any data changes.
Unfortunately, hashing methods struggle with one issue. The mapping of high dimensional data into much simpler representations can result in the loss of certain information about specific observations
[3]. Therefore, it has to be verified if hashing ideas can be applied to other fields apart from similarity search tasks. Perhaps their ability to combine multiple modalities while maintaining low costs and robustness to data changes recompenses the lost information.
4.3. Sketch Representation
The sketch representation has already been proven effective if fed with visual, behavioral, and textual data for the recommendation and similarity search tasks
[23]. The idea of this representation comes from combining two algorithms: locality sensitive hashing and count-min sketch. All modalities are transformed into mutual space with the use of hash functions. Generally, a sketch is a one-hot sparse matrix containing all combined modalities. Hash functions make the representation modality independent, robust to missing modalities, and easily interpreted. Furthermore, modalities can be added to the sketch on the fly, which is extremely important in a production scenario.
In the research, the researchers slightly modify this sketch representation to the binarized form, see Figure 1. Instead of representing an observation with a subspace ID, it can be represented as a set of binary features. Then, 0 and 1 represent where the point lies concerning a single hyperplane. Such a sketch should preserve more information about a single observation.
Figure 1. The idea of binarizing the sketch. Instead of representing an observation with a subspace ID, it can be represented as a set of binary features. Then, 0 and 1 represent where the point lies concerning a single hyperplane. Such a sketch consumes much less memory and perhaps preserves more information about a single observation.
5. Multimodal Model Evaluation
Evaluating the multimodal data fusion algorithm is not straightforward, and no universal metric would measure the aspect of captured inter- and cross-modalities
[19]. However, we can assess whether learning from multiple data types simultaneously enhances task performance.
The most popular way of verifying the quality of the multimodal fusion model
[20] is to compare its performance scores (precision, AUC, etc.) to those achieved by models considering single modalities. With such an approach, we can state whether and to what extent combining modalities brings new information. The researchers also aim to preserve all similarities between observations, i.e., similar observations should be comparable in their multimodal representations. Therefore, several works
[21][23] have compared their multimodal models to NN algorithms, which serve as a good baseline. Lastly, multimodal models should be tested when adjusting additional modalities. In certain cases
[21], adding new modalities slightly improves the results while the training time increases dramatically. As a result, the model might be unfeasible in production scenarios despite its excellent performance. Therefore, we should not only focus on the scores the model achieves but also consider its flexibility and simplicity.