Visual question answering (VQA) is a task that generates or predicts an answer to a question in human language about visual images. VQA is an active field combining two AI branches: Natural language processing (NLP) and computer vision. VQA usually has four components: vision featurization, text featurization, fusion model, and classifier. Vision featurization is a part of the multi-model responsible for extracting the vision features. Text featurization is another part of the VQA multi-model responsible for extracting text features. The combination of both features and their processes is the fusion component. The last component is the classifier that classifies the queries about the images and generates the answer.
1. Introduction
Visual question answering (VQA) is a process that provides meaningful information from images to a user based on a given question. With the rapid advancements in computer vision (CV) and natural language processing (NLP), medical visual question answering (Med-VQA) has attracted much attention. The Med-VQA model seeks to retrieve accurate answers by fusing clinical visuals and inquiries. The Med-VQA model can aid in medical diagnosis, automatically extract data from medical images, andower medical professionals’ training costs. Additionally, the Med-VQA system has many benefits for the medical industry. Here are a few illustrations:
-
Diagnosis and treatment: By offering a quick and precise method for analyzing medical images, Med-VQA can help medical practitioners diagnose and treat medical disorders. Healthcare experts can learn more about the patient’s condition by inquiring about medical imaging (such as X-rays, CT scans, and MRI scans), which can aid in making a diagnosis and selecting the best course of therapy. In addition, it would reduce the doctors’ efforts by letting the patients obtain answers to the most frequent questions about their images.
-
Medical education: By giving users a way to learn from medical images, Med-VQA can be used to instruct medical students and healthcare workers. Students can learn how to assess and understand medical images—a crucial ability in the area of medicine—by posing questions about them.
-
Patient education: By allowing patients to ask questions about their medical photos, Med-VQA can help them better comprehend their medical issues. Healthcare practitioners can improve patient outcomes by assisting patients in understanding their problems and available treatments by responding to their inquiries regarding their medical photos.
-
Research: Large collections of medical photos can be analyzed using Med-VQA to glean insights that can be applied to medical research. Researchers can better comprehend medical issues and create new remedies by posing questions regarding medical imaging and examining the results.
Although much Med-VQA research has been accomplished, it requires much enhancement due to public data and data size imitations to achieve practical usage. There are a few public data available: the VQA-RAD
[1], VQA-Med 2019
[2], VQA-MED 2020
[3], SLAKE dataset
[4], and Diabetic Macular Edema (DME) dataset
[5]. Only the VQA-RAD and SLAKE datasets are manually generated and validated by clinicians. In addition, they have more question diversity among all medical VQA datasets. The DME dataset is manually generated but not validated by specialists.
Several models have been developed to solve this problem. These models rely on four types of methods: joint embedding approaches
[6][7][8], attention mechanisms
[9][10][11][12][13], composition models
[12][14][15][16][17][18], and knowledge base-enhanced approaches
[1][19][20][21]. In Med-VQA, VGGNet
[22], ResNet
[23], and the ensemble of vision pre-trained models
[24][25] are the vision features extraction methods widely used, while LSTM
[26], Bi-LSTM
[27], and BERT
[28] are the text features extraction mainly utilized. Lately, most models have aimed to use attention mechanisms to align between the text and image features
[10][29][30][31]. In addition, vision and language (V + L) pre-trained models, such as visualBERT
[32], VilBERT
[33], UNITER
[34], and CLIP
[35]. Researchers claimed that Med-VQA requires more text information about images to facilitate the classification task for the model. Therefore, they utilize image captioning generation to give the model extra information about the image
[36].
2. Visual Question Answering
VQA usually has four components: vision featurization, text featurization, fusion model, and classifier. Vision featurization is a part of the multi-model responsible for extracting the vision features. Text featurization is another part of the VQA multi-model responsible for extracting text features. The combination of both features and their processes is the fusion component. The last component is the classifier that classifies the queries about the images and generates the answer.
2.1. Vision Featurization
Applying mathematical operations to an image requires representing it as a numerical vector, called image featurization. There are several techniques to extract the features of the image, such as scale-invariant feature transform (SIFT)
[37], simple RGB vector, a histogram of oriented gradients (HOG)
[38], Haar transforms
[39], and deep learning. In deep learning, such as CNNs, visual feature extraction learns using a neural network. Using deep learning can be accomplished by training the model from scratch, which requires a large data size, or using transfer learning, which behaves significantly with a limited data size. Since medical VQA datasets are limited, most researchers aim to use per-trained models, such as AlexNet
[40], VGGNet
[22][41][42][43][44], GoogLeNet
[45], ResNet
[5][23][46][47][48][49][50], and DenseNet-121
[51]. Ensemble models can be stronger than single models, so there is a direction to use it as vision feature extraction
[25][52][53][54][55].
2.2. Text Featurization
As a vision featurization, a question has to be converted into a numeric vector using word-embedding methods for mathematical computations. A suitable text embedding method is based on trial and error
[56]. Various text embedding methods are used in the SOTA to impact the multi-model significantly. The most common methods used in question models are LSTM
[5][51][54][55][57], GRU
[57], RNNs
[42][58][59][60], Faster-RNN
[57], and the encoder-decoder method
[43][46][47][48][61][62]. In addition to the previous methods, pre-trained models have been used, such as Generalized Auto-regressive Pre-training for language Understanding (XLNet)
[63] and the BERT model
[28][43][49][50]. Some models have ignored text featurization and converted the problem into an image classification problem
[53][64][65].
2.3. Fusion
Extracting the features of text and images is processed independently. Therefore, those features are fused using the fusion method. Manmadhan et al.
[56] classified fusion into three types: baseline fusion models, end-to-end neural network models, and joint attention models. In baseline fusions, various methods are used, such as element-wise addition
[7], element-wise multiplication, concatenation
[66], all of them combined
[67], or a hybrid of these methods with a polynomial function
[68]. End-to-end neural network models can be used to fuse image and text featurization. Various methods are currently used, including neural module networks (NMNs)
[12], multimodal, MCB
[46], dynamic parameter prediction networks (DPPNs)
[69], multimodal residual network (MRNs)
[70], cross-modal multistep fusion (CMF) networks
[71], basic MCB model with a deep attention neural tensor network (DA-NTN) module
[72], multi-layer perceptron (MLP)
[73], and the encoder-decoder method
[32][74]. The main reason for using the joint attention model is to address the semantic relationship between text attention and question attention
[56]. There are various joint attention models, such as the word-to-region attention network (WRAN)
[29], co-attention
[30], the question-guided attention map (QAM)
[10], and question type-guided attention (QTA)
[31].
Neural network methods, such as LSTM and encoder-decoder, are also used in the fusion phase. Verma and Ramachandran
[43] designed a multi-model that used encoder-decoder, STM, and GloVe. Furthermore, vision + language pre-trained models are also utilized, such as in
[50].
In the VQA system, the question and image are embedded separately using one or a hybrid of text and vision featurization techniques mentioned above. Then, the textual and visual feature victors are combined with a fusion technique, such as concatenation, element-wise multiplication, or attention. The obtained victor from the fusion phase is classified using a classification technique, or it can be used to generate an answer as a VQA generation problem. Figure 1 shows the overall VQA system.
Figure 1. The Overall VQA Structure.