1. Unimodal Applications
Unimodal applications in the context of NLP refer to AI-based systems that primarily focus on processing and analyzing text as their main modality. The subsequent sections delve into the primary categories of unimodal applications, which include Language Modeling, Question Answering, Machine Translation, Text Classification, Text Generation, Text Summarization, Sentiment Analysis, Named Entity Recognition, and Information Retrieval. These categories exemplify the diverse range of applications and capabilities that AI systems can achieve by focusing on text-based information.
1.1. Language Modeling
Language modeling is a fundamental task in NLP that involves predicting the next word in a sequence of text based on the preceding words. The goal of language modeling is to estimate the probability distribution of sequences of words in a given language and is used as a building block for many NLP tasks such as machine translation, speech recognition, and text generation. Language modeling can be easily extended to more complex NLP tasks such as sentence-pair modeling, cross-document language modeling, and definition modeling. By leveraging the knowledge learned from language modeling, these tasks can benefit from improved accuracy and efficiency. Language modeling typically follows the decoder-only architecture popularized by the GPT family [
3,
12,
13].
However, when used for generation, language modeling is often limited in its ability to handle complex language phenomena, and the large size of models makes them computationally expensive to fine-tune for specific tasks. Typically, these models are utilized by providing additional context in the form of a prompt (called “prompt tuning”), either manually or through automated selection [
14]. Transformer-XL [
15] is a unique neural architecture intended for language modeling that can learn relationships beyond a set length while keeping temporal consistency. It has a segment-level recurrence mechanism and an innovative positional encoding scheme that captures longer-term dependencies while addressing context fragmentation. As a result, Transformer-XL outperforms both LSTMS and standard transformers on both short and long sequences and is significantly faster during evaluation.
Dynamic evaluation enhances models by adapting to recent sequence history through gradient descent, capitalizing on recurrent sequential patterns. It can exploit long-range dependencies in natural language, such as style and word usage. Krause et al. [
16] investigated the benefits of applying dynamic evaluation to transformers, aiming to determine whether transformers can fully adapt to recent sequence history. Their work builds on top of the previously mentioned Transformer-XL model.
Recently, a promising new direction in language modeling involves the use of reinforcement learning (RL)-based fine-tuning. In this approach, a pre-trained language model is fine-tuned using RL to optimize a particular task-specific reward function [
17]. This allows the model to learn from its predictions, leading to improved accuracy and generalization performance on the target task. Additionally, fine-tuning with RL can be accomplished with much smaller models, making it computationally more efficient and faster. This new direction of RL-based fine-tuning has shown promising results on a variety of NLP tasks and is a promising avenue for further research and development in the field of language modeling.
1.2. Question Answering
Question answering is a task in NLP that involves automatically answering questions posed in natural language. The goal of question answering is to extract the relevant information from a given text corpus and present it as an answer to a user’s question. Question-answering systems can operate over a wide range of text types, including news articles, Wikipedia pages, and others, and can be designed to answer a wide range of questions, including fact-based questions, opinion questions, and others. There are several subtasks within QA, each with its unique challenges and requirements. Among the most common subtasks are:
-
Open-Domain Question Answering (ODQA): This task involves finding an answer to a question from an open domain, such as the entire internet or a large corpus of text. The goal is to find the most relevant information to answer the question, even if it requires synthesizing information from multiple sources. Reformer, introduced by Kitaev et al. [
18], has been shown to excel at ODQA, with its success attributed to the use of locality-sensitive hashing which enables far larger context windows than ordinary transformers.
-
Conversational Question Answering (CQA): This task involves answering questions in a conversational setting, where the model must understand the context of the conversation and generate an answer that is relevant and appropriate for the current conversational context. SDNet [
19] utilizes both inter-attention and self-attention mechanisms to effectively process context and questions separately and fuse them at multiple intervals.
-
Answer Selection: This task involves ranking a set of candidate answers for a given question, where the goal is to select the most accurate answer from the candidate set. Fine-tuning pre-trained transformers has been shown to be an effective method within answer selection [
20].
-
Machine Reading Comprehension (MRC): This task involves understanding and answering questions about a given passage of text. The model must be able to comprehend the text, extract relevant information, and generate an answer that is accurate and relevant to the question. XLNet [
21] uses a permutation-based training procedure that allows it to take into account all possible combinations of input tokens, rather than just the left-to-right order as in traditional transformer models. XLNet’s ability to capture long-range dependencies and its strong pre-training make it a highly competitive model for the MRC task.
1.3. Machine Translation
Machine translation (MT) is the task of automatically converting a source text in one language to a target text in another language. The goal of machine translation is to produce a fluent and accurate translation that conveys the meaning of the source text in the target language. MT models often follow an encoder–decoder architecture to capture the context effectively using a bidirectional encoder and be able to generate text of arbitrary length, following the original formulation of transformer architecture [
1]. There are several subtasks within MT, each with its unique challenges and requirements. Among the most common subtasks are:
1.4. Text Classification
Text classification is the task of categorizing a text into one or more predefined categories based on its content. The goal of text classification is to automatically assign a label to a given text based on its content, allowing it to be organized and categorized for easier analysis and management. These models are trained on annotated text data in order to learn the relationship between the text content and its label, and can then be used to classify new unseen text data. Text classification models typically follow a decoder-only architecture in order to effectively capture the entirety of context using bidirectional attention. While text classification is often the most varied use case due to its commercial importance, two primary subcategories are prominent:
-
Document Classification: This task involves assigning a label or category to a full document, such as a news article, blog post, or scientific paper. Document classification is typically accomplished by first representing the document as a numerical vector and then using a machine learning model to make a prediction based on the document’s representation. LinkBERT [
24] extends the pre-training objective of BERT to incorporate links between documents, which results in better classification quality.
-
Cause and Effect Classification: This task involves identifying the cause and effect relationship between two events described in a sentence or paragraph. An approach by Hosseini et al. [
25] has shown the efficacy of the language modeling paradigm by verbalizing knowledge graphs and using them as a pre-training corpus for a language model. The model obtains acceptable performance without any further fine-tuning or prompting.
1.5. Text Generation
Text Generation is a task in NLP in which the objective is to produce new text automatically, typically starting from a given prompt or input. The output can be a single word, phrase, sentence, or full-length piece of text, and is used for chatbots, content creation, and more. The generated text should reflect an understanding of the input and the language being generated, and the quality and coherence of the generated text can vary depending on the approach used. Text generation has seen a surge of interest following the release of commercial APIs such as GPT, Cohere, and ChatGPT. Text generation typically follows a decoder-only architecture; however, recent issues with prompt-injection attacks have migrated part of the focus towards encoder = -decoder models that have been instruction-tuned, such as T5 [
26]. While the most prominent approach to text generation is based on prompting, several other forms of generation have been studied in the literature and have found commercial success as well. Text generation subtasks include:
-
Dialogue Generation: This category focuses on generating text in the form of a conversation between two or more agents. Dialogue generation systems are used in various applications, such as chatbots, virtual assistants, and conversational AI systems. These systems use dialogue history, user input, and context to generate appropriate and coherent responses. P2-BOT [
27] is a transmitter–receiver-based framework that aims to explicitly model understanding in chat dialogue systems through mutual persona perception, resulting in improved personalized dialogue generation based on both automated metrics and human evaluation.
-
Code Generation: This category focuses on generating code based on a given input, such as a natural language description of a software problem. Code generation systems are used in software development to automate repetitive tasks, improve productivity, and reduce errors. These systems can be trained to use expert knowledge, and can be specialized for a single programming language, such as SQL [
28], or trained on a large corpus to support various programming languages and different programming paradigms [
29];
-
Data-to-Text Generation: This category focuses on generating natural language text from structured data such as tables, databases, or graphs. Data-to-text generation systems are used in various applications, such as news reporting, data visualization, and technical writing. These systems use natural language generation techniques to convert data into human-readable text, taking into account the context, target audience, and purpose of the text. Control Prefixes [
30] extend prefix tuning by incorporating input-dependent information into a pre-trained transformer through attribute-level learnable representations, resulting in a parameter-efficient data-to-text model.
1.6. Text Summarization
Text Summarization is a task in NLP where the goal is to condense a given text into a shorter and more concise version while preserving its essential information. This is typically accomplished by identifying and extracting the most important information, sentences, or phrases from the original text. The resulting summary can be a few sentences long or a single bullet point and is intended to provide a quick overview of the content without the need to read the entire text. Text summarization is used in a variety of applications, such as news aggregation, document summarization, and more. Text summarization typically requires an encoder–decoder architecture to completely capture the source information. Depending on the input size, standard attention might prove too costly due to the quadratic computation cost based on the sequence length. Methods such as [
31] replace the attention layer with an equivalent (here, a pooled attention module) to efficiently handle larger context windows. Under this category, possible tasks are as follows:
-
Extractive Summarization: This is the most straightforward subtask of text summarization, where the goal is to extract the most important sentences or phrases from a document and present them as a summary. Extractive summarization methods typically use a combination of information retrieval and natural language processing techniques to identify the most informative sentences or phrases in a document.
-
The attention mechanism of Longformer [
32] is a substitute for standard self-attention, and merges localized windowed attention with globally focused attention. The encoder–decoder version of the longformer (called LED) has demonstrated its effectiveness on the arXiv summarization dataset and is used often for processing long contexts in real-world applications.
-
Abstractive Summarization: This subtask aims to generate a summary by synthesizing new information based on the input document. Abstractive summarization methods typically use deep learning models, such as recurrent neural networks or transformers, to generate a summary. These models are trained on large amounts of data and can generate summaries that are more concise and coherent than extractive summaries. mBart is a sequence-to-sequence transformer [
33] trained on multiple large-scale monolingual corpora with the objective of denoising. Due to its rich pretraining dataset and ability to process multiple languages using the same network, it excels at abstractive summarization.
-
Multi-Document Summarization: This subtask addresses the problem of summarizing multiple related documents into a single summary. Multi-document summarization methods typically use information retrieval techniques to identify the most important documents and natural language processing techniques to generate a summary from the selected documents. While prior state-of-the-art methods relied on GNNs to take advantage of inherent connectivity, Primer by Xiao et al. [
34] has shown better performance in zero-shot, few-shot, and fine-tuned paradigms by introducing a new pretraining objective in the form of predicting masked salient sentences.
-
Query-Focused Summarization: This subtask focuses on summarizing a document based on a specific query or topic. Query-focused summarization methods typically use information retrieval techniques to identify the most relevant sentences or phrases in a document and present them as a summary. Baumel et al. [
35] introduced a pre-inference step involving computing the relevance between the query and each sentence of the document. The quality of summarization has been shown to improve when incorporating this form of relevance as an additional input. Support for multiple documents is achieved using a simple iterative scheme that uses maximum word count as a budget.
-
Sentence Compression: This subtask focuses on reducing the length of a sentence while preserving its meaning. Sentence compression methods typically use natural language processing techniques to identify redundant or unnecessary words or phrases in a sentence and remove them to create a more concise sentence. Ghalandari et al. [
36] trained a six-layer model called DistilRoBERTa with reinforcement learning to predict a binary classifier that keeps or discards words to reduce the sentence length.
1.7. Sentiment Analysis
Sentiment Analysis is a task in NLP with the goal of determining the sentiment expressed in a given text. This is typically accomplished by assigning a sentiment label such as positive, negative, or neutral to the text based on its contents. The sentiment can be expressed in different forms, such as opinions, emotions, or evaluations, and can be expressed at various levels of granularity, such as at the document, sentence, or aspect level. Sentiment Analysis is used in a variety of applications, such as customer service, marketing, and opinion mining. The quality of the sentiment analysis results can be influenced by factors such as the subjectivity of the text, the tone, and the context in which the sentiment is expressed. Instruction-tuned models such as T5 [
26] are often used in a zero-shot manner to perform sentiment analysis. XLNet [
21] has been shown to be effective on several sentiment analysis leaderboards such as SST-2, IMDB, and Yelp fine-grained.
1.8. Named Entity Recognition
Named Entity Recognition (NER) is a task in NLP with the goal of identifying and categorizing named entities present in a given text into predefined categories such as person names, organizations, locations, dates, and more. NER is an important subtask of information extraction and is used as an intermediate step in various applications such as question-answering, event extraction, and information retrieval. The quality of NER results can be influenced by factors such as the ambiguity of entity names, the presence of entity mentions with different forms, and the context in which the entities are expressed.
NER systems typically use machine learning techniques such as supervised learning to learn and identify named entities based on annotated training data. The output of NER is usually a sequence of tagged words, with each word being labeled with its corresponding entity class. As such, it falls under the paradigm of token-wise classification, with the added caveat that unlike most classification tasks it includes a null category. As the sentence and output lengths in NER are equal, it typically utilizes an encoder-only architecture. While the approach of fine-tuning a pretrained model with a classification head added on top for NER works well in practice, Automated Concatenation of Embeddings (ACE) [
37] has shown improved results using an ensemble of several pretrained models while training only a simple classifier on top using reinforcement learning.
1.9. Information Retrieval
Information Retrieval (IR) is a task in NLP with the goal of retrieving relevant information from a large collection of documents in response to a user query. This is typically accomplished by matching the query terms against the document content and ranking the documents based on their relevance to the query. IR systems can be used for various applications, such as web search, document search, and question answering. The quality of the retrieval results can be influenced by factors such as the relevance of the documents, the effectiveness of the ranking algorithm, and the representation of the documents and queries.
IR systems are typically classified further based on the level of granularity, such as document, paragraph, sentence, etc. While symbolic methods dominated IR leaderboards for a long time, transformer-based embeddings are quickly becoming the norm within the research community. Commercial use however remains in its infancy due to more demanding hardware requirements as compared to symbolic methods. The typical methods for retrieval include the use of a pretrained model such as RoBERTa [
38] in a Siamese fashion to find the similarity between two embeddings. For larger datasets, the embeddings are precomputed and stored in a vector database for faster lookup.
2. Multimodal Applications
Multimodal applications are AI-driven systems that leverage multiple modalities, such as text, images, and videos, to process and analyze information. By integrating various forms of data, these applications enable more comprehensive and versatile solutions in diverse domains. The subsequent sections explore the primary categories of multimodal applications, including Generative Control, Description Generation, and Multimodal Question Answering. These categories showcase the potential of AI systems to deliver more robust and context-aware insights by utilizing different data types, ultimately leading to improved performance and user experience across a wide array of applications.
2.1. Generative Control
Generative Control is a task in multimodal NLP in which text is used as an interface to generate another modality, such as images or speech. The goal of Generative Control is to generate a target modality that corresponds to a given text description or instruction. For example, based on a textual description of an object, such as "a red sports car," the task of Generative Control would be to generate an image of a red sports car. Generative Control combines the strengths of NLP and computer graphics or speech synthesis to produce high-quality and semantically meaningful outputs in the target modality. It has applications in areas such as computer vision, robotics, and human–computer interaction. Rombach et al. [
39] used text as the primary modality for image generation. An open-source implementation of this method named StableDiffusion has generated vast interest as an alternative to the commercial API based on prior work by Ramesh et al. [
40]. In the domain of text-to-speech (TTS), Wang et al. [
41] combined traditional neural codecs with transformers, which are able to outperform zero-shot TTS systems by treating the problem as conditional language modeling.
2.2. Description Generation
Description Generation is a task in which text is generated to describe another modality, such as an image or a point cloud. The goal of Description Generation is to automatically produce a textual description of the content of the target modality that accurately captures its key aspects and characteristics. For example, given an image of a scene, the task of Description Generation would be to generate a textual description of the objects, actions, and attributes present in the scene. Description generation commonly includes tasks such as image captioning and scene understanding.
mPLUG [
42] is a new transformer-based vision-language model that combines cross-modal understanding and generation, achieving state-of-the-art results on various vision-language tasks and addressing the inefficiency and linguistic signal issues in existing models through its efficient cross-modal skip connections.
2.3. Multimodal Question Answering
Multimodal Question Answering (QA) is a task with the goal of answering questions about a given multimodal input, such as an image or a video, using information from multiple modalities. The task involves combining information from text, images, audio, and other modalities to accurately answer questions about the content of the input. For example, given an image of a scene and a question about the scene, such as “What is the color of the car?”, the task of Multimodal QA would be to identify the car in the image and answer the question with the correct color. Multimodal QA requires the integration of NLP, computer vision, and other relevant modalities to accurately answer questions about the content of the input. It has applications in areas such as intelligent tutoring systems, customer service, and multimedia retrieval.
Models utilized for multimodal QA usually show heterogeneity to effectively process modalities other than text. For example, Plepi et al. [
43] used a stacked pointer network to aggregate information from a knowledge graph for conversational question answering. Unik-qa [
44] uses a retriever–reader architecture that fetches the most relevant documents related to the question based on dense embedding similarity and uses it as context during generation, supporting multiple modalities such as text, tables, lists, and knowledge bases within the documents.
BEiT [
45] performs masked language modeling on images, texts, and image-text pairs using a shared backbone. For visual question answering, the model utilizes a fusion encoder in which patch and word embeddings share attention with the attention component of the transformer block while having separate feed-forward layers in the initial stages. By simply fine-tuning a classifier on top, BEiT outperforms all previous methods by a large margin.
This entry is adapted from the peer-reviewed paper 10.3390/info14040242