During the last few years, several technological advances have led to an increase in the creation and consumption of audiovisual multimedia content. Users are overexposed to videos via several social media or video sharing websites and mobile phone applications. For efficient browsing, searching, and navigation across several multimedia collections and repositories, e.g., for finding videos that are relevant to a particular topic or interest, this ever-increasing content should be efficiently described by informative yet concise content representations. A common solution to this problem is the construction of a brief summary of a video, which could be presented to the user, instead of the full video, so that she/he could then decide whether to watch or ignore the whole video. Such summaries are ideally more expressive than other alternatives, such as brief textual descriptions or keywords.
1. Introduction
During the last few years and mainly due to the rise of social media and video sharing websites, there has been an exponential increase in the amount of user-generated audio visual content. The average user captures and shares several aspects of her/his daily life moments, such as (a) personal videos, e.g., time spent with friends and/or family and hobbies; (b) activity videos, e.g., sports and other similar activities; (c) reviews, i.e., sharing opinions regarding products, services, movies etc.; and (d) how-to videos, i.e., videos created by users in order to teach other users to fulfill a task. Of course, apart from those that are creating content, a plethora of users daily consume massive amounts of such content. Notably, YouTube users daily watch more than 1 billion hours of visual content, while creating and uploading more than 500 h
[1]. Of course, these numbers are expected to further increase within the next few years, resulting in overexposure to massive amounts of data, which in turn may prohibit users from capturing relevant information. The latter becomes more difficult in case of lengthy content, thus necessitating aid to this task
[1].
Video summarization is a promising solution to the aforementioned problem, aiming to extract the most relevant segments from a given video, in order to create a shorter, more informative version of the original video, which is engaging, while preserving its main content and context
[2]. Specifically, a generated summary typically consists of a set of representative frames (i.e., the “keyframes”) or a set of video fragments. These parts should be kept in their original order, while the summary should be of a much shorter duration than the original video, while including its most relevant elements. Video summarizing applications include efficient browsing and retrieval of visual art movies (such as films and documentaries), TV shows, medical videos, surveillance videos, and so forth.
Video summarization techniques may be categorized into four main categories
[1], which differ based on their output, i.e., the actual summary that is delivered to its end user
[2]. Specifically, these categories are
[3][4][5][6] (a) a collection of video frames (keyframes), (b) a collection of video segments, (c) graphical cues, and (d) textual annotations. Summaries belonging to (a) and (b) are frequently referred to as “static” and “dynamic”, respectively. Note that a dynamic summary preserves the audio and the motion of videos, whereas a static summary consists of a collection of still video frames. In addition, graphical cues are rarely employed in conjunction with other methodologies. As expected, users tend to prefer dynamic summaries over static ones
[7]. Video summarization techniques may be also categorized as (a) unimodal approaches, i.e., those that are based only on the visual content of the video, and (b) multimodal approaches, i.e., those that are using more than one of the available modalities, such as audio, textual, semantic (i.e., depicted objects, scenes, people, etc.) video content
[8]. Depending on the training approach that is used, summarization techniques may be categorized as (a) supervised, i.e., those that are based on datasets that have been annotated by human annotators, in either a per-frame or a per-fragment basis; (b) unsupervised, i.e., those that do not rely on some kind of ground truth data, but instead use a large corpus of available data so as to “learn” the important parts; and (c) weakly supervised, i.e., those that do not need exact, full annotations, but instead are based on weak labels, which are imperfect yet able to create powerful predictive models.
2. Video Summarization
A user-generated video summary method that was based on the fusion of audio and visual modalities was proposed. Specifically, the video summarization task was addressed as a binary, supervised classification problem, relying on audio and visual features
[1]. The proposed model was trained to recognize the “important” parts of audiovisual content. A key component of this approach was its dataset, which consists of user-generated, single camera videos and a set of extracted attributes. Each video included a per-second annotation indicating its “importance”.
The fundamental purpose of a video summarization approach, according to
[9], is to create a more compact version of the original video, without sacrificing much semantic information, while making it relatively complete for the viewer. In the work, the authors introduced SASUM, a unique approach that, in contrast with previous algorithms that focused just on the variety of the summary, extracted the most descriptive elements of the video while summarizing it. SASUM, in particular, comprised a frame selector and video descriptors to assemble the final video so that the difference between the produced description and the human-created description was minimized. In
[10], a user-attention-model-based strategy for keyframe extraction and video skimming was developed. Audio, visual, and linguistic features are extracted, and an attention model is created based on the motion vector field, resulting in the creation of a motion model. Three types of maps based on intensity, spatial coherence, and temporal coherence are created and are then combined to create a saliency map. A static model was also used to pick important backdrop regions and extract faces and camera attention elements. Finally, audio, voice, and music models were developed. To construct an “attention” curve, the aforementioned attention components were linearly fused. Keyframes were extracted from local maxima of this curve within pictures, whereas skim segments were chosen based on a variety of factors.
Based on deep features extracted by a convolutional neural network (CNN), the authors of
[11] trained a deep adversarial long short-term memory (LSTM) network consisting of a “summarizer” and a “discriminator” to reduce the distance between ground truth movies and their summaries. The former, in particular, was made up of a selector and an encoder that picked out relevant frames from the input video and converted them to a deep feature vector. The latter was a decoder that distinguished between “original” and “summary” frames. The proposed deep neural network aimed to deceive the discriminator by presenting the video summary as the original input video, thinking that both representations are identical. Otani et al.
[12] proposed a deep video feature extraction approach with the goal of locating the most interesting areas of the movie that are necessary for video content analysis.
The methodology proposed by
[13] used as input sequences original video frames and produced their projected significance scores. Specifically, they adopted a framework for sequence-to-sequence learning so as to formulate the task of summarization, addressing the problems of short-term attention deficiency and distribution inconsistency. Extensive tests on benchmark datasets indicated that the suggested ADSum technique is superior to other existing approaches. A supervised methodology for the automatic selection of keyframes of important subshots of videos is proposed in
[14]. These keyframes serve as a summary, while the core concept of this approach is the description of the variable-range temporal dependence between video frames using long short-term memory networks (LSTM), taking into consideration the sequential structure that is essential for producing insightful video summaries. In the work of
[15], the specific goal of video summarization was to make it easier for users to acquire movies by creating brief and informative summaries that are diverse and authentic to the original videos. To summarize movies, they used a deep summarization network (DSN), which selected the video frames to be included in summaries, based on probability distributions. Specifically, it forecast a probability per video frame, indicating how likely it is to be selected. Note that, within this process, labels were not necessary; thus, the DNS approach may operate completely unsupervised. In
[16], a unique video-summarizing technique called VISCOM was introduced, which is based on the color occurrence matrices from the video, which were then utilized to characterize each video frame. Then, from the most informative frames of the original video, a summary was created. In order to make the aforementioned video-summarizing model robust, VISCOM is tested on a large number of videos from a range of genres.
Finally, the authors in
[17] introduce a novel method for supervised video summarization that addresses the limitations of existing recurrent neural network (RNN)–based approaches, particularly in terms of modeling long-range dependencies between frames and parallelizing the training process. The proposed model employs self-attention mechanisms to determine the significance of video frames. Unlike previous attention-based summarization techniques that model frame dependencies by examining the entire frame sequence, this method integrates global and local multihead attention mechanisms to capture different levels of granularity in frame dependencies. The attention mechanisms also incorporate a component that encodes the temporal position of video frames, which is crucial for video summarization. The results show that the model outperforms existing attention-based methods and is competitive with other top-performing supervised summarization approaches.