Deep learning approaches to the detection of visual data instances that markedly digress from regular sequences have been mostly focusing on outdoor video-surveillance scenarios, mainly regarding abnormal behaviour and suspicious or abandoned object detection. However, with the increasing importance of public and shared transportation for urban mobility, it becomes imperative to provide autonomous intelligent systems capable of detecting abnormal behaviour that threatens passenger safety. In-vehicle monitoring becomes particularly relevant for Shared Autonomous Vehicles, which do not have a driver responsible for assuring the well-being and safety of passengers; such vehicles must be accompanied by reliable autonomous in-vehicle surveillance systems.
Georgescu et al.[17] proposed some alterations to frame prediction, innovating by learning the discrimination of moving objects, which is referred to as the arrow of time. Essentially, it considered both classification and detection information, producing large prediction discrepancies when anomalies occur. This approach was inspired by the object-centric perspective of Ionescu et al.[18], which employed an object detector on each frame, applying a convolutional autoencoder to learn deep unsupervised representations for a one-versus-rest classification.
The main drawbacks of semi-supervised approaches are the lack of consideration for the diversity of normal patterns and the ability of deep learning techniques to correctly recreate abnormal video frames based on already abnormal inputs. To this end, Park et al.[19] proposed a memory module that updates items in the memory while assuring that these represent prototypical patterns of normal data. Similarly, Cai et al.[20] attempted to assure appearance and motion consistency through modality memory pools. Two separate pools were created to store this information: one comprising appearance features and the other consisting of the motion features, guaranteeing a robust feature representation of normality
Several papers followed the MIL framework, suggesting improvements to the method. The inner-bag score gap regularisation was introduced by Zhang et al.[22] to increase the gap between the lowest and highest scores in a positive bag and reduce it in a negative one. Wan et al.[23] proposed a dynamic MIL-loss and centre-guided regularisation; the former enlarged the interclass dispersion, and the latter reduced the intraclass distance of normal snippets. Additionally, Zhu et al.[25], in an encoder-based approach, suggested an attention-based MIL model capable of encoding motion-aware features by using an autoencoder based on optical flow.
Zhong et al.[26] denoted that in the methods that used MIL, if the model incorrectly predicted anomalous instances in the positive bag, the error would propagate to subsequent instance selection. To tackle this problem, Zhong et al.[26] reformulated the task as a binary classification under a noisy label problem and suggested the use of a Graph Convolution Neural (GCN) network to correct low-confidence anomaly scores, replacing them with high-confidence ones. Even though this work achieved better accuracy in the identification of anomalies when compared to MIL-based approaches, training both a GCN and MIL is computationally expensive and may cause unstable performance due to unconstrained latent space.
A similar approach was implemented by Landi et al.[28], focusing on spatiotemporal tubes instead of the entirety of video segments containing full frames. UCFCrime2Local, an enriched subsection of 100 burglary and assault sequences from UCF-Crime[21], was presented as a separate dataset for anomaly detection with bounding box supervision in its train and test set. The proposed model was able to provide spatiotemporal proposals for unseen surveillance videos leveraging only video-level labels, enlarging the anomaly dataset without additional human labelling.
It is desirable to learn an anomaly detection model capable of performing well under multiple scenes and viewing angles. To address these drawbacks, ShanghaiTech[4] was developed, taking advantage of multiple surveillance cameras with different view angles installed at different spots, to capture real events at a university campus. ShanghaiTech has challenging light conditions and camera angles, as Figures 3c,d exemplify. It contains 130 abnormal events and annotations for pixel-level ground truth of abnormal events.
Figure 3. Abnormal frames extracted from widely used datasets for training and benchmarking video anomaly tasks. (a) Two bikers amongst the pedestrians in Ped1[7] dataset. (b) Car and biker in a pedestrian walkway in Ped2[7] dataset. (c) A normal frame from ShanghaiTech dataset[4]. (d) Two people fighting in ShanghaiTech dataset[4].
3.1.2. Real-World Anomalies
Motivated by the limitations of previous datasets, UCF-Crime[21] was developed as a new large-scale dataset to evaluate video anomaly detection. It is composed of 1900 untrimmed videos of real-world surveillance footage, extracted from the internet, with an average length of 4 min each. It includes 13 types of anomalous events with a high impact on public safety, such as abuse, burglary, shoplifting and shooting, displayed in Figure 4a,b. UCF-Crime contains annotated bounding boxes of anomalous regions in one image per 16 frames of each abnormal video. A considerable amount of available data was essential for the development of weakly supervised strategies.XD-Violence[29] was originally released to develop a large-scale and multi-scene dataset for violence detection and classification. Furthermore, it contains audio-visual signals, allowing for the research on multi-modal solutions for this problem. XD-Violence consists of 4754 weak-labelled untrimmed videos with audio, which were collected from both films and YouTube. This dataset embraces a variety of scenarios and anomalies, for instance, rioting, and explosions, as shown in Figure 4c,d.
Figure 4. Comparison between normal and abnormal frames extracted from real-world anomalies datasets. (a) Frame from a normal activity extracted from UCF-Crime[21]. (b) Abnormal frame from UCF-Crime[21], showing a shooting. (c) Frame from a normal activity in XD-Violence[29]. (d) Abnormal frame from XD-Violence[29], representing an explosion.
SVIRO-Uncertainty[33] is a high-quality synthetic dataset that is not directly related to the task of anomaly detection. Nonetheless, it has the potential to be adapted to study a subset of this problem: the detection of abandoned or dangerous objects. The original goal of this dataset was to train models capable of classifying the object that is occupying each position. SVIRO-Uncertainty is made up of sequences of the rear bench of a vehicle, in which each of the three seats might contain a passenger or an object, as displayed in Figure 5c,d. The dataset is quite large, containing two separate training sets, 4384 scenes with adult passengers only and 3515 using adults, child seats and infant seats.
Choosing the best model for a new use case such as anomaly detection inside of a vehicle is not straightforward. The typical scenario of the publicly available datasets does not faithfully represent the new environment in which anomalies must be detected; therefore, their use does not produce an authentic benchmark of the proposed methods. Most of these sequences were captured with stationary video cameras that were recording static backgrounds. Although cameras inside vehicles are also stationary, windows on a moving vehicle produce a partially moving background on the recorded sequence. The distance between the cameras and the subjects is much smaller inside a vehicle, increasing the effect of geometric distortions on the captured information. Additionally, headlights of other vehicles, public illumination and occlusions of sunlight produce more frequent illumination perturbations in the scene than those found on datasets that focus on a pedestrian walkway, for instance. The behaviour of the available models in such scenes is uncertain, as these did not have to specifically build and test tools for such problems.
Anomaly detection in confined spaces, such as the interior of vehicles, is an interesting new application scenario for deep anomaly detection methods. However, as the work of Augusto et al.[5] demonstrates, the development of solutions for this use case is still fully dependent on the availability of private datasets. However, the relevance of objects was not considered in this work, whether for representing a danger to the passengers or simply as an object that was left behind by one of them. The latter is of significant importance in the suggested shared autonomous vehicle scenario.
Creating new datasets or expanding existing ones appears to be an immediate need for considering new use applications for anomaly detection. The former is a complex and costly task that implies allocating resources for staging and recording the desired interactions. Hence, an attractive option relies on synthetic data that could be generated for direct use or to augment available data. The work of Acsintoae et al.[31] is referred to as an interesting approach to the translation of simulated objects to real-world datasets. Similar hybrid strategies could be employed to circumvent the lack of data for in-vehicle monitoring applications. Furthermore, such strategies could pre-emptively add some artificial variety to the available video sequences. The work of Capozzi et al.[34] has linked the lack of actor independence with the underperformance of the trained models, as a bias is developed linking certain actors to certain actions, instead of learning the pattern of the action.
A common issue with the proposed deep anomaly detection techniques was noted by Pang et al.[10] Most anomaly detection studies focus on detection performance only, ignoring the capability of illustrating the identified anomalies. Although it would be relevant to classify the abnormal behaviour that was detected, the detection could represent a novel anomaly. Hence, it is crucial to at least provide spatial cues that demonstrate the specific data portion that is anomalous. These cues might prove useful as a tool for interpreting such complex models and identifying scenarios in which they could be missing.