A convolutional neural network and recurrent neural network (CNN + RNN) combination is an effective approach for many modern image recognition tasks that need to identify the behaviour of objects through a sequence of frames. For example, in a security CCTV camera footage, we want to identify what abnormal actions a character is doing in the scene (e.g. fighting with someone, breaking into a store, etc.). A deep convolutional neural network (e.g. ResNet50) has many layers of abstraction and is good for extracting essential features in each frame of the input stream. These extracted features, which may represent low-level image features or even high-level objects, can be monitored over a sequence of frames by a recurrent neural network (e.g. ConvLSTM) so as to detect whether a certain action or event has happened.
1. Introduction
With the many emerging challenges in public management, security, and safety, there is an increasing need for monitoring public scenes through surveillance cameras. At first sight, it seems an easy job for a human to monitor surveillance cameras feed to extract essential and helpful information from behavioral patterns, detect abnormal behaviours, and provide immediate response [1]. However, due to severe limitations in human ability, it is hard for a person to monitor simultaneous signals [2]. It is also a time-consuming task requiring many resources such as people and workspace [3]. Therefore, an automatic detection method is crucial to this end. One of the sub-domain in behaviour understanding [4] from surveillance cameras is detecting anomalous events. Anomaly detection in surveillance cameras is a challenging task that might face several problems: (1) abnormal events rarely happen; therefore, it is hard to find massive datasets of such events. This lack of samples might lead to some difficulties in the learning process. (2) Generally, everything that does not follow a specified pattern (or rule) is called an “anomaly”.
From a learning standpoint, anomaly detection can be divided into three approaches: supervised, unsupervised, and semi-supervised, as a significant and well-known categorizing for learning methods. In supervised learning, there are two different approaches by considering whether the model is trained by only one category or all existing categories [5][7]. In other words, in single model learning, the model is trained by only normal(or abnormal) events, whereas in multi-model learning, both normal and abnormal events need to be trained. In the single model learning, anomalous events distinguished from normal ones by learning a threshold for normality definition [6][7][8][8,9,10], learning of a multidimensional model of normal events within the feature space [9][10][11][12][13][14][15][11,12,13,14,15,16,17] and learning rules for model definition [16][18]. While, for the multi-model learning approach, which is particularly used when there are several groups of anomalies, each class will be trained dependently or independently [5][7]. On the other hand, an anomaly detection problem is generally considered as an unsupervised learning problem [17][19]. This technique deal with unlabeled data in which it is assumed that Normal events frequently occur while Abnormal events rarely happen in data. Considering all rare events as anomalous is one of the drawbacks of this learning [5][7]. Several clustering algorithms in unsupervised learning consider normal and abnormal events should be well separated in the feature space [18][19][20][20,21,22]. Besides, the semi-supervised Anomaly detection approach neither is too reliable on labeled data like the supervised approach nor have a low accuracy as unsupervised models [21][23]. This model tries to diminish the differences between supervised and unsupervised techniques [17][19]. Several works take advantage of the properties of semi-supervised learning schema such as in [22][23][24][24,25,26].
2. Convolutional Neural Networks
CNNs are the most popular choice of neural network for the image processing goals
[25][32]. Extracting complex hidden features from high dimensional data with a complex structure is the main advantage of CNNs, making them suitable feature extractors for sequential and image datasets
[26][27][33,34]. The extracted deep features were utilized in different applications like image quality assessment
[28][35], skin lesions classification
[29][36], and person re-identification
[30][37]. Although CNNs are widely used in various deep learning tasks like text classification and NLP, they are mainly used in computer vision, such as for image and video detection and classification
[31][38]. Various kinds of CNNs have been built in the recent decade like AlexNet, ResNets, VGG, Inceptions and their variants. Several works were also done by combining these convolutional neural networks with a softmax layer
[32][39], and morphological analysis
[33][40] in the anomaly detection area. In addition to CNN, Xu et al.
[34][41] and Hasan et al.
[35][42] proposed autoencoder structures. Nguyen et al.
[36][43] proposed a Bayesian nonparametric approach for abnormal event detection in video streams. Moreover, several other models like Fisher vector and PCA
[37][44], Motion Interaction Field (MIF)
[38][45] have been proposed in this scope.
However, there is also some model that is mainly designed for focusing on more than one dimension of data.
One of the most common CNN used for feature extraction in deep learning methods is ResNets. A regular CNN is typically a combination of convolutional and fully connected layers
[39][46]. The number of layers depends on several criteria, and each kind of CNNs has its structure. For instance, AlexNet has eight layers, and GoogleNet has 22 layers. Another type of Artificial Neural Network called Residual Neural Network (ResNet) has a somehow different structure. ResNet uses skip connection (or shortcuts), which can jump over layers. The main reason for using such shortcuts is to pass activations from previous layers to subsequent layers for better memorizing the parameters, which leads to diminishing the chance of vanishing gradients
[40][47].
3. Recurrent Neural Networks
On the other hand, RNNs is one of the well-known choices for capturing features in analyzing time series data
[41][48]. However, they fail in extracting context as time steps increases. Long short-term memory (LSTM) networks were introduced to overcome this limitation by improving the long-term dependency in RNNs
[42][49]. Due to the sequential nature of the surveillance camera feeds, LSTM networks have become more popular for anomaly detection applications
[24][26]. Therefore, several researchers worked on anomaly detection problems using the LSTM structure. Using regularity scores from reconstruction errors in an LSTM-based network is one approach of using LSTM to solve anomaly detection problems
[43][44][50,51]. Furthermore, Srivastava et al. proposed a model using autoencoders, the encoder LSTM, and Decoder LSTM in an unsupervised learning approach
[45][52].
However, the only RNNs methods could not achieve high accuracy results. They mainly predict the subsequent frames in a video time-series and, by calculating the difference between the ground truth and predicted value, decide whether the video segment is abnormal or not. Therefore, as the abnormal events do not follow a particular algorithm, it is difficult to say an uncommon event happened based on the prediction of the next frame.
4. CNN + RNN
Deep learning architectures perform well in learning spatial (via CNNs) and temporal (via LSTMs) features individually. Spatiotemporal networks (STNs) are networks in which spatial and temporal relation features are learned
[46][53]. In STNs, both CNNs and LSTMs are combined to extract spatiotemporal features
[47][31]. After applying CNN to the data, the output of the CNN structure (ResNet or AlexNet, for instance) will be the input of the subsequent LSTM. Several researchers adopt such techniques for detection in video dataset like
[48][49][50][30,54,55] for finding abnormal events. Furthermore, another approach has emerged in recent years in which a convolutional layer filters the output of CNN before entering the LSTM structure
[43][51][52][50,56,57]. This new approach is called Convolutional LSTM or ConvLSTM. As a result, instead of fully connected in LSTM, a convolutional layer dramatically decreases the number of parameters. Therefore, the chance of overfitting decreases, and it can boost the model’s performance.