Convolutional Neural Network + Recurrent Neural Network

Convolutional Neural Network + Recurrent Neural Network: History

Please note this is an old version of this entry, which may differ significantly from the current revision.

Subjects: Computer Science, Artificial Intelligence

Contributor:

A convolutional neural network and recurrent neural network (CNN + RNN) combination is an effective approach for many modern image recognition tasks that need to identify the behaviour of objects through a sequence of frames. For example, in a security CCTV camera footage, want to identify what abnormal actions a character is doing in the scene (e.g. fighting with someone, breaking into a store, etc.). A deep convolutional neural network (e.g. ResNet50) has many layers of abstraction and is good for extracting essential features in each frame of the input stream. These extracted features, which may represent low-level image features or even high-level objects, can be monitored over a sequence of frames by a recurrent neural network (e.g. ConvLSTM) so as to detect whether a certain action or event has happened.

CNN+RNN
convolutional neural networks
recurrent neural networks
video surveillance
abnormal behaviour detection

1. Introduction

With the many emerging challenges in public management, security, and safety, there is an increasing need for monitoring public scenes through surveillance cameras. At first sight, it seems an easy job for a human to monitor surveillance cameras feed to extract essential and helpful information from behavioral patterns, detect abnormal behaviours, and provide immediate response ^[1]. However, due to severe limitations in human ability, it is hard for a person to monitor simultaneous signals ^[2]. It is also a time-consuming task requiring many resources such as people and workspace ^[3]. Therefore, an automatic detection method is crucial to this end. One of the sub-domain in behaviour understanding ^[4] from surveillance cameras is detecting anomalous events. Anomaly detection in surveillance cameras is a challenging task that might face several problems: (1) abnormal events rarely happen; therefore, it is hard to find massive datasets of such events. This lack of samples might lead to some difficulties in the learning process. (2) Generally, everything that does not follow a specified pattern (or rule) is called an “anomaly”.

From a learning standpoint, anomaly detection can be divided into three approaches: supervised, unsupervised, and semi-supervised, as a significant and well-known categorizing for learning methods. In supervised learning, there are two different approaches by considering whether the model is trained by only one category or all existing categories ^[5]. In other words, in single model learning, the model is trained by only normal(or abnormal) events, whereas in multi-model learning, both normal and abnormal events need to be trained. In the single model learning, anomalous events distinguished from normal ones by learning a threshold for normality definition ^[6]^[7]^[8], learning of a multidimensional model of normal events within the feature space ^[9]^[10]^[11]^[12]^[13]^[14]^[15] and learning rules for model definition ^[16]. While, for the multi-model learning approach, which is particularly used when there are several groups of anomalies, each class will be trained dependently or independently ^[5]. On the other hand, an anomaly detection problem is generally considered as an unsupervised learning problem ^[17]. This technique deal with unlabeled data in which it is assumed that Normal events frequently occur while Abnormal events rarely happen in data. Considering all rare events as anomalous is one of the drawbacks of this learning ^[5]. Several clustering algorithms in unsupervised learning consider normal and abnormal events should be well separated in the feature space ^[18]^[19]^[20]. Besides, the semi-supervised Anomaly detection approach neither is too reliable on labeled data like the supervised approach nor have a low accuracy as unsupervised models ^[21]. This model tries to diminish the differences between supervised and unsupervised techniques ^[17]. Several works take advantage of the properties of semi-supervised learning schema such as in ^[22]^[23]^[24].

2. Convolutional Neural Networks

CNNs are the most popular choice of neural network for the image processing goals ^[25]. Extracting complex hidden features from high dimensional data with a complex structure is the main advantage of CNNs, making them suitable feature extractors for sequential and image datasets ^[26]^[27]. The extracted deep features were utilized in different applications like image quality assessment ^[28], skin lesions classification ^[29], and person re-identification ^[30]. Although CNNs are widely used in various deep learning tasks like text classification and NLP, they are mainly used in computer vision, such as for image and video detection and classification ^[31]. Various kinds of CNNs have been built in the recent decade like AlexNet, ResNets, VGG, Inceptions and their variants. Several works were also done by combining these convolutional neural networks with a softmax layer ^[32], and morphological analysis ^[33] in the anomaly detection area. In addition to CNN, Xu et al. ^[34] and Hasan et al. ^[35] proposed autoencoder structures. Nguyen et al. ^[36] proposed a Bayesian nonparametric approach for abnormal event detection in video streams. Moreover, several other models like Fisher vector and PCA ^[37], Motion Interaction Field (MIF) ^[38] have been proposed in this scope.

However, there is also some model that is mainly designed for focusing on more than one dimension of data.

One of the most common CNN used for feature extraction in deep learning methods is ResNets. A regular CNN is typically a combination of convolutional and fully connected layers ^[39]. The number of layers depends on several criteria, and each kind of CNNs has its structure. For instance, AlexNet has eight layers, and GoogleNet has 22 layers. Another type of Artificial Neural Network called Residual Neural Network (ResNet) has a somehow different structure. ResNet uses skip connection (or shortcuts), which can jump over layers. The main reason for using such shortcuts is to pass activations from previous layers to subsequent layers for better memorizing the parameters, which leads to diminishing the chance of vanishing gradients ^[40].

3. Recurrent Neural Networks

On the other hand, RNNs is one of the well-known choices for capturing features in analyzing time series data ^[41]. However, they fail in extracting context as time steps increases. Long short-term memory (LSTM) networks were introduced to overcome this limitation by improving the long-term dependency in RNNs ^[42]. Due to the sequential nature of the surveillance camera feeds, LSTM networks have become more popular for anomaly detection applications ^[24]. Therefore, several researchers worked on anomaly detection problems using the LSTM structure. Using regularity scores from reconstruction errors in an LSTM-based network is one approach of using LSTM to solve anomaly detection problems ^[43]^[44]. Furthermore, Srivastava et al. proposed a model using autoencoders, the encoder LSTM, and Decoder LSTM in an unsupervised learning approach ^[45].

However, the only RNNs methods could not achieve high accuracy results. They mainly predict the subsequent frames in a video time-series and, by calculating the difference between the ground truth and predicted value, decide whether the video segment is abnormal or not. Therefore, as the abnormal events do not follow a particular algorithm, it is difficult to say an uncommon event happened based on the prediction of the next frame.

4. CNN + RNN

Deep learning architectures perform well in learning spatial (via CNNs) and temporal (via LSTMs) features individually. Spatiotemporal networks (STNs) are networks in which spatial and temporal relation features are learned ^[46]. In STNs, both CNNs and LSTMs are combined to extract spatiotemporal features ^[47]. After applying CNN to the data, the output of the CNN structure (ResNet or AlexNet, for instance) will be the input of the subsequent LSTM. Several researchers adopt such techniques for detection in video dataset like ^[48]^[49]^[50] for finding abnormal events. Furthermore, another approach has emerged in recent years in which a convolutional layer filters the output of CNN before entering the LSTM structure ^[43]^[51]^[52]. This new approach is called Convolutional LSTM or ConvLSTM. As a result, instead of fully connected in LSTM, a convolutional layer dramatically decreases the number of parameters. Therefore, the chance of overfitting decreases, and it can boost the model’s performance.

This entry is adapted from the peer-reviewed paper 10.3390/app12031021

References

Hospedales, T.; Gong, S.; Xiang, T. Video behaviour mining using a dynamic topic model. Int. J. Comput. Vis. 2012, 98, 303–323.
Sulman, N.; Sanocki, T.; Goldgof, D.; Kasturi, R. How effective is human video surveillance performance? In Proceedings of the 2008 19th IEEE International Conference on Pattern Recognition, ICPR, Tampa, FL, USA, 8–11 December 2008; pp. 1–3.
Nguyen, T.N.; Meunier, J. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV, Seoul, Korea, 27 October–2 November 2019; pp. 1273–1283.
Tian, B.; Morris, B.T.; Tang, M.; Liu, Y.; Yao, Y.; Gou, C.; Shen, D.; Tang, S. Hierarchical and networked vehicle surveillance in its: A survey. IEEE Trans. Intell. Transp. Syst. 2017, 18, 25–48.
Sodemann, A.A.; Ross, M.P.; Borghetti, B.J. A review of anomaly detection in automated surveillance. IEEE Trans. Syst. Man, Cybern. Part C (Appl. Rev.) 2012, 42, 1257–1272.
Zweng, A.; Kampel, M. Unexpected human behavior recognition in image sequences using multiple features. In Proceedings of the 2010 20th International Conference on Pattern Recognition, ICPR, Istanbul, Turkey, 23–26 August 2010; pp. 368–371.
Jodoin, P.M.; Konrad, J.; Saligrama, V. Modeling background activity for behavior subtraction. In Proceedings of the 2008 Second ACM/IEEE International Conference on Distributed Smart Cameras, Trento, Italy, 9–11 September 2008; pp. 1–10.
Dong, Q.; Wu, Y.; Hu, Z. Pointwise motion image (PMI): A novel motion representation and its applications to abnormality detection and behavior recognition. IEEE Trans. Circuits Syst. Video Technol. 2009, 19, 407–416.
Mecocci, A.; Pannozzo, M.; Fumarola, A. Automatic detection of anomalous behavioural events for advanced real-time video surveillance. In Proceedings of the 3rd International Workshop on Scientific Use of Submarine Cables and Related Technologies, Lugano, Switzerland, 31 July 2003; pp. 187–192.
Li, H.P.; Hu, Z.Y.; Wu, Y.H.; Wu, F.C. Behavior modeling and abnormality detection based on semi-supervised learning method. Ruan Jian Xue Bao (J. Softw.) 2007, 18, 527–537.
Yao, B.; Wang, L.; Zhu, S.C. Learning a scene contextual model for tracking and abnormality detection. In Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8.
Yin, J.; Yang, Q.; Pan, J.J. Sensor-based abnormal human-activity detection. IEEE Trans. Knowl. Data Eng. 2008, 20, 1082–1090.
Benezeth, Y.; Jodoin, P.M.; Saligrama, V.; Rosenberger, C. Abnormal events detection based on spatio-temporal co-occurences. In Proceedings of the 2009 IEEE conference on computer vision and pattern recognition CVPR, Miami, FL, USA, 20–25 June 2009; pp. 2458–2465.
Dong, N.; Jia, Z.; Shao, J.; Xiong, Z.; Li, Z.; Liu, F.; Zhao, J.; Peng, P. Traffic abnormality detection through directional motion behavior map. In Proceedings of the 2010 7th IEEE International Conference on Advanced Video and Signal Based Surveillance, Boston, MA, USA, 29 August–1 September 2010; pp. 80–84.
Loy, C.C.; Xiang, T.; Gong, S. Detecting and discriminating behavioural anomalies. Pattern Recognit. 2011, 44, 117–132.
Zhang, J.; Liu, Z. Detecting abnormal motion of pedestrian in video. In Proceedings of the 2008 International Conference on Information and Automation, Changsha, China, 20–23 June 2008; pp. 81–85.
Ruff, L.; Vandermeulen, R.A.; Görnitz, N.; Binder, A.; Müller, E.; Müller, K.R.; Kloft, M. Deep semi-supervised anomaly detection. arXiv 2019, arXiv:1906.02694.
Tang, Y.P.; Wang, X.J.; Lu, H.F. Intelligent video analysis technology for elevator cage abnormality detection in computer vision. In Proceedings of the 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology, Seoul, Korea, 24–26 November 2009; pp. 1252–1258.
Feng, J.; Zhang, C.; Hao, P. Online learning with self-organizing maps for anomaly detection in crowd scenes. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 3599–3602.
Sharif, M.H.; Uyaver, S.; Djeraba, C. Crowd behavior surveillance using Bhattacharyya distance metric. In Proceedings of the International Symposium Computational Modeling of Objects Represented in Images, Buffalo, NY, USA, 5–7 May 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 311–323.
Xiang, G.; Min, W. Applying Semi-supervised cluster algorithm for anomaly detection. In Proceedings of the 2010 Third International Symposium on Information Processing, Qingdao, China, 15–17 October 2010; pp. 43–45.
Wang, J.; Neskovic, P.; Cooper, L.N. Pattern classification via single spheres. In Proceedings of the 8th International Conference on Discovery Science, Singapore, 8–11 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 241–252.
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545.
Ergen, T.; Mirza, A.H.; Kozat, S.S. Unsupervised and semi-supervised anomaly detection with LSTM neural networks. arXiv 2017, arXiv:1710.09207.
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105.
Gorokhov, O.; Petrovskiy, M.; Mashechkin, I. Convolutional neural networks for unsupervised anomaly detection in text data. In Proceedings of the 18th International Conference on Intelligent Data Engineering and Automated Learning, Guilin, China, 30 October–1 November 2017; Springer: Cham, Switzerland, 2017; pp. 500–507.
Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882.
Varga, D. Multi-pooled inception features for no-reference image quality assessment. Appl. Sci. 2020, 10, 2186.
Kawahara, J.; BenTaieb, A.; Hamarneh, G. Deep features to classify skin lesions. In Proceedings of the 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), Prague, Czech Republic, 13–16 April 2016; pp. 1397–1400.
Bai, X.; Yang, M.; Huang, T.; Dou, Z.; Yu, R.; Xu, Y. Deep-person: Learning discriminative deep features for person re-identification. Pattern Recognit. 2020, 98, 107036.
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53.
Christiansen, P.; Nielsen, L.N.; Steen, K.A.; Jørgensen, R.N.; Karstoft, H. DeepAnomaly: Combining background subtraction and deep learning for detecting obstacles and anomalies in an agricultural field. Sensors 2016, 16, 1904.
Dong, L.; Zhang, Y.; Wen, C.; Wu, H. Camera anomaly detection based on morphological analysis and deep learning. In Proceedings of the 2016 IEEE International Conference on Digital Signal Processing (DSP), Beijing, China, 16–18 October 2016; pp. 266–270.
Xu, D.; Ricci, E.; Yan, Y.; Song, J.; Sebe, N. Learning deep representations of appearance and motion for anomalous event detection. arXiv 2015, arXiv:1510.01553.
Hasan, M.; Choi, J.; Neumann, J.; Roy-Chowdhury, A.K.; Davis, L.S. Learning temporal regularity in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 733–742.
Nguyen, V.; Phung, D.; Pham, D.S.; Venkatesh, S. Bayesian nonparametric approaches to abnormality detection in video surveillance. Ann. Data Sci. 2015, 2, 21–41.
Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558.
Yun, K.; Yoo, Y.; Choi, J.Y. Motion interaction field for detection of abnormal interactions. Mach. Vis. Appl. 2017, 28, 157–171.
Fu, J.; Rui, Y. Advances in deep learning approaches for image tagging. APSIPA Trans. Signal Inf. Process. 2017, 6, E11.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Ebrahimi Kahou, S.; Michalski, V.; Konda, K.; Memisevic, R.; Pal, C. Recurrent neural networks for emotion recognition in video. In Proceedings of the 2015 17th ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, 9–13 November 2015; pp. 467–474.
Karim, F.; Majumdar, S.; Darabi, H.; Chen, S. LSTM fully convolutional networks for time series classification. IEEE Access 2017, 6, 1662–1669.
Medel, J.R.; Savakis, A. Anomaly detection in video using predictive convolutional long short-term memory networks. arXiv 2016, arXiv:1612.00390.
Singh, A. Anomaly Detection for Temporal Data Using Long Short-Term Memory (LSTM). Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2017.
Srivastava, N.; Mansimov, E.; Salakhudinov, R. Unsupervised learning of video representations using lstms. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 June 2015; pp. 843–852.
Zhang, H.; Zheng, Y.; Yu, Y. Detecting urban anomalies using multiple spatio-temporal data sources. Proc. Acm Interact. Mobile Wearable Ubiquitous Technol. 2018, 2, 1–18.
Chalapathy, R.; Chawla, S. Deep learning for anomaly detection: A survey. arXiv 2019, arXiv:1901.03407.
Sultani, W.; Chen, C.; Shah, M. Real-world anomaly detection in surveillance videos. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488.
Dong, Z.; Qin, J.; Wang, Y. Multi-stream deep networks for person to person violence detection in videos. In Proceedings of the 7th Chinese Conference on Pattern Recognition (CCPR), Chengdu, China, 5–7 November 2016; Springer: Singapore, 2016; pp. 517–531.
Zhou, S.; Shen, W.; Zeng, D.; Fang, M.; Wei, Y.; Zhang, Z. Spatial–temporal convolutional neural networks for anomaly detection and localization in crowded scenes. Signal Process. Image Commun. 2016, 47, 358–368.
Sudhakaran, S.; Lanz, O. Learning to detect violent videos using convolutional long short-term memory. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6.
Xingjian, S.H.I.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 802–810.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.