Strategy for Catastrophic Forgetting Reduction in Incremental Learning: Comparison
Please note this is a comparison between Version 2 by Camila Xu and Version 1 by thi thanh thuy pham.

Catastrophic forgetting or catastrophic interference is a serious problem in continuous learning in machine learning. It happens not only in traditional machine learning algorithms such as SVM (Support Vector Machine), NB (Naive Bayes), DT (Decision Tree), and CRF (Conditional Random Field) but also in DNNs.

  • continual learning
  • memory reconstruction
  • loss combination

1. Introduction

Catastrophic forgetting or catastrophic interference is a serious problem in continuous learning in machine learning. It happens not only in traditional machine learning algorithms such as SVM (Support Vector Machine), NB (Naive Bayes), DT (Decision Tree), and CRF (Conditional Random Field) but also in DNNs. This phenomenon was firstly exposed in [1]. According to the observations in that work, when training on new tasks, the knowledge learned from previous tasks is forgotten by the machine learning model. The forgotten knowledge includes the weights learned from the source tasks that are overridden when learning the target tasks. The authors also demonstrated that the phenomenon of catastrophic forgetting is the main reason for degradation in the performance of the machine learning models. Since the announcement of [1] on catastrophic interference, several publications to address this challenge have emerged [2,3[2][3][4][5][6][7][8],4,5,6,7,8], and they shed more light on the causes of this phenomenon.
Considering traditional machine learning algorithms such as SVM (Support Vector Machine), NB (Naive Bayes), DT (Decision Tree), and CRF (Conditional Random Field), although they have been very successful in practice, they are inherently focused on single-task learning or isolated-task learning. Moreover, the models are fixed after deployment, or there is no learning after deployment. This causes limitations in leveraging the knowledge learned from these models to solve the new tasks or categories [1,3,4][1][3][4].
Recently, DNNs have achieved outstanding efficiency in different areas, but they require a large amount of training data [9,10,11][9][10][11]. In order for some networks to be effective, it is frequently necessary to reuse model parameters that were learned from training a sizable dataset [12]. The pre-trained models are then fine-tuned on specialized data to obtain higher efficiency [13,14,15][13][14][15]. Transfer learning can cause the old knowledge to be overwritten by the new one. This makes it impossible for the model to remember the previous content. Therefore, transfer learning is considered the main reason for catastrophic forgetting in DNNs.
How to mitigate the catastrophic forgetting effect in DNNs has recently attracted great attention from the research community [16,17,18][16][17][18]. As a result, several continual learning solutions have been proposed for language problems [19,20][19][20], object segmentation and 3D reconstruction [21], and the recognition field [7,22,23,24,25][7][22][23][24][25]. The strategies proposed for mitigating the catastrophic forgetting of DNNs can be grouped into three main approaches: (1) rehearsal-based methods; (2) loss regularization-based methods; and (3) architecture search-based methods. In the first one, an explicit memory is utilized for maintaining (i) raw samples [22[22][26][27],26,27], (ii) pseudo-samples generated using a generator [28], or (iii) network representations/parameters stored from past tasks [29,30][29][30]. The previously learned knowledge is then adapted to the new task training, which helps to avoid forgetting the previous tasks. This approach can be applied to different types of continual learning settings, such as class incremental learning, task incremental learning, and domain incremental learning. However, it requires an efficient strategy on what to store and how to update the old knowledge into new tasks; otherwise, overfitting will occur. The second approach focuses on adding a regularization term to the loss function [7,29,30,31][7][29][30][31]. The loss, together with the penalty factor, determine the effectiveness of injecting knowledge from the source task model to the target task model. This itself is an attractive research area. The third approach pays attention to the significance of the neural network architecture in continual learning. Individual architecture selections such as the number of hidden layers, batch normalization, skip connections, pooling layers, etc. can affect the performance of the continual learning [32].

2. An Efficient Strategy for Catastrophic Forgetting Reduction in Incremental Learning

The catastrophic interference problem caused by transfer learning in DNNs has been recognized since about 30 years ago [1,3,39][1][3][33]. Currently, several solutions have been proposed to solve the problem of lifelong learning without forgetting [40,41,42][34][35][36].

2.1. Rehearsal-Based Methods

The methods of this approach try to retain a small subset of the source task to replay in the target task. This is a direct and effective way to lessen catastrophic forgetting through continual learning. In the work of [22], a training strategy named iCaRL is proposed for class-incremental learning. The classifiers and a feature representation are learned simultaneously from a class-incremental data stream. For classification, iCaRL relies on sequential sets of the fixed exemplar images that are dynamically selected from the data stream. All exemplars in each set belong to a certain class. An update routine algorithm is proposed to train the batches of classes at the same time in an incremental way. The iCaRL method uses this algorithm to adjust the exemplars and network parameters stored in the memory to be adapted to the new class set of training. Based on this, it can learn new tasks without forgetting the old tasks. The experimental results on the CIFAR-100 and ImageNet ILSVRC 2012 datasets proved the incremental learning efficiency of iCaRL in comparison with other methods. In [26], a model of Gradient Episodic Memory (GEM) is proposed for continual learning. In GEM, a small amount of data per task is stored in an explicit memory, while the number of tasks is large. This is the opposite of other previous approaches to continual learning. The advantage of GEM is that it is able to constrain the training using real data. However, it is easily overfitted by the subset of stored samples because the replayed data are the only information that the model has for the previous tasks. An improved version of GEM called A-GEM was proposed in [27]. A-GEM focuses on balancing accuracy and efficiency in sample complexity, computational cost, and memory cost. In addition to the features inherited from GEM, a small change to the loss function was proposed to boost training faster while maintaining similar performance. In addition, cross validation is performed on a set of the disjointed tasks for evaluation. The experiments indicate that there is only a small gap in performance between the lifelong learning methods, including A-GEM. Forgetting in neural networks can be eliminated, but the transfer learning performance does not improve by much. Different from the abovementioned solutions, the methods of [43,44][37][38] utilize greedy selection strategies for replay memory. However, these are inefficient because of the additional memory required for greedy saving. To overcome this, several optimal strategies for sample selection have been proposed. In [45][39], the authors proposed a memory retrieval technique that recovers memory samples with increased loss as a result of estimated parameter updates based on the current task. The solution in [46][40] scores samples in the memory based on their capacity to maintain latent decision boundaries for previously observed classes. The strategy in [47][41] expressly promotes samples from the same class to cluster closely in an embedding space while discouraging samples from other classes during replay-based training. Instead of storing any old samples in replay memory as in the abovementioned methods, in [28], an infinite number of instances of old classes in the deep feature space are generated to overcome the classifier bias in class-incremental learning. In addition, the work in [48][42] uses orthogonal and low-dimensional vector subspaces for learning individual tasks. This helps to prevent catastrophic interference of learning target tasks from source tasks. In comparison with the rehearsal baseline methods with the same amount of memory, both accuracy and anti-forgetting ability are improved on deeper networks.

2.2. Loss Regularization-Based Solutions

The knowledge from the source task to the target task can be distilled via loss regularization in the training phase of the tasks. In [7], the authors proposed a solution called LwF (Learning without Forgetting), in which training is carried out on new task data while preserving the original capabilities. This overcomes the disadvantage of growing the number of new tasks with the increase in data that needs to be stored and retrained. The LwF method changes the network architecture and combines the sharing parameters of the trained model with the new model in the optimization function. However, the old model is frozen to retain the old knowledge and to optimize the extended network parameter. The LwF method outperforms the current standard practice of fine-tuning if only care about the performance of the new task. In addition, it highly depends on the relevance between the new tasks and the old tasks, and the training time increases linearly with the number of the learned tasks. The later proposed solutions for LwF showed more efficient improvements. In [30], the authors proposed two algorithms: Dark Experience Replay (DER) and DER++. DER looks for parameters that fit the current task well while approximating the behavior observed in the previous tasks. In order to retain knowledge of the previous tasks, the objective function implemented in DER is minimized by an optimization trajectory. DER++ feeds an additional term on buffer data points to the objective of DER to promote higher conditional likelihood to their ground truth labels with a minimal memory overhead. The work in [49][43] learns representations using the contrastive learning objective and reserves learned representations using a self-supervised distillation step. The experiments show that the proposed scheme performs better than baselines in various learning configurations. More efficient solutions for continual learning were recently published [25,38][25][44]. The work in [25] used all of the training data’s attributes when learning each task. This helps lessen the feature bias brought on by cross entropy, which only learns features that are discriminative for the task at hand but may not be discriminative for another. The network parameters that were previously learned must be adjusted in order to learn a new task well. The promising technique in [38][44] named AOP (Adaptive Orthogonal Projection) only changes the network parameters in the direction that is orthogonal to the subspace bounded by all of the prior task inputs. As a result, there is no requirement to memorize the input data for each task, and the knowledge about the prior tasks is incrementally updated. According to empirical analysis, these methods significantly outperformed most recent continual learning baselines.

References

  1. McCloskey, M.; Cohen, N.J. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of Learning and Motivation; Elsevier: Amsterdam, The Netherlands; Academic Press: Cambridge, MA, USA, 1989; Volume 24, pp. 109–165.
  2. Caruana, A. Corporate reputation: Concept and measurement. J. Prod. Brand Manag. 1997, 6, 109–118.
  3. Abraham, W.C.; Robins, A. Memory retention–the synaptic stability versus plasticity dilemma. Trends Neurosci. 2005, 28, 73–78.
  4. Richardson, F.M.; Thomas, M.S. Critical periods and catastrophic interference effects in the development of self-organizing feature maps. Dev. Sci. 2008, 11, 371–389.
  5. Mesnil, G.; Dauphin, Y.; Glorot, X.; Rifai, S.; Bengio, Y.; Goodfellow, I.; Lavoie, E.; Muller, X.; Desjardins, G.; Warde-Farley, D.; et al. Unsupervised and transfer learning challenge: A deep learning approach. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Edinburgh, UK, 27 June 2012; JMLR Workshop and Conference Proceedings. pp. 97–110.
  6. Donahue, J.; Jia, Y.; Vinyals, O.; Hoffman, J.; Zhang, N.; Tzeng, E.; Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; JMLR Workshop and Conference Proceedings. Volume 32, pp. 647–655.
  7. Li, Z.; Hoiem, D. Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2935–2947.
  8. Parisi, G.I.; Tani, J.; Weber, C.; Wermter, S. Lifelong learning of spatiotemporal representations with dual-memory recurrent self-organization. Front. Neurorobot. 2018, 12, 78.
  9. Liu, H.; Liu, T.; Zhang, Z.; Sangaiah, A.K.; Yang, B.; Li, Y. ARHPE: Asymmetric relation-aware representation learning for head pose estimation in industrial human–computer interaction. IEEE Trans. Ind. Inform. 2022, 18, 7107–7117.
  10. Liu, H.; Fang, S.; Zhang, Z.; Li, D.; Lin, K.; Wang, J. MFDNet: Collaborative poses perception and matrix Fisher distribution for head pose estimation. IEEE Trans. Multimed. 2021, 24, 2449–2460.
  11. Liu, H.; Nie, H.; Zhang, Z.; Li, Y.F. Anisotropic angle distribution learning for head pose estimation and attention understanding in human-computer interaction. Neurocomputing 2021, 433, 310–322.
  12. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009.
  13. Truong, D.M.; Giang, D.; Tran, T.H.; Hai, V.; Le, T. Robustness Analysis of 3D Convolutional Neural Network for Human Hand Gesture Recognition. Int. J. Mach. Learn. Comput. 2019, 9, 135–142.
  14. Doan, H.-G.; Nguyen, N.T. End-to-end multiple modals deep learning system for hand posture recognition. Indones. J. Electr. Eng. Comput. Sci. 2022, 27, 214–221.
  15. Molchanov, P.; Gupta, S.; Kim, K.; Kautz, J. Hand gesture recognition with 3D convolutional neural networks. In Proceedings of the The IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 1–7.
  16. Parisi, G.I.; Kemker, R.; Part, J.L.; Kanan, C.; Wermter, S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71.
  17. Mirzadeh, S.I.; Farajtabar, M.; Gorur, D.; Pascanu, R.; Ghasemzadeh, H. Linear Mode Connectivity in Multitask and Continual Learning. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021.
  18. Kudithipudi, D.; Aguilar-Simon, M.; Babb, J.; Bazhenov, M.; Blackiston, D.; Bongard, J.; Brna, A.; Chakravarthi Raja, S.; Cheney, N.; Clune, J.; et al. Biological underpinnings for lifelong learning machines. Nat. Mach. Intell. 2022, 4, 196–210.
  19. Douillard, A.; Chen, Y.; Dapogny, A.; Cord, M. PLOP: Learning without Forgetting for Continual Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4039–4049.
  20. Saporta, A.; Douillard, A.; Vu, T.H.; P’erez, P.; Cord, M. Multi-Head Distillation for Continual Unsupervised Domain Adaptation in Semantic Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), New Orleans, LA, USA, 19–20 June 2022; pp. 3750–3759.
  21. Jin, Y.; Jiang, D.; Cai, M. 3d reconstruction using deep learning: A survey. Commun. Inf. Syst. 2020, 20, 389–413.
  22. Rebuffi, S.A.; Kolesnikov, A.; Sperl, G.; Lampert, C. iCaRL: Incremental Classifier and Representation Learning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5533–5542.
  23. Maltoni, D.; Lomonaco, V. Continuous learning in single-incremental-task scenarios. Neural Netw. 2019, 116, 56–73.
  24. Carta, A.; Cossu, A.; Lomonaco, V.; Bacciu, D. Ex-Model: Continual Learning from a Stream of Trained Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 3790–3799.
  25. Guo, Y.; Liu, B.; Zhao, D. Online Continual Learning through Mutual Information Maximization. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; Proceedings of Machine Learning Research. Volume 162, pp. 8109–8126.
  26. Lopez-Paz, D.; Ranzato, M. Gradient episodic memory for continual learning. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 6468–6477.
  27. Chaudhry, A.; Ranzato, M.; Rohrbach, M.; Elhoseiny, M. Efficient Lifelong Learning with A-GEM. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
  28. Zhu, F.; Cheng, Z.; Zhang, X.y.; Liu, C.l. Class-Incremental Learning via Dual Augmentation. In Proceedings of the 35th Conference on Neural Information Processing Systems, Virtual-only Conference, 6–14 December 2021; Volume 34, pp. 14306–14318.
  29. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526.
  30. Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; Calderara, S. Dark experience for general continual learning: A strong, simple baseline. Adv. Neural Inf. Process. Syst. 2020, 33, 15920–15930.
  31. Zenke, F.; Poole, B.; Ganguli, S. Continual Learning through Synaptic Intelligence. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 3987–3995.
  32. Gao, Q.; Luo, Z.; Klabjan, D.; Zhang, F. Efficient architecture search for continual learning. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–11.
  33. French, R.M. Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 1999, 3, 128–135.
  34. Liu, H.; Zheng, C.; Li, D.; Shen, X.; Lin, K.; Wang, J.; Zhang, Z.; Zhang, Z.; Xiong, N.N. EDMF: Efficient deep matrix factorization with review feature learning for industrial recommender system. IEEE Trans. Ind. Inform. 2021, 18, 4361–4371.
  35. Liu, T.; Yang, B.; Liu, H.; Ju, J.; Tang, J.; Subramanian, S.; Zhang, Z. GMDL: Toward precise head pose estimation via Gaussian mixed distribution learning for students’ attention understanding. Infrared Phys. Technol. 2022, 122, 104099.
  36. Liu, H.; Liu, T.; Chen, Y.; Zhang, Z.; Li, Y.F. EHPE: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Trans. Multimed. 2022, 1–12.
  37. Prabhu, A.; Torr, P.H.S.; Dokania, P.K. GDumb: A Simple Approach that Questions Our Progress in Continual Learning. In Proceedings of the European Conference on Computer Vision, Online Event, 23–28 August 2020.
  38. Aljundi, R.; Lin, M.; Goujaud, B.; Bengio, Y. Gradient based sample selection for online continual learning. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–12 December 2019.
  39. Aljundi, R.; Belilovsky, E.; Tuytelaars, T.; Charlin, L.; Caccia, M.; Lin, M.; Page-Caccia, L. Online Continual Learning with Maximal Interfered Retrieval. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, NSW, Australia, 2019; pp. 11849–11860.
  40. Shim, D.; Mai, Z.; Jeong, J.; Sanner, S.; Kim, H.J.; Jang, J. Online Class-Incremental Continual Learning with Adversarial Shapley Value. Proc. Aaai Conf. Artif. Intell. 2021, 35, 9630–9638.
  41. Mai, Z.; Li, R.; Kim, H.J.; Sanner, S. Supervised Contrastive Replay: Revisiting the Nearest Class Mean Classifier in Online Class-Incremental Continual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021; pp. 3589–3599.
  42. Chaudhry, A.; Khan, N.; Dokania, P.; Torr, P. Continual Learning in Low-rank Orthogonal Subspaces. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020.
  43. Cha, H.; Lee, J.; Shin, J. Co2L: Contrastive Continual Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 9516–9525.
  44. Guo, Y.; Hu, W.; Zhao, D.; Liu, B. Adaptive Orthogonal Projection for Batch and Online Continual Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 22 February–1 March 2022.
More
Video Production Service