Catastrophic forgetting or catastrophic interference is a serious problem in continuous learning in machine learning. It happens not only in traditional machine learning algorithms such as SVM (Support Vector Machine), NB (Naive Bayes), DT (Decision Tree), and CRF (Conditional Random Field) but also in DNNs.
1. Introduction
Catastrophic forgetting or catastrophic interference is a serious problem in continuous learning in machine learning. It happens not only in traditional machine learning algorithms such as SVM (Support Vector Machine), NB (Naive Bayes), DT (Decision Tree), and CRF (Conditional Random Field) but also in DNNs. This phenomenon was firstly exposed in
[1]. According to the observations in that work, when training on new tasks, the knowledge learned from previous tasks is forgotten by the machine learning model. The forgotten knowledge includes the weights learned from the source tasks that are overridden when learning the target tasks. The authors also demonstrated that the phenomenon of catastrophic forgetting is the main reason for degradation in the performance of the machine learning models. Since the announcement of
[1] on catastrophic interference, several publications to address this challenge have emerged
[2,3[2][3][4][5][6][7][8],
4,5,6,7,8], and they shed more light on the causes of this phenomenon.
Considering traditional machine learning algorithms such as SVM (Support Vector Machine), NB (Naive Bayes), DT (Decision Tree), and CRF (Conditional Random Field), although they have been very successful in practice, they are inherently focused on single-task learning or isolated-task learning. Moreover, the models are fixed after deployment, or there is no learning after deployment. This causes limitations in leveraging the knowledge learned from these models to solve the new tasks or categories
[1,3,4][1][3][4].
Recently, DNNs have achieved outstanding efficiency in different areas, but they require a large amount of training data
[9,10,11][9][10][11]. In order for some networks to be effective, it is frequently necessary to reuse model parameters that were learned from training a sizable dataset
[12]. The pre-trained models are then fine-tuned on specialized data to obtain higher efficiency
[13,14,15][13][14][15]. Transfer learning can cause the old knowledge to be overwritten by the new one. This makes it impossible for the model to remember the previous content. Therefore, transfer learning is considered the main reason for catastrophic forgetting in DNNs.
How to mitigate the catastrophic forgetting effect in DNNs has recently attracted great attention from the research community
[16,17,18][16][17][18]. As a result, several continual learning solutions have been proposed for language problems
[19,20][19][20], object segmentation and 3D reconstruction
[21], and the recognition field
[7,22,23,24,25][7][22][23][24][25]. The strategies proposed for mitigating the catastrophic forgetting of DNNs can be grouped into three main approaches: (1) rehearsal-based methods; (2) loss regularization-based methods; and (3) architecture search-based methods. In the first one, an explicit memory is utilized for maintaining (i) raw samples
[22[22][26][27],
26,27], (ii) pseudo-samples generated using a generator
[28], or (iii) network representations/parameters stored from past tasks
[29,30][29][30]. The previously learned knowledge is then adapted to the new task training, which helps to avoid forgetting the previous tasks. This approach can be applied to different types of continual learning settings, such as class incremental learning, task incremental learning, and domain incremental learning. However, it requires an efficient strategy on what to store and how to update the old knowledge into new tasks; otherwise, overfitting will occur. The second approach focuses on adding a regularization term to the loss function
[7,29,30,31][7][29][30][31]. The loss, together with the penalty factor, determine the effectiveness of injecting knowledge from the source task model to the target task model. This itself is an attractive research area. The third approach pays attention to the significance of the neural network architecture in continual learning. Individual architecture selections such as the number of hidden layers, batch normalization, skip connections, pooling layers, etc. can affect the performance of the continual learning
[32].
2. An Efficient Strategy for Catastrophic Forgetting Reduction in Incremental Learning
The catastrophic interference problem caused by transfer learning in DNNs has been recognized since about 30 years ago
[1,3,39][1][3][33]. Currently, several solutions have been proposed to solve the problem of lifelong learning without forgetting
[40,41,42][34][35][36].
2.1. Rehearsal-Based Methods
The methods of this approach try to retain a small subset of the source task to replay in the target task. This is a direct and effective way to lessen catastrophic forgetting through continual learning. In the work of
[22], a training strategy named iCaRL is proposed for class-incremental learning. The classifiers and a feature representation are learned simultaneously from a class-incremental data stream. For classification, iCaRL relies on sequential sets of the fixed exemplar images that are dynamically selected from the data stream. All exemplars in each set belong to a certain class. An update routine algorithm is proposed to train the batches of classes at the same time in an incremental way. The iCaRL method uses this algorithm to adjust the exemplars and network parameters stored in the memory to be adapted to the new class set of training. Based on this, it can learn new tasks without forgetting the old tasks. The experimental results on the CIFAR-100 and ImageNet ILSVRC 2012 datasets proved the incremental learning efficiency of iCaRL in comparison with other methods.
In
[26], a model of Gradient Episodic Memory (GEM) is proposed for continual learning. In GEM, a small amount of data per task is stored in an explicit memory, while the number of tasks is large. This is the opposite of other previous approaches to continual learning. The advantage of GEM is that it is able to constrain the training using real data. However, it is easily overfitted by the subset of stored samples because the replayed data are the only information that the model has for the previous tasks. An improved version of GEM called A-GEM was proposed in
[27]. A-GEM focuses on balancing accuracy and efficiency in sample complexity, computational cost, and memory cost. In addition to the features inherited from GEM, a small change to the loss function was proposed to boost training faster while maintaining similar performance. In addition, cross validation is performed on a set of the disjointed tasks for evaluation. The experiments indicate that there is only a small gap in performance between the lifelong learning methods, including A-GEM. Forgetting in neural networks can be eliminated, but the transfer learning performance does not improve by much.
Different from the abovementioned solutions, the methods of
[43,44][37][38] utilize greedy selection strategies for replay memory. However, these are inefficient because of the additional memory required for greedy saving. To overcome this, several optimal strategies for sample selection have been proposed. In
[45][39], the authors proposed a memory retrieval technique that recovers memory samples with increased loss as a result of estimated parameter updates based on the current task. The solution in
[46][40] scores samples in the memory based on their capacity to maintain latent decision boundaries for previously observed classes. The strategy in
[47][41] expressly promotes samples from the same class to cluster closely in an embedding space while discouraging samples from other classes during replay-based training.
Instead of storing any old samples in replay memory as in the abovementioned methods, in
[28], an infinite number of instances of old classes in the deep feature space are generated to overcome the classifier bias in class-incremental learning. In addition, the work in
[48][42] uses orthogonal and low-dimensional vector subspaces for learning individual tasks. This helps to prevent catastrophic interference of learning target tasks from source tasks. In comparison with the rehearsal baseline methods with the same amount of memory, both accuracy and anti-forgetting ability are improved on deeper networks.
2.2. Loss Regularization-Based Solutions
The knowledge from the source task to the target task can be distilled via loss regularization in the training phase of the tasks. In
[7], the authors proposed a solution called LwF (Learning without Forgetting), in which training is carried out on new task data while preserving the original capabilities. This overcomes the disadvantage of growing the number of new tasks with the increase in data that needs to be stored and retrained. The LwF method changes the network architecture and combines the sharing parameters of the trained model with the new model in the optimization function. However, the old model is frozen to retain the old knowledge and to optimize the extended network parameter. The LwF method outperforms the current standard practice of fine-tuning if only care about the performance of the new task. In addition, it highly depends on the relevance between the new tasks and the old tasks, and the training time increases linearly with the number of the learned tasks.
The later proposed solutions for LwF showed more efficient improvements. In
[30], the authors proposed two algorithms: Dark Experience Replay (DER) and DER++. DER looks for parameters that fit the current task well while approximating the behavior observed in the previous tasks. In order to retain knowledge of the previous tasks, the objective function implemented in DER is minimized by an optimization trajectory. DER++ feeds an additional term on buffer data points to the objective of DER to promote higher conditional likelihood to their ground truth labels with a minimal memory overhead. The work in
[49][43] learns representations using the contrastive learning objective and reserves learned representations using a self-supervised distillation step. The experiments show that the proposed scheme performs better than baselines in various learning configurations. More efficient solutions for continual learning were recently published
[25,38][25][44]. The work in
[25] used all of the training data’s attributes when learning each task. This helps lessen the feature bias brought on by cross entropy, which only learns features that are discriminative for the task at hand but may not be discriminative for another. The network parameters that were previously learned must be adjusted in order to learn a new task well. The promising technique in
[38][44] named AOP (Adaptive Orthogonal Projection) only changes the network parameters in the direction that is orthogonal to the subspace bounded by all of the prior task inputs. As a result, there is no requirement to memorize the input data for each task, and the knowledge about the prior tasks is incrementally updated. According to empirical analysis, these methods significantly outperformed most recent continual learning baselines.