Contrastive Learning for Hyperspectral Image Classification

Contrastive Learning for Hyperspectral Image Classification: Comparison

Please note this is a comparison between Version 2 by Fanny Huang and Version 1 by Li Jinhui.

Despite the rapid development of deep learning in hyperspectral image classification (HSIC), most models require a large amount of labeled data, which are both time-consuming and laborious to obtain. However, contrastive learning can extract spatial–spectral features from samples without labels, which helps to solve the above problem.

data augmentation
random occlusion
hyperspectral image

1. Introduction

Hyperspectral images (HSI) have gained widespread use due to their ability to provide extensive spectral and spatial information ^[1]. With hundreds of bands, HSIs can distinguish surface materials based on their unique spectral characteristics with exceptional spectral resolution. This feature makes them highly valuable for various applications, including vegetation surveys, atmospheric research, military detection, environmental monitoring ^[2] and landcover classification ^[3]. HSIC is a key research area within the hyperspectral field and involves the classification of individual pixels based on the rich spectral information they contain. As hardware technology continues to improve, the spatial resolution of hyperspectral sensors also increases, allowing for the incorporation of spatial information from surrounding pixels in classification efforts. Currently, the combination of spectral and spatial features is the primary approach in the HSIC field ^[4].

The abundance of bands in HSI presents a significant challenge in classification. Processing such large amounts of data directly without reduction would require a network of immense scale and huge computational memory. Furthermore, high spectral resolution creates spectral redundancy, which can be addressed through dimensionality reduction techniques that preserve critical information while reducing data size. Feature extraction is a widely used method for reducing data dimensions in HSI by extracting or sorting effective features for subsequent use. Common methods for feature extraction include principal component analysis (PCA) ^[5], independent component analysis (ICA) ^[6], linear discriminant analysis (LDA) ^[7], multidimensional scaling (MDS) ^[8], etc. These algorithms are still widely used as preprocessing methods due to their simplicity and effectiveness. With the increasing maturity of deep learning algorithms, more sophisticated algorithms are being developed to extract features from HSIs. Presently, the prevalent classification method involves using supervised or unsupervised feature extraction algorithms to extract spectral or spatial–spectral features, followed by classifier training using the extracted features.

The earlier developed feature extraction algorithms were based on supervised deep learning. In supervised learning, convolutional neural networks (CNNs) play an crucial role, evolving from one-dimensional CNNs ^[9] that only extract spectral features to two-dimensional and three-dimensional CNNs ^[10] that extract both spatial and spectral information. Roy et al. proposed the HybridSN network, which combines 2D and 3D convolutions to further enhance classification accuracy ^[11]. Zhong et al. introduced the classic residual network into the hyperspectral domain and designed the SSRN network ^[12]. Zhong et al. combined attention mechanism and CNNs ^[13]. Apart from CNNs, deep recurrent neural networks (DRNN) ^[14], deep feed-forward networks (DFFN), and other networks have also achieved promising results in HSIC.

However, supervised learning often heavily relies on labeled data, necessitating a sufficient number of labeled samples to achieve optimal training results. In the case of HSIs, both data collection and labeling involve significant human and time costs. Consequently, in recent years, the focus of feature extraction algorithms has gradually shifted towards unsupervised deep learning. The fundamental difference between unsupervised learning and supervised learning lies in the fact that the training data of unsupervised learning are unlabeled, and samples are classified based on their similarities, reducing the distance between data of the same class and increasing the distance between data of different classes. Without the constraints of labels, unsupervised learning can unleash the potential of models, enabling them to autonomously discover and explore data, learn more latent features, and ultimately result in models with better robustness and generalization. Unsupervised learning can be divided into generative learning and discriminative learning. Generative models learn to model the underlying probability distribution of input data. They are trained on large amounts of data and leverage this information to synthesize new samples that resemble the original data. The most basic generative deep learning algorithms include autoencoders (AE) ^[15] and generative adversarial networks (GAN) ^[16]. Variants of AEs, such as the adversarial autoencoder (AAE) ^[17], variational autoencoder (VAE) ^[18], and masked autoencoder (MAE) ^[19], have been widely used for feature extraction in hyperspectral image analysis. GANs optimized with algorithms such as deep convolutional GAN (DCGAN) ^[20], information maximizing GAN (InfoGAN) ^[21], and multitask GAN ^[22] have also achieved remarkable results in HSIC.

Discriminative learning models the conditional probability and learns the optimal boundary between different classes. Contrastive learning is a typical discriminative learning algorithm in deep learning, which aims to acquire representations by contrasting positive and negative pairs in the latent space. Positive pairs are spatially close but spectrally similar patches, whereas negative pairs are either spectrally dissimilar or spatially distant patches. By minimizing the distance between positive pairs and maximizing the distance between negative pairs, the model learns to encode both spatial and spectral information in the latent space.

Contrastive learning has made rapid progress in recent years, and many variants such as Moco ^[23], SimCLR ^[24], BYOL ^[25], SwAV ^[26], and SimSiam ^[27] have been proposed and gradually applied in the field of hyperspectral data analysis ^[28]. These methods differ in their choice of contrastive loss, encoder architecture, and training strategy. However, they share a common goal of learning representations that capture the underlying structure of hyperspectral data. Furthermore, the key to contrastive learning is to prevent model collapse, which means that all data converge to the same constant solution after feature representation.

For contrastive learning, researchers can integrate additional optimization techniques to encourage the model to learn more representative features while ensuring that it does not collapse. In handling the spatial–spectral features of HSIs, spatial and spectral information are often combined into the same sample. Although this approach is simple and compensates for the lack of spectral information, directly inputting the entire sample cube into the model results in a significant amount of redundant information interfering with feature extraction. Some studies have separated spatial and spectral information into different samples and used cross-domain contrastive learning to extract them separately [28,29,30]^[28][29][30]. This approach can reduce a lot of redundant information but may also lead to the loss of valuable sub-key information. Coordinating the extraction of spatial and spectral information, preserving useful information as much as possible, reducing the interference of useless information, and increasing the model’s attention to key information are essential for improving the efficiency of contrastive learning.

2. Contrastive Learning

Contrastive learning is a type of self-supervised learning that involves constructing pairs of similar and dissimilar examples to learn a representation learning model. The goal is to learn a model that projects similar samples close together in a projection space, whereas dissimilar samples are projected far apart. Essential factors in contrastive learning include how to construct similar and dissimilar samples, how to design a representation learning model that adheres to the above principles, and how to prevent model collapse, which occurs when all data converge to a single constant solution after feature representation. Currently, there are many contrastive learning methods available, and they can be roughly categorized into those based on negative samples ^[24], contrastive clustering ^[26], asymmetric network structures [25^[25][27],27], and redundancy elimination loss functions, depending on the approach used to prevent model collapse.

Bootstrap-Your-Own-Latent (BYOL) is a typical asymmetric structure-based approach, where the online network has an extra predictor compared to the target network. Furthermore, the two branches are connected by an asymmetric similarity loss. BYOL extracts features of samples by training the ability of online network to predict the output of target network, thereby learning the potential connections between positive sample pairs. Unlike other contrastive learning methods, BYOL only requires positive sample pairs, not negative ones, which makes the BYOL model more robust and generalizable.

He et al. proposed Simsiam ^[27] and analyzed the necessary factors to make the network not collapse. Its structure is similar to BYOL, retaining the predictor of the online network, but without EMA (exponential moving average). Simsiam proves that EMA is not necessary to prevent collapse but removing it will sacrifice part of the accuracy [32]^[31].

3. Data Augmentation

3.1. Normal Data Augmentation

Data augmentation is a widely used technique in contrastive learning that increases the diversity of the training data by creating new samples that are variations of the original data. This technique helps to reduce overfitting and improve the model’s ability to generalize. In contrastive learning, data augmentation is typically applied to both the anchor and positive samples to create new pairs of samples for training.

Normal augmentation methods, such as random cropping, flipping, rotation, color jittering, and Gaussian noise injection, are often used in contrastive learning [33]^[32]. The augmented samples are paired with their corresponding original samples to form positive pairs for training, thereby increasing the number of positive pairs and enhancing the diversity of the training data. This can improve the performance of the contrastive learning model.

Although they can be used to process hyperspectral data, these normal augmentation methods were originally designed for RGB or grayscale images and do not take into account the unique characteristics of HSIs.

In [34]^[33], the author pointed out that in the existing contrastive learning, it is not ideal to use various data augmentation methods to map the original data to the same space and then perform various downstream tasks. Blindly using data augmentation methods may be harmful to the learned features. WResearchers believe that the choice of data augmentation method should be based on the specific downstream tasks and the shape of the data. To fully leverage the potential of contrastive learning in HSIC, it is necessary to develop new data augmentation techniques that are tailored to hyperspectral data.

3.2. Random Occlusion

The random occlusion (RO) technique is a data augmentation method that involves randomly masking or occluding a portion of the input data during training. This technique simulates missing or incomplete information in the input data and forces the model to learn robust features that can still accurately classify the data, even when certain regions are missing. Random occlusion can be applied to various types of input data, such as images, text, and audio. In image classification tasks, random occlusion can be applied by randomly masking a portion of the image with a black rectangle or by replacing a portion of the image with random noise. The size and location of the occluded region can also be randomized to increase the diversity of the training data. By using random occlusion as a data augmentation technique, the model can learn to be more robust to incomplete or missing data, which can improve its performance on real-world situations where the input data may be noisy or incomplete.

References

Datta, D.; Mallick, P.K.; Bhoi, A.K.; Ijaz, M.F.; Shafi, J.; Choi, J. Hyperspectral image classification: Potentials, challenges, and future directions. Comput. Intell. Neurosci. 2022, 2022, 3854635.
Stuart, M.B.; McGonigle, A.J.; Willmott, J.R. Hyperspectral imaging in environmental monitoring: A review of recent developments and technological advances in compact field deployable systems. Sensors 2019, 19, 3071.
Tong, X.; Xie, H.; Weng, Q. Urban land cover classification with airborne hyperspectral data: What features to use? IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2013, 7, 3998–4009.
Duan, P.; Ghamisi, P.; Kang, X.; Rasti, B.; Li, S.; Gloaguen, R. Fusion of Dual Spatial Information for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote. Sens. 2021, 59, 7726–7738.
Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52.
Villa, A.; Benediktsson, J.A.; Chanussot, J.; Jutten, C. Hyperspectral image classification with independent component discriminant analysis. IEEE Trans. Geosci. Remote. Sens. 2011, 49, 4865–4876.
Du, Q. Modified Fisher’s linear discriminant analysis for hyperspectral imagery. IEEE Geosci. Remote. Sens. Lett. 2007, 4, 503–507.
Kruskal, J.B.; Wish, M.; Uslaner, E.M. Multidimensional scaling. In Handbook of Perception and Cognition; Academic Press: Cambridge, MA, USA, 1978.
Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sensors 2015, 2015, 258619.
Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote. Sens. 2016, 54, 6232–6251.
Roy, S.K.; Krishna, G.; Dubey, S.R.; Chaudhuri, B.B. HybridSN: Exploring 3D-2D CNN Feature Hierarchy for Hyperspectral Image Classification. IEEE Geosci. Remote. Sens. Lett. 2020, 17, 277–281.
Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote. Sens. 2017, 56, 847–858.
Hang, R.; Li, Z.; Liu, Q.; Ghamisi, P.; Bhattacharyya, S.S. Hyperspectral image classification with attention-aided CNNs. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 2281–2293.
Zhang, X.; Sun, Y.; Jiang, K.; Li, C.; Jiao, L.; Zhou, H. Spatial sequential recurrent neural network for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2018, 11, 4141–4155.
Ballard, D.H. Modular Learning in Neural Networks. In Proceedings of the 6th National Conference on Artificial Intelligence (AAAI-87), Seattle, WA, USA, 13–17 July 1987; pp. 279–284.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the 28th Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680.
Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial autoencoders. arXiv 2015, arXiv:1511.05644.
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114.
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 16000–16009.
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434.
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29.
Hang, R.; Zhou, F.; Liu, Q.; Ghamisi, P. Classification of hyperspectral images via multitask generative adversarial networks. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 1424–1436.
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 16–18 June 2020.
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning (PMLR), Vienna, Austria, 12–18 July 2020; pp. 1597–1607.
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.; Buchatskaya, E.; Doersch, C.; Avila Pires, B.; Guo, Z.; Gheshlaghi Azar, M.; et al. Bootstrap your own latent-a new approach to self-supervised learning. In Proceedings of the 34th Conference on Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 21271–21284.
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. In Proceedings of the 34th Conference on Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 9912–9924.
Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 15750–15758.
Hang, R.; Qian, X.; Liu, Q. Cross-Modality Contrastive Learning for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5532812.
Guan, P.; Lam, E.Y. Cross-domain contrastive learning for hyperspectral image classification. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5528913.
Shu, Z.; Liu, Z.; Zhou, J.; Tang, S.; Yu, Z.; Wu, X.J. Spatial–Spectral Split Attention Residual Network for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 16, 419–430.
Halvagal, M.S.; Laborieux, A.; Zenke, F. Predictor networks and stop-grads provide implicit variance regularization in BYOL/SimSiam. arXiv 2022, arXiv:2212.04858.
Ding, K.; Xu, Z.; Tong, H.; Liu, H. Data augmentation for deep graph learning: A survey. ACM SIGKDD Explor. Newsl. 2022, 24, 61–77.
Xiao, T.; Wang, X.; Efros, A.A.; Darrell, T. What should not be contrastive in contrastive learning. arXiv 2020, arXiv:2008.05659.