Combining Self-Supervision with Knowledge Distillation

Combining Self-Supervision with Knowledge Distillation: History

View Latest Version

Please note this is an old version of this entry, which may differ significantly from the current revision.

Contributor:

Xuchao Gong

Hongjie Duan

Yaozhong Yang

Lizhuang Tan

Jian Wang

Athanasios V. Vasilakos

The current audio single-mode self-supervised classification mainly adopts a strategy based on audio spectrum reconstruction. Overall, its self-supervised approach is relatively single and cannot fully mine key semantic information in the time and frequency domains.

audio classification
comparative learning
knowledge distillation

1. Introduction

In recent years, with the rapid development of mobile multimedia technology, the exponential growth of audiovisual data have increased the demand for the capability of audio classification ^[1]. Audio classification, incorporating relevant technologies from machine learning and signal processing, plays a crucial role in applications such as speech recognition, sound event detection, emotion analysis, music classification, and speaker recognition. The objective of audio classification is to accurately categorize audio signals into predefined classes, facilitating the identification and understanding of different sound sources for improved downstream applications.

Although supervised audio classification methods have demonstrated effectiveness in many scenarios, they heavily rely on extensive labeled data, leading to increased costs in practice. Simultaneously, in numerous experiments, it has been observed that directly applying label-based discrete learning to audio information processing can result in classification bias ^[2]. The reason is that although the audio signal is continuous, the duration of the same sound event varies in different situations. Uniform discretization learning can easily lead to learning bias. In other words, supervised discrete learning fails to effectively elicit advanced semantic features from continuous audio and discard redundant details.

The reconstruction aspect can be divided into two major categories: spectrogram reconstruction and feature reconstruction. For the former, considering the two-dimensional specificity of audio spectrograms, self-supervised strategies can be constructed in a single dimension of time or frequency or jointly in the time-frequency dimension. For the latter, model-internal or teacher-student model learning strategies based on knowledge distillation can be constructed. By conducting more comprehensive self-supervised learning from these two dimensions, the recognition ability of audio single-modal classification is further optimized.

2. Combining Self-Supervision with Knowledge Distillation

Over the past few years, supervised audio classification methods have demonstrated excellent performance in various publicly available datasets ^[1]^[3]^[4]^[5]^[6]^[7]. In the specific modeling process, supervised learning for audio classification assigns a discrete label or category to a segment of continuous audio information. The audio information is then projected through the model to generate feature vectors with rich audio semantics, which are subsequently mapped to discrete labels, ultimately achieving the goal of classification. AST (Audio Spectrogram Transformer) ^[5], utilizing audio spectrograms as input, employs two-dimensional convolution to extract serialized features, followed by cascaded operations of multiple transformer blocks to obtain global features of the audio sequence and improve recognition performance significantly. Panns ^[1] leverages the large-scale audio dataset AudioSet for training, exploring various aspects of audio classification effects, such as depth, feature dimensions, dropout ratios, and spectrum generation methods, proposing high-quality models. Considering that CNN (Convolutional Neural Network) ^[8]^[9] focuses on the local context, PSLA (Pretraining, Sampling, Labeling, and Aggregation) ^[10] introduces a pooling attention mechanism for feature enhancement to capture global audio information and improve classification performance. To effectively enhance supervised audio classification, Khaled Koutini et al. ^[11] decomposes audio Transformer position encoding into temporal and frequency components, supporting variable-length audio classification. Arsha Nagrani ^[12], combining Transformer modeling, promotes the model’s learning ability through an intermediate feature fusion approach. Ke Chen ^[13] introduces a hierarchical audio Transformer with a semantic module that combines with input tokens, mapping the final output to class feature maps and enhancing classification performance. Eranns ^[14] reduces computational complexity by introducing model scale hyperparameters. By utilizing optimal parameter values, computational efficiency is improved, leading to potential savings or performance enhancement.

Additionally, to fully leverage the benefits of supervised pre-training and further enhance the effectiveness of audio classification, many methods employ model weights from the image domain. AST ^[5] initializes its model based on the pre-trained model on ImageNet using VIT ^[15], effectively boosting model performance. Hts-at ^[13] utilizes pre-training weights from the Swin Transformer ^[16] on image datasets, significantly improving audio classification results. PaSST ^[11] incorporates pre-training weights from DeiT ^[15]. To effectively confirm the appropriate number of hidden layer nodes in the neural network, Xuetao Xie ^[17] using the L1 regularization method proposes several effective methods to determine the optimal number of hidden nodes for a perceptron network.

From another perspective, advanced semantic abstraction of continuous audio can be achieved through signal autoencoding-decoding reconstruction. This technique is a powerful means of self-supervised learning. Compared to supervised learning, self-supervised learning does not require a large number of labeled samples. From this perspective, self-supervised data are easily obtainable. Similar to other self-supervised learning methods, self-supervised audio training typically aims to learn its representations through contrastive learning or reconstruction.

Self-supervised techniques have found numerous applications in recent years in audio classification ^[18]^[19]^[20]^[21]^[22]^[23]. Concurrently, various studies ^[24]^[25]^[26]^[27]^[28]^[29] indicate that reconstruction-based self-supervised techniques are not only effective for speech but also exhibit robust learning capabilities in modeling information such as video images and multimodal fusion.

Considering that audio often encompasses a variety of environmental events, such as speech, ambient sounds, and musical beats, often accompanied by considerable ambient noise, this poses significant challenges for universal audio classification modeling. In response, approaches like Dading Chong et al. ^[18] and Hu Xu et al. ^[19] apply masking operations to spectrograms, in the pre-training stage, and self-supervised acoustic feature reconstruction is used as the pre-training target. COLA ^[30] to achieve good pre-training results, during the pre-training period, comparative learning is performed on the audio dataset, assigning high similarity to data from the same audio segment, and low similarity to data from different segments. Eduardo Fonseca et al. ^[31] enhance sound event learning through different view-enhanced learning tasks, demonstrating that unsupervised contrastive pre-training can alleviate the impact of data scarcity and improve generalization. For more effective contrastive learning, Clar ^[32] proposes several efficient data augmentation and enhancement methods. Luyu Wang ^[33] introduces a contrastive learning method with audio samples in different formats, maximizing consistency between original audio and its acoustic features. To enhance generalization and obtain a robust audio representation, Daisuke Niizumi ^[34] trained and learned different audio samples using mean square error loss and exponential moving average optimization strategies. A patch-based self-supervised learning method is proposed by Ssast ^[24] for pre-training and achieving good performance. Mae-ast ^[35], based on the Transformer encoding-decoding structure, reconstructs pre-training tasks where the decoder is used for masked reconstruction, demonstrating excellent recognition performance in multi-audio classification tasks. Andrew N Carr ^[36] sequentially shuffles input audio features, implementing end-to-end model pre-training with a differentiable sorting strategy, and exploring self-supervised audio pre-training methods with masked discrete label prediction targets. In order to effectively distinguish unsupervised features, AARC ^[37] integrates the selection of unsupervised features and the determination of network structure into a unified framework, while adding two Group Lasso losses to the objective function.

Although self-supervised spectrogram pre-training strategies have shown good results in audio classification, some methods argue that this self-supervised reconstruction is relatively singular and can only restore low-level time-frequency features, with weaker capabilities in advanced audio semantic abstraction ^[38]^[39].

This entry is adapted from the peer-reviewed paper 10.3390/electronics13010052

References

Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE ACM Trans. Audio Speech Lang. Process. 2020, 28, 2880–2894.
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460.
Verma, P.; Berger, J. Audio Transformers: Transformer Architectures for Large Scale Audio Understanding. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 17–20 October 2021; pp. 1–5.
Arnault, A.; Hanssens, B.; Riche, N. Urban Sound Classification: Striving towards a fair comparison. arXiv 2020, arXiv:2010.11805.
Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the IEEE Conference on Interspeech, Brno, Czechia, 30 August–3 September 2021; pp. 571–575.
Liu, A.T.; Li, S.W.; Tera, H.L. Self-supervised learning of transformer encoder representation for speech. IEEE ACM Trans. Audio Speech Lang. Process. 2021, 29, 2351–2366.
Chi, P.H.; Chung, P.H.; Wu, T.H.; Hsieh, C.C.; Chen, Y.H.; Li, S.W.; Lee, H.Y. Audio albert: A lite bert for self-supervised learning of audio representation. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 344–350.
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114.
Giraldo, J.S.P.; Jain, V.; Verhelst, M. Efficient Execution of Temporal Convolutional Networks for Embedded Keyword Spotting. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 2220–2228.
Yuan, G.; Yu, A.C.; James, G. PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3292–3306.
Schmid, F.; Koutini, K.; Widmer, G. Efficient Large-Scale Audio Tagging Via Transformer-to-CNN Knowledge Distillation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5.
Arsha, N.; Shan, Y.; Anurag, A.; Jansen, A.; Schmid, C.; Sun, C. Attention bottlenecks for multimodal fusion. J. Adv. Neural Inf. Process. Syst. 2021, 34, 14200–14213.
Chen, K.; Du, X.; Zhu, B.; Ma, Z.; Berg-Kirkpatrick, T.; Dubnov, S. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 646–650.
Sergey, V.; Vladimir, B.; Viacheslav, V. Eranns: Efficient residual audio neural networks for audio pattern recognition. J. Pattern Recognit. Lett. 2022, 161, 38–44.
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 8–24 July 2021; pp. 10347–10357.
Ze, L.; Yutong, L.; Yue, C.; Han, H.; Wei, Y.; Zheng, Z.; Stephen, L.; Baining, G. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022.
Xie, X.; Zhang, H.; Wang, J.; Chang, Q.; Wang, J.; Pal, N.R. Learning optimized structure of neural networks by hidden node pruning with L1 regularization. IEEE Trans. Cybern. 2020, 50, 1333–1346.
Dading, C.; Helin, W.; Peilin, Z.; Zeng, Q.C. Masked spectrogram prediction for self-supervised audio pre-training. arXiv 2022, arXiv:2204.12768.
Huang, P.Y.; Xu, H.; Li, J.; Baevski, A.; Auli, M.; Galuba, W.; Metze, F.; Feichtenhofer, C. Masked autoencoders that listen. arXiv 2022, arXiv:2207.06405.
Yu, Z.; Daniel, S.P.; Wei, H.; Qin, J.; Gulati, A.; Shor, J.; Jansen, A.; Xu, Y.Z.; Huang, Y.; Wang, S. Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE J. Sel. Top. Signal Process. 2022, 16, 1519–1532.
Chen, S.; Wu, Y.; Wang, C.; Liu, S.; Tompkins, D.; Chen, Z.; Wei, F. BEATS: Audio Pre-Training with Acoustic Tokenizers. In Proceedings of the 40th International Conference on Machine LearningJuly, Honolulu Hawaii, HI, USA, 23–29 July 2023; pp. 5178–5193.
Baevski, A.; Hsu, W.N.; Xu, Q.; Babu, A.; Gu, J.; Auli, M. Data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 1298–1312.
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518.
Gong, Y.; Lai, C.I.; Chung, Y.A.; Glass, J. Ssast: Self-supervised audio spectrogram transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22 February–1 March 2022; pp. 10699–10709.
Huang, P.Y.; Sharma, V.; Xu, H.; Ryali, C.; Fan, H.; Li, Y.; Li, S.W.; Ghosh, G.; Malik, J.; Feichtenhofer, C. MAViL: Masked Audio-Video Learners. arXiv 2022, arXiv:2212.08071.
Wei, Y.; Hu, H.; Xie, Z.; Zhang, Z.; Cao, Y.; Bao, J.; Chen, D.; Guo, B. Contrastive learning rivals masked image modeling in fine-tuning via feature distillation. arXiv 2022, arXiv:2205.14141.
Chen, H.; Xie, W.; Vedaldi, A.; Zisserman, A. VGGSound: A large-scale audio-visual dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 721–725.
Wei, C.; Fan, H.; Xie, S.; Wu, C.Y.; Yuille, A.; Feichtenhofer, C. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14668–14678.
Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Taylor, B.K.; Dubnov, S. Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, 4–10 June 2023; pp. 1–5.
Aaqib, S.; David, G.; Neil, Z. Contrastive learning of general-purpose audio representations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3875–3879.
Eduardo, F.; Diego, O.; Kevin, M.; Noel, E.O.C.; Serra, X. Unsupervised contrastive learning of sound event representations. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 371–375.
Haider, A.T.; Yalda, M. Clar: Contrastive learning of auditory representations. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Toronto, ON, Canada, 6–11 June 2021; pp. 2530–2538.
Luyu, W.; Aaron, O. Multi-format contrastive learning of audio representations. arXiv 2021, arXiv:2103.06508.
Daisuke, N.; Daiki, T.; Yasunori, O.; Harada, N.; Kashino, K. Byol for audio: Self-supervised learning for general-purpose audio representation. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8.
Alan, B.; Puyuan, P.; David, H. Mae-ast: Masked autoencoding audio spectrogram transformer. In Proceedings of the 23rd Interspeech Conference, Incheon, Republic of Korea, 18–22 September 2022; pp. 2438–2442.
Andrew, N.C.; Quentin, B.; Mathieu, B.; Teboul, O.; Zeghidour, N. Selfsupervised learning of audio representations from permutations with differentiable ranking. J. IEEE Signal Process. Lett. 2021, 28, 708–712.
Gong, X.; Yu, L.; Wang, J.; Zhang, K.; Bai, X.; Pal, N.R. Unsupervised Feature Selection via Adaptive Autoencoder with Redundancy Control. Neural Netw. 2022, 150, 87–101.
Aditya, R.; Mikhail, P.; Gabriel, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 8–24 July 2021; pp. 8821–8831.
Bao, H.; Dong, L.; Piao, S.; Wei, F. Beit: Bert pre-training of image transformers. arXiv 2021, arXiv:2106.08254.

© Text is available under the terms and conditions of the Creative Commons Attribution (CC BY) license; additional terms may apply. By using this site, you agree to the Terms and Conditions and Privacy Policy.