Combining Self-Supervision with Knowledge Distillation: History
Please note this is an old version of this entry, which may differ significantly from the current revision.
Contributor: , , , , ,

The current audio single-mode self-supervised classification mainly adopts a strategy based on audio spectrum reconstruction. Overall, its self-supervised approach is relatively single and cannot fully mine key semantic information in the time and frequency domains.

  • audio classification
  • comparative learning
  • knowledge distillation

1. Introduction

In recent years, with the rapid development of mobile multimedia technology, the exponential growth of audiovisual data have increased the demand for the capability of audio classification [1]. Audio classification, incorporating relevant technologies from machine learning and signal processing, plays a crucial role in applications such as speech recognition, sound event detection, emotion analysis, music classification, and speaker recognition. The objective of audio classification is to accurately categorize audio signals into predefined classes, facilitating the identification and understanding of different sound sources for improved downstream applications.
Although supervised audio classification methods have demonstrated effectiveness in many scenarios, they heavily rely on extensive labeled data, leading to increased costs in practice. Simultaneously, in numerous experiments, it has been observed that directly applying label-based discrete learning to audio information processing can result in classification bias [2]. The reason is that although the audio signal is continuous, the duration of the same sound event varies in different situations. Uniform discretization learning can easily lead to learning bias. In other words, supervised discrete learning fails to effectively elicit advanced semantic features from continuous audio and discard redundant details.
The reconstruction aspect can be divided into two major categories: spectrogram reconstruction and feature reconstruction. For the former, considering the two-dimensional specificity of audio spectrograms, self-supervised strategies can be constructed in a single dimension of time or frequency or jointly in the time-frequency dimension. For the latter, model-internal or teacher-student model learning strategies based on knowledge distillation can be constructed. By conducting more comprehensive self-supervised learning from these two dimensions, the recognition ability of audio single-modal classification is further optimized.

2. Combining Self-Supervision with Knowledge Distillation

Over the past few years, supervised audio classification methods have demonstrated excellent performance in various publicly available datasets [1,3,4,5,6,7]. In the specific modeling process, supervised learning for audio classification assigns a discrete label or category to a segment of continuous audio information. The audio information is then projected through the model to generate feature vectors with rich audio semantics, which are subsequently mapped to discrete labels, ultimately achieving the goal of classification. AST (Audio Spectrogram Transformer) [5], utilizing audio spectrograms as input, employs two-dimensional convolution to extract serialized features, followed by cascaded operations of multiple transformer blocks to obtain global features of the audio sequence and improve recognition performance significantly. Panns [1] leverages the large-scale audio dataset AudioSet for training, exploring various aspects of audio classification effects, such as depth, feature dimensions, dropout ratios, and spectrum generation methods, proposing high-quality models. Considering that CNN (Convolutional Neural Network) [8,9] focuses on the local context, PSLA (Pretraining, Sampling, Labeling, and Aggregation) [10] introduces a pooling attention mechanism for feature enhancement to capture global audio information and improve classification performance. To effectively enhance supervised audio classification, Khaled Koutini et al. [11] decomposes audio Transformer position encoding into temporal and frequency components, supporting variable-length audio classification. Arsha Nagrani [12], combining Transformer modeling, promotes the model’s learning ability through an intermediate feature fusion approach. Ke Chen [13] introduces a hierarchical audio Transformer with a semantic module that combines with input tokens, mapping the final output to class feature maps and enhancing classification performance. Eranns [14] reduces computational complexity by introducing model scale hyperparameters. By utilizing optimal parameter values, computational efficiency is improved, leading to potential savings or performance enhancement.
Additionally, to fully leverage the benefits of supervised pre-training and further enhance the effectiveness of audio classification, many methods employ model weights from the image domain. AST [5] initializes its model based on the pre-trained model on ImageNet using VIT [15], effectively boosting model performance. Hts-at [13] utilizes pre-training weights from the Swin Transformer [16] on image datasets, significantly improving audio classification results. PaSST [11] incorporates pre-training weights from DeiT [15]. To effectively confirm the appropriate number of hidden layer nodes in the neural network, Xuetao Xie [17] using the L1 regularization method proposes several effective methods to determine the optimal number of hidden nodes for a perceptron network.
From another perspective, advanced semantic abstraction of continuous audio can be achieved through signal autoencoding-decoding reconstruction. This technique is a powerful means of self-supervised learning. Compared to supervised learning, self-supervised learning does not require a large number of labeled samples. From this perspective, self-supervised data are easily obtainable. Similar to other self-supervised learning methods, self-supervised audio training typically aims to learn its representations through contrastive learning or reconstruction.
Self-supervised techniques have found numerous applications in recent years in audio classification [18,19,20,21,22,23]. Concurrently, various studies [24,25,26,27,28,29] indicate that reconstruction-based self-supervised techniques are not only effective for speech but also exhibit robust learning capabilities in modeling information such as video images and multimodal fusion.
Considering that audio often encompasses a variety of environmental events, such as speech, ambient sounds, and musical beats, often accompanied by considerable ambient noise, this poses significant challenges for universal audio classification modeling. In response, approaches like Dading Chong et al. [18] and Hu Xu et al. [19] apply masking operations to spectrograms, in the pre-training stage, and self-supervised acoustic feature reconstruction is used as the pre-training target. COLA [30] to achieve good pre-training results, during the pre-training period, comparative learning is performed on the audio dataset, assigning high similarity to data from the same audio segment, and low similarity to data from different segments. Eduardo Fonseca et al. [31] enhance sound event learning through different view-enhanced learning tasks, demonstrating that unsupervised contrastive pre-training can alleviate the impact of data scarcity and improve generalization. For more effective contrastive learning, Clar [32] proposes several efficient data augmentation and enhancement methods. Luyu Wang [33] introduces a contrastive learning method with audio samples in different formats, maximizing consistency between original audio and its acoustic features. To enhance generalization and obtain a robust audio representation, Daisuke Niizumi [34] trained and learned different audio samples using mean square error loss and exponential moving average optimization strategies. A patch-based self-supervised learning method is proposed by Ssast [24] for pre-training and achieving good performance. Mae-ast [35], based on the Transformer encoding-decoding structure, reconstructs pre-training tasks where the decoder is used for masked reconstruction, demonstrating excellent recognition performance in multi-audio classification tasks. Andrew N Carr [36] sequentially shuffles input audio features, implementing end-to-end model pre-training with a differentiable sorting strategy, and exploring self-supervised audio pre-training methods with masked discrete label prediction targets. In order to effectively distinguish unsupervised features, AARC [37] integrates the selection of unsupervised features and the determination of network structure into a unified framework, while adding two Group Lasso losses to the objective function.
Although self-supervised spectrogram pre-training strategies have shown good results in audio classification, some methods argue that this self-supervised reconstruction is relatively singular and can only restore low-level time-frequency features, with weaker capabilities in advanced audio semantic abstraction [38,39].

This entry is adapted from the peer-reviewed paper 10.3390/electronics13010052

This entry is offline, you can click here to edit this entry!
Video Production Service