1000/1000
Hot
Most Recent
With the rise in piano teaching in recent years, many people have joined the ranks of piano learners. However, the high cost of traditional manual instruction and the exclusive one-on-one teaching model have made learning the piano an extravagant endeavor. Most existing approaches, based on the audio modality, aim to evaluate piano players' skills. These methods overlook the information contained in videos, resulting in a one-sided and simplistic evaluation of the piano player's skills.
Figure 1 shows the framework of researchers' proposal. It consists of three main components: data pre-processing, feature extraction and fusion, and performance evaluation.
Figure 1. Framework of audio-visual fusion model for piano skills evaluation.
Visual branch: Specifically, researchers consider ResNet-3D [5] to extract the spatio-temporal features of the performance clips from a video sequence. Compared to conventional 3D CNNs, ResNet-3D can effectively capture the spatio-temporal dynamics of the video modality with higher computational efficiency. For ResNet-3D, researchers stack multiple 3D convolutional layers to model motion features in the temporal dimension and utilize 3D pooling layers and fully connected layers for feature descent and combination. In this way, researchers can extract rich visual features from the video data, including information such as the object's shape and color, to capture finger motion patterns. In addition, researchers can utilize a pre-trained model to improve the model's performance.
Aural branch: Information such as the pitch and rhythm of a piano performance is contained in the audio data, and both raw audio waveforms [6] [7] and spectrograms [8] [9] can be used to extract the auditory features. Specifically, researchers convert the raw audio data into the corresponding Mel-spectrogram. Researchers then feed the obtained Mel-spectrogram to the auditory network. Further, ResNet-2D [10] outperforms the traditional 2D CNN in terms of computational efficiency and feature extraction. Additionally, it can utilize a pre-trained model to improve performance. Therefore, researchers prefer to use ResNet-2D for feature extraction from the Mel-spectrogram. By stacking 2D convolutional layers, researchers can capture the patterns and variations in the audio data in the frequency and time dimensions.
Multimodal branch: By utilizing the ResNet-3D and ResNet-2D networks, researchers have obtained visual and aural features. To better capture the semantic association and complementary information between the video and audio modalities, researchers adopt a joint representation approach for the features extracted from the video and audio data. This helps to create a more comprehensive and accurate feature representation.
Aggregation option: It is often advantageous to perform linear operations on the learned features, which enhances the interpretability and expressiveness of the learned features. Linear operations can also be utilized to reduce the dimensionality of the features, which enhances the efficiency and generalization capabilities of the model. Consequently, researchers propose the utilization of linear averaging as the preferred aggregation scheme. Figure 3 shows the details of the linear averaging.
Figure 3. Feature average option.
In the visual and aural branches, to reduce the dimensionality of the features to 128, researchers pass them through a linear layer, as shown in Figure 4, and finally input them into the prediction layer. In the multimodal branch, researchers' operations are similar to those of others, except that researchers do not back-propagate from the multimodal branch to a separate modal backbone to avoid cross-modal contamination.
Figure 4. The structure of feature fusion.
Table 1 and Figure 5 present the results of the unimodal and audio-visual fusion models on the PISA dataset. The audio-visual fusion model achieved better results than the unimodal model. This indicates that the multimodal method can effectively compensate for the inability of the audio model alone to utilize visual information, leading to more accurate and comprehensive evaluation results.
Figure 5. Performance (% accuracy) of unimodal evaluation.
Model | V : A | Accuracy(%) |
C3D + ResNet18-2D | 8 : 1 | 68.70 |
ResNet18-3D + ResNet34-2D | 1 : 1 | 66.39 |
ResNet34-34 + ResNet18-2D | 1 : 1 | 66.81 |
Resnet34-3D + ResNet34-2D | 1 : 1 | 65.97 |
ResNet50-3D + ResNet50-2D | 1 : 1 | 59.45 |
ResNet18-3D + ResNet18-2D(Ours) | 1 : 1 | 70.80 |
Table 1. Performance (% accuracy) of multimodal evaluation. V:A: the ratio of the number of video features to audio features.
The results in Table 2 show that the accuracy improvement of the audio-visual fusion model was smaller when the ratio of the number of video features to audio features was relatively large. However, when the number of features in both modalities was closer, the accuracy improvement was relatively larger. The reason may be that the number of features of one modality had a significantly larger number of features than the other, which caused the model to place more emphasis on the modality with more features and overlook the modality with fewer features.
Model | V : A | Accuracy(%) |
C3D + ResNet18-2D | 8 : 1 | 68.70 |
ResNet18-3D + ResNet50-2D | 1 : 4 | 65.55 |
ResNet34-3D + ResNet50-2D | 1 : 4 | 67.02 |
ResNet50-3D + ResNet18-2D | 4 : 1 | 64.50 |
ResNet50-3D + ResNet34-2D | 4 : 1 | 61.34 |
ResNet18-3D + ResNet18-2D(Ours) | 1 : 1 | 70.80 |
Table 2. Performance (% accuracy). V:A: the ratio of the number of video features to audio features.
The fusion of visual and auditory features enables the discovery of correlations and complementarities between audio and visual information, resulting in a more comprehensive and accurate feature representation. By utilizing ResNet as the backbone network, the proposed model leverages ResNet-3D and ResNet-2D to extract visual and auditory features from finger motions (visual) and audio features (auditory), respectively. Then, the visual and auditory features are combined by feature stitching to form multimodal features. Finally, the multimodal features are fed to the linear layer to predict the piano player's skill level.