最近的研究表明,算法音乐之所以引起全球关注,不仅因为它的娱乐性,还因为它在行业中的巨大潜力。因此,产量增加了在算法音乐生成主题上旋转的学术数字。数理逻辑和美学价值之间的平衡在音乐的产生中很重要。
1. Introduction
During the past few decades, the field of computer music has precisely addressed challenges surrounding the analysis of musical concepts [
1,
2]. Indeed, only by first understanding this type of information can we provide more advanced analytical and compositional tools, as well as methods to advance music theory [
2]. Currently, literature on music computing and intelligent creativity [
1,
3,
4] focuses specifically on algorithmic music. We have observed a notable rise in literature inspired by the field of machine learning because of its attempt to explain the compositional textures and formation methods within music on a mathematical level [
5,
6]. Machine learning methods are well accepted as an additional motivation for generating music content. Instead of the previous methods, such as grammar-based [
1], rule-based [
7], and metaheuristic strategy-based [
8] music generation systems, machine learning-based generation methods can learn musical paradigms from an arbitrary corpus. Thus, the same system can be used for various musical genres.
Driven by the requirement for widespread music content, more massive music datasets have emerged in the genres of classical [
9], rock [
10], and pop music [
11], for instance. However, a publicly available corpus of traditional folk music seems to pay little attention to the niche corner. Historically, research investigating factors associated with music composition from large-scale music datasets has focused on deep learning architectures, stemming from its ability to automatically learn musical styles from a corpus and generate new content [
5].
Although music possesses its special characteristics that distinguish it from text, it is still classified as sequential data because of its temporal sequential relationship. Hence, recurrent neural networks (RNN) and its variants are adopted by most music-generating neural network models that are currently available [
12,
13,
14,
15,
16,
17]. Music generation sequence models were often characterized by the representation and prediction of a number of events. Then, those models can use the conditions formed by previous events to generate the current event. MelodyRNN [
18] and SampleRNN [
19] are representatives of this approach, with the shortcoming that the generated music lacks segmental integrity and a musical recurrent structure. Neural networks have studied this musical repetitive structure, called translation invariance [
20]. Convolutional neural network (CNN) has been influential in the music domain, stemming from its excellence in the image domain. This regional learning capability is sought to migrate to the translational invariance of the musical context. Some representative work has emerged [
21,
22,
23] to use deep CNN for music generation, although there have been few attempts. However, it seems to be more imitative than creative in music, stemming from its over-learning of the local structure of music. Therefore, inspired by whether it is possible to combine the advantages of both structures, they used compound architectures in music generation research [
12,
24,
25,
26].
Compound architecture combines at least two architectures of the same type or of different types [
5] and can be divided into two main categories. Some cases are homogeneous composite architectures that combine various instances of the same architecture, such as the stacked autoencoder. Most cases are heterogeneous compound architectures that combine various types of architectures, such as a RNN Encoder-Decoder that combines the RNN and autoencoder. From an architectural point of view, we can conduct compositing using different methodologies.
-
Composition—Combination of two architectures of the same type or of different types. For instance, the bidirectional LSTM [
15] combines two RNNs to analyze music semantic contexts from temporal forward and inverse; and RNN-RBM architectures combine RNN architectures and RBM architectures [
14].
-
Refinement—Refinement and specialization of a model by additional constraints. The sparse autoencoder architecture is an example of a specialized solution to the note sparse coding problem on top of the autoencoder architecture [
27] and the variational autoencoder (VAE) [
28].
-
Nesting—Nesting one model into another structure to form a new model. Examples include stacked autoencoder architectures [
29] and RNN encoder-decoder architectures, where two RNN models are nested in the encoder and decoder parts of an autoencoder, so we can also call them autoencoders (RNN, RNN) [
16].
-
Instantiation—The architectural pattern is instantiated into a given architecture. For a case in point, the Anticipation-RNN architecture instantiates a conditional reflection architectural pattern onto an RNN and the output of another RNN as a conditional reflection input, which we can call conditional reflection (RNN, RNN) [
17]. The C-RBM architecture is a convolutional architectural pattern instantiated onto an RBM architecture, which we can note as convolutional (RBM) [
30].
2. Deep Learning-Based Music Generation
-
基于 RNN 的音乐生成。这部作品[
12]是一种RNN架构,具有循环层的层次结构,不仅可以生成旋律,还可以生成鼓和和弦。该模型[
13]很好地证明了RNN同时生成多个序列的能力。但是,它需要预先了解音阶和旋律的一些轮廓才能生成。结果表明,基于文本的长短期记忆(LSTM)在生成和弦和鼓时表现更好。MelodyRNN [
18]可能是神经网络在符号域中生成音乐的最著名的例子之一。它包括该模型的三个基于RNN的变体,两个旨在音乐结构学习的变体,回顾RNN和注意力RNN。索尼CSL [
31]提出了DeepBach,它可以专门创作出J.S.巴赫风格的复调四部分合唱曲目。它也是一个基于RNN的模型,允许执行用户定义的约束,例如节奏,音符,部分,和弦和快板。然而,由于以下原因,这个方向仍然具有挑战性。从外部看,整体音乐结构似乎没有层次特征,部分也没有统一的节奏模式。音乐特征在音乐语法方面被认为是极其简化的,忽略了关键的音乐特征,如音符时间,节奏,音阶和间隔。关于音乐的内涵,音乐风格是不可控的,审美测量是无效的,听觉与音乐家创作的音乐之间存在着明显的差距。
-
基于 CNN 的音乐生成。一些 CNN 架构已被确定为 RNN 架构的替代方案 [
21,
22]。本文[
21]被提出作为基于CNN的生成模型构建的代表工作,该模型可实现语音识别,语音合成和音乐生成任务。WaveNet架构呈现了许多因果卷积层,有点类似于递归层。然而,它有两个局限性:其低效的计算减少了实时的使用,并且它被创建为主要面向声学数据。用于符号数据的迷笛网 [
22] 架构的灵感来自波浪网。它包括一个调整机制,该机制结合了先前测量的历史信息(旋律和和弦)。作者讨论了控制创造力和限制条件的两种方法。一种方法是仅在生成器架构的中间卷积层中插入调整数据。另一种方法是减小特征匹配正则化的两个控制参数的值,从而减少实际数据和生成数据的分布。
-
基于复合架构的音乐生成。Bretan等人[
32]通过开发深度自动编码器实现了音乐输入的编码,并通过从库中进行选择来重建输入。随后,他们建立了一个深度结构化的语义模型DSSM与LSTM相结合,对单音旋律进行单音预测。但是,由于统一预测的局限性,生成的内容的质量有时很差。Bickerman等人[
24]提出了一种使用深度信仰网络学习爵士乐的音乐编码方案。该模型可以生成不同音调的灵活和弦。它表明,如果爵士乐语料库足够大以产生和弦,那么有理由相信可以演奏更复杂的爵士乐语料库。虽然已经创建了一些有趣的爵士旋律片段,但模型生成的短语不足以代表爵士乐语料库的所有特征。Lyu等人[
11]结合了LSTM在长期数据训练中的能力和受限玻尔兹曼机(RBM)在高维数据建模中的优势。结果表明,该模型在和弦音乐的生成中具有良好的泛化效果,但一些高质量的音乐片段很少见。Chu等人[
12]提出了一种基于音符元素生成流行音乐的分层神经网络模型。下层处理旋律生成,上层产生和弦和鼓。该模型的两个实际应用与认知水平的神经舞蹈和神经叙事有关。然而,这种模式的缺点还在于基于音符的生成模式,其中不包括音乐理论研究,从而限制了其音乐创造力和风格完整性。Lattner等人[
25]通过设计一种C-RBM架构来学习音乐的局部结构,该架构仅在时间维度上利用卷积来模拟时间不变性,而不是音高不变性,从而打破了音高的概念。其核心思想是在语法上简化音乐生成之前的音乐生成结构,例如音乐模式,节奏模式等。缺点是音乐结构被抄袭。黄等人[
26]提出了一种基于变压器的音乐生成模型。该算法的核心是将中间内存要求减少到线性序列的长度。最后,可以在几分钟内生成一个很小的片段步骤的组合,并在JBS合唱团中使用它。尽管对Maestro的两个经典公共音乐数据集进行了实验性比较,但定性评估相对粗略。
3. 中国传统音乐计算
据我们所知,很少有可用的基于MIDI的中国民间音乐数据集。Luo等人[
33]提出了一种基于自动编码器生成特定流派的中国民歌的算法。然而,结果只能产生更简单的片段,并没有从音乐类型的角度对音乐进行定性分析。李等人[
34]提出了一种基于条件随机场(CRF)和RBM的中国民歌分类组合方法。值得注意的是,这种方法是从音乐理论角度对分类结果进行深入的定性分析。Zheng等人[
35]重构了速度更新公式,提出了一种基于空间粒子群算法的中国民乐创作模型。黄[
36]从基于中国旋律的两个音乐元素中收集数据,分析了中国旋律意象在创作中国民乐中的应用价值。张等[
37,
38]对中国传统五音群进行了音乐数据文本化和聚类分析。综上所述,我们的动机是制作具有层次结构的中国五音音乐和具有多种音乐特征和统一节奏的局部五音音乐,如图
1[
39]所示。
图 1.中国传统五音阶中的五个主要音阶和四个部分音阶。
This entry is adapted from the peer-reviewed paper 10.3390/app12189309