Deep Symphony Reading Notes - Shaofan Lai's Blog

This post is about the project “Deep Symphony” we are building in CSCI599. The ultimate goal is to compose a (long-term) music by learning. It is a somewhat open project unlike tasks with quantifiable results (e.g. classification, detection or translation). Therefore, we have to define and tackle the problem ourselves, which is very exciting.

I think there are two major problems are the essential ones:

The loss function is hard to design manually. Whether a piece of music is pleasant or not is hard to quantify with mathematical calculations. Although there are some basic pattern in music, I believe that composing by hand-crafted rules might be devoid of flexibilities. Teaching machine to unsupervised exploit the pattern of music and what makes a masterpiece is a more natural way to do that. Using MSE or Cross-entropy loss to teach a machine to repeat the existing song is a brute-force method. However, Ian Goodfellow argued in his talk that this method might end up with a model trying to fuse the output of every input, which is an acceptable solution w.r.t. the loss function but not the task. During the discussion with TA, we think GAN might be a promising way to learn the loss function.
Long-term structure is difficult to define and generate. Capturing and generating the structure of songs is very challenging because (a) structure include various components, like melody, velocity, density and tone, and one has to balance between repetition and diversity when maintaining a consistent theme (b) a long-term structure might have an extremely long timespan and the model (e.g. RNN) can hardly capture it.

Currently, there are several companies/universities are supporting music generation, such as Google magenta, Sony Flow Machines, Jukedeck, Fork-RNN and so on. We are trying our best to come up with a competitive model. Personally, I also want to build a more interpretable and interactive model that can run in real time. I am sorry that I have to keep some ideas and designs private to our team until the end of this semester. Here are keywords you can think about: performance vs composing, hierarchical, Turing Test, Incremental Learning, Reinforcement Learning, Lyrics, Interactive, Embedding, Theme, Usage.

The following summary is developed in a more takeaway style, which means I will only highlight the points that might contribute to our project. These papers/posts will be mentioned:

Performance RNN: Generating Music with Expressive Timing and Dynamics
Chord2Vec: Learning Musical Chord Embeddings

1> Performance RNN: Generating Music with Expressive Timing and Dynamics

This post didn’t introduce a novel model. But a better dataset, a more flexible representation and some engineering tricks made this model impressing. Here are some takeaway points:

Dataset: They used a dataset from Yamaha e-Piano Competition, in which the music is performed by HUMAN. Other sheet music datasets exactly record when and what should we play, but the competition dataset includes more dynamics on the velocity and timing.
Representation: Their previous models used a fixed metrical grid to generate music, which is rigid. In this model, they instead used a 10ms long step and allowed the model to skip forward to the next note event.
In my experiments, I currently used the first representation. There are several critical problems: (a) Mulitple notes are usually played at the same time and multi-hots vector with BCE cannot handle it well. (b) It has to generate the same note for multiple timesteps to have a sustain note. What I worry about is that the second representation might play a note and forget to stop it.
Preprocessing: Stretch the performance up to 5% faster or lower. Raise or lower the pitch by up to a major third.
Temperature: This is a variable to control the randomness during generating. The official definition in the code is “A float specifying how much to divide the logits by before computing the softmax. Greater than 1.0 makes tracks more random, less than 1.0 makes tracks less random”.

However, the generated songs sound like someone is showing off his piano skill by traversing on the keyboard. There is barely a structure in the music.

2>Chord2Vec: Learning Musical Chord Embeddings

The main goal of this paper is to embed chord into a vector. A chord can contain multiple notes. Like skip-gram in Word2Vec, they tried to learn the coding by predicting some given chord’s content. More formally,

Notice that a chord is a vector that might contain multiple notes, namely, polyphonic. And three proposed models differ on the probability dependency between the notes in the target chord.

(Bilinear Model) For the first one, they assumed that notes are independent of the chord:

(AutoRegressive Model) The second assumed that each note depends on its previous notes in the same chord:

(Sequence-to-sequence Model) The third one use seq2seq method to relate all notes:

The output of the first two model is an indicator vector activated by function sigmoid. Each position of the vector represents a corresponding note. In the last model, a special separator to split between different chords is added to the coding.

The result showed that the seq2seq model outperforms other models by a large margin: