Deep Symphony - Shaofan Lai's Blog

Introduction:

DeepSymphony is a final project for course CSCI599 (https://csci599-dl.github.io). Our goal is to generate music with long-term structure. These logs only record my part of work, and hence they might not be the final model. We will release our final code. You can access my code at https://github.com/piscesdream/deepsymphony

09/13/2017 00:00

Let's compose!

Start: milestone

09/15/2017 00:00

First commit

Finished most utility tools.

09/16/2017 00:00

Plain text log for baseline

  finish the first demo

  x sustain too long
    add limitation on sustain time
    > more diverse, have longer pattern

  x dive into slience (no 1 at all)
    randomize the input when slient
    > no slience 
  ===>0000.mid
> try stateful RNN but not aligned with one song when training
    > not good, all slience

  > change mse to binary_crossentropy
    much less confidence
    > use threshold==0.3
      > not bad, has rhythm ====> 0001.mid
      comes to slience and start again 
        > use note/=max strategy, not good ===> 0002.mid
      > add noise
        still slience but more diverse 0003.mid
      > use binomial random (1, 0.3) rather than uniform
        more diverse 0004.mid
      > use binomial random (1, 0.6)
        it got emotional sometimes LOL 0005.mid
      > use binomial random (1, 0.8)
        more keys, but in an aweful styel 0006.mid 

    ==> conclusion
      overfit on one tune, which is bad (or good?)

09/17/2017 00:00

Plain text log for baseline

Ã?ÃÂ  retrain
Ã?ÃÂ  Ã?ÃÂ  > got another rhythm, which keeps unchanged in the songÃ?ÃÂ 
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ 0007.mid
Ã?ÃÂ  activation from hard_sigmoid to sigmoid
Ã?ÃÂ  Ã?ÃÂ  more diverse and proactive
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ 0008.mid
Ã?ÃÂ  Ã?ÃÂ  some magical realism
Ã?ÃÂ  fix accumulate bug:
Ã?ÃÂ  Ã?ÃÂ  didn't set to zero after note off
Ã?ÃÂ  set noise after notes.append
Ã?ÃÂ  seq.append after noise is added
Ã?ÃÂ  Ã?ÃÂ Ã?ÃÂ 
Ã?ÃÂ  Ã?ÃÂ  wired repetition:
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  0009.mid
Ã?ÃÂ  Ã?ÃÂ  max_sus=4, noise=(0.5, 2.0)
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  0010.mid
Ã?ÃÂ  less the binomial_p, sparser the notes
Ã?ÃÂ  Ã?ÃÂ  max_sus=4, noise=(0.5, 2.0), random=binomial(1, 0.3)
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  0011.mid
Ã?ÃÂ  notes too dense, raise the threshold to 0.99
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  0012.mid
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  not good
Ã?ÃÂ  make the dense layers more complex, shorten the max_sus
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ quite wellÃ?ÃÂ 
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ 0013.mid
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ set the threshold to 0.50 to get a cleaner version
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ 0014.mid
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ remove the random reset to make the theme of song consistent
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ 0015.mid
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ random->(1, 0.3)
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ 0016.mid
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  can switch between rhythms
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ 0017.mid


Ã?ÃÂ  Ã?ÃÂ simple_rnn2
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  overfit on one tune, 0018.mid
Ã?ÃÂ  Ã?ÃÂ simple structure
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  32-LSTM -> 128 FC Ã?ÃÂ  Ã?ÃÂ 0019.mid

09/18/2017 00:00

Plain text log for baseline

Ã?ÃÂ  Ã?ÃÂ memorize
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ hard to memorize the song
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ stack on a rhythm Ã?ÃÂ  Ã?ÃÂ  0020.mid
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ keep repeating a most-frequent rhythm

Ã?ÃÂ  Ã?ÃÂ change to
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ LSTM-100, Dense-50, Dense-128
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ change remember longer rhythm, but still repeating 0021.mid
Ã?ÃÂ  Ã?ÃÂ 
Ã?ÃÂ  Ã?ÃÂ add random decay in the generation
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ random connection of pieces 0022.mid

09/19/2017 00:00

commit

Ã?ÃÂ  gan rnn1 (commit "gan rnn")
Ã?ÃÂ  Ã?ÃÂ  cannot even learn the patter that most places are black
Ã?ÃÂ  Ã?ÃÂ  stop converging after 500Ã?ÃÂ 
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  d_loss converged to almost 0
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  g_loss dispesed to 16 and the output remained unchange

Ã?ÃÂ  gan rnn2 (commit "gan conv")
Ã?ÃÂ  Ã?ÃÂ  can learn which note is more frequently played than other,Ã?ÃÂ 
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  but don't have inner patter (maybe the dataset is too small?)
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  display/9900.png vs display/real.png

09/28/2017 00:00

commit

Ã?ÃÂ  refactory the encoder
Ã?ÃÂ  encode again
Ã?ÃÂ  large dataset

Ã?ÃÂ  updated simple_rnn on e-comp (len=2000)
Ã?ÃÂ  Ã?ÃÂ  0023.mid
Ã?ÃÂ  Ã?ÃÂ  0031.mid (loss=0.3579, seed=32) data argument
Ã?ÃÂ  Ã?ÃÂ  0032.mid favorite oneÃ?ÃÂ 
Ã?ÃÂ  updated simple_rnn on esay (len=100)
Ã?ÃÂ  Ã?ÃÂ  first one with rhythm: 0024.mid (loss=1.1757)
Ã?ÃÂ  Ã?ÃÂ  0025.mid (loss=0.7159)
Ã?ÃÂ  Ã?ÃÂ  0026.mid (loss=0.4621)
Ã?ÃÂ  Ã?ÃÂ  0027.mid (loss=0.2543)
Ã?ÃÂ  Ã?ÃÂ  0028.mid (loss=0.2166) plagiarism?
Ã?ÃÂ  Ã?ÃÂ  0029.mid (loss=0.2048, seed=32) plagiarism?

09/30/2017 14:00

Deep Symphony Baseline

Check this post for the generated music: Deep Symphony Baseline.

10/05/2017 17:34

Plain text log for baseline

Ã?ÃÂ  trained all night
Ã?ÃÂ  Ã?ÃÂ  e-comp-all, loss = 1.8979 (0033.mid)
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  awesome
Ã?ÃÂ  Ã?ÃÂ  e-comp-all, loss = 1.85 Ã?ÃÂ (0034.mid)
Ã?ÃÂ  Ã?ÃÂ  Ã?ÃÂ  longer one, borrow

10/09/2017 00:42

Skipgram

Applying the skip-gram model on the sequence: suppose $w_x$ and $w_y$ are the vectors for message $x$ and message $y$ respectively, the loss function is

$\large -\frac{1}{\#x}\sum_{x}(\sum_{y\in N(x)}\log(w_x^Tw_y)+\sum_{y\sim V\setminus N(x)}\log(1-w_x^Tw_y))$

where $N(x)$ is the neighbor of word $x$ and the second term is the negative sampling. I uploaded the list of the most related pair to here.

For most message of type “note on”, the most related vector should be the “note off” message of the same note:

Most similar notes with <C5 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <C5 off>, <C6 off>, <C6 on>, <delay 0.0>, <C4 off>
Most similar notes with <C#5 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <C#5 off>, <C#6 on>, <C#6 off>, <E5 on>, <E5 off>
Most similar notes with <D5 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <D5 off>, <D6 off>, <D6 on>, <F5 off>, <F5 on>
Most similar notes with <D#5 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <D#5 off>, <D#6 off>, <D#6 on>, <D#4 off>, <D#4 on>
Most similar notes with <E5 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <E5 off>, <E6 on>, <E6 off>, <delay 0.0>, <G5 off>

But in the low register area, they are related with delays because we usually use a sustaining low key as the background:

Most similar notes with <C2 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 450.0>, <delay 410.0>, <delay 430.0>, <delay 440.0>, <delay 420.0>
Most similar notes with <C#2 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 450.0>, <delay 410.0>, <delay 440.0>, <delay 430.0>, <delay 470.0>
Most similar notes with <D2 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 410.0>, <delay 450.0>, <delay 430.0>, <delay 440.0>, <F#8 on>
Most similar notes with <D#2 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 450.0>, <delay 410.0>, <delay 430.0>, <delay 440.0>, <delay 420.0>
Most similar notes with <E2 on>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 410.0>, <delay 450.0>, <delay 430.0>, <F8 on>, <D2 on>

Messages of type delay are related to each other since we have many continuous delays (brought by meta message). However, when the delay time is longer than 500 units (500*10ms), they are related to message <D10 on> at most, which is weird.

Most similar notes with <delay 520.0>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <D10 on>, <A#0 off>, <E0 on>, <D0 on>, <E1 on>
Most similar notes with <delay 530.0>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <E0 on>, <G0 off>, <D10 on>, <G#0 on>, <D0 on>
Most similar notes with <delay 540.0>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <D10 on>, <A#0 off>, <G0 off>, <G#0 on>, <E0 on>
Most similar notes with <delay 550.0>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <D10 on>, <E0 on>, <G#0 on>, <G0 off>, <D0 on>
Most similar notes with <delay 560.0>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <D10 on>, <D0 on>, <G#0 on>, <E0 on>, <G#1 on>
...

The message of change of the velocity is mostly related with delay message.

Most similar notes with <velocity 1>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <D10 on>, <A#0 off>, <G0 off>, <G#0 on>, <D0 on>
Most similar notes with <velocity 2>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 410.0>, <delay 430.0>, <delay 450.0>, <delay 440.0>, <delay 400.0>
Most similar notes with <velocity 4>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 430.0>, <delay 390.0>, <delay 400.0>, <delay 450.0>, <F8 on>
Most similar notes with <velocity 8>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 410.0>, <delay 380.0>, <delay 340.0>, <C#2 on>, <delay 400.0>
Most similar notes with <velocity 16>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 20.0>, <delay 10.0>, <delay 0.0>, <delay 30.0>, <delay 40.0>
Most similar notes with <velocity 32>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <delay 10.0>, <velocity 32>, <delay 20.0>, <velocity 64>, <delay 30.0>
Most similar notes with <velocity 64>
Ã?ÃÂ Ã?ÃÂ  Ã?ÃÂ <velocity 64>, <delay 10.0>, <velocity 32>, <delay 20.0>, <delay 30.0>

~~Currently I am re-training the model with aforementioned embedding and keeping the embedding weight fixed. Hope it can converge to a lower (<1.85) loss and have a better generation.~~

The stacked RNN model with embedding also can generate “resonable” performance. However, still there is no long-term sturcture among the whole songs.

10/09/2017 17:14

Pianoteq

Pianoteq can synthesize much more realistic audio than timidity given the same piece of MIDI file.

10/10/2017 04:49

Visualization of Cells

Visualization of the activation during generation:

Some neurons keep fired all the time. Updated the asynchronous on 13.

10/10/2017 14:00

Meeting with TA (Hexiang)

Q: Any more types of models? A: Use a classifier to distinguish the real one and the generated one. And use methods like guided backpropagation to update the input (similar to GAN).
Q: Evaluation? A: (1) BELU score (2) inception score
Q: Music has dialects (different keys, genres). A: Add additional information bits to control.
Q: How to interpret the LSTM. A: (1) Fix/Disable one intentionally, check the result (2) add an guided mask before softmax (3) (mine) try L2 regularization.
Q: Generate by interpolation? A: Not so common.

10/13/2017 17:50

Formal Proposal

https://www.sharelatex.com/project/59df9a75337bc12c64b259d9

10/15/2017 01:20

Vanilla BLEU

It seems that BLEU is not a suitable metric for music generation. To apply BLEU, I sampled some sequences from the existing songs and use the first part as the prior and the last part as the references. Two observations:

The model trained with 1 epoch has a better BLEU score than the one trained with 500+ epochs.
Using 1000 previous notes as prior is worse than just using 5 previous notes as prior during testing.

There are two possible explanations: (i) generation with high randomness can “cover” the ground truth sequence better and the fully-trained one might compose in a reasonable but, compared with the ground truth sequence, inconsistent direction (ii) BLEU should be further adapted to use in music generation (e.g., different events of delaying are actually similar.).

10/16/2017 01:40

Histogram Cluster

In music theory, there is a concept called Key. The key of a piece is the group of pitches, or scale that form the basis of a music composition in classical, Western art, and Western pop music.

Simply speaking, a song should be majorly composed of some notes and occasionally use other notes. Such major combination (key) varies from song to song but it should be consistent in one song (or a long segment of the song).

I simply calculate the histogram of the appearance of each note in a song as its feature and visualize these features using T-SNE:

Obviously, there are some natural clusters and this can be used as a conditional information to help RNN generate more consistent songs. Although this can be learned by RNN explicitly, adding this information should lend us a hand to control the generation.

By applying a softmax with a small temperature on the histogram ( $e^{h/0.03}$ ), we can get a more discriminative clustering.

Furthermore, I think the embedding of notes should be different in different keys.

If we manage to generate music with a given histogram, then we are actually halfway on generating different genres of music. For instance, there are some popular keys/scales in Jazz, which are rarely used in classical music.

10/16/2017 14:28

Histogram-conditional RNN

I use the pre-calculated histogram of the song as the conditional information and concatenate it with the previous note to form each input. More precisely, there are 12 note numbers in music ('C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B’) and I calculate their appearance in a song and normalize the histogram. This histogram is then connected with the one-hot vector to form an input. It works as estimated:

Methods	#Epochs	Loss (Cross-entropy)	JS-divergence (with global histogram)
Random	0	5.89	0.162
Baseline	1000	1.90	0.414
Hist-RNN	200^*	2.91	0.251/0.174^**
Patch-Hist RNN	500	1.75	0.287/0.195^**

Although Hist-RNN is trained with fewer epochs and has a higher loss, its generation has a lower divergence with the true distribution. *: I will update the results later. **: Temperature equals to 1e-2 ( $e^{hist/10^{-2}}$ ), which makes the conditional information more sparse after normalization.

Notice that the random generation has the lowest divergence, which means Hist-RNN is still not accurate enough. Ideally, Hist-RNN should generate songs whose histogram distribution should be the same with the given one (divergence=0). There are some issues that might affect the results:

The conditional information (i.e. the histogram of notes) is calculated based on the whole song rather than a patch, while I use patches to train the network. The discrepancy between the local distribution and the global distribute might confuse the network.
When generating, some noisy is included intentionally to increase the variety of the generation, which might disturb the results.

TODO next:

~~Keep finetuning the Hist-RNN~~
~~Use patch-histogram.~~
- Update 17/10/17: The histograms of the patch and of the whole is incomparable to some extends. Although Patch-Hist-RNN has a larger divergence, it manages to converge to a smaller loss. It seems that vanilla RNN fails to control some features (i.e. keeping it consistent with the prior) of the music when it tries to increase the variety.
Try to come up with other statistical information to direct the generation (e.g., density, pitch, n-gram histogram, sustain time).

New setting: Rather than generating from scratch, we allow the neural network to take 500 previous notes as a prior and test on the consecutive sequence.

Methods	#Epochs	Loss (Cross-entropy)	JS-divergence (with patch histogram)
Random	0	5.89	0.251
Baseline	1000	1.90	0.301
Patch-Hist RNN	500	1.75	0.161

Compared with the first table, the random generation gets a larger JS-divergence because the patch histogram is less similar to a uniform distribution, which the random generation tends to have. And also, the Patch-Hist RNN can predict the coming sequence much better with the prior as well as the histogram.

Another problem is that whether overfitting is good. We want the model to learn the rule of music but we don’t want it to repeat what it learned, which is somewhat contradictory.

10/17/2017 02:38

Ideas (Hierarchical Composition)

Short-term: decoding to sequence using GAN/VAE
Long-term: LSTM to generate encoding
Training method:
- VAE-style (encoder-decoder)
  - Short-term: Sequence → code → sequence
  - Long-term: Encode segment by segment (sliding window).
- GAN-style (random variable-generation):
  - Short-term: Random variable→ Sequence
  - Long-term: Not sure. Possible solutions: (i) metaGAN: Fix the local GAN and generate the local random variable with a metaGAN. (ii) Use KNN to reverse the generation procedure (Likely to fail).
Encoding:
- Learning based: AE, VAE, Hourglass, U-Net ...
- Statistics based: Note histogram, density, pitch, n-gram histogram.

If we use GAN, there is no way to encode a existing song and find its movement in the code space.

10/18/2017 03:16

ideas (Scale Filter)

Let’s step back and learn to generate songs with notes spanning on a single scale. It is reasonable to do that because changing of scale is rarely observed in simple composition. Based on my experiment on (unprofessional) improvising, I can use only one scale to play different themes.

For those who are not familiar with the music, a scale or a key can be regarded as a fixed combination of notes. In C major scale, we have C, D, E, F, G, A, B and in B major scale, we have B, C#, D#, E, F#, G#, A#. Once the scale is selected, we can use seven (or more) notes in the scale to compose the song.

To do that, we should first find the relationship between different scales. For instance, one can rewrite a song written in scale C to scale B# by shifting the notes (~~TODO: verify whether each scale can be theoretically translated to every other scales~~ Confirmed). After that, we can “normalize” every song to a standard scale in order to simplify the learning of rhythms or chords.

10/18/2017 03:59

Next steps

~~Rewrite everything in Tensorflow~~
~~Implement LCS (Longest common subsequence) algorithm to check the ratio of repetition (on the easymusicnotes dataset).~~
~~Analyse the easymusicnotes datset and hardcode some rules about scales.~~
Develop the “memorize one and diverge” idea.

10/20/2017 03:10

Commit

Updated to Tensorflow
Found a library (music21) to analyze the music.
Used LCS to analyze the overfitting problem (in this blog).

10/21/2017 05:14

Multivoice Model on MultiHots inputs

Music can be quantized with fixed step-size and represented as a multi-hots vector at each step. The figure on the right contains 25 songs where the x-axis is the timeline and the y-axis is the index of the notes. A white pixel at $(x, y)$ means the key $y$ is pressed at time $x$ .

With this coding method, the model has to do a multi-label prediction at each time step. I used the cross-entropy loss to train the model and it can converge perfectly (loss=0.10) at the training set. However, during generation, the model is usually stuck at some certain notes or just plays a rarely changing chord.

There are some issues with this coding:

The model has to output one note multiple times to simulate the “sustain” effect, while human just press one note for a long time to sustain rather than hitting the key multiple times.
Because the model is doing a multilabel prediction by giving the confidence of each note at each timestep, I have to use a threshold to decide which key should be played. The model gives low confidence when changing chords and a lower threshold should be used, while it gives strong confidence about sustaining the current states, namely playing what is played previously. This generation strategy actually encourages the model to generate the same notes again and again.

(Crazy) Solution:

Rather than using the multilabel prediction, the outputs with various length at each timestep can be locally regarded as a sequence. Furthermore, the duration of a note can be embedded with the note index to avoid replicated prediction when the model tries to sustain it.

As for the model, I used a multiway LSTM structure:

The bottom LSTM blocks take corresponding voices as the inputs separately.
The output of different bottom LSTM blocks are concatenated and sent to a middle LSTM block, where the information from different voices can be merged.
Each top LSTM block takes the output of the middle LSTM block and generates a single voice.

The reason why this overfitting model cannot generate what it remembered (like this one) might be the exposure bias:

when generating a long-term sequence, we take the output of RNN as its future input, and such input pattern might never be observed in the training set. Therefore, this discrepancy can accumulate as the generated sequence grows.

We are currently implementing seqGAN to solve this problem. Let’s hope it can work.

PS: EXTREMELY simple rules to improvise (https://www.youtube.com/watch?v=HeI6QMZBSqI, https://www.youtube.com/watch?v=fMmksLBcBFY). Why machine cannot learn it!?

PS2(11/08/17): The problem may result from the fact that many songs are not aligned (on the definition of the voice). Manually preprocessing is needed.

10/22/2017 23:25

Sequence Autoencoder

The new seq2seq API of Tensorflow is more organized than the previous one. But there are a limited number of tutorials about the new API. I follow this post to code. Notice that there are few mistakes in that script.

The concept of Helper is vague in the document. Generally, it defines how the decoder handles the input.

TrainingHelper feeds the given input without changing it. This is widely used in supervised training.
GreedyEmbeddingHelper finds the output word with maximal probability and uses its embed to generate the next input. This can be used to generate sequences in the development phase.

The figure on the right demonstrates the model:

An encoder is trained to compress the sequence into the hidden state.
A decoder is trained to recover the sequence.

The model can overfit the training set during training (loss=5e-2). The figure at the bottom is the prediction result (on the training set). Each testing case contains three rows: (i) ground truth sequence (ii) prediction with TrainingHelper (training setting) (iii) prediction sequence using GreedyEmbeddingHelper (deployment setting).

As shown, (i) and (ii) are almost the same because we converge to a very small loss during training. However, except the first few notes, (iii) is very different from (i). The bug is in the GreedyEmbeddingHelper. GreedyEmbeddingHelper tries to return the last output of RNN as the next input, while our model add an additional fully-connected layer on the top of RNN as the decision layer. Removing that decision layer makes the training much slower.

Update23: I use a customized a helper to handle the output (GreedyEmbeddingDecisionHelper). Now all three sequences are similar:

How to generate with Sequence AutoEncoder:

[TODO] Sample in the code space (Update SeqAE to SeqVAE).
[Done] Encode a short piece of an existing song (like 100 events) and then decode a much longer sequence (like 1000 events). The extra sequence (900 events) is the generated part.
[Done] When the model is underfitting, like the below one where I train the model to encode 400 events and it ends with a high loss (4e-1), the “reconstruction” sounds nothing like the original one. (Another funny observation: the decoder gets stuck in a plausible rhythm and keeps generate it. If I increase the generation length, it still can repeat the rhythm. This resembles the definition of “fixed point”/”equilibrium” in the neurodynamics.)

TODO:

Split to train/test set to evaluate the generalization.
Encode longer sequences (e.g. the whole song).

10/26/2017 00:48

Aligned generation

What if we align all the notes and breaks to the same length, so we can focus on the combination of chords and ignore the dynamic? I test this idea and actually, the generated song is not bad.

One funny thing is when decoding two totally different codes, the decoder eventually ends with a dynamic equilibrium. More specifically, it can repeat an existing rhythm over and over again without external input. Normally, RNN (or char-RNN) will get stuck in a wired key. But SeqAE can stably generate a rhythm depending on its own output, which is very impressive.

My Theory: If we denote the decoder as function $f$ , the hidden state as $x_t$ and the output as $y_t$ . Then we have $x_{t+1}, y_{t+1} = f(x_{t}, y_{t})$ during testing because the decoder feed its last output as its next input. So we actually have $s_{t+1} = f(s_t)$ if we combine $[x_t, y_t]$ to $s_t$ . In traditional neurodynamics, we have $\frac{d }{d t}x(t)=F(x(t))$ , while we here have a discrete form $x_{t+1}-x_{t} = g(x_t)$ where $g(x_t) = f(x_t) - x_t$ . A neurodynamics system can act like a CAM (Content Addressable Memory): the input or the first state $s_0$ is a non-accurate address and it might converge by iterating to state $s_t$ , which is the memory. Instead of converging to a fixed point, sometimes it might converge to a limit cycle, as we have here. If we can find a proper Lyapunov function, maybe we can prove this stability. Apart from the stability, we can also dive into the manifold of the coding space. Maybe a song/piece of rhythm is a trace that spins in the space.

10/29/2017 13:33

Continuous AutoEncoder

Original Plan:

~~Test SeqAE on unseen pieces~~. Still very accurate.
Simple RL+GAN (to find a proper discriminator structure).
~~Encode short piece in a short vector~~. In a 256 bit code (with only values near 0, 0.5, 1.0 are activated).
~~Observe the dynamic of the code space in a song to decide the “action”~~. The rhythm of Random walking is continuous (consistent between immediate neighbors).
Better quantized method ([-1, 1] rather than [0, 1]).
Let’s RL.

Continuous AutoEncoder: (This post has been updated for multiple times. Inconsistency may appear between different versions.)

The aligned encoding is used, where the duration of every event is the same. Assume the original sequence is $\boldsymbol{x} = (x_1, x_2, x_3, ..., x_n)$ and the encoder learns to compress $\boldsymbol{x}$ into a code $\boldsymbol{c}$ while the decoder tries to reconstruct $\boldsymbol{x}$ from $\boldsymbol{c}$ . The reconstruction loss is:

$\mathcal{L}_{rec} = entropy(f_{dec}(\boldsymbol{c}), \boldsymbol{x}) = entropy(f_{dec}(f_{enc}(\boldsymbol{x})), \boldsymbol{x})$

In the continuous SeqAE, I split the song by every $m$ events, which means a song can be represented as $\boldsymbol{s} = ((x_1, ..., x_m), (x_{m+1}, ..., x_{2m}), ... ) = (\boldsymbol{\hat{x}_1}, \boldsymbol{\hat{x}_2}, ...)$ . The reason to do that is to generate music with a larger granularity ( $\boldsymbol{\hat{x}}$ ) rather than note by note ( $x$ ). So the AutoEncoder learns to encode and decode different subsequences $\boldsymbol{\hat{x}}$ , whose length is set to 8 in my experiment.

Now, a song can be encoded into a sequence of codes:

$(f_{enc}(\boldsymbol{\hat{x_1}}), f_{enc}(\boldsymbol{\hat{x_2}}), ...) = (\boldsymbol{c_1}, \boldsymbol{c_2}, ...)$

If an agent can learn how to sequentially pick those codes, then we can compose a plausible song! However, the dimension of the code space is very high and learning how to walk in a unregularized space might be too taxing. To alleviate that task, two modification is added.

The first one is to use the sigmoid function to regularize the code space. More specifically, I employ a sigmoid function with a great slope ( $1/({1+e^{-100*x}})$ , similar to $sgin(\cdot)$ ). After training, the value of the code distributes near 0, 0.5, and 1. We can further quantize the real code space into a discrete space by quantizing every value to {0, 0.5, 1}.

(TODO) Currently, the reconstruction loss of the quantized code is very large and I am trying to add some quantized code in the training of the decoder to make it invariant to the quantization (but the gradient cannot flow back). In the real design, the reconstruction loss contains two terms, the first one is $\mathcal{L}_{rec}=entropy(f_{dec}(f_{enc}(\boldsymbol{x})), \boldsymbol{x})$ and the second one is $\mathcal{L}_{rec}^q=entropy(f_{dec}(quantized(f_{enc}(\boldsymbol{x})), \boldsymbol{x}))$ . The first one has meaningful gradients for both encoder and decoder, while the second loss can only be used to train the decoder because the $quantized(\cdot)$ cannot back propagate gradients to the encoder.

Secondly, although the change of rhythm happens in composition, the rhythm in a song should be consistent locally. In other words, we want the agent to take small steps for the most time. Therefore, I design a continuousness loss (in the form of hinge loss) to restrict the code space:

$\mathcal{L}_{con}=\max(0, d(\boldsymbol{\hat{x}_i}, \boldsymbol{\hat{x}_{i+1}}) - d(\boldsymbol{\hat{x}_i}, \boldsymbol{\hat{x}_{i+2}}) +1)$ ,

which means we want the immediate neighbor sub-sequences to be close. An alternative loss is $d(\boldsymbol{\hat{x}_i}, \boldsymbol{\hat{x}_{i+1}})$ . I think this problem is similar to the manifold embedding, except that I am trying to smooth the changing between consecutive codes. Maybe I will look into manifold learning to find some well-developed losses to try out.

Follow-up:

Assume our activation function is $1/({1+e^{-\beta*x}})$ and the final loss is $\mathcal{L} = \mathcal{L}_{rec}+\mathcal{L}_{rec}^q + \alpha\mathcal{L}_{con}$ . The combination of $\alpha$ and $\beta$ is very critical. Because the requirement of accurate reconstruction is contradictory with the continuousness to some degrees. Athough I assume that most of consecutive sub-sequences are similar, but actually it’s very hard to keep this consistency in a highly non-linear mapping.

	$\mathcal{L}_{rec}$	$\mathcal{L}_{rec}^q$	$d(\boldsymbol{\hat{x}_i}, \boldsymbol{\hat{x}_{i+1}})$
$\beta \gg 1$	converges	converges	fails to converge because it takes a huge gradient to flip a bit.
$\beta \leq 1$	converges	converges slowly because the original code space is not sparse enough^*	equals to 0
$\alpha \gg 1$	fails to converge because all the coding remain the same	fails to converge because all the coding remain the same	equals to 0
$\alpha \ll 1$	converges	converges	converges slowly because the restriction is too relax

*: It actually learns to generate the most frequent subsequence because the quantized code is the same.

Also, the learning rate is very important. A large learning rate can help minimizing $\mathcal{L}_{rec}^q$ because the code will be sparse at a fast speed. This fast convergence also hinder the optimization of $\mathcal{L}_{con}$ because it needs a great gradient to re-arange an almost discrete coding space. My suggestion is to fix the learning rate first and finetune $\beta$ and $\alpha$ .

In the ideal situation, $\mathcal{L}_{rec}^q$ should be as small as possible to generate accurate sub-sequences and $\mathcal{L}_{con}$ should be around 0.5, indicating that the agent can take only one of few steps to transit to the next state.

Follow-up2:

When quantinizing the code, some information is lost and the encoder cannot notice that because the quantization function cannot propagate the gradient. Therefore, when $\mathcal{L}_{rec}$ totally converged and $\mathcal{L}_{rec}^q$ does not, we are actually overfitting the decoder so that it can remember the correct outputs. The decoder has to comprose with a not-that-cooperate encoder. $\mathcal{L}_{con}^q$ and $\mathcal{L}_{con}$ should not differ too much in the training process. (Everyone should work equally :))

Also, the continuous loss is imposed on the quantized code (the eventual output). Hence, it cannot constrict the encoder. With that observation, I decide to impose the continuousness loss on the orginal code space (rather than the quantized one).

Follow-up3:

When trying out a new model, it is recommended to firstly ignore all regularizations and see whether the model can overfit.

10/30/2017 14:25

Continuous AutoEncoder2

10/30/2017

Change the activation function to $\tanh(\beta*x)$
Change the quantization function to $sign(x)$
Change the continuousness loss to $\mathcal{L}_{con}=\| \boldsymbol{\hat{x}_i} - \boldsymbol{\hat{x}_j} \|^2$
Remove clipnorm to allow significant re-arangements in the code space

11/01/2017

Fix the key normalization bug.
Use a hourglass structure rather than two RNNs.
Change the continuousness loss to $\mathcal{L}_{con}=\| \boldsymbol{\hat{x}_i^q} - \boldsymbol{\hat{x}_j} \|^2+\| \boldsymbol{\hat{x}_i} - \boldsymbol{\hat{x}_j^q} \|^2$

11/01/2017 00:13

GAN (G:RNN D:CNN)

The generation is not good.

Some observations:

If a (fully connected) decision layer is added on the top of the RNN, then network learns to generate only one key.
The RNN can hardly update after converging to a certain outline (like in the figure). Keep training it will only tweak a very small part of the image.

11/01/2017 03:37

Scale of Different keys

Key

1^#

3^b

4^#

5^#

7^b

e^b

b^b

e^b

g^b

a^b

b^b

d^b

a^b

b^b

e^b

b^b

g^b

b^b

e^b

a^b

11/03/2017 14:00

Refine Net

Refine the duration from a quantized song.

Not good. Severe overfitting.

11/05/2017 11:36

One-shot Music Composition Learning

Exp:

Train on short one and change the timestep when generating

Problems:

How to walk? Only up and rep, no down. (region based reset rather than randomly reset)
Reward only on the ground truth has no variety.
AE may not fully employ the coding space. Should use GAN as well?

Update 09/11

One-shot learning can actually be promising

11/07/2017 04:11

Jevois meets DeepSymphony

See how can we interactively generate music. Details: http://shaofanlai.com/exp/2#log-54

With audio: http://shaofanlai.com/exp/2#log-56

11/10/2017 18:51

DCRNN

Because the previous model can hardly have a long-term structure, so I think whether it is possible to generate songs hierarchically. The basic idea is very intuitive as shown in the figure: (i) A RNN is trained to generate a short sequence. (ii) This output will be repeated along the time-axis to form a locally consistent sequence. (iii) The expanded output is then fed into another RNN. (iv) Goto (ii) until it reaches the deepest layer.

At present, these model combined with CNN discriminator can generate repeated rhythms, but it also has a severe problem of mode collapse. The generated songs are almost the same even given different noise code.

There are four components in the generator. The first three are all RNN and the last layer is a fully-connected layer. I tried to train the whole model in a layer-by-layer way: the last RNN and the decision FC layer is trained first to learn what a music might locally look like and then I finetune the first two layer to learn a diverse skeleton for the music. In the experiment, I found that I can generate repeated sequences using only the last RNN and the FC layer. The mode collapse problem is very severe in that all the generations look the same. Besides, changing first two RNN after that barely has an impact on the generated song.

During the first step, the collapsed mode generated by the model change abruptly from one pattern to another pattern even I used a small learning rate. I tried to increase the noise’s variance in order to include a larger difference in the generation process, but it didn’t work well. Next step will be checking the output of each step to check the intermediate results of different noise vectors to find out where goes wrong.

Follow up: Different noise vectors have similar outcomes after three layers of RNN. I reduced the model in to two RNN layers and hope it works. I guess the reason is that the highly non-linear mapping in the RNN reduces the variance and adding some scaling operations between two RNN layers should also work.

Observation: The mean of the output of softmax is very small (mode collapse):

0.95266 (noise vector)
0.32785 (output of RNN1)
0.70545 (output of RNN2)
15.042 (output of FC)
0.00329796 (output of softmax)

11/11/2017 21:32

IWGAN

The loss from Improved WGAN works better than the loss of traditional GAN in two ways:

Less mode collapse
Better (more smooth) gradients.

I am thinking about whether using CNN-discriminator on music generation is a good idea. According to Hinton in his lecture about the defects of CNN, he pointed out that CNN has invariance rather than equivariance. In music generation, the position of the note is very important and therefore CNN is not suitable for the task.

(11/12/07) However, when using the RNN as the discrminator, the tensorflow reported that “TypeError: Second-order gradient for while loops not supported”

11/12/2017 13:01

Local Resemblance

There are 3 methods to achieve the temporal-local resemblance:

Reuse the short pieces from the existing song
Use encode-decode method to encode a short piece in a supervised way
Learning explicitly (i.e. learning how to generate the whole song rather than focusing on the local)

However, there are disadvantages about all of them. Reusing the original pieces is somewhat cheating and is an extreme case of overfitting. The Encode-decode method cannot accurately reconstruct the sequences when the coding space is constricted. By learning explicitly, if the model can accurately generate local sequences, it must also overfit on the long sequence. It’s very contradictory that we want the music sounds fluent locally but we don’t want the whole song to be exactly the same with the existing one.

11/12/2017 14:33

DCRNN can generate repeated songs

Not stable but it is the first model that can generate repetitions without hand-craft rules. I cherry-picked the result:

11/13/2017 20:41

Sequential GAN

The idea of applying traditional GAN on Sequential generation is actually possible. Stating that GAN is not suitable for discrete decision making, SeqGAN replaces the training method with reinforcement learning by treating the generator as a policy and the discriminator as a reward function. However, by carefully altering some architectures, we can actually train a (multi-hots) sequential GAN without RL:

Remove gradient clip.
Use Loss function from IWGAN rather than GAN. The Wasserstein metric is much more stable than the JS divergence.
Adding a fully connected layer after the RNN can boost the convergence*.
Don’t pre-train the discriminator.

*: Although the last linear layer can greatly boost (x200) the convergence, it also makes the generation volatile. Given a noise vector, the generated sample should change steadily during the generation progress. But in the model with an extra linear layer, the generated sample vary significantly between iterations. My conjecture is that such decision layer can amplify the output of RNN. But whether this instability is good for adversarial training still remains unclear.

11/14/2017 00:08

Vanilla LSTM GAN

With the techniques described before, we can actually build a generator with vanilla LSTM (without hierarchical architecture).

Visualized song (x-axis is the time and y-axis is the pitch):

The input of LSTM at each step is composed of the noise vector and the time indicator (a float from 0.0 to 1.0). One problem is that the LSTM pays almost no attention to the time indicator and generate one rhythm again and again. Although this is much better than previous models, there should be some changing in a song. This might be caused by the short timespan of the training samples (128 when training vs 512 when generating). I am trying to train on longer sequences to ameliorate this issue.

Update: Longer sequences do not help. Trying linearly changing noise vector.

Some might question that the CNN discriminator cannot distinguish on the temporal level. Here, I plot some generated samples and their gradients of the loss generator. We can see that the CNN can give different gradients for the same note on different time steps.

11/14/2017 13:28

DCRNN with Linspace Code

Only using long sequences as the training samples cannot resolve the problem of repetition (http://shaofanlai.com/exp/1#log-62). When applying linearly-changing noise vector to the LSTM, the training becomes extremely slow (on a single LSTM with 128 cells) and the network failed to transit from one rhythm to another smoothly. By using a stacked LSTM (64-32 cells), it can generate visually plausible songs but they sound very weird in a way that many keys are sustained too short and there is no clear beats between different rhythms.

I believe the note selection itself is good and unseen in the dataset, but the beat (sustain, velocity, pause) is awful. Some postprocessing is required to generate a plausible song.

11/15/2017 13:38

DCRNN with noisy real song

In this model, the noise vectors are sampled from the real songs with noisy masks. My original purpose was that the model should learn to reconstruct the real song by adding details. The model failed mainly because I didn’t use the bidirectional RNN because the model should be allowed to observe both forward and backward to decide how long the original note sustains. Also, a reconstruction loss should also be included in the optimization goal.

But anyway, the byproduct is still interesting. In my last few experiments, I complained that the generated song keep looping the same rhythm because the noise vectors are all the same across the timesteps. I then tried to resolve it with a linearly-changing code which can learn to transfer from a rhythm to another. So in a word, the pattern/repetition of generation is closed related to the temporal pattern of noise vectors. Using the real song as the noise vectors, the generated song will repeat when the original repeats and changes when the original one changes.

From top to bottom: code, generated result, final results after thresholding

Apart from generating repeated rhythms, there is another reason I want to achieve with DCRNN. Ideally, the later RNNs should learn to generate locally repeated rhythms while the first few RNNs should learn to generate codes for the later RNNs. Here, “later” and ”first” refers to their position in the forwarding process. However, the whole DCRNN learns to generate repeated rhythm and the generated song becomes tedious.

11/15/2017 19:38

DCRNN (AB code generation)

The theme is new and sounds not bad. The unpleasant sustain and high pitch can be considered as the noise in the generation (as the noise pixels in the image).

The next problem is how to use the model, which can generate short pieces, to generate a full song. More specifically, it is about how to model the long-term in high-level.

11/16/2017 02:55

DCRNN with Myo

It is awesome

11/17/2017 12:25

DCRNN

When training the DCRNN on a simple dataset, DCRNN tends to generate repeated rhythm because it seems to be a simpler job for RNN and the discriminator can hardly distinguish that. But when the training iteration increases (to about 12000), the DCRNN starts to include some changes in the cycle. When training on a complex dataset, DCRNN starts to diverge at about 3000 iterations that because the sampled short subsequences of these songs seldom repeat.

Now the problem is that with models that can generate plausible short sequences, how to pick a sequence of code so that it can generate the whole song.

11/17/2017 19:07

Velocity Finetune

Finetune a generated song.

11/20/2017 12:34

RNN-RNN-GAN

Using RNN-RNN-GAN seems to be a bad solution for two reasons:

Cannot train the model with `tf.nn.dynamic_rnn`. The `tf.static_rnn` takes much more (~x10) time to build and train than the dynamical one, especially when the sequences are long. It is very time-consuming to debug this model.
RNN-RNN-GAN can hardly learn the distribution of notes. Although it once (at step 20171) generated some plausible samples, those samples were highly repetitive and the severe issue of mode collapse was observed.

11/20/2017 21:30

Strip Deconv Kernel

The reason why I used strip transpose convolutional (deconvolutional) kernels is that there are some fixed combinations of chords in music (like the Major chord). Therefore, those patterns (chords) should be reusable in different places on the keyboard. I set the kernel size to 1x12 because there are 12 notes per octave.

The results is very promising. The model converges faster than normal RNN+FC generator. There might be two reasons: i) the percentage of the recurrent network is reduced or ii) the strip kernels can help the training.

~~Now I am trying to make the inputs of the transpose conv layer sparse so that the patterns of the kernels can be more clear.~~ Failed to learn complex pattern.

12/01/2017 00:00

TOREAD

Come up with some mathematical models to represent the notes.
Metrics, Surveys:
- BLEU: a Method for Automatic Evaluation of Machine Translation
  - A modified n-gram matching for machine translation. Vanilla BLEU is not suitable for music generation,
- Deep Learning Techniques for Music Generation A survey
GAN-related:
- SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
  - Uses LSTM as the generator as well as the policy and CNN as the discriminator as well as the reward function. The model is trained in a policy gradient way.
- A SeqGAN for Polyphonic Music Generation
  - Same architecture with SeqGAN. They used the existing chords as words (not just notes) and normalized every song to C major.
- Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets
  - Mixing SeqGAN with BLEU score.
- Generating Text via Adversarial Training
  - Multiplying the noise code with the hidden cell in the LSTM. Classical G:RNN-D:CNN architecture. Feature matching is used to train the generator.
- Improved Techniques for Training GANs
  - Deprecated but still useful to some extent.
AE-related:
- Variational Recurrent Auto-Encoders
  - Common Sequential Auto-Encoder with a classical VAE-like loss imposing on the code space.
- Generating Sentences from a Continuous Space
  - Usage of VRAE. KL cost annealing, Word dropout, and Historyless decoding were introduced to handle the problem that the decoder might ignore the input code.
Toward Controlled Generation of Text
Deep learning for music
Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

12/01/2017 00:00

Some Debates

Composition versus Performance:
- After these days, I found that the side information about the performance (i.e. the velocity and delaying) might be too much for the neural network to learn because at the same time it has to exploit some general rules for music.
Complex dataset versus Simple dataset
- One advantage of the simple songs is that they are composed in a more well-ordered way and the network can extract its simple pattern. However, we can easily overfit/memorize it with a standard RNN, which will repeat what it heard during the testing time.
- Complex dataset introduces diversity. But RNN fails to stay consistent on one style during generation.
Generate with Threshold
- A small threshold will include more noise but the continuousness of the notes will be preserved.
- A large threshold will ignore noise but might also tend to generate notes with short sustain.

12/01/2017 00:00

Crazy Thoughts

Anneal generation
Various length hierarchical RNN.
GAIL

01/01/2018 00:00

Congrats!

We won the “Best Demo Award” and the “Best Presentation Award”! Thanks for everyone who has contributed to this project!

03/13/2018 15:13

postscript

Music generation: Like face-recognition, should be done in pipeline.

03/31/2018 15:22

Rebooted

Rebooted! DeepSymphony4

04/04/2018 00:00

Chord sequence reconstruction

1D conv kernel of length 12 is introduced to abstract the common patterns.
Tanh is much better than sigmoid in rec
The decode is just a linear mapping from the output of LSTM with tanh as an output regularizer.

Visualization of 1D kernel when the activation function is sigmoid. There are 32 filters (columns) and the perceptive field is 12 notes. The pattern is not as obvious as I expected.

Using Softmax as the activation function in the 1D conv layer, which means only one filter can be activated when scaning the same place.

Replacing sigmoid with Softmax in the 1Dconv layer, we can get a model that achieves similar rec error on the training set as the last one but with more meaningful filter. The reason is that only one filter can be activated and the filters can no longer cooperate with each other. The Y-axis can be somehow misleading because all filters scan the keyboard with stride of one. In other words, you can pick any note beside “C” as the start.

If you look at the 14^th column, you can find a filter for D minor chord. As I discussed before, this one can be shifted to filter any other minor chords.

Note: when restricting the filters, we should also control the complexity of the following components. Otherwise, the filter will be too lazy to discover the latent features but relies on the memorization ability of the following components to recover the chord.
When sampling, randomly chopping the rhythms might help overfitting due to the memorization.

04/26/2018 02:38

000 (Finish-me-with-beats) model

preceding context (only melody) => context vector

provide silence+attack => structure vector

context+structure => fill me

Filled one (top) is sometimes more reasonable than the original one (bottom).

Can perfectly fill when the pattern is obvious, which indicates that can be somehow compressed in a 256-dim vector (or the model just overfit).

Possible problem:
The attack and silence sometimes are not strong enough to change the progression of the rhythm. It might take an extra step to correct the progression.

Extension:
attention mechanism. longer context