Start: milestone
Introduction:
DeepSymphony is a final project for course CSCI599 (https://csci599-dl.github.io). Our goal is to generate music with long-term structure. These logs only record my part of work, and hence they might not be the final model. We will release our final code. You can access my code at https://github.com/piscesdream/deepsymphony
Plain text log for baseline
finish the first demo
x sustain too long
add limitation on sustain time
> more diverse, have longer pattern
x dive into slience (no 1 at all)
randomize the input when slient
> no slience
===>0000.mid
> try stateful RNN but not aligned with one song when training
> not good, all slience
> change mse to binary_crossentropy
much less confidence
> use threshold==0.3
> not bad, has rhythm ====> 0001.mid
comes to slience and start again
> use note/=max strategy, not good ===> 0002.mid
> add noise
still slience but more diverse 0003.mid
> use binomial random (1, 0.3) rather than uniform
more diverse 0004.mid
> use binomial random (1, 0.6)
it got emotional sometimes LOL 0005.mid
> use binomial random (1, 0.8)
more keys, but in an aweful styel 0006.mid
==> conclusion
overfit on one tune, which is bad (or good?)
Plain text log for baseline
Ã?ÃÂ retrain
Ã?ÃÂ Ã?ÃÂ > got another rhythm, which keeps unchanged in the songÃ?ÃÂ
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0007.mid
Ã?ÃÂ activation from hard_sigmoid to sigmoid
Ã?ÃÂ Ã?ÃÂ more diverse and proactive
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0008.mid
Ã?ÃÂ Ã?ÃÂ some magical realism
Ã?ÃÂ fix accumulate bug:
Ã?ÃÂ Ã?ÃÂ didn't set to zero after note off
Ã?ÃÂ set noise after notes.append
Ã?ÃÂ seq.append after noise is added
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ
Ã?ÃÂ Ã?ÃÂ wired repetition:
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0009.mid
Ã?ÃÂ Ã?ÃÂ max_sus=4, noise=(0.5, 2.0)
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0010.mid
Ã?ÃÂ less the binomial_p, sparser the notes
Ã?ÃÂ Ã?ÃÂ max_sus=4, noise=(0.5, 2.0), random=binomial(1, 0.3)
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0011.mid
Ã?ÃÂ notes too dense, raise the threshold to 0.99
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0012.mid
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ not good
Ã?ÃÂ make the dense layers more complex, shorten the max_sus
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ quite wellÃ?ÃÂ
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0013.mid
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ set the threshold to 0.50 to get a cleaner version
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0014.mid
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ remove the random reset to make the theme of song consistent
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0015.mid
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ random->(1, 0.3)
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0016.mid
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ can switch between rhythms
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 0017.mid
Ã?ÃÂ Ã?ÃÂ simple_rnn2
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ overfit on one tune, 0018.mid
Ã?ÃÂ Ã?ÃÂ simple structure
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ 32-LSTM -> 128 FC Ã?ÃÂ Ã?ÃÂ 0019.mid
Plain text log for baseline
Ã?ÃÂ Ã?ÃÂ memorize
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ hard to memorize the song
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ stack on a rhythm Ã?ÃÂ Ã?ÃÂ 0020.mid
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ keep repeating a most-frequent rhythm
Ã?ÃÂ Ã?ÃÂ change to
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ LSTM-100, Dense-50, Dense-128
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ change remember longer rhythm, but still repeating 0021.mid
Ã?ÃÂ Ã?ÃÂ
Ã?ÃÂ Ã?ÃÂ add random decay in the generation
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ random connection of pieces 0022.mid
commit
Ã?ÃÂ gan rnn1 (commit "gan rnn")
Ã?ÃÂ Ã?ÃÂ cannot even learn the patter that most places are black
Ã?ÃÂ Ã?ÃÂ stop converging after 500Ã?ÃÂ
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ d_loss converged to almost 0
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ g_loss dispesed to 16 and the output remained unchange
Ã?ÃÂ gan rnn2 (commit "gan conv")
Ã?ÃÂ Ã?ÃÂ can learn which note is more frequently played than other,Ã?ÃÂ
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ but don't have inner patter (maybe the dataset is too small?)
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ display/9900.png vs display/real.png
commit
Ã?ÃÂ refactory the encoder
Ã?ÃÂ encode again
Ã?ÃÂ large dataset
Ã?ÃÂ updated simple_rnn on e-comp (len=2000)
Ã?ÃÂ Ã?ÃÂ 0023.mid
Ã?ÃÂ Ã?ÃÂ 0031.mid (loss=0.3579, seed=32) data argument
Ã?ÃÂ Ã?ÃÂ 0032.mid favorite oneÃ?ÃÂ
Ã?ÃÂ updated simple_rnn on esay (len=100)
Ã?ÃÂ Ã?ÃÂ first one with rhythm: 0024.mid (loss=1.1757)
Ã?ÃÂ Ã?ÃÂ 0025.mid (loss=0.7159)
Ã?ÃÂ Ã?ÃÂ 0026.mid (loss=0.4621)
Ã?ÃÂ Ã?ÃÂ 0027.mid (loss=0.2543)
Ã?ÃÂ Ã?ÃÂ 0028.mid (loss=0.2166) plagiarism?
Ã?ÃÂ Ã?ÃÂ 0029.mid (loss=0.2048, seed=32) plagiarism?
Deep Symphony Baseline
Check this post for the generated music: Deep Symphony Baseline.
Plain text log for baseline
Ã?ÃÂ trained all night
Ã?ÃÂ Ã?ÃÂ e-comp-all, loss = 1.8979 (0033.mid)
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ awesome
Ã?ÃÂ Ã?ÃÂ e-comp-all, loss = 1.85 Ã?ÃÂ (0034.mid)
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ longer one, borrow
Skipgram
Applying the skip-gram model on the sequence: suppose and are the vectors for message and message respectively, the loss function is
where is the neighbor of word and the second term is the negative sampling. I uploaded the list of the most related pair to here.
For most message of type “note on”, the most related vector should be the “note off” message of the same note:
Most similar notes with <C5 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <C5 off>, <C6 off>, <C6 on>, <delay 0.0>, <C4 off>
Most similar notes with <C#5 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <C#5 off>, <C#6 on>, <C#6 off>, <E5 on>, <E5 off>
Most similar notes with <D5 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <D5 off>, <D6 off>, <D6 on>, <F5 off>, <F5 on>
Most similar notes with <D#5 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <D#5 off>, <D#6 off>, <D#6 on>, <D#4 off>, <D#4 on>
Most similar notes with <E5 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <E5 off>, <E6 on>, <E6 off>, <delay 0.0>, <G5 off>
But in the low register area, they are related with delays because we usually use a sustaining low key as the background:
Most similar notes with <C2 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 450.0>, <delay 410.0>, <delay 430.0>, <delay 440.0>, <delay 420.0>
Most similar notes with <C#2 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 450.0>, <delay 410.0>, <delay 440.0>, <delay 430.0>, <delay 470.0>
Most similar notes with <D2 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 410.0>, <delay 450.0>, <delay 430.0>, <delay 440.0>, <F#8 on>
Most similar notes with <D#2 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 450.0>, <delay 410.0>, <delay 430.0>, <delay 440.0>, <delay 420.0>
Most similar notes with <E2 on>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 410.0>, <delay 450.0>, <delay 430.0>, <F8 on>, <D2 on>
Messages of type delay are related to each other since we have many continuous delays (brought by meta message). However, when the delay time is longer than 500 units (500*10ms), they are related to message <D10 on> at most, which is weird.
Most similar notes with <delay 520.0>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <D10 on>, <A#0 off>, <E0 on>, <D0 on>, <E1 on>
Most similar notes with <delay 530.0>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <E0 on>, <G0 off>, <D10 on>, <G#0 on>, <D0 on>
Most similar notes with <delay 540.0>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <D10 on>, <A#0 off>, <G0 off>, <G#0 on>, <E0 on>
Most similar notes with <delay 550.0>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <D10 on>, <E0 on>, <G#0 on>, <G0 off>, <D0 on>
Most similar notes with <delay 560.0>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <D10 on>, <D0 on>, <G#0 on>, <E0 on>, <G#1 on>
...
The message of change of the velocity is mostly related with delay message.
Most similar notes with <velocity 1>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <D10 on>, <A#0 off>, <G0 off>, <G#0 on>, <D0 on>
Most similar notes with <velocity 2>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 410.0>, <delay 430.0>, <delay 450.0>, <delay 440.0>, <delay 400.0>
Most similar notes with <velocity 4>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 430.0>, <delay 390.0>, <delay 400.0>, <delay 450.0>, <F8 on>
Most similar notes with <velocity 8>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 410.0>, <delay 380.0>, <delay 340.0>, <C#2 on>, <delay 400.0>
Most similar notes with <velocity 16>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 20.0>, <delay 10.0>, <delay 0.0>, <delay 30.0>, <delay 40.0>
Most similar notes with <velocity 32>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <delay 10.0>, <velocity 32>, <delay 20.0>, <velocity 64>, <delay 30.0>
Most similar notes with <velocity 64>
Ã?ÃÂ Ã?ÃÂ Ã?ÃÂ <velocity 64>, <delay 10.0>, <velocity 32>, <delay 20.0>, <delay 30.0>
Currently I am re-training the model with aforementioned embedding and keeping the embedding weight fixed. Hope it can converge to a lower (<1.85) loss and have a better generation.
The stacked RNN model with embedding also can generate “resonable” performance. However, still there is no long-term sturcture among the whole songs.
Pianoteq
Pianoteq can synthesize much more realistic audio than timidity given the same piece of MIDI file.
Visualization of Cells
Visualization of the activation during generation:
Some neurons keep fired all the time. Updated the asynchronous on 13.
Meeting with TA (Hexiang)
- Q: Any more types of models? A: Use a classifier to distinguish the real one and the generated one. And use methods like guided backpropagation to update the input (similar to GAN).
- Q: Evaluation? A: (1) BELU score (2) inception score
- Q: Music has dialects (different keys, genres). A: Add additional information bits to control.
- Q: How to interpret the LSTM. A: (1) Fix/Disable one intentionally, check the result (2) add an guided mask before softmax (3) (mine) try L2 regularization.
- Q: Generate by interpolation? A: Not so common.
Vanilla BLEU
It seems that BLEU is not a suitable metric for music generation. To apply BLEU, I sampled some sequences from the existing songs and use the first part as the prior and the last part as the references. Two observations:
- The model trained with 1 epoch has a better BLEU score than the one trained with 500+ epochs.
- Using 1000 previous notes as prior is worse than just using 5 previous notes as prior during testing.
There are two possible explanations: (i) generation with high randomness can “cover” the ground truth sequence better and the fully-trained one might compose in a reasonable but, compared with the ground truth sequence, inconsistent direction (ii) BLEU should be further adapted to use in music generation (e.g., different events of delaying are actually similar.).
Histogram Cluster
In music theory, there is a concept called Key. The key of a piece is the group of pitches, or scale that form the basis of a music composition in classical, Western art, and Western pop music.
Simply speaking, a song should be majorly composed of some notes and occasionally use other notes. Such major combination (key) varies from song to song but it should be consistent in one song (or a long segment of the song).
I simply calculate the histogram of the appearance of each note in a song as its feature and visualize these features using T-SNE:
Obviously, there are some natural clusters and this can be used as a conditional information to help RNN generate more consistent songs. Although this can be learned by RNN explicitly, adding this information should lend us a hand to control the generation.
By applying a softmax with a small temperature on the histogram (), we can get a more discriminative clustering.
Furthermore, I think the embedding of notes should be different in different keys.
If we manage to generate music with a given histogram, then we are actually halfway on generating different genres of music. For instance, there are some popular keys/scales in Jazz, which are rarely used in classical music.
Histogram-conditional RNN
I use the pre-calculated histogram of the song as the conditional information and concatenate it with the previous note to form each input. More precisely, there are 12 note numbers in music ('C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B’) and I calculate their appearance in a song and normalize the histogram. This histogram is then connected with the one-hot vector to form an input. It works as estimated:
Methods | #Epochs |
Loss (Cross-entropy) |
JS-divergence (with global histogram) |
Random | 0 | 5.89 | 0.162 |
Baseline | 1000 | 1.90 | 0.414 |
Hist-RNN | 200* | 2.91 | 0.251/0.174** |
Patch-Hist RNN | 500 | 1.75 | 0.287/0.195** |
Although Hist-RNN is trained with fewer epochs and has a higher loss, its generation has a lower divergence with the true distribution. *: I will update the results later. **: Temperature equals to 1e-2 (), which makes the conditional information more sparse after normalization.
Notice that the random generation has the lowest divergence, which means Hist-RNN is still not accurate enough. Ideally, Hist-RNN should generate songs whose histogram distribution should be the same with the given one (divergence=0). There are some issues that might affect the results:
- The conditional information (i.e. the histogram of notes) is calculated based on the whole song rather than a patch, while I use patches to train the network. The discrepancy between the local distribution and the global distribute might confuse the network.
- When generating, some noisy is included intentionally to increase the variety of the generation, which might disturb the results.
TODO next:
Keep finetuning the Hist-RNNUse patch-histogram.- Update 17/10/17: The histograms of the patch and of the whole is incomparable to some extends. Although Patch-Hist-RNN has a larger divergence, it manages to converge to a smaller loss. It seems that vanilla RNN fails to control some features (i.e. keeping it consistent with the prior) of the music when it tries to increase the variety.
- Try to come up with other statistical information to direct the generation (e.g., density, pitch, n-gram histogram, sustain time).
New setting: Rather than generating from scratch, we allow the neural network to take 500 previous notes as a prior and test on the consecutive sequence.
Methods | #Epochs |
Loss (Cross-entropy) |
JS-divergence (with patch histogram) |
Random | 0 | 5.89 | 0.251 |
Baseline | 1000 | 1.90 | 0.301 |
Patch-Hist RNN | 500 | 1.75 | 0.161 |
Compared with the first table, the random generation gets a larger JS-divergence because the patch histogram is less similar to a uniform distribution, which the random generation tends to have. And also, the Patch-Hist RNN can predict the coming sequence much better with the prior as well as the histogram.
Another problem is that whether overfitting is good. We want the model to learn the rule of music but we don’t want it to repeat what it learned, which is somewhat contradictory.
Ideas (Hierarchical Composition)
- Short-term: decoding to sequence using GAN/VAE
- Long-term: LSTM to generate encoding
- Training method:
- VAE-style (encoder-decoder)
- Short-term: Sequence → code → sequence
- Long-term: Encode segment by segment (sliding window).
- GAN-style (random variable-generation):
- Short-term: Random variable→ Sequence
- Long-term: Not sure. Possible solutions: (i) metaGAN: Fix the local GAN and generate the local random variable with a metaGAN. (ii) Use KNN to reverse the generation procedure (Likely to fail).
- VAE-style (encoder-decoder)
- Encoding:
- Learning based: AE, VAE, Hourglass, U-Net ...
- Statistics based: Note histogram, density, pitch, n-gram histogram.
If we use GAN, there is no way to encode a existing song and find its movement in the code space.
ideas (Scale Filter)
Let’s step back and learn to generate songs with notes spanning on a single scale. It is reasonable to do that because changing of scale is rarely observed in simple composition. Based on my experiment on (unprofessional) improvising, I can use only one scale to play different themes.
For those who are not familiar with the music, a scale or a key can be regarded as a fixed combination of notes. In C major scale, we have C, D, E, F, G, A, B
and in B major scale, we have B, C#, D#, E, F#, G#, A#
. Once the scale is selected, we can use seven (or more) notes in the scale to compose the song.
To do that, we should first find the relationship between different scales. For instance, one can rewrite a song written in scale C to scale B# by shifting the notes (TODO: verify whether each scale can be theoretically translated to every other scales Confirmed). After that, we can “normalize” every song to a standard scale in order to simplify the learning of rhythms or chords.
Next steps
Rewrite everything in TensorflowImplement LCS (Longest common subsequence) algorithm to check the ratio of repetition (on the easymusicnotes dataset).Analyse the easymusicnotes datset and hardcode some rules about scales.- Develop the “memorize one and diverge” idea.
Commit
- Updated to Tensorflow
- Found a library (music21) to analyze the music.
- Used LCS to analyze the overfitting problem (in this blog).
Multivoice Model on MultiHots inputs
Music can be quantized with fixed step-size and represented as a multi-hots vector at each step. The figure on the right contains 25 songs where the x-axis is the timeline and the y-axis is the index of the notes. A white pixel at means the key is pressed at time .
With this coding method, the model has to do a multi-label prediction at each time step. I used the cross-entropy loss to train the model and it can converge perfectly (loss=0.10) at the training set. However, during generation, the model is usually stuck at some certain notes or just plays a rarely changing chord.
There are some issues with this coding:
- The model has to output one note multiple times to simulate the “sustain” effect, while human just press one note for a long time to sustain rather than hitting the key multiple times.
- Because the model is doing a multilabel prediction by giving the confidence of each note at each timestep, I have to use a threshold to decide which key should be played. The model gives low confidence when changing chords and a lower threshold should be used, while it gives strong confidence about sustaining the current states, namely playing what is played previously. This generation strategy actually encourages the model to generate the same notes again and again.
(Crazy) Solution:
- Rather than using the multilabel prediction, the outputs with various length at each timestep can be locally regarded as a sequence. Furthermore, the duration of a note can be embedded with the note index to avoid replicated prediction when the model tries to sustain it.
As for the model, I used a multiway LSTM structure:
- The bottom LSTM blocks take corresponding voices as the inputs separately.
- The output of different bottom LSTM blocks are concatenated and sent to a middle LSTM block, where the information from different voices can be merged.
- Each top LSTM block takes the output of the middle LSTM block and generates a single voice.
The reason why this overfitting model cannot generate what it remembered (like this one) might be the exposure bias:
when generating a long-term sequence, we take the output of RNN as its future input, and such input pattern might never be observed in the training set. Therefore, this discrepancy can accumulate as the generated sequence grows.
We are currently implementing seqGAN to solve this problem. Let’s hope it can work.
PS: EXTREMELY simple rules to improvise (https://www.youtube.com/watch?v=HeI6QMZBSqI, https://www.youtube.com/watch?v=fMmksLBcBFY). Why machine cannot learn it!?
PS2(11/08/17): The problem may result from the fact that many songs are not aligned (on the definition of the voice). Manually preprocessing is needed.
Sequence Autoencoder
The new seq2seq API of Tensorflow is more organized than the previous one. But there are a limited number of tutorials about the new API. I follow this post to code. Notice that there are few mistakes in that script.
The concept of Helper is vague in the document. Generally, it defines how the decoder handles the input.
- TrainingHelper feeds the given input without changing it. This is widely used in supervised training.
- GreedyEmbeddingHelper finds the output word with maximal probability and uses its embed to generate the next input. This can be used to generate sequences in the development phase.
The figure on the right demonstrates the model:
- An encoder is trained to compress the sequence into the hidden state.
- A decoder is trained to recover the sequence.
The model can overfit the training set during training (loss=5e-2). The figure at the bottom is the prediction result (on the training set). Each testing case contains three rows: (i) ground truth sequence (ii) prediction with TrainingHelper (training setting) (iii) prediction sequence using GreedyEmbeddingHelper (deployment setting).
As shown, (i) and (ii) are almost the same because we converge to a very small loss during training. However, except the first few notes, (iii) is very different from (i). The bug is in the GreedyEmbeddingHelper. GreedyEmbeddingHelper tries to return the last output of RNN as the next input, while our model add an additional fully-connected layer on the top of RNN as the decision layer. Removing that decision layer makes the training much slower.
Update23: I use a customized a helper to handle the output (GreedyEmbeddingDecisionHelper). Now all three sequences are similar:
How to generate with Sequence AutoEncoder:
- [TODO] Sample in the code space (Update SeqAE to SeqVAE).
- [Done] Encode a short piece of an existing song (like 100 events) and then decode a much longer sequence (like 1000 events). The extra sequence (900 events) is the generated part.
- [Done] When the model is underfitting, like the below one where I train the model to encode 400 events and it ends with a high loss (4e-1), the “reconstruction” sounds nothing like the original one. (Another funny observation: the decoder gets stuck in a plausible rhythm and keeps generate it. If I increase the generation length, it still can repeat the rhythm. This resembles the definition of “fixed point”/”equilibrium” in the neurodynamics.)
TODO:
- Split to train/test set to evaluate the generalization.
- Encode longer sequences (e.g. the whole song).
Aligned generation
What if we align all the notes and breaks to the same length, so we can focus on the combination of chords and ignore the dynamic? I test this idea and actually, the generated song is not bad.
One funny thing is when decoding two totally different codes, the decoder eventually ends with a dynamic equilibrium. More specifically, it can repeat an existing rhythm over and over again without external input. Normally, RNN (or char-RNN) will get stuck in a wired key. But SeqAE can stably generate a rhythm depending on its own output, which is very impressive.
My Theory: If we denote the decoder as function , the hidden state as and the output as . Then we have during testing because the decoder feed its last output as its next input. So we actually have if we combine to . In traditional neurodynamics, we have , while we here have a discrete form where . A neurodynamics system can act like a CAM (Content Addressable Memory): the input or the first state is a non-accurate address and it might converge by iterating to state , which is the memory. Instead of converging to a fixed point, sometimes it might converge to a limit cycle, as we have here. If we can find a proper Lyapunov function, maybe we can prove this stability. Apart from the stability, we can also dive into the manifold of the coding space. Maybe a song/piece of rhythm is a trace that spins in the space.
Continuous AutoEncoder
Original Plan:
Test SeqAE on unseen pieces. Still very accurate.- Simple RL+GAN (to find a proper discriminator structure).
Encode short piece in a short vector. In a 256 bit code (with only values near 0, 0.5, 1.0 are activated).Observe the dynamic of the code space in a song to decide the “action”. The rhythm of Random walking is continuous (consistent between immediate neighbors).- Better quantized method ([-1, 1] rather than [0, 1]).
- Let’s RL.
Continuous AutoEncoder: (This post has been updated for multiple times. Inconsistency may appear between different versions.)
The aligned encoding is used, where the duration of every event is the same. Assume the original sequence is and the encoder learns to compress into a code while the decoder tries to reconstruct from . The reconstruction loss is:
In the continuous SeqAE, I split the song by every events, which means a song can be represented as . The reason to do that is to generate music with a larger granularity () rather than note by note (). So the AutoEncoder learns to encode and decode different subsequences , whose length is set to 8 in my experiment.
Now, a song can be encoded into a sequence of codes:
If an agent can learn how to sequentially pick those codes, then we can compose a plausible song! However, the dimension of the code space is very high and learning how to walk in a unregularized space might be too taxing. To alleviate that task, two modification is added.
The first one is to use the sigmoid function to regularize the code space. More specifically, I employ a sigmoid function with a great slope ( , similar to ). After training, the value of the code distributes near 0, 0.5, and 1. We can further quantize the real code space into a discrete space by quantizing every value to {0, 0.5, 1}.
(TODO) Currently, the reconstruction loss of the quantized code is very large and I am trying to add some quantized code in the training of the decoder to make it invariant to the quantization (but the gradient cannot flow back). In the real design, the reconstruction loss contains two terms, the first one is and the second one is . The first one has meaningful gradients for both encoder and decoder, while the second loss can only be used to train the decoder because the cannot back propagate gradients to the encoder.
Secondly, although the change of rhythm happens in composition, the rhythm in a song should be consistent locally. In other words, we want the agent to take small steps for the most time. Therefore, I design a continuousness loss (in the form of hinge loss) to restrict the code space:
,
which means we want the immediate neighbor sub-sequences to be close. An alternative loss is . I think this problem is similar to the manifold embedding, except that I am trying to smooth the changing between consecutive codes. Maybe I will look into manifold learning to find some well-developed losses to try out.
Follow-up:
Assume our activation function is and the final loss is . The combination of and is very critical. Because the requirement of accurate reconstruction is contradictory with the continuousness to some degrees. Athough I assume that most of consecutive sub-sequences are similar, but actually it’s very hard to keep this consistency in a highly non-linear mapping.
converges | converges | fails to converge because it takes a huge gradient to flip a bit. | |
converges | converges slowly because the original code space is not sparse enough* | equals to 0 | |
fails to converge because all the coding remain the same | fails to converge because all the coding remain the same | equals to 0 | |
converges | converges | converges slowly because the restriction is too relax |
*: It actually learns to generate the most frequent subsequence because the quantized code is the same.
Also, the learning rate is very important. A large learning rate can help minimizing because the code will be sparse at a fast speed. This fast convergence also hinder the optimization of because it needs a great gradient to re-arange an almost discrete coding space. My suggestion is to fix the learning rate first and finetune and .
In the ideal situation, should be as small as possible to generate accurate sub-sequences and should be around 0.5, indicating that the agent can take only one of few steps to transit to the next state.
Follow-up2:
When quantinizing the code, some information is lost and the encoder cannot notice that because the quantization function cannot propagate the gradient. Therefore, when totally converged and does not, we are actually overfitting the decoder so that it can remember the correct outputs. The decoder has to comprose with a not-that-cooperate encoder. and should not differ too much in the training process. (Everyone should work equally :))
Also, the continuous loss is imposed on the quantized code (the eventual output). Hence, it cannot constrict the encoder. With that observation, I decide to impose the continuousness loss on the orginal code space (rather than the quantized one).
Follow-up3:
When trying out a new model, it is recommended to firstly ignore all regularizations and see whether the model can overfit.
Continuous AutoEncoder2
10/30/2017
- Change the activation function to
- Change the quantization function to
- Change the continuousness loss to
- Remove clipnorm to allow significant re-arangements in the code space
11/01/2017
- Fix the key normalization bug.
- Use a hourglass structure rather than two RNNs.
- Change the continuousness loss to
GAN (G:RNN D:CNN)
The generation is not good.
Some observations:
- If a (fully connected) decision layer is added on the top of the RNN, then network learns to generate only one key.
- The RNN can hardly update after converging to a certain outline (like in the figure). Keep training it will only tweak a very small part of the image.
Scale of Different keys
Key |
1 |
1# |
2 |
3b |
3 |
4 |
4# |
5 |
5# |
6 |
7b |
7 |
C |
c |
c# |
d |
eb |
e |
f |
f# |
g |
g# |
a |
bb |
b |
C# |
c# |
d |
d# |
e |
f |
f# |
g |
g# |
a |
a# |
b |
c |
D |
d |
d# |
e |
f |
f# |
g |
g# |
a |
bb |
b |
c |
c# |
Eb |
eb |
e |
f |
gb |
g |
ab |
a |
bb |
b |
c |
db |
d |
E |
e |
f |
f# |
g |
g# |
a |
a# |
b |
c |
c# |
d |
d# |
F |
f |
f# |
g |
ab |
a |
bb |
b |
c |
c# |
d |
eb |
e |
F# |
f# |
g |
g# |
a |
a# |
b |
c |
c# |
d |
d# |
e |
f |
G |
g |
g# |
a |
bb |
b |
c |
c# |
d |
d# |
e |
f |
f# |
G# |
g# |
a |
a# |
b |
c |
c# |
d |
d# |
e |
f |
gb |
g |
A |
a |
bb |
b |
c |
c# |
d |
d# |
e |
f |
f# |
g |
g# |
Bb |
bb |
b |
c |
c# |
d |
eb |
e |
f |
f# |
g |
ab |
a |
B |
b |
c |
c# |
d |
d# |
e |
f |
f# |
g |
g# |
a |
a# |
Refine Net
Refine the duration from a quantized song.
Not good. Severe overfitting.
One-shot Music Composition Learning
Exp:
- Train on short one and change the timestep when generating
Problems:
- How to walk? Only up and rep, no down. (region based reset rather than randomly reset)
- Reward only on the ground truth has no variety.
- AE may not fully employ the coding space. Should use GAN as well?
Update 09/11
One-shot learning can actually be promising
Jevois meets DeepSymphony
See how can we interactively generate music. Details: http://shaofanlai.com/exp/2#log-54
With audio: http://shaofanlai.com/exp/2#log-56
DCRNN
Because the previous model can hardly have a long-term structure, so I think whether it is possible to generate songs hierarchically. The basic idea is very intuitive as shown in the figure: (i) A RNN is trained to generate a short sequence. (ii) This output will be repeated along the time-axis to form a locally consistent sequence. (iii) The expanded output is then fed into another RNN. (iv) Goto (ii) until it reaches the deepest layer.
At present, these model combined with CNN discriminator can generate repeated rhythms, but it also has a severe problem of mode collapse. The generated songs are almost the same even given different noise code.
There are four components in the generator. The first three are all RNN and the last layer is a fully-connected layer. I tried to train the whole model in a layer-by-layer way: the last RNN and the decision FC layer is trained first to learn what a music might locally look like and then I finetune the first two layer to learn a diverse skeleton for the music. In the experiment, I found that I can generate repeated sequences using only the last RNN and the FC layer. The mode collapse problem is very severe in that all the generations look the same. Besides, changing first two RNN after that barely has an impact on the generated song.
During the first step, the collapsed mode generated by the model change abruptly from one pattern to another pattern even I used a small learning rate. I tried to increase the noise’s variance in order to include a larger difference in the generation process, but it didn’t work well. Next step will be checking the output of each step to check the intermediate results of different noise vectors to find out where goes wrong.
Follow up: Different noise vectors have similar outcomes after three layers of RNN. I reduced the model in to two RNN layers and hope it works. I guess the reason is that the highly non-linear mapping in the RNN reduces the variance and adding some scaling operations between two RNN layers should also work.
Observation: The mean of the output of softmax is very small (mode collapse):
- 0.95266 (noise vector)
- 0.32785 (output of RNN1)
- 0.70545 (output of RNN2)
- 15.042 (output of FC)
- 0.00329796 (output of softmax)
IWGAN
The loss from Improved WGAN works better than the loss of traditional GAN in two ways:
- Less mode collapse
- Better (more smooth) gradients.
I am thinking about whether using CNN-discriminator on music generation is a good idea. According to Hinton in his lecture about the defects of CNN, he pointed out that CNN has invariance rather than equivariance. In music generation, the position of the note is very important and therefore CNN is not suitable for the task.
(11/12/07) However, when using the RNN as the discrminator, the tensorflow reported that “TypeError: Second-order gradient for while loops not supported”
Local Resemblance
There are 3 methods to achieve the temporal-local resemblance:
- Reuse the short pieces from the existing song
- Use encode-decode method to encode a short piece in a supervised way
- Learning explicitly (i.e. learning how to generate the whole song rather than focusing on the local)
However, there are disadvantages about all of them. Reusing the original pieces is somewhat cheating and is an extreme case of overfitting. The Encode-decode method cannot accurately reconstruct the sequences when the coding space is constricted. By learning explicitly, if the model can accurately generate local sequences, it must also overfit on the long sequence. It’s very contradictory that we want the music sounds fluent locally but we don’t want the whole song to be exactly the same with the existing one.
DCRNN can generate repeated songs
Not stable but it is the first model that can generate repetitions without hand-craft rules. I cherry-picked the result:
Sequential GAN
The idea of applying traditional GAN on Sequential generation is actually possible. Stating that GAN is not suitable for discrete decision making, SeqGAN replaces the training method with reinforcement learning by treating the generator as a policy and the discriminator as a reward function. However, by carefully altering some architectures, we can actually train a (multi-hots) sequential GAN without RL:
- Remove gradient clip.
- Use Loss function from IWGAN rather than GAN. The Wasserstein metric is much more stable than the JS divergence.
- Adding a fully connected layer after the RNN can boost the convergence*.
- Don’t pre-train the discriminator.
*: Although the last linear layer can greatly boost (x200) the convergence, it also makes the generation volatile. Given a noise vector, the generated sample should change steadily during the generation progress. But in the model with an extra linear layer, the generated sample vary significantly between iterations. My conjecture is that such decision layer can amplify the output of RNN. But whether this instability is good for adversarial training still remains unclear.
Vanilla LSTM GAN
With the techniques described before, we can actually build a generator with vanilla LSTM (without hierarchical architecture).
Visualized song (x-axis is the time and y-axis is the pitch):
The input of LSTM at each step is composed of the noise vector and the time indicator (a float from 0.0 to 1.0). One problem is that the LSTM pays almost no attention to the time indicator and generate one rhythm again and again. Although this is much better than previous models, there should be some changing in a song. This might be caused by the short timespan of the training samples (128 when training vs 512 when generating). I am trying to train on longer sequences to ameliorate this issue.
Update: Longer sequences do not help. Trying linearly changing noise vector.
Some might question that the CNN discriminator cannot distinguish on the temporal level. Here, I plot some generated samples and their gradients of the loss generator. We can see that the CNN can give different gradients for the same note on different time steps.
DCRNN with Linspace Code
Only using long sequences as the training samples cannot resolve the problem of repetition (http://shaofanlai.com/exp/1#log-62). When applying linearly-changing noise vector to the LSTM, the training becomes extremely slow (on a single LSTM with 128 cells) and the network failed to transit from one rhythm to another smoothly. By using a stacked LSTM (64-32 cells), it can generate visually plausible songs but they sound very weird in a way that many keys are sustained too short and there is no clear beats between different rhythms.
I believe the note selection itself is good and unseen in the dataset, but the beat (sustain, velocity, pause) is awful. Some postprocessing is required to generate a plausible song.
DCRNN with noisy real song
In this model, the noise vectors are sampled from the real songs with noisy masks. My original purpose was that the model should learn to reconstruct the real song by adding details. The model failed mainly because I didn’t use the bidirectional RNN because the model should be allowed to observe both forward and backward to decide how long the original note sustains. Also, a reconstruction loss should also be included in the optimization goal.
But anyway, the byproduct is still interesting. In my last few experiments, I complained that the generated song keep looping the same rhythm because the noise vectors are all the same across the timesteps. I then tried to resolve it with a linearly-changing code which can learn to transfer from a rhythm to another. So in a word, the pattern/repetition of generation is closed related to the temporal pattern of noise vectors. Using the real song as the noise vectors, the generated song will repeat when the original repeats and changes when the original one changes.
Apart from generating repeated rhythms, there is another reason I want to achieve with DCRNN. Ideally, the later RNNs should learn to generate locally repeated rhythms while the first few RNNs should learn to generate codes for the later RNNs. Here, “later” and ”first” refers to their position in the forwarding process. However, the whole DCRNN learns to generate repeated rhythm and the generated song becomes tedious.
DCRNN (AB code generation)
The theme is new and sounds not bad. The unpleasant sustain and high pitch can be considered as the noise in the generation (as the noise pixels in the image).
The next problem is how to use the model, which can generate short pieces, to generate a full song. More specifically, it is about how to model the long-term in high-level.
DCRNN
When training the DCRNN on a simple dataset, DCRNN tends to generate repeated rhythm because it seems to be a simpler job for RNN and the discriminator can hardly distinguish that. But when the training iteration increases (to about 12000), the DCRNN starts to include some changes in the cycle. When training on a complex dataset, DCRNN starts to diverge at about 3000 iterations that because the sampled short subsequences of these songs seldom repeat.
Now the problem is that with models that can generate plausible short sequences, how to pick a sequence of code so that it can generate the whole song.
RNN-RNN-GAN
Using RNN-RNN-GAN seems to be a bad solution for two reasons:
- Cannot train the model with `tf.nn.dynamic_rnn`. The `tf.static_rnn` takes much more (~x10) time to build and train than the dynamical one, especially when the sequences are long. It is very time-consuming to debug this model.
- RNN-RNN-GAN can hardly learn the distribution of notes. Although it once (at step 20171) generated some plausible samples, those samples were highly repetitive and the severe issue of mode collapse was observed.
Strip Deconv Kernel
The reason why I used strip transpose convolutional (deconvolutional) kernels is that there are some fixed combinations of chords in music (like the Major chord). Therefore, those patterns (chords) should be reusable in different places on the keyboard. I set the kernel size to 1x12 because there are 12 notes per octave.
The results is very promising. The model converges faster than normal RNN+FC generator. There might be two reasons: i) the percentage of the recurrent network is reduced or ii) the strip kernels can help the training.
Now I am trying to make the inputs of the transpose conv layer sparse so that the patterns of the kernels can be more clear. Failed to learn complex pattern.
TOREAD
- Come up with some mathematical models to represent the notes.
- Metrics, Surveys:
- BLEU: a Method for Automatic Evaluation of Machine Translation
- A modified n-gram matching for machine translation. Vanilla BLEU is not suitable for music generation,
- Deep Learning Techniques for Music Generation A survey
- BLEU: a Method for Automatic Evaluation of Machine Translation
- GAN-related:
- SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
- Uses LSTM as the generator as well as the policy and CNN as the discriminator as well as the reward function. The model is trained in a policy gradient way.
- A SeqGAN for Polyphonic Music Generation
- Same architecture with SeqGAN. They used the existing chords as words (not just notes) and normalized every song to C major.
- Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets
- Mixing SeqGAN with BLEU score.
- Generating Text via Adversarial Training
- Multiplying the noise code with the hidden cell in the LSTM. Classical G:RNN-D:CNN architecture. Feature matching is used to train the generator.
- Improved Techniques for Training GANs
- Deprecated but still useful to some extent.
- SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient
- AE-related:
- Variational Recurrent Auto-Encoders
- Common Sequential Auto-Encoder with a classical VAE-like loss imposing on the code space.
- Generating Sentences from a Continuous Space
- Usage of VRAE. KL cost annealing, Word dropout, and Historyless decoding were introduced to handle the problem that the decoder might ignore the input code.
- Variational Recurrent Auto-Encoders
- Toward Controlled Generation of Text
- Deep learning for music
- Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription
Some Debates
- Composition versus Performance:
- After these days, I found that the side information about the performance (i.e. the velocity and delaying) might be too much for the neural network to learn because at the same time it has to exploit some general rules for music.
- Complex dataset versus Simple dataset
- One advantage of the simple songs is that they are composed in a more well-ordered way and the network can extract its simple pattern. However, we can easily overfit/memorize it with a standard RNN, which will repeat what it heard during the testing time.
- Complex dataset introduces diversity. But RNN fails to stay consistent on one style during generation.
- Generate with Threshold
- A small threshold will include more noise but the continuousness of the notes will be preserved.
- A large threshold will ignore noise but might also tend to generate notes with short sustain.
Congrats!
We won the “Best Demo Award” and the “Best Presentation Award”! Thanks for everyone who has contributed to this project!
Chord sequence reconstruction
Chord sequence reconstruction
- 1D conv kernel of length 12 is introduced to abstract the common patterns.
- Tanh is much better than sigmoid in rec
- The decode is just a linear mapping from the output of LSTM with tanh as an output regularizer.
Replacing sigmoid with Softmax in the 1Dconv layer, we can get a model that achieves similar rec error on the training set as the last one but with more meaningful filter. The reason is that only one filter can be activated and the filters can no longer cooperate with each other. The Y-axis can be somehow misleading because all filters scan the keyboard with stride of one. In other words, you can pick any note beside “C” as the start.
If you look at the 14th column, you can find a filter for D minor chord. As I discussed before, this one can be shifted to filter any other minor chords.
- Note: when restricting the filters, we should also control the complexity of the following components. Otherwise, the filter will be too lazy to discover the latent features but relies on the memorization ability of the following components to recover the chord.
- When sampling, randomly chopping the rhythms might help overfitting due to the memorization.
000 (Finish-me-with-beats) model
preceding context (only melody) => context vector
provide silence+attack => structure vector
context+structure => fill me
Possible problem:
The attack and silence sometimes are not strong enough to change the progression of the rhythm. It might take an extra step to correct the progression.
Extension:
attention mechanism. longer context