extracted 19 songs with https://phonicmind.com/
sequential WGAN_GP with embedding failed the MNIST test.
The Generator didn’t learn. The RNN learns slower than the decision layer, which cause all outputs in the sequence change together.
Discretize the frequents into 128 bins and only use the first 500 frequents:
- Filter out spike larger than MAX (5000~6000)
- Only take the first 512 freq
- Complex 128 to Complex Single
Can reproduce the song with the correct preceding prefix. Has same problem as in Deep-symphony project.
- Goes to silence quickly during the self-inference.
- Still vanishes with self-generated inputs in the training.
- The self-generated inputs from previous steps are different from the one in the dataset.
- The self-generated inputs still deviate from the correct one.
- Including the self-generated inputs during training gradually (proportion from 0,0 to 1.0).
Conditional Char RNN
Use conditional information to alleviate the averaging problem of Char RNN.
When using Char RNN, we are using the mean square loss to optimize the model. However, mse might cause the problem of “averaging”, which means the output looks like an averaged result of multiple inputs. If the inputs are images, the generated image will be blurry. Since we are generating music here, the generated sound is hard to recognize. This is a bad thing and we more prefer diverse but clean outputs. That’s exactly one of the advantages that GAN can provide.
- Mean (of abs)
- Max, Min
- Argmax (of abs)
- Can recall the melody and self-generate it only with the conditional input. Can generate reasonable vocal for some segments. But most parts of the generation are mumbling.
A more reasonable solution for pitch detection
Smoothing is not that smooth since the pitch is dirty (with a lot of high spikes) and is somewhat discrete rather than continuous. Plus averaging might be affected by the extreme spike.
Therefore, we can pick a window size and calculate the median or the mode in that window to avoid being affected by the extreme value.
There are still a lot of extreme spikes. It might happen because the audio is noise or it’s originally a hard task. Will looking into it later. A more safe way is to manually label the pitch myself to check the difference.
Lyric and Pitch
Added the extra information by annotating the lyrics. The lyrics are Chinese characters (initials, finals, tones). It can help to control the output. In the experiment of pinyin2audio, we can switch all character with “ba” and it did generate the corresponding sound.
Also, I added the relatively time tag for each character. For instance, if a character “ba” lasts for 4 timesteps, then we have time tags of [0, 0.25, 0.50, 0.75] for those four steps.
If we only condition the output on the Lyric, then the neural network might overfit and correlate the pronunciation with the pitch in a wrong way. By adding the information of pitches, we can control the generation better.
To overcome the overfitting, we can shrink the dense layer while enlarging the embedding layer to maintain the same accuracy.
- We get a lot of noise.
- Cannot utter unseen characters.
- time tag is sparse. too many unseen combination.
Use time mask (with ‘sigmoid’ as the activation) style to handle the time tag input. Still have the noise. But performs better on the generation.
Adding L1 constraint to the activation can greatly reduce the noise.
L1 is better than L2 when constraining the activity.
PReLU performs better than LeakyReLU