房顶的猫 - Shaofan Lai's Blog

Introduction:

Vocal Synthesizing.

11/28/2018 22:42

Vocal extractor

extracted 19 songs with https://phonicmind.com/

11/29/2018 17:21

Use the std to separate background from vocal.

11/29/2018 17:21

Found a way to calculate the frequency of a piece.

Notes:

https://stackoverflow.com/questions/36511068/computational-physics-fft-analysis/36539114#36539114

11/29/2018 17:22

Notes:

https://bideyuanli.com/p/3671
Timbre
- caused by different combinations of overtone
Overtune
- frequencies are multipliers of the fundamental frequency

12/01/2018 15:16

Deep methods

sequential WGAN_GP with embedding failed the MNIST test.

The Generator didn’t learn. The RNN learns slower than the decision layer, which cause all outputs in the sequence change together.

12/04/2018 14:33

Demo song

The demo song is set to 所以来吧

12/04/2018 14:36

Discretize

Discretize the frequents into 128 bins and only use the first 500 frequents:

Filter out spike larger than MAX (5000~6000)
Only take the first 512 freq
Complex 128 to Complex Single

12/05/2018 21:57

CharRNN

Char RNN:

Can reproduce the song with the correct preceding prefix. Has same problem as in Deep-symphony project.

Problems:

Goes to silence quickly during the self-inference.
Still vanishes with self-generated inputs in the training.

Reasons:

The self-generated inputs from previous steps are different from the one in the dataset.
The self-generated inputs still deviate from the correct one.

Solutions:

Including the self-generated inputs during training gradually (proportion from 0,0 to 1.0).
Nah?

12/05/2018 22:28

Conditional Char RNN

Use conditional information to alleviate the averaging problem of Char RNN.

When using Char RNN, we are using the mean square loss to optimize the model. However, mse might cause the problem of “averaging”, which means the output looks like an averaged result of multiple inputs. If the inputs are images, the generated image will be blurry. Since we are generating music here, the generated sound is hard to recognize. This is a bad thing and we more prefer diverse but clean outputs. That’s exactly one of the advantages that GAN can provide.

Conditional infos:

Mean (of abs)
Max, Min
Argmax (of abs)

Result:

Can recall the melody and self-generate it only with the conditional input. Can generate reasonable vocal for some segments. But most parts of the generation are mumbling.

12/06/2018 03:13

A more reasonable solution for pitch detection

Smoothing is not that smooth since the pitch is dirty (with a lot of high spikes) and is somewhat discrete rather than continuous. Plus averaging might be affected by the extreme spike.

raw detection by picking the most activated frequency

Therefore, we can pick a window size and calculate the median or the mode in that window to avoid being affected by the extreme value.

Discretize to pitch space and do the voting in a sliding window

There are still a lot of extreme spikes. It might happen because the audio is noise or it’s originally a hard task. Will looking into it later. A more safe way is to manually label the pitch myself to check the difference.

12/13/2018 00:30

Lyric and Pitch

Added the extra information by annotating the lyrics. The lyrics are Chinese characters (initials, finals, tones). It can help to control the output. In the experiment of pinyin2audio, we can switch all character with “ba” and it did generate the corresponding sound.

Also, I added the relatively time tag for each character. For instance, if a character “ba” lasts for 4 timesteps, then we have time tags of [0, 0.25, 0.50, 0.75] for those four steps.

If we only condition the output on the Lyric, then the neural network might overfit and correlate the pronunciation with the pitch in a wrong way. By adding the information of pitches, we can control the generation better.

To overcome the overfitting, we can shrink the dense layer while enlarging the embedding layer to maintain the same accuracy.

Issues:

We get a lot of noise.
Cannot utter unseen characters.

Possible reason:

time tag is sparse. too many unseen combination.

12/13/2018 00:45

Some Global Hyperparams

SAMPLE_LENGTH in filter.py
and FIRST_FREQ in filter.py

12/13/2018 01:30

Network details

Use time mask (with ‘sigmoid’ as the activation) style to handle the time tag input. Still have the noise. But performs better on the generation.

Adding L1 constraint to the activation can greatly reduce the noise.

L1 is better than L2 when constraining the activity.

PReLU performs better than LeakyReLU

12/22/2018 18:31

test

12/22/2018 18:31

test

01/01/2019 00:43

TODO/TOFIX

Context inputs
~~use range as the time tag, not just start~~
~~use masking on the time input, not embedding~~
noise case study:
- maxnorm-control on the generated fts/wave
character-based generation, not sample-based
The time tag of lyrics can be unaccurate