My objective is to generate speech with certain characteristics - making it sound like a real person. In this talk I'm going to present Tacotron2 implemeted with PyTorch. Tacotron2 only uses audio and text data only to generate speech without any further assumptions. Missing an dataset in German, I'll show how I created my own dataset in a semi-automated fashion.
Computer generated speech has existed for a while, parameters being painfully engineered by hand. Deep Learning models can be efficient at learning inherent features of data - how well does this work out for audio?
There are different DL-models as WaveNet, SampleRNN and Tacotron2. After a quick overview I'm going to focus on Tacotron2 - how it works, it's benefits and how to implement it with PyTorch.
With Tacotron2 we make no assumption what features should be passed to the vocoder. All there is required are audio-snippets and corresponding text. Non-English language audio datasets are hard to get. I had to generate my own dataset. This talk will also cover how I have created my own dataset in a semi automatic efficiently with tools like audiotok and methods as Speaker diarisation.
The talks will feature sythesised speech audio demos. I will also cover some failures and reason about them.