Linguistics-fag here. the reason machines can't parse language easily is because language is just incredibly complex. There are dozens of subtle variations for many linguistic elements, many of which are context dependent.
take the Jow Forums phoneme for example, Although this is just the single american english "r" phoneme, it alters noticeably across every unique context - we just have mental generalizations that make it sound like the same thing.
so
cross
ross
race
erase
all have a unique version of the Jow Forums phoneme. We have like 40~ish different unique phonemes in american english, most of which will condition the Jow Forums and make it unique. Also, whether the Jow Forums is situated at the start or the middle or the end of a word will make it unique.
So, you end up with literal hundreds of physically variant Jow Forumss.
Now this is just the phoneme conditioning. Beyond this, you have tones. You have socio-economic factors. You have specific context (is it a movie, a play, a job interview), you have age and gender factors. Then there are regional factors. "American English" is a very broad class, there are dozens of regional variants, all with their own socio-economic variations.
Language naturally does these things. You can't simplify them away. The reason TTS sounds weird to us is because it just uses a database of unconditioned phonemes.
We simply don't know enough about our own ability to mentally generalize all these variations to be able to instruct a machine to do the same. OP's github link is trained on very specific phrases and datasets, as he's pointed out. As far as i'm concerned, that's essentially just a copy-pasting technique with extra steps.
TL;DR language is very fucked up and hard to make human-like. The best we have so far are just painstakingly developed databases
Attached: boxxy.jpg (480x360, 8K)