/tts/ text to speech

We made great progress on voice recognition so far, but an audible feedback from machines is still lacking. Considering some are already falling for the robot waifu meme we should start on a waifu ai first, or at least something that resembles one.
Free to use there's IBMs tts demo
text-to-speech-demo.ng.bluemix.net
Which is pretty ok, even good to most alternatives.
More interesting I found this though
github.com/r9y9/wavenet_vocoder
An open sourced meme learning solution which already got some suberb results as examples. Unfortunately this all quite slow/requires to prerecord every answer.

This thread is here to discuss these several approaches to tts.

Attached: robowaifu.png (1471x1010, 1.34M)

Other urls found in this thread:

github.com/MycroftAI/mimic-recording-studio
mycroft.ai/
twitter.com/NSFWRedditVideo

>sheila [red vs blue]

You seem to confused by this board's name "technology". We don't do that. Better go to /diy/

>open sourced meme learning solution
I mean, is there any other way to implement tts if not by machine learnig?

On the first glance likely not, though the difference is these projects generate their voices completely. We used to copy it an existing person and edit to our liking, see for example the vocaloid project.

Android 18 is a cyborg.

>see for example the vocaloid project
VOCALOID is a concatenation-based engine, machine learning has nothing to do with it. If you want a similar software but that actually relies on machine learning, check out CeVIO Creative Studio. I heard Synth-V also relies on machine learning, but I don’t know to what extent.

>VOCALOID is a concatenation-based engine, machine learning has nothing to do with it.
When did I claim it is?
Anyway I haven't heard of cevio so far. It looks interesting, but my end goal would still be some kind of "on the fly" method that still sounds reasonable. If we're going the route of actual software recommendations, I'd like to add open source to the bonus list.

My bad, I misread your post.
CeVIO CS’s engine is Sinsy, which is open source. Unfortunately, CeVIO CS itself is not, which I find to be a pity. There are also some engines for UTAU that are open-source too, such as world4utau (or abbreviated as w4u). All other software that come to my mind are closed-source (Alter/Ego, Chipspeech, UTAU itself, and the very new DeepVocal which I haven’t tried yet).

Well if you're just jumping around like right me, try the wavenet vocoder I linked above. Out of the projects I found on github it combined the aspects of easy to set up and good results the most.

Honestly it's an okay start but way better and more open is Mycroft . Mycroft.ai is a fully open source voice assistant engine. It is in development and soon will have standardized ways to add a voice pack ( ie get someone to read all these words and sounds, it becomes the new voice option so waifu away!) among other neat features. All without Google, Alexa, Siri spyware server garbage

One thing I liked about my 2008 MBP was the built in speech recognition and tts technology. Keep in mind this was before Siri, and nobody really used it, but you could program simple scripts into any macbook and have it respond to recognized voice input. I was able to annoy the shit out of my room mates while I was away by leaving my computer on over night where it would start creepily laughing in response to any subtle noise.

I don't see this shit built into any linux distributions. Any solution has to be internet based like google cloud speech, which is fucking gay. Does anyone have experience trying to mess around with speech in linux? What would be the easiest solution if I wanted to dick around like I did in college?

Mycroft is pretty based. I tried to run it on the raspberry pi in my car but it's not powerful enough.

Oh that's what i'm already working with. Dunno how much you're into the current development, but my first goal is to jump onto local solutions for the voice to speech decoding. I'd say the guys are pretty great, but i got a local server lab. So why not use that? After that i wanna full on jump onto the tts problem. As you said they're eager to implement options for more voice packs, but even then someone has to create them first. Admittly i expect the weaboo crowd to be faster than the rest then though.

Linux doesn't have that per default as it would be completely against the "do one thing good" rule. Depending on what you actually wanna accomplish you may wanna look into mycroft. It's easy to set up and may be easily configured to react to anything interpretable as sentence.

Im gonna go on a tangent and ask if anyone's got the source on that aigis pic on op's

Have you tried cropping and searching with that?

You can easily make a TTS in an afternoon without ML. You simply record a snippet of yourself saying each phonomes and then you can use Pocket sphinx to convert words to phonomes and play each related sound file. Best value you can get for your time with TTS.

I went down this rabbit hole last year and the furthest I made it was that Mycroft open-source their mimic recording studio and I was going to pay a voice actor to create a 15k word corpus to build a model with. Just lost interest.
github.com/MycroftAI/mimic-recording-studio

Attached: 1563677556306.jpg (475x318, 23K)

Anything I can use on f-droid?

Also I thought mycroft.ai/ were going to work on a offline TTS to replace google TTS they currently use.
Or did they lie?
I know they made it modular to allow different TTS backends if the default one isn't even fully offline and it's using googles style why the fug do I even care if the rest of it is foss?

Search the AT&T natural voices Paul
It's by far the best

No, that's still the plan/currently happening.

>going to pay a voice actor to create a 15k word corpus
Unironically how much would that have cost?

Attached: 1561657263030.jpg (392x309, 141K)

I did some VERY rough math. But here's an idea.

Average words spoken per min is around 125, but for reading a corpus you can probably cut that down to 60 because it's more difficult to read.

So that's 60*60 which would mean ~3,600 words spoken per hour. At that speed it'll take just over 4 hours so lets say 5.

For a voice actor it'll probably cost anywhere from $20 - 100 per hour.

Rough final cost being anywhere from $100 to $500.

Yeah I was aiming around 500$, the voice studio software can easily be self hosted and makes recording a corpus easy. It breaks up a text into segments like speed reading, giving you the ability to record and listen to your reading of the sample before continuing to the next segment. 15k is actually fairly small for building a model, it's just the min recommended by mimic devs. You could pay multiple people to expand corpus and even pay by the segment added possibly.

I was planning on paying upfront costs then sticking corpus behind a gumroad suggested donation to make my money back. If you want to crowd fund it, throw out a link and I'm sure people here will contribute for the end goal.

Attached: cheers.gif (348x323, 1.12M)

Side question, were there attempts to make a language that's easily parsable by machines but also pleasant to speak in?

Linguistics-fag here. the reason machines can't parse language easily is because language is just incredibly complex. There are dozens of subtle variations for many linguistic elements, many of which are context dependent.

take the Jow Forums phoneme for example, Although this is just the single american english "r" phoneme, it alters noticeably across every unique context - we just have mental generalizations that make it sound like the same thing.

so
cross
ross
race
erase

all have a unique version of the Jow Forums phoneme. We have like 40~ish different unique phonemes in american english, most of which will condition the Jow Forums and make it unique. Also, whether the Jow Forums is situated at the start or the middle or the end of a word will make it unique.

So, you end up with literal hundreds of physically variant Jow Forumss.

Now this is just the phoneme conditioning. Beyond this, you have tones. You have socio-economic factors. You have specific context (is it a movie, a play, a job interview), you have age and gender factors. Then there are regional factors. "American English" is a very broad class, there are dozens of regional variants, all with their own socio-economic variations.


Language naturally does these things. You can't simplify them away. The reason TTS sounds weird to us is because it just uses a database of unconditioned phonemes.

We simply don't know enough about our own ability to mentally generalize all these variations to be able to instruct a machine to do the same. OP's github link is trained on very specific phrases and datasets, as he's pointed out. As far as i'm concerned, that's essentially just a copy-pasting technique with extra steps.


TL;DR language is very fucked up and hard to make human-like. The best we have so far are just painstakingly developed databases

Attached: boxxy.jpg (480x360, 8K)

Thinking it like this, that's actually pretty reasonable. My first thoughts went to like 3-4k due to getting the sound actor in front of a proper mic, getting all rights to reuse and all that jazz. But maybe it's just me being used to idiotic industry rates by now.

I'd probably take the middle road then, let them speak minimum commands to let at least those sound as natural as possible, then feed everything to ml for future texts.

You mean someone inventing a whole new language just for machines? Why would anyone do that, if we already know time will fix it anyway?

Attached: a.jpg (800x452, 56K)