Text-to-speech Models Suck

Text-to-speech Models Suck

2024-04-16

由 Brom 使用 Udio AI 创建

Lyrics

[Audience cheers] Okay, okay... Okay, so you know how there's all this buzz around A.I. voice generation? Like, text-to-speech, voice cloning, yada yada yada... And all those apps coming out telling you that [sarcastic tone] "we got the most realistic one". [Audience laughs] I don't think I'm gonna have to drop names here, you know which ones I'm talking about. [Audience laughs] You know who's NOT telling you they have the best text-to-speech? Udio. And I know what you're gonna say, "hey Kenny, that's not a text-to-speech model". [Audience laughs] [firmly] Yes it is. [Audience cheers] It is a text-to-speech model, if you WANT it to be. And if you don't mind the whole "audience laughing in the background" stuff [Audience laughs] Exactly. But I'm not even kidding, it's PHENOMENAL. People tell me "wow, Kenny, did you try out the new Eleven Labs text-to-speech model?" Oh, oops, there went the name-drop anyway I guess. [Audience laughs] And yeah, I play around with it for three seconds and am immediately bored. [Audience laughs] [yelling] CAUSE IT CAN'T YELL WHEN I WANT IT TO. It just refuses to do emotion. Nothing realistic about that, if you ask me. [Audience cheers] And then you find this random model that generates music. And it's like "wow, so A.I. can do emotions after all". So yeah, apparently a random music model does text-to-speech better than the best text-to-speech model. [Audience laughs] What kinda timeline is this? [Audience cheers]