Speech synthesis, the use of computers to generate realistic human speech, is rapidly entering the ‘uncanny valley’ – creepily almost-realistic.
Recent approaches have used neural networks, trained using only speech examples and text transcripts, to generate human-like text-to-speech synthesis. In the example embedded above, the late Christopher George Latore Wallace (May 21, 1972 – March 9, 1997), aka The Notorious B.I.G., raps The Book of Genesis.
The ‘voice’ voice was computer-generated, using a text-to-speech model trained on the speech patterns of The Notorious B.I.G. In a nutshell, the approach uses an AI to ‘learn’ how an audio file of an individual’s speech compares to a text transcript. Once trained, the model can synthesize speech from text that conforms to the ‘learned’ speech patterns.
The Vocal Synthesis channel on Youtube features a wide range of examples that demonstrate what’s currently possible. The introduction to the channel features the voices of six presidents:
Bill Clinton reads Baby Got Back by Sir Mix-A-Lot, embedded below, is an example that highlights both the entertaining and troubling potential of the technology:
“In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings.
While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises. Also, our system cannot yet generate audio in realtime. Furthermore, we cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own.”
What Will Biggie Rap Next?
Biggie has been gone for more than two decades, but this technology is being used to make him rap from beyond the grave. The results are realistic enough that artists and agents are taking notice.
For example, in April 2020, Youtube temporarily took down this video of Jay-Z rapping the To Be, Or Not To Be soliloquy from Hamlet:
Jay-Z’s agency, Roc Nation LLC, claimed that the video “unlawfully uses an AI to impersonate our client’s voice.”
The legality of unauthorized speech synthesis is still to be determined, and could depend on how the audio is used. Some uses may fall into the category of fair use and others may infringe on copyright or an artist’s publicity rights.
The anonymous creator of the Vocal Synthesis channel recognizes that the channel is in ‘legally uncharted waters’, but suggests that it is important to explore the creative possibilities of AI-generated speech:
“I wanted to show that synthetic media doesn’t have to be exclusively made for malicious/evil purposes, and I think there’s currently massive amounts of untapped potential in terms of fun/entertaining uses of the technology.
I think the scariness of deepfakes and synthetic media is being overblown by the media, and I’m not at all convinced that the net impact will be negative, so I hoped that my channel could be a counterexample to that narrative.”
Check out the examples and share your thoughts on AI speech synthesis, and its creative possibilities, in the comments!