Microsoft Unveiled the Vall-E Model, an Audio AI That Can Mimic Any Voice From 3-Second Prompts
10 Jan, 2023
Microsoft researchers have just announced VALL-E, a new Artificial Intelligence (AI) model that is capable of accurately mimicking a person's voice when given a three-second audio sample. The AI model uses a neural codec language model, which is based on the EnCodec AI model that was revealed in October 2022.VALL-E is a neural codec language model based on EnCodec and is capable of synthesizing audio of that person saying anything—while attempting to retain the speaker's emotional tone. This technology can be used for a variety of applications, from high-quality text-to-speech to audio content creation.
VALL-E processes how a person sounds, breaks the relevant data down into discrete components (referred to as "tokens") using EnCodec and then uses training data to match what it "knows" about how that voice might sound if it spoke other phrases beyond the three-second sample. To train VALL-E's speech synthesis functionalities, Microsoft used Meta's LibriLight audio library, which includes 60,000 hours of English language speakers from over 7,000 speakers.
Microsoft has showcased dozens of audio examples of VALL-E in action on their website, allowing users to compare the three-second "Speaker Prompt" sample with the "Ground Truth" (a previously recorded version of the same speaker saying a specific phrase), the "Baseline" (generated by a traditional text-to-speech synthesis method), and the "VALL-E" (generated by the VALL-E model). Results of VALL-E can be so accurate that it is hard to distinguish them from human speech.
When the AI model is given the text string, it creates discrete audio codec codes from text and acoustic prompts to synthesize speech as opposed to other text-to-speech methods that manipulate waveforms. Some of the generated voices could be mistaken for human speech, which is the model's goal.
VALL-E offers many potential applications, including speech editing, in which a recording of a person could be edited and altered from a text transcript, and audio content creation. VALL-E could revolutionize how we communicate and how we create content. For example, it could be used to create videos with lifelike narrators or to generate more realistic audio for virtual assistants.
However, due to the potential of the AI model to be used to power wrongdoings and deceit, Microsoft has not made the VALL-E code available for others to explore. The researchers are aware of the potential social harm that this technology may cause.