VALL-E AI simulates your voice in just 3 seconds!

Microsoft has developed VALL-E, an artificial intelligence (AI) capable of simulating a sound from a sample of just three seconds. Some of the demonstrations are very convincing. The company understands the danger of putting such a tool in the wrong hands.

For more on this news, discover the Vitamine Tech audio log, in which Emma Hollin talks with you in detail about how VALL-E works. © Futura

distance ” Deep fake With photo or video, will we see the arrival of the “deep fake” sound? It is possible since then MicrosoftMicrosoft Unveiling a new modelArtificial intelligence (AI) call text-to-speech Val-E. his privacy? It can mimic a person’s voice and thus simulate it with a simple three-second audio sample. Once it has learned a specific voice, this AI can synthesize that person’s voice, preserving its timbre and emotion.

At Microsoft it is believed that VALL-E can be used for AppsApps Audio synthesis, but also, and this is obviously more of a concern, to edit the speech in the recording. It will be possible to edit and modify the audio from a file transcriptiontranscription Speech text. Imagine a politician’s speech changed by this Artificial intelligenceArtificial intelligence

Machine learning in action

For the company, VALL-E is what they call a “neural coding language paradigm,” and it’s based on the sound pressuresound pressure named EnCodec, unveiled by Meta (Facebook) last October. Unlike other speech synthesis methods that usually synthesize speech by manipulating waveforms, VALL-E generates audio codecs from text and audio samples. It basically analyzes a person’s voice, and breaks that information down into symbols (symbolssymbols) by EnCodec, and uses machine learning to match the three-second sample to what you’ve learned.

For this, Microsoft relied on the audio library Libre Lite. It contains 60,000 hours of English speaking from over 7,000 speakers, most of them taken from public domain LibriVox audiobooks. For VALL-E to produce a meaningful result, the sound in the three-second sample must match a sound in the training data.

I must do something about it.

Example. © VALL-E

Microsoft is aware of the danger

To convince you, Microsoft offers Dozens of audio examples The AI ​​model in action. Some are eerily similar, but others are clearly artificial and the human ear can tell that they are AI. What’s impressive is that in addition to preserving the tone and emotion of the person speaking, VALL-E is able to reproduce the recording environment and conditions. Microsoft takes the example of a phone call with the voice and frequency characteristics of this type of conversation.

In response to a question about the dangers of such artificial intelligence, Microsoft confirmed that the source code is not available, and the company is aware of this. This can lead to potential risks of form abuse, such as impersonation or impersonation of a specific speaker. To mitigate these risks, it is possible to build a detection model to differentiate whether a phonogram was made by VALL-E. We will also put Microsoft AI principles into practice when developing models further. “.