MIT researchers have developed a neural network, named Speech2Face, that can predict a person's facial features based on their voice.
Abstract
Researchers from MIT have created a neural network called Speech2Face that can generate a visual representation of a person's face based on their voice. The goal was not to create an exact portrait but to capture the dominant facial traits linked to a person's speech. The researchers collected a massive dataset from YouTube, which had millions of video clips featuring over 100,000 different speakers. They then trained their neural network to recognize voice patterns and turn them into facial features. The software can be used in various fields, including forensic analysis, humanizing AI voices, and creating customized avatars in multiplayer games.
Bullet points
MIT researchers have developed a neural network, named Speech2Face, that can predict a person's facial features based on their voice.
The goal was not to create an exact portrait but to capture the dominant facial traits linked to a person's speech.
The researchers collected a massive dataset from YouTube, which had millions of video clips featuring over 100,000 different speakers.
The neural network was trained to recognize voice patterns and turn them into facial features.
The software can be used in various fields, including forensic analysis, humanizing AI voices, and creating customized avatars in multiplayer games.
The software can also be used to create animated avatars for social media based on a person's voice.
The software can help bring a more personal and visual touch to our digital lives.
Vocal Portraits — How AI Creates a Face Out of a Slice of Your Speech
Ever wondered if you could picture someone’s face just by hearing their voice?
We’ve felt the vibe, but turning that into a visual seems like a stretch.
Listen to this voice.
What does this person look like based on the audio alone?
It’s tough but guess what? Scientists from MIT cracked the code.
Let’s look at what they did and by the end of this article, you’ll also discover the person behind this intriguing voice.
Voice-activated faces
Researchers from MIT tackled the abovementioned question and they created a neural network, named Speech2Face, that could take a vocal snapshot to predict what that person’s face might look like.
But the goal was not to create an exact portrait but to capture the dominant facial traits linked to a person’s speech.
How did they do this?
First, they collected a massive dataset from YouTube called AVSpeech, which had millions of video clips featuring over 100,000 different speakers. Then, they created their own neural network, their virtual brain, that was trained to recognize voice patterns and turn them into facial features.
Their neural network would break down the voice audio fragment into an essential face sketch and then use these low-resolution features to draw a more defined neutral face.
The researchers wanted to show that voices carry a ton of info about a person like age, gender, and ethnicity. And the results are amazing, just look at these images:
The left image is taken from the video, cropped around the person’s face. It’s the actual image from the video.
The middle image is a reconstructed face that normalizes lighting and frontally faces the person.
The right image is the output from the scientist’s neural network which extracts an image from people’s speech audio.
Isn’t it mind-boggling?
Remember, the idea was not to reconstruct a precise image but to recover key facial features from people’s voices. A voice-face matchmaker of sorts.
How useful is this?
Let’s unpack the potential applications of this software in several areas.
Think about the implications for forensic analysis. Law enforcement agencies could use Speech2Face to reconstruct faces from audio recordings, to help identify a suspect. This could be a game-changer in criminal investigations, enhancing our much-needed public safety.
This software could be used to humanize AI voices and create a more relatable and user-friendly experience. If you see the face of the person behind that friendly voice, it might guide you better through your day.
In multiplayer games, Speech2Face could turn character customization into a totally different experience. The in-game avatar won’t just be a random mix of features but a visual representation of the player’s voice. It doesn’t have to be an exact portrait, but it can help create a tailored character that still has some of the player’s unique features
In a less serious tone, this might be the next-gen avatar for social media. People’s voices could be translated into a cartoon avatar or some other related creature (e.g. a monster with some of your key features). It’s not just about selfies anymore, now your voice can have its own animated presence.
Any other applications you might suggest?
As you can see, there’s a potential to reshape how we interact with technology and bring a more personal and visual touch to our digital lives.
Final thoughts
We can now figure out how someone looks just by hearing their voice.
It doesn’t generate an exact replica but it helps bring out key features to give it a distinctive face. It’s still impressive tech.
And this tech can redefine how we visualize voices in the future in many different areas. It’s just a matter of time to make it even more accurate.
On a final note, did you figure out whose face matches the voice you heard at the start? Here’s your answer:
Join my newsletter to explore everything related to music and sound, from the effects on our psyche to the technologies that use sound to improve our lives.