How To Turn Your Voice Into Any Instrument

Summary

The website content outlines a project aimed at converting a human voice into the sound of any musical instrument using machine learning techniques, inspired by Google Magenta's Tone Transfer.

Abstract

The author of the web content, an amateur in linguistics but a seasoned musician, has embarked on a project to transform the human voice into the sounds of various instruments. This initiative is motivated by the complexity of human speech and the relative simplicity of instrument timbre, which is easier to model computationally. By leveraging Python and machine learning libraries such as librosa and scikit-learn, the author has developed a method to replicate the unique textural qualities of instruments. The process involves training a RandomForestRegressor model on frequency bins of instrument WAV files to predict amplitudes, thereby mimicking the instrument's voice. The project also addresses the need to modulate the pitch of the synthesized instrument audio to match that of the human voice. The author provides a Colab notebook link for listeners to experience the transformation of their voice into a clarinet, demonstrating the potential of this technology in music production. The project is ongoing, with future plans to explore more nuanced pitch variations akin to Google Magenta's approach, which allows for more fluid and expressive sound manipulation.

Opinions

The author finds audio deepfake projects, like Microsoft's VALL-E, fascinating due to their ability to capture vocal nuances.
They express a preference for working with instrument sounds over language models due to the former's simplicity and ease of explanation to computers.
The author acknowledges the complexity of human speech and the challenge it presents in creating accurate text-to-speech models.
They are intrigued by the potential of machine learning in sound synthesis and are excited to push the boundaries of tone transfer technology.
The author suggests a cautious approach to the use of deepfake technology, reminding readers that their application in music is benign and can be skipped if unsettling.
There is an appreciation for the nuances of live instrument performance and a desire to replicate these in synthetic audio, as evidenced by the intention to explore continuous pitch slides in future work.
The author seems to value both the precision of quantized music and the expressiveness of unquantized sound, aiming to strike a balance in their project.

How To Turn Your Voice Into Any Instrument

If you’re impatient and just looking for my code and audio, here you go!

The complexity we see with any text-to-speech (or “audio deepfake”) project comes from nuances in spoken language. Not only do we all have individual voices, we also have unique ways of saying our names, our friend’s names, and basically any word. Microsoft’s claim that they can simulate your voice with only 3 seconds of audio is true, but its accuracy comes from averaging out the way most people say most words, while factoring in some elements of your vocal timbre and pronunciation.

As someone with only a basic understanding of linguistics, projects like Microsoft’s VALL-E fascinate me. Audio processing sounds like fun until you need to actually train and deploy a machine learning model, and any attempt I’ve made thus far to build my own language model has fallen a little short. So I decided to forget about language for now and focus on instrument sounds instead!

I’ve been a pianist for most of my life and have taught myself guitar and percussion along the way. The complexity of an instrument voice is a little easier to understand — making it a lot easier to explain to a computer.

Which of these spectrograms comes from a drum loop, and which comes from a speaker in a podcast? How might quantization help you find the answer?

My goal for this project was to replicate Google Magenta’s Tone Transfer to allow for more flexible outputs. That is, instead of being able to turn a voice into 4 instruments, I wanted to turn my voice into any instrument for which I had the WAV files.

If deepfakes scare you, remember that we’re just making music. You can always skip. (Source: Google Magenta)

Method & Quick Results

I completed this project in Python, starting with the librosa module to analyze the timbre of WAV files. By timbre, I mean the textural qualities that make instrument (or human) voices unique.

Mathematically, I needed to replicate the amplitude for each frequency in a stack of frequency bins, across multiple clarinet samples for better accuracy. So I built a scikit-learn RandomForestRegressor model to read in frequency bins and return the amplitudes to replicate a clarinet voice. Training the model took around one minute per second of audio, and around twice that time when testing in Google Cloud.

Once the model was trained on a couple clarinet samples and deployed on a fresh audio sample of me speaking, I needed to modulate the deepfaked clarinet audio to match my pitch. Again, I used librosa to find the fundamental frequency f0 of my voice and then the pitch_shift method to, of course, shift the pitch to match mine.

Please refer to this Colab notebook to hear me saying a word and then a clarinet “playing” it the same way. You may notice that my voice dips while the clarinet pitch remains consistent — this is intentional! As shown in the spectrograms above, our voices have a lot more flexibility than physical instruments, and modern producers will tell you they prefer to quantize their music so that everything “locks” or “snaps” onto the grid evenly.

Since the folks at Google Magenta seem more anti-quantization, and instead allow for sounds to slide more flexibly, I’m currently taking it a step further to match their efforts. Even though the possibilities for a clarinet are rather discrete (step-by-step), I’d like to explore the possibilities of fun continuous slides.

Stay tuned for Part 2, and thanks for following!