Summary

This article introduces Mel Spectrograms, a human-like audio representation used in machine learning, and Mel Frequency Cepstral Coefficients (MFCCs) for audio processing.

Abstract

The article begins by explaining the Mel Scale, a logarithmic transformation of a signal's frequency that mimics human perception of sound. It then introduces Mel Spectrograms, a type of spectrogram that visualizes sounds on the Mel scale, and discusses their utility in machine learning applications. The article also covers MFCCs, originally used in speech processing, but now also used in Music Information Retrieval (MIR) to represent timbre. The article concludes by summarizing the concepts learned and mentioning future topics in MIR.

Opinions

The Mel Scale is fundamental in machine learning applications to audio as it mimics human-like perception of sound.
Developing Mel Spectrograms is easy once the intuition of spectrograms is established.
MFCCs can represent timbre well and are used in various speech processing techniques, such as automatic speech recognition and denoising audio.
The number of MFCCs used is a hyperparameter of the model and will vary based on the problem.
Leveraging Mel Spectrograms is a fantastic way to process audio for machine learning and deep learning problems.
In future articles, the author will dive deeper into Music Information Retrieval (MIR) using the bases established in this article.

Learning from Audio: The Mel Scale, Mel Spectrograms, and Mel Frequency Cepstral Coefficients

Breaking down the intuition for human-like audio representations

By now, we have developed a stronger intuition as to what spectrograms are, and how to create them. Simply put, spectrograms allow us to visualize audio and the pressure these sound waves create, thus allowing us to see the shape and form of the recorded sound.

The main aim of this article is to introduce a new flavor of spectrograms — one that is widely used in the Machine Learning space as it represents human-like perception very well.

As always, if you would like to view the code, as well as the files needed to follow along, you can find everything on my GitHub.

Let’s first start by importing all our necessary packages.

Note: all images were created by the author.

Mel Scale

Before discussing Mel Spectrograms, we first need to understand what the Mel Scale is and why it is useful. The Mel Scale is a logarithmic transformation of a signal’s frequency. The core idea of this transformation is that sounds of equal distance on the Mel Scale are perceived to be of equal distance to humans. What does this mean?

For example, most human beings can easily tell the difference between a 100 Hz and 200 Hz sound. However, by that same token, we should assume that we can tell the difference between 1000 and 1100 Hz, right? Wrong.

It is actually much harder for humans to be able to differentiate between higher frequencies, and easier for lower frequencies. So, even though the distance between the two sets of sounds are the same, our perception of the distance is not. This is what makes the Mel Scale fundamental in Machine Learning applications to audio, as it mimics our own perception of sound.

The transformation from the Hertz scale to the Mel Scale is the following:

Note that log in this case refers to the natural logarithm (also denoted as ln.) If the logarithm were of base 10, the equation’s coefficient (1127) would alter slightly. However, in this article, we will simply refer to the equation stated above.

Let’s visualize the relationship between Hertz and Mels:

As we can see from the graph above, frequencies that are lower in Hz have a larger distance between them Mels, whereas frequencies that are higher in Hz have a smaller distance between them in Mels, reinforcing its human-like properties.

Now that we have a good understanding of the Mel Scale’s utility, let’s use this intuition to develop Mel Spectrograms.

Mel Spectrograms

Mel Spectrograms are spectrograms that visualize sounds on the Mel scale as opposed to the frequency domain, as we saw previously. Now, I know what you are thinking, is it really that simple? Yes, it is.

As soon as the intuition of spectrograms are established, it makes learning various flavors of them very easy. All that is required is the new framework in which we develop our spectrograms under. I will assume that you know the underlying properties of how this is done. Developing Mel Spectrograms are even easier than their definition.

Here’s how we do it:

Recall the envelope mask from wave forms. This function creates a mask to develop an envelope that trims out unnecessary dead noise. This allows us to focus on the significant portions of the audio.

Similar to our results in spectrograms, we can see how each sound takes a unique shape based off of the sound it actually produces.

The guitar (which is longer in length than the kick and snare) resonates outwards more than the other studied sounds. Intuitively, this should make sense as when one plays the guitar, the strings that were strummed are still vibrating even after being played, which is how this resonating structure is being portrayed. The kick drum, has a quite low and immediate sound. You can think of the kick drum as a sort of thump. The snare, is quite high frequency and while slightly resonates outward (and more so upwards,) dissipates quicker than the other sounds.

Mel Frequency Cepstral Coefficients

Mel Frequency Cepstral Coefficients (MFCCs) were originally used in various speech processing techniques, however, as the field of Music Information Retrieval (MIR) began to develop further adjunct to Machine Learning, it was found that MFCCs could represent timbre quite well.

The basic procedure to develop MFCCs is the following:

Convert from Hertz to Mel Scale
Take logarithm of Mel representation of audio
Take logarithmic magnitude and use Discrete Cosine Transformation
This result creates a spectrum over Mel frequencies as opposed to time, thus creating MFCCs

If the ML problem warrants MFCCs to be used, such as automatic speech recognition or denoising audio, the number of coefficients used is a hyperparameter of the model. Because of this, the number of MFCCs will vary based on the problem. However, for this example, we will use librosa’s default 20 MFCCs. In librosa, we can do all of this and visualize the output in a just few lines of code:

Conclusion

As a wrap-up for this article, you have now learned:

What the Mel Scale is and how it plays a role in human-like interpretation of audio
How to map the Mel Scale onto spectrograms
What MFCCs are, certain use cases of MFCCs, and how to develop them

Leveraging Mel Spectrograms is a fantastic way to process audio such that various Deep Learning and Machine Learning problems can learn from the recorded sounds.