Making Sense of Audio Features with Librosa — Part 3: Spectrograms

In Part 2 of this series, we took our first step into the world of Fourier Transforms. We learned how these mathematical tools allow us to break down complex audio signals into their underlying frequency components. From the Discrete Fourier Transform (DFT) to the Fast Fourier Transform (FFT), we explored how these techniques enable us to analyze the spectral content of sound. We also introduced the Short-Time Fourier Transform (STFT), which adds a time dimension to our frequency analysis, making it particularly useful for non-stationary signals like music and speech.

You can read the second blog here:

Making Sense of Audio Features with Librosa- Part 2: Fourier Transform

In the first part of our series “Making Sense of Audio Features with Librosa,” we delved into the basics of audio…

medium.com

But understanding frequencies alone is only part of the picture. While Fourier Transforms reveal the individual frequencies within a signal, they don’t tell us how they change over time. Audio, especially in the form of music or speech, constantly evolves, and the shifts in pitch, rhythm, or timbre can be crucial to understanding its structure. To fully grasp these changes, we need a way to observe the frequency content and how it develops as time progresses. Fourier Transforms alone cannot capture this temporal evolution, which is why we need a more comprehensive tool.

That’s where spectrograms come in. Spectrograms provide a powerful way to visualize sound, combining both time and frequency in a single, intuitive image. In the next part, we’ll explore how spectrograms allow us to see the intricate patterns in audio signals as they unfold over time. Using Librosa and Python, we’ll create different types of spectrograms, including Mel spectrograms and MFCCs, to get a clearer picture of how sound behaves across both domains. This will give us the tools we need to analyze and interpret audio data with even greater precision and depth.

What is a Spectrogram?

A spectrogram is like a picture of sound. It shows how the different frequencies in a piece of audio change over time. Imagine you’re looking at a song not just by hearing it but by seeing the notes and beats visually — this is what a spectrogram allows us to do.

Components of a Spectrogram

Time (x-axis): This represents the flow of sound over time, like the length of a song or a spoken sentence. As you move from left to right on the spectrogram, you see the sound as it happens over time.
Frequency (y-axis): This represents the pitch of the sound. Low frequencies (at the bottom) are deeper sounds, like a bass guitar or a drum, while high frequencies (at the top) represent sharper sounds, like a flute or someone singing a high note.
Magnitude or Amplitude (color intensity): The brightness or darkness of the colors shows how strong or loud each frequency is at any given time. Brighter areas mean those frequencies are louder, and darker areas mean they’re quieter.

How Spectrograms Are Used in Audio Analysis

In audio analysis, spectrograms are incredibly helpful because they let us see the sound and its changes. For example, in speech, a spectrogram can show where different words and sounds start and stop, or how the pitch of someone’s voice rises and falls. In music, it can show the rhythm, melody, and how instruments play together.

By using a spectrogram, we can analyze sounds in a way that’s much clearer than just looking at a waveform or listening to the audio alone. Whether for detecting patterns in music, recognizing spoken words, or identifying specific sound events, spectrograms are a crucial tool.

Short-Time Fourier Transform (STFT)

What is STFT?

The Short-Time Fourier Transform (STFT) is a way to break down an audio signal into smaller pieces and analyze the frequency content in each piece. Think of it like taking snapshots of the sound at regular intervals and then figuring out the pitch or frequency for each snapshot. This lets us see how the sound changes over time, instead of just analyzing the whole sound at once.

While a basic Fourier Transform shows the overall frequency of a signal, it doesn’t tell us when certain frequencies occur. STFT solves this by splitting the audio into short, overlapping segments (or windows), allowing us to see the frequencies over time.

How STFT Relates to Spectrograms

A spectrogram is the result of applying STFT across an entire audio signal. Each short window of the signal is analyzed, and its frequency content is displayed as a small slice of the spectrogram. When you stack these slices side by side, you get a visual map of how the frequency content of the audio changes over time. This combination of time and frequency is what makes the spectrogram such a powerful tool for analyzing audio.

In summary:

STFT tells us what frequencies are present at each moment in time.
The spectrogram visualizes that information, showing how the sound’s frequency content changes over time.

Generating a Basic Spectrogram Using STFT in Librosa

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

# Load an example audio file (Librosa includes some built-in examples)
y, sr = librosa.load(librosa.example('trumpet'))

# Perform Short-Time Fourier Transform (STFT)
D = librosa.stft(y)

# Convert the amplitude to decibels for better visualization
S_db = librosa.amplitude_to_db(abs(D), ref=np.max)

# Plot the spectrogram
plt.figure(figsize=(10, 6))
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')
plt.show()

Visualizing and Interpreting the Spectrogram

Once you generate the spectrogram, you’ll see a plot with time on the x-axis, frequency on the y-axis, and the intensity (or loudness) of the frequencies represented by color. Here’s how to interpret the spectrogram:

Time (x-axis): Shows how the sound progresses over time.
Frequency (y-axis): Shows the range of frequencies in the audio. Lower frequencies (deep sounds) are at the bottom, and higher frequencies (sharp sounds) are at the top.
Magnitude/Amplitude (color intensity): The brighter the color, the louder that particular frequency is at a given time. Darker areas represent quieter frequencies.

For example, in music, you might see repeating patterns for rhythmic elements like drums, and harmonic structures for instruments playing notes. In speech, you might notice changing patterns that correspond to different spoken phonemes.

Mel Spectrograms

What is the Mel Scale, and Why is It Used?

The Mel scale is a way of measuring frequency that reflects how humans perceive sound. Our ears are not equally sensitive to all frequencies; we’re much better at hearing differences in lower frequencies (like bass sounds) than in higher frequencies (like high-pitched notes). The Mel scale adjusts for this by spacing frequencies in a way that matches human hearing.

For example:

A change from 100 Hz to 200 Hz sounds much more noticeable to us than a change from 1,000 Hz to 1,100 Hz, even though both are 100 Hz apart.
The Mel scale compresses higher frequencies and spreads out lower ones, so it better matches how we perceive pitch.

How a Mel Spectrogram Differs from a Regular Spectrogram

A regular spectrogram shows all frequencies linearly, meaning the frequency axis is spaced equally. This is useful for technical analysis but doesn’t align well with how we hear sound.

A Mel spectrogram converts the frequency axis to the Mel scale, making the representation more aligned with human perception. This is especially important in applications like speech recognition or music analysis, where it’s essential to mimic how humans interpret sound.

The key difference:

Regular spectrogram: Frequencies are plotted linearly.
Mel spectrogram: Frequencies are scaled according to human hearing, with more emphasis on lower frequencies and compressed higher frequencies.

Code Implementation of Mel Spectrograms Using Librosa

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

# Load an example audio file
y, sr = librosa.load(librosa.example('trumpet'))

# Compute the Mel spectrogram
S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=8000)

Here:

n_mels=128 means we’re dividing the Mel scale into 128 frequency bins.
fmax=8000 limits the maximum frequency to 8000 Hz, which is common for speech and music analysis.

# Convert the Mel spectrogram to decibels for better visualization
S_db = librosa.amplitude_to_db(S, ref=np.max)

# Plot the Mel spectrogram
plt.figure(figsize=(10, 6))
librosa.display.specshow(S_db, sr=sr, x_axis='time', y_axis='mel', fmax=8000)
plt.colorbar(format='%+2.0f dB')
plt.title('Mel Spectrogram')
plt.show()

Visualizing and Interpreting Mel Spectrograms

When visualizing the Mel spectrogram, the x-axis still represents time, but the y-axis now represents Mel frequency rather than a linear frequency scale. Lower frequencies (e.g., bass sounds) are stretched out, while higher frequencies (e.g., high-pitched sounds) are compressed, reflecting how we perceive sound.

Time (x-axis): Same as a regular spectrogram, showing how sound changes over time.
Mel Frequency (y-axis): Frequencies are distributed on the Mel scale, giving more emphasis to frequencies we’re more sensitive to.
Magnitude (color intensity): As with other spectrograms, brighter areas indicate louder sounds, while darker areas indicate quieter ones.

Why Use a Mel Spectrogram?

Mel spectrograms are especially useful in:

Speech recognition: Because they represent sound in a way that’s similar to how humans hear it.
Music analysis: They help highlight the low frequencies, which are crucial in many music genres.
Audio classification: In tasks like genre classification or sound event detection, the Mel spectrogram’s focus on human hearing often leads to better performance.

MFCC (Mel Frequency Cepstral Coefficients)

Introduction to MFCCs

Mel Frequency Cepstral Coefficients (MFCCs) are one of the most widely used features in audio analysis, particularly in tasks like speech recognition and music genre classification. The reason MFCCs are so popular is that they effectively represent the shape of the sound spectrum in a way that closely resembles how the human ear processes sound.

In essence, MFCCs capture the most important information from the Mel spectrogram but reduce it into a more compact and meaningful form. This makes MFCCs ideal for distinguishing between different types of sounds, such as different speakers, instruments, or even environmental noises.

How MFCCs Are Derived from the Mel Spectrogram

MFCCs are calculated by taking several steps beyond the Mel spectrogram. Here’s a high-level overview of the process:

Compute the Mel spectrogram: This represents the sound’s frequency content on the Mel scale, as we discussed earlier.
Apply the logarithm: This compresses the dynamic range of the signal, mimicking the way our ears perceive loudness.
Perform a Discrete Cosine Transform (DCT): The DCT decorates the Mel frequencies and compresses the data, keeping only the most important features. The result is a small set of coefficients (usually 12–13) that capture the essence of the sound’s spectrum.

The result is a set of MFCCs that represent the overall shape of the sound spectrum, making it easier to compare and classify different sounds.

Code Implementation of MFCCs Using Librosa

import librosa
import librosa.display
import matplotlib.pyplot as plt

# Load an example audio file
y, sr = librosa.load(librosa.example('trumpet'))

# Compute the MFCCs
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)

The n_mfcc parameter specifies the number of MFCCs to compute (usually between 12 and 20.

# Visualize the MFCCs
plt.figure(figsize=(10, 6))
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
plt.colorbar()
plt.title('MFCC')
plt.show()

Interpretation of MFCC Plots

When visualizing MFCCs, each row in the plot represents one of the MFCC coefficients, and the x-axis represents time. The color intensity in the plot represents the strength of each coefficient at each moment in time.

Here’s how to interpret the MFCC plot:

Rows (MFCC coefficients): Each row corresponds to a different coefficient, with the first few coefficients capturing the overall shape of the spectrum (e.g., loudness), and the higher coefficients capturing finer details (e.g., subtle variations in pitch or timbre).
Time (x-axis): Like with spectrograms, the x-axis shows how the audio evolves.
Magnitude (color intensity): The color intensity shows the strength of each coefficient at a given time. Brighter areas indicate stronger features, while darker areas indicate weaker features.

The most important information about the sound is typically captured in the lower MFCC coefficients, making MFCCs a compact and efficient representation of the sound’s spectral shape.

Why Use MFCCs?

MFCCs are highly effective for tasks like:

Speech recognition: Because they focus on the spectral shape, which is essential for identifying different speech sounds.
Music genre classification: MFCCs help differentiate between different styles of music by analyzing the timbre and tonal quality.
Audio classification: MFCCs make it easier to distinguish between different sound sources, such as instruments or environmental sounds.

When to Use Each Spectrogram Type

Regular Spectrogram: Use when you need detailed frequency information or are working on tasks requiring precise technical analysis of the sound, such as identifying specific frequency patterns in noise reduction or musical tones.
Mel Spectrogram: Use when analyzing audio for tasks that involve human perception, such as speech recognition, music analysis, or any machine learning tasks where the model needs to understand sound the way humans hear it. This is also a good choice when you need a more compact representation but still want frequency information.
MFCCs: Use when you want to extract meaningful features from audio for machine learning tasks like classification or recognition. MFCCs are ideal for projects where computational efficiency is crucial, as they distill the sound’s characteristics into a small set of coefficients.

Applications of Spectrograms

Spectrograms are an essential tool in various fields of audio processing because they provide a clear visual representation of how sound evolves over time. By breaking down the frequency components of sound, spectrograms allow us to analyze complex audio data in ways that simple waveforms cannot. Below are some of the key use cases for spectrograms across different domains:

1. Speech Analysis

Spectrograms are widely used in speech analysis to visualize the patterns of phonemes, pitch changes, and speech rhythm over time. This is crucial for:

Speech recognition systems: Spectrograms help convert spoken language into text by breaking down the unique frequency patterns of each sound or word.
Speaker identification: By analyzing the spectrogram, we can identify specific vocal features that differentiate one speaker from another, such as tone, pitch, and intonation.
Emotion recognition: Spectrograms can reveal subtle changes in pitch and rhythm that correspond to different emotions, making them useful for developing systems that detect mood or emotional state based on voice patterns.

2. Music Analysis

In music analysis, spectrograms allow us to break down a piece of music into its component frequencies, making it possible to:

Identify instruments: Each instrument has a unique harmonic structure that is visible in a spectrogram, allowing analysts or machine learning models to identify which instruments are playing.
Analyze rhythm and melody: Spectrograms show the evolution of musical notes and chords over time, making them useful for identifying rhythm patterns, melodies, and harmonics in a song.
Genre classification: Spectrograms can capture the tonal qualities and rhythm patterns that help classify different genres of music, such as distinguishing classical music from rock or jazz.
Audio fingerprinting: Spectrograms are often used in systems like Shazam to match short clips of music to a large database by analyzing their unique frequency patterns.

3. Environmental Sound Analysis

Spectrograms play a significant role in the analysis of environmental sounds — sounds that occur naturally in environments such as cities, forests, or oceans. They help in:

Sound event detection: Systems designed to detect specific sounds, like car horns, bird calls, or thunderstorms, use spectrograms to identify these events based on their unique frequency and time patterns.
Wildlife monitoring: Spectrograms can be used to study animal vocalizations, such as bird songs or whale calls, helping researchers understand patterns in behavior, migration, and communication.
Urban soundscapes: In cities, spectrograms are used to study patterns of noise pollution, identify traffic noise, and monitor how sound levels fluctuate over time in different areas.

4. Biomedical Audio Processing

In the medical field, spectrograms are used to analyze sounds produced by the human body, such as heartbeats or breathing patterns:

Heart sound analysis: Doctors can use spectrograms to visualize heart sounds and detect abnormalities such as arrhythmias or murmurs, aiding in the diagnosis of cardiovascular conditions.
Sleep studies: Spectrograms are used in sleep studies to analyze breathing patterns, helping identify conditions like sleep apnea by visualizing disruptions in breathing sounds during sleep.

5. Sound Design and Audio Engineering

In sound design and audio engineering, spectrograms are used to fine-tune audio signals for various media applications, such as film, television, and music production. Use cases include:

Noise reduction: Audio engineers use spectrograms to visually identify and isolate unwanted noise, such as hums, hisses, or background chatter, making it easier to clean up recordings.
Audio effects: Spectrograms help in designing and applying sound effects by showing the impact of filters, reverb, and other effects on the frequency spectrum.
Mixing and mastering: When mixing a track, spectrograms allow engineers to see how different instruments and sounds interact with each other in the frequency domain, ensuring that no frequency range is too crowded or underrepresented.

How Spectrograms Are Used in Modern Audio Processing

In modern audio processing, spectrograms are central to a wide array of machine learning, AI, and audio engineering tasks. Some key areas include:

Deep learning for audio: Spectrograms serve as input for neural networks in tasks such as automatic speech recognition, audio classification, and music recommendation systems. By converting audio into a visual representation, spectrograms allow deep learning models to recognize patterns in sound the way they would with image data.
Real-time audio monitoring: In applications such as security, environmental monitoring, or live audio mixing, real-time spectrograms provide immediate feedback on the frequency content of incoming audio, allowing quick adjustments or triggering alerts when specific sound patterns are detected.
Forensic audio analysis: Spectrograms are used to enhance and analyze recordings in forensic investigations, helping identify voice patterns or other critical audio elements that may be difficult to hear or detect.

Conclusion

In this part of our series, we explored the power of spectrograms as a tool for visualizing and analyzing audio signals. We began by understanding the basic spectrogram, generated through the Short-Time Fourier Transform (STFT), and how it shows the changes in frequency over time. We then looked at the Mel spectrogram, which aligns more closely with human perception of sound by emphasizing lower frequencies. Finally, we introduced MFCCs, a compressed representation of the Mel spectrogram that extracts the most critical features of an audio signal, making it highly useful for tasks like speech recognition and audio classification.

Spectrograms are incredibly versatile, used across various fields like speech analysis, music analysis, environmental sound analysis, and even medical audio processing. Their ability to reveal intricate sound patterns makes them indispensable in both research and practical applications, from machine learning models to audio engineering.

Teaser for the next blog

In the next part of this series, we will dive even deeper into the world of audio features by exploring chroma features, which focus on pitch class and harmony in music. Stay tuned as we continue to unravel the secrets of audio processing and uncover new ways to analyze and interpret sound!

Final Notes

I hope you found our deep dive into spectrograms and their applications in audio analysis both insightful and practical. Understanding how these visual representations of sound can unlock new possibilities in fields like speech recognition, music analysis, and more is an exciting part of the journey into audio processing. If this post helped broaden your understanding of audio features, feel free to share it with fellow learners and enthusiasts in the tech, AI, and data science communities.

Don’t forget to hit the ‘Follow’ button to stay updated on my latest posts and continue exploring fascinating topics in machine learning and audio analysis. I’d love to hear your thoughts and questions — let’s keep the discussion going in the comments below. Remember, the best way to learn is together with a community of curious minds.