avatarLeonie Monigatti

Summary

This article discusses various data augmentation techniques for audio data in Python, including noise injection, shifting time, changing speed, changing pitch, Mixup, and SpecAugment.

Abstract

The article "Data Augmentation Techniques for Audio Data in Python" provides an overview of popular data augmentation techniques for audio data, focusing on both waveform (time domain) and spectrogram (frequency domain) manipulation. It covers techniques such as noise injection, shifting time, changing speed, changing pitch, Mixup, and SpecAugment. The article also emphasizes the importance of using appropriate augmentation techniques for spectrograms, as some image augmentation techniques may not be suitable for them. The content is supported by code examples and references to relevant literature and datasets.

Opinions

  • The article highlights the importance of data augmentation in improving the generalization capabilities of Deep Learning models, especially when working with limited datasets.
  • It suggests that while some data augmentation techniques, like volume augmentation, might be ineffective when converting waveforms to spectrograms, other techniques like noise injection, shifting time, changing speed, and changing pitch can be beneficial.
  • The article cautions against using certain image augmentation techniques, such as horizontal flipping, for spectrograms, as they can alter the information contained in the spectrograms.
  • It introduces Mixup and SpecAugment as effective data augmentation techniques for spectrograms.
  • The article emphasizes the need to consider the processing steps involved in applying data augmentations to audio data in a Deep Learning context.
  • It references relevant literature and datasets to support the discussed techniques and concepts.
  • The article provides code examples and references to Kaggle Notebooks for practical implementation guidance.

Data Augmentation Techniques for Audio Data in Python

How to augment audio in waveform (time domain) and as spectrograms (frequency domain) with librosa, numpy, and PyTorch

(Image drawn by the author)

Deep Learning models are data-hungry. If you don’t have a sufficient amount of data, generating synthetic data from the available dataset can help improve the generalization capabilities of your Deep Learning model. While you might already be familiar with data augmentation techniques for images (e.g., flipping an image horizontally), data augmentation techniques for audio data are often lesser known.

This article will review popular data augmentation techniques for audio data. You can apply data augmentations for audio data in the waveform and in the spectrogram:

For the data augmentations, we will use librosa, which is a popular library for audio processing, and numpy.

import numpy as np
import librosa

If you are already working with PyTorch, you could also use torchaudio as an alternative.

Audio Data Augmentations for Waveform (Time Domain)

This section will discuss popular data augmentation techniques you can apply to the audio data in the waveform. You can use the load() method from the librosa library to load the audio file as a waveform.

PATH = "audio_example.wav" # Replace with your file here
original_audio, sample_rate = librosa.load(PATH)
Original audio data of the word “stop” in waveform from the “Speech Commands” dataset [1] (Image by the author)

The following code implementations are referenced from Kaggle Notebooks by kaerururu [7] and CVxTz [5].

Noise injection

A popular data augmentation technique is to inject some sort of noise into the original audio data.

You can choose from many different types of noise:

  • White noise
# Code copied and edited from https://www.kaggle.com/code/kaerunantoka/birdclef2022-use-2nd-label-f0
noise_factor = 0.005
white_noise = np.random.randn(len(original_audio)) * noise_factor
  • Colored noise (e.g., pink, brown, etc.)
# Code copied and edited from https://www.kaggle.com/code/kaerunantoka/birdclef2022-use-2nd-label-f0
import colorednoise as cn

pink_noise = cn.powerlaw_psd_gaussian(1, len(original_audio))
  • Background noise
# Load background noise from another audio file
background_noise, sample_rate = librosa.load("background_noise.wav")

Once you have defined the type of noise you want to inject, you add the noise to your original waveform audio. Of course, you can use all different types of noises for your data augmentations. Below, you can see an example of noise injection with white noise.

augmented_audio = original_audio + noise
Data Augmentation for Audio: White Noise (Image by the author)

Shifting time

With the roll() function from the numpy library, you can shift the audio in time.

# Code copied and edited from https://www.kaggle.com/code/CVxTz/audio-data-augmentation/notebook
augmented_audio = np.roll(original_audio, 3000)
Data Augmentation for Audio: Shifting time (Image by the author)

Note that the audio will wrap around if you don’t have enough trailing silence. Depending on your sound type, this data augmentation might not be recommended in this case (e.g., human speech).

Changing speed

You can also increase (rate>1) or decrease (rate<1) the speed of the audio with the time_stretch() method from the librosa library.

# Code copied and edited from https://www.kaggle.com/code/CVxTz/audio-data-augmentation/notebook
rate = 0.9
augmented_audio = librosa.effects.time_stretch(original_audio, rate = rate)
Data Augmentation for Audio: Time stretch / changing speed (Image by the author)

Changing pitch

Or you can modify the pitch of the audio with the pitch_shift() method from the librosa library.

# Code copied and edited from https://www.kaggle.com/code/CVxTz/audio-data-augmentation/notebook
augmented_audio = librosa.effects.pitch_shift(original_audio, sampling_rate, pitch_factor)
Data Augmentation for Audio: Pitch shift (Image by the author)

Changing volume (not recommended)

You could also augment the waveform in terms of volume. However, if you are going to convert the waveform into spectrograms later on, volume augmentation will be ineffective as the amplitude is not considered in the frequency domain.

# Increase volume by 5 dB
augmented_audio = original_audio + 5

# Decrease volume by 5 dB
augmented_audio = original_audio - 5

Audio Data Augmentations for Spectrograms (Frequency Domain)

When modeling audio data with a Deep Learning model, it is a popular method to convert the audio classification problem into an image classification problem. For this, the waveform audio data is converted to a Mel spectrogram. If you need a refresher on Mel spectrograms, I recommend the following article:

You can convert audio in waveform into a Mel spectrogram with the melspectrogram() and power_to_db() methods from the librosa library.

original_melspec = librosa.feature.melspectrogram(y = original_audio,
                                                  sr = sample_rate, 
                                                  n_fft = 512, 
                                                  hop_length = 256, 
                                                  n_mels = 40).T

original_melspec = librosa.power_to_db(original_melspec)
Original audio data of the word “stop” in waveform from the “Speech Commands” dataset [1] (Image by the author)
Original audio data of the word “stop” as Mel spectrogram from the “Speech Commands” dataset [1] (Image by the author)

Although you now have an image classification problem, you must be careful when selecting image augmentation techniques for spectrograms. E.g., horizontally flipping the spectrogram would substantially change the information contained in the spectrogram and thus is not recommended.

This section will discuss popular data augmentation techniques you can apply to the audio data as a Mel spectrogram.

The following code implementations are referenced from Kaggle Notebooks by kaerururu [7] and DavidS [6].

Mixup

Simply put, Mixup [4] combines two samples by overlaying them and giving the new sample two labels.

# Code copied and edited from https://www.kaggle.com/code/kaerunantoka/birdclef2022-use-2nd-label-f0
def mixup(original_melspecs, original_labels, alpha=1.0):
    indices = torch.randperm(original_melspecs.size(0))

    lam = np.random.beta(alpha, alpha)

    augmented_melspecs = original_melspecs * lam + original_melspecs[indices] * (1 - lam)
    augmented_labels = [(original_labels * lam), (original_labels[indices] * (1 - lam))]

    return augmented_melspec, augmented_labels
Data Augmentation for Spectrogram: Mixup [4] (Image by the author)

Alternatively, you could also try other data augmentation techniques used in computer vision, like cutmix [3].

SpecAugment

SpecAugment [2] is to spectrograms what cutout is to regular images. While cutout blocks out random areas in an image, SpecAugment [2] masks random frequencies and time periods.

# Code copied and edited from https://www.kaggle.com/code/davids1992/specaugment-quick-implementation
def spec_augment(original_melspec,
                 freq_masking_max_percentage = 0.15, 
                 time_masking_max_percentage = 0.3):

    augmented_melspec = original_melspec.copy()
    all_frames_num, all_freqs_num = augmented_melspec.shape

    # Frequency masking
    freq_percentage = random.uniform(0.0, freq_masking_max_percentage)
    num_freqs_to_mask = int(freq_percentage * all_freqs_num)
    f0 = int(np.random.uniform(low = 0.0, high = (all_freqs_num - num_freqs_to_mask)))
    
    augmented_melspec[:, f0:(f0 + num_freqs_to_mask)] = 0

    # Time masking
    time_percentage = random.uniform(0.0, time_masking_max_percentage)
    num_frames_to_mask = int(time_percentage * all_frames_num)
    t0 = int(np.random.uniform(low = 0.0, high = (all_frames_num - num_frames_to_mask)))
    
    augmented_melspec[t0:(t0 + num_frames_to_mask), :] = 0
    
    return augmented_melspec
Data Augmentation for Spectrogram: SpecAugment [2] (Image by the author)

Alternatively, if you are using Pytorch, you could also use the TimeMasking and FrequencyMasking augmentations from torchaudio:

import torch
import torchaudio
import torchaudio.transforms as T

time_masking = T.TimeMasking(time_mask_param = 80)
freq_masking = T.FrequencyMasking(freq_mask_param=80)

augmented_melspec = time_masking(original_melspec)
augmented_melspec = freq_masking(augmented_melspec)

Summary

Data augmentations for audio data can be applied to the audio data in the time domain (waveform) as well as the frequency domain (Mel spectrogram). To successfully apply data augmentations to the audio data in a Deep Learning setting, you have to consider the following processing steps:

  1. Load audio file as waveform (time domain)
  2. Apply data augmentation to the waveform
  3. Convert audio from waveform to spectrogram (frequency domain)
  4. Apply data augmentations to the spectrogram

This article has covered different data augmentation techniques for audio data in the waveform. It is important to note that some data augmentation techniques, such as augmenting the volume, are ineffective when converting the waveform to spectrograms later on because the amplitude is not considered in the frequency domain.

Although you could technically apply all image augmentation techniques to spectrograms, not all of them will make sense. E.g., flipping them vertically or horizontally would change the meaning of the spectrograms. Additionally, a variant of the popular cutout image augmentation tailored for spectrograms masks whole timestamps and frequencies.

Enjoyed This Story?

Subscribe for free to get notified when I publish a new story.

Find me on LinkedIn, Twitter, and Kaggle!

References

Dataset

[1] Warden P. Speech Commands: A public dataset for single-word speech recognition, 2017. Available from http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz

License: CC-BY-4.0

Literature

[2] Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779.

[3] Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6023–6032).

[4] Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017) mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.

Code

[5] CVxTz (2018). Audio data augmentation in Kaggle Notebooks (accessed March 24, 2023).

[6] DavidS (2019). SpecAugment quick implementation in Kaggle Notebooks (accessed March 24, 2023).

[7] kaerururu (2022). BirdCLEF2022 : use 2nd label f0 in Kaggle Notebooks (accessed March 24, 2023).

Data Science
Artificial Intelligence
Deep Learning
Technology
Editors Pick
Recommended from ReadMedium