avatarDaniel Warfield

Summarize

Use Frequency More Frequently

A handbook from simple to advanced frequency analysis: exploring a vital tool which is widely underutilized in data science

Frequency analysis is extremely useful in a vast number of domains. From audio, to mechanical systems, to natural language processing and unsupervised learning. For many scientists and engineers it’s a vital tool, but for many data scientists and developers it’s hardly understood, if at all. If you don’t know about frequency analysis, don’t fret, you just found your handbook.

Image by Daniel Warfield using p5.js. All images in this document are either created with p5.js or Python’s Matplotlib library unless otherwise specified.

Who is this useful for? Anyone who works with virtually any signal, sensor, image, or AI/ML model.

How advanced is this post? This post is accessible to beginners and contains examples that will interest even the most advanced users of frequency analysis. You will likely get something out of this article regardless of your skill level.

What will you get from this post? Both a conceptual and mathematical understanding of waves and frequencies, a practical understanding of how to employ those concepts in Python, some common use cases, and some more advanced use cases.

Note: To help you skim through, I’ve labeled subsections as Basic, Intermediate, and Advanced. This is a long article designed to get someone from zero to hero. However, if you already have education or experience in the frequency domain, you can probably skim the intermediate sections or jump right to the advanced topics.

I’ve also set up links so you can click to navigate to and from the table of contents

Table Of Contents

Click the links to navigate to specific sections

1) The Frequency Domain 1.1) The Basics of the Frequency Domain (Basic) 1.2) The Specifics of the Frequency Domain (Intermediate) 1.3) A Simple Example in Python (Intermediate) 2) Common Uses of the Frequency Domain 2.1) De-trending and Signal Processing (Intermediate) 2.2) Vibration Analysis (Advanced) 3) Advanced Uses of the Frequency Domain 3.1) Data Augmentation (Advanced) 3.2) Embedding and Clustering (Advanced) 3.3) Compression (Intermediate) 4) Conceptual Takeaways for Data Scientists 5) Summary

1) The Frequency Domain

1.1) The Basics of the Frequency Domain (Basic)

(Back To Table of Contents)

First, what is a domain? Imagine you want to understand temperature changes over time. Just reading that sentence, you probably imagined a graph like this:

What you might be imagining when you think of temperature over some period of time

Maybe you imagine time progressing from left to right, and greater temperatures corresponding to higher vertical points. Congratulations, you’ve taken data and mapped it to a 2d time domain. In other words, you’ve taken temperature readings, recorded at certain times, and mapped that information to a space where time is one axis, and the value is another.

There are other ways to represent our temperature vs time data. As you can see, there’s a “periodic” nature to this data, meaning it oscillates back and forth. A lot of data behaves this way: sound, ECG data from heartbeats, movement sensors like accelerometers, and even images. In one way or another, a lot of things have data that goes to and fro periodically.

“If you want to find the secrets of the universe, think in terms of energy, frequency and vibration.” ― Nikola Tesla

I could get to this point in a circuitous way, but a picture speaks 1000 words. In essence, we can disassemble our temperature graph into a bunch of simple waves, with various frequencies and amplitudes (frequency being the speed it goes back in forth, and amplitude being how high and low it goes), and use that to describe the data.

All the waves, of various frequencies and amplitudes, which goes into making our original wave. You might notice that there’s one wave which is more subtle than the other two and is practically impossible to see in the original. Finding this hidden information is one benefit of frequency analysis.

These waves are extracted using a Fourier Transform, which maps our original wave from the time domain to the frequency domain. Instead of value vs time, the frequency domain is amplitude vs frequency.

Each of the extracted waves has a frequency and amplitude. If we plot frequency on the x axis, and amplitude on the y axis, we have plotted what is called a spectrogram

So, to summarize, the Fourier Transform maps data (usually, but not always in the time domain) into the frequency domain. The frequency domain describes all of the waves, with different frequencies and amplitudes, which when added together reconstruct the original wave.

The original wave, in the time domain, and the frequency content in the frequency domain. These both describe the same signal

1.2) The Specifics of the Frequency Domain (Intermediate)

(Back To Table of Contents)

The sin function is the ratio of the opposite side of a triangle vs the hypotenuse of that right triangle, for some angle.

θ(theta) is an angle of a right triangle, a is the length of the opposite side of θ, and c is the length of the hypotenuse

The sin wave is what you get when you plot a/c for different values of θ (Different Angles), and is used in virtually all scientific disciplines as the most fundamental wave.

The relationship between the sin function, right triangles, and the sin wave

Often sin(θ) is expanded to A*sin(ωθ+ϕ).

ω(omega) represents frequency (larger values of ω mean the sin wave oscillates more quickly)

ϕ(phi) represents phase (changing ϕ shifts the wave to the right or left)

A scales the function, which defines the amplitude (how large the oscillations are).

“A” controls the amplitude (height), “omega” controls the frequency (speed of oscillation), and “phi” controls the phase (shift from side to side)

When I explained the frequency domain I presented a simplified representation, where the horizontal axis is frequency, and the vertical axis is amplitude. In actuality the frequency domain is not 2 dimensional, but 3: one dimension for frequency, one for amplitude, and one for phase. A spectrogram can be of even higher dimension for higher dimensional signals (like images).

A Traditional Amplitude vs frequency spectrogram (left) vs a more descriptive amplitude, frequency, and phase plot.

When converting a signal to the frequency domain (using a library like scipy, for instance) you’ll get a list of imaginary numbers.

[1.13-1.56j, 2.34+2.6j, 7.4,-3.98j, ...]

If you’re not familiar with imaginary numbers, don’t worry about it. You can imagine these lists as points, where the index of the list corresponds to frequency, and the complex imaginary number represents a tuple corresponding to amplitude and phase respectively.

[(1.13, 1.56), (2.34, 2.6), (7.4, -3.98), ...]

I haven’t talked about the units of these numbers. Because units are, essentially, linear transformations to all data, they can often be disregarded from a data science perspective. However, if you do use the frequency domain in the future, you will likely encounter words like Hertz (Hz), Period (T), and other frequency domain-specific concepts. You will see these units explored in the examples.

If you want to learn more about units in general, and how to deal with them as a data scientist, I have an article all about it here

1.3) A Simple Example in Python (Intermediate)

(Back To Table of Contents)

In this example, we load a snippet of trumpet music, convert it to the frequency domain, plot the frequency spectrogram, and use the spectrogram to understand the original signal.

First, we’ll load and plot the sound data, which is an amplitude over time. This data is used to control the location of the diaphragm within a speaker, the oscillation of which generates sound.

"""
Loading a sample waveform, and plotting it in the time domain
"""

#importing dependencies
import matplotlib.pyplot as plt     #for plotting
from scipy.io import wavfile        #for reading audio file
import numpy as np                  #for general numerical processing

#reading a .wav file containing audio data.
#This is stereo data, so there's a left and right audio audio channel
samplerate, data = wavfile.read('trumpet_snippet.wav')

#creating wide figure
plt.figure(figsize=(18,6))

#defining number of samples we will explore
N = 3000

#calculating time of each sample
x = np.linspace(start = 0, stop = N/samplerate, num = N)

#plotting channel 0
plt.subplot(2, 1, 1)
plt.plot(x,data[:N,0])

#plotting channel 1
plt.subplot(2, 1, 2)
plt.plot(x,data[:N,1])

#rendering
plt.show()
The left and right sound waves from a snippet of stereo trumpet music, in the time domain. The X axis corresponds to time, in seconds, and the y axis corresponds to the amplitude of the signal, which controls the location of a speaker diaphragm, generating sound. (Raw trumpet data from storyblocks.com)

Lets convert these waveforms to the frequency domain

"""
Converting the sample waveform to the frequency domain, and plotting it

This is basically directly from the scipy documentation
https://docs.scipy.org/doc/scipy/tutorial/fft.html
"""

#importing dependencies
from scipy.fft import fft, fftfreq      #for computing frequency information

#calculating the period, which is the amount of time between samples
T = 1/samplerate
#defining the number of samples to be used in the frequency calculation
N = 3000

#calculating the amplitudes and frequencies using fft
yf0 = fft(data[:N,0])
yf1 = fft(data[:N,1])
xf = fftfreq(N, T)[:N//2]

#creating wide figure
plt.figure(figsize=(18,6))

#plotting only frequency and amplitude for the 1st channel
plt.subplot(2, 1, 1)
plt.plot(xf, 2.0/N * np.abs(yf0[0:N//2]))
plt.xlim([0, 6000])

#plotting only frequency and amplitude for the 2st channel
plt.subplot(2, 1, 2)
plt.plot(xf, 2.0/N * np.abs(yf1[0:N//2]))
plt.xlim([0, 6000])

plt.show()
The frequency domain representation of the previously loaded trumpet audio. The X axis is the frequency (in Hz, which is oscillations/second), and the y axis is the amplitude of the signal.

Just by visualizing this graph, a few insights can be made.

  1. Both signals contain very similar frequency content, which makes sense because they’re both from the same recording. Often stereo recordings are recorded with two separate microphones simultaneously.
  2. The dominant frequency is around 523Hz, which corresponds to a C5 note.
  3. There is a lot of sympathetic resonance, which can be seen as spikes at frequencies that are at integer multiples of the base frequency. This trait is critical in making an instrument sound good and is the result of various pieces of the instrument resonating at different frequencies which is induced by the primary vibration.
  4. This is a very clear sound, the spikes are not muddled by a lot of unrelated frequency content
  5. This is an organic sound. There is some frequency content which is not related to the base frequency. This can be thought of as the timbre of the instrument and makes it sound like a trumpet, rather than some other instrument performing the same note.

In section 2 we’ll explore how the frequency domain is used commonly in time series signal processing. In section 3 we’ll explore more advanced topics.

2) Common Uses of the Frequency Domain

2.1) De-trending and Signal Processing (Intermediate)

(Back To Table of Contents)

Let’s say you have an electrical system, and you want to understand the minute-by-minute voltage changes in that system over the course of a day. You set up a voltage meter, capture, and plot the voltage information over time.

Let's say, for the purposes of this example, we only cared about the graph for the minute-by-minute data, and we consider waves which are too high of a frequency to be noise, and waves which are too low in frequency to be a trend that we want to ignore.

We don’t care about the long term trends which take place over the course of hours. We’re interested in minute-by-minute data (raw data synthetically generated by the author)
We care about the trends going on in around this time frame
We don’t care about waves which oscillate too quickly, these are considered as noise in the signal

So, for this example, we only care about observing content which oscillates slower than once per second, and faster than once every 5 minutes. We can convert our data to the frequency domain, remove all but the frequencies we’re interested in observing, then convert back to the time domain. so we can visualize the wave including only the trends we’re interested in.

First, let’s observe the frequency domain unaltered:

"""
Plotting the entire frequency domain spectrogram for the mock electrical data
"""

#load electrical data, which is a numpy list of values taken at 1000Hz sampling frequency
x, y = load_electrical_data()
samplerate = 1000
N = len(y)

#calculating the period, which is the amount of time between samples
T = 1/samplerate

#calculating the amplitudes and frequencies using fft
yf = fft(y)
xf = fftfreq(N, T)[:N//2]

#creating wide figure
plt.figure(figsize=(18,6))

#plotting only frequency and amplitude for the 1st channel
plt.plot(xf, 2.0/N * np.abs(yf[0:N//2]))

#marking units of the two axis
plt.xlabel('fq (Frequency in Hz)')
plt.ylabel('V (Volts)')

#setting the vertical axis as logorithmic, for better visualization
plt.gca().set_yscale('log')

#rendering
plt.show()
This is the complete , unfiltered spectrogram for the electrical system we are analyzing

We can set all the frequency content we are not interested into zero. Often you use a special filter, like a butter-worth filter, to do this, but we’ll keep it simple.

"""
converting the data to the frequency domain, and filtering out
unwanted frequencies
"""

#defining low frequency cutoff
lowfq = 1/(5*60)

#defining high frequency cutoff
highfq = 1

#calculating the amplitudes and frequencies, preserving all information
#so the inverse fft can work
yf = fft(y)
xf = fftfreq(N, T)

#applying naiive filter, which will likely create some artifacts, but will
#filter out the data we don't want
yf[np.abs(xf) < lowfq] = 0
yf[np.abs(xf) > highfq] = 0

#creating wide figure
plt.figure(figsize=(18,6))

#plotting only frequency and amplitude
plt.plot(xf[:N//2], 2.0/N * np.abs(yf[0:N//2]))

#marking units of the two axis
plt.xlabel('fq (Frequency in Hz)')
plt.ylabel('V (Volts)')

#setting the vertical axis as logorithmic, for better visualization
plt.gca().set_yscale('log')

#zooming into the frequency range we care about
plt.xlim([-0.1, 1.1])

#rendering
plt.show()
The plot of the frequency domain we’re isolating, with all other frequency information set to zero

Now we can perform an inverse Fast Fourier Transform to reconstruct the wave, including only the data we care about

"""
Reconstructing the wave with the filtered frequency information
"""

#importing dependencies
from scipy.fft import ifft      #for computing the inverse fourier transform

#computing the inverse fourier transform
y_filt = ifft(yf)

#creating wide figure
plt.figure(figsize=(18,6))

#plotting
plt.plot(x,y_filt)

#defining x and y axis
plt.xlabel('t (seconds)')
plt.ylabel('V (volts)')

#looking at a few minutes of data, not looking at 
#the beginning or end of the data to avoid filtration artifacts
plt.xlim([60*2,60*10])

#rendering
plt.show()
A few minutes of data, with our filter enabled. We have removed excessively high frequency content, and brought the wave to center around 0 by removing excessively low frequency content.

And that’s it. We have successfully removed high-frequency information we don’t care about, and centered the data we do care about around zero by removing low-frequency trends. We can now use this minute-by-minute data to hone in on understanding the electrical system we’re measuring.

2.2) Vibration Analysis (Advanced)

(Back To Table of Contents)

I covered vibration analysis in a previous example in the form of analyzing a sound wave. In this example, I’ll discuss analyzing vibrations in physical systems, like a motor in a factory.

It can be difficult to predict when certain motors require maintenance. Often, simple issues like a misalignment can cascade into much more severe issues, like a complete engine failure. We can use frequency recordings, collected periodically over time, to help us understand when a motor is operating differently; allowing us to diagnose issues within an engine before it cascades into a larger issue.

Vibration data taken over a period of time where the engine experienced a minor failure. In the time domain it’s virtually impossible to see the time of failure. (raw data synthetically generated by the author)

To analyze this data, we will compute and render what is called a mel spectrogram. A mel spectrogram is just like a normal spectrogram, but instead of computing the frequency content across the entire waveform, we extract the frequency content from small rolling windows extracted from the signal. This allows us to plot how the frequency content changes over time.

"""
plotting a mel-spectrogram of motor vibration to diagnose the point of failure

note: if you don't want to use librosa, you can construct a mel-spectrogram
easily using scipy's fft function across a rolling window, allowing for more
granular calculation, and matplotlib's imshow function for more granular
rendering
"""

#importing dependencies
import librosa              #for calculating the mel-spectrogram
import librosa.display      #for plotting the mel spectrogram

#loading the motor data
y = load_motor_data()
samplerate = 1000 #in Hz

#calculating the mel spectrogram, as per the librosa documentation
D = np.abs(librosa.stft(y))**2
S = librosa.feature.melspectrogram(S=D, sr=samplerate)

#creating wide figure
fig = plt.figure(figsize=(18,6))

#plotting the mel spectrogram
ax = fig.subplots()
S_dB = librosa.power_to_db(S, ref=np.max)
img = librosa.display.specshow(S_dB, x_axis='time',
                         y_axis='mel', sr=samplerate,
                         fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency spectrogram')

#rendering
plt.show()
A mel spectrogram of the motor data. Instead of the a 2d frequency spectrogram, mel spectrograms are 3d: The vertical axis is the frequency of oscillation, the x axis is time (in this case a percentage of time) that the frequency content was calculated, and the color represents amplitude, which is measured in a unit called decibels. Note that at time 0.2, the frequency content of the motor suddenly changes.

In a Mel Spectrogram, each vertical slice represents a region of time, with high-frequency content being shown higher up, and low-frequency content being shown lower down in the plot. It’s easy to see that at time 0.2 (20% through our data), the frequency content changed dramatically. At this point a balancing weight became loose, causing the engine to become unbalanced. Maintenance at this point may save the engine from excess wear in the future.

A simple yet effective way to employ this principle is with scheduled vibration readings. A worker sticks an accelerometer on the body of a motor with a magnet and records the frequency content once or twice a month. Those windows of vibration data are then converted to the frequency domain, where certain key features are extracted. A common extracted feature from the frequency domain is power spectral density, which is essentially the area under the frequency domain curve over certain regions of frequencies. Extracted features can be plotted over several weeks of recordings and used as a proxy for overall motor health.

3) Advanced Uses of the Frequency Domain

3.1) Data Augmentation (Advanced)

(Back To Table of Contents)

Data augmentation is the process of creating fake data from real data. The quintessential example is image classification to bolster a data set for classifying if images are of a dog or a cat.

Example of image augmentation, where a single image can be used to generate multiple images for a machine learning model to learn from. Created with Affinity Designer 2, stock photo from storyblocks.com

Augmentation can be an incredibly powerful tool, but what if you don’t have images? What if you have sound, motion, temperature, or some other signal? How can one sensibly augment these types of data? In the time domain, augmentation strategies look a lot more like regularization strategies: add a bit of noise here, and shift the data up or down there. They add random information to data, which can be useful, but they don’t really make new examples.

We can steal something from the music production scene: a wavetable. The idea behind a wavetable is to convert two waves to the frequency domain, interpolate between the two in the frequency domain, then convert the interpolation back to the time domain. I don’t mean blending, where you overlay one signal over the other, but making a completely new wave which contains frequency content from two (or more) other waves.

Let’s imagine we’re trying to build a model to detect if people are talking or not in an audio snippet. We have a bunch of samples of audio where people are talking, and a bunch of samples where people aren’t, both in a variety of situations. This data requires someone to go out with a collection of different microphones and capture sounds, and then manually flag if the data contains someone talking or not, in a variety of situations. let’s say the model has to be very robust, and very accurate, and recording sufficient data to reach desired performance levels is not financially feasible.

In theory, the thing that makes human speech sound the way it does is frequency content. A blend of frequency content from one snippet of talking and another snippet of talking should still sound like someone talking. We can use a wave table to construct these artificial waves, thus making more data for free (besides a data scientist's salary and big old expensive computing resources on the cloud).

"""
loading and plotting two waveforms recorded in two seperate environments,
both including people talking
"""

#loading two waveforms
samplerate, y1 = wavfile.read('crowd.wav')
_,          y2 = wavfile.read('citycenter.wav')

#creating x axis for both waveforms
N = 1000000
x1 = np.linspace(start = 0, stop = N/samplerate, num = N)
x2 = np.linspace(start = 0, stop = N/samplerate, num = N)

#creating wide figure
plt.figure(figsize=(18,6))

offset = 1000000

#plotting waveform 1
plt.subplot(2, 1, 1)
plt.plot(x1,y1[offset:offset+N])
plt.xlabel('t (seconds)')
plt.ylabel('A (db)')

#plotting waveform 2
plt.subplot(2, 1, 2)
plt.plot(x2, y2[offset:offset+N])
plt.xlabel('t (seconds)')
plt.ylabel('A (db)')

#rendering
plt.show()
Two waveforms, both prominently including people talking. (raw data from storyblocks.com)

We can convert both of these waves to the frequency domain, and create several frequency representations which are interpolations between the two waves.

"""
Converting both waves to the frequency domain, constructing a wave table, and rendering the wave table
"""

#calculating the frequency content for both waves.
#Only analyzing 1 of the 2 stereo channels
fq1 = fft(y1[offset:offset+N,0])
fq2 = fft(y2[offset:offset+N,0])

#defining frequency axis
T = 1/samplerate
xf = fftfreq(N, T)

#creating wide figure
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(111, projection='3d')

#plotting source waves
plt.plot(xf[:N//2], np.array([1]*(N//2)), 2.0/N * np.abs(fq1[0:N//2]))
plt.plot(xf[:N//2], np.array([0]*(N//2)), 2.0/N * np.abs(fq2[0:N//2]))

fq_interp = []
#creating interpolations
for per in np.linspace(0.1,0.9,9):
    thisfq = (fq1*per) + (fq2*(1-per))
    fq_interp.append((per, thisfq))

    #plotting interpolation
    plt.plot(xf[:N//2], np.array([per]*(N//2)), 2.0/N * np.abs(thisfq[0:N//2]))
    
plt.show()
Frequency spectrograms for both the original waveforms (at the extremes) and the waveforms in the middle. Note that the the plot shows the spectrogram as frequency vs amplitude, but the interpolation also is done over the phase as well.

We can now compute the inverse Fast Fourier Transform on all of these interpolated frequency domains, and extract our table of waves.

"""
Computing the inverse fft on the frequency content, and constructing the final table of waves
"""

#creating wide figure
fig = plt.figure(figsize=(18,6))
ax = fig.add_subplot(111, projection='3d')

plt.plot(x, np.array([1]*len(x)), y1[offset:offset+N,0])
plt.plot(x, np.array([0]*len(x)), y2[offset:offset+N,0])

#creating interpolations
for per, interp in fq_interp:
    
    waveform = ifft(interp)
    
    plt.plot(x, np.array([per]*len(x)), waveform)
    
plt.show()
The final wave table. The extreme waves are the source waves, while the ones in between are interpolations in the frequency domain.

And there we go. From 2 waves of people talking, we now have 10 waves of people talking. Data augmentation can be a tricky task, as you can easily create data which is not actually indicative of the data you’re trying to emulate. When employing a similar augmentation strategy, you can use augmentations which are closer to the source waves (80% one wave, 20% another). These will be more likely to be realistic than waves closer to the center (50%, 50%).

3.2) Embedding and Clustering (Advanced)

(Back To Table of Contents)

For this example, we’ll use the output from a sentiment analysis model to cluster different products based on their customer sentiment over time. Let’s say we run a store with reviews, and those reviews fluctuate between positive and negative. We notice we have some reviews which correlate with one another. We want to find products which have similar sentiment analysis trends, such that they can be grouped together and further understood.

First, let’s look at our data:

"""
loading 1000 average sentiment scores over the course of a year,
and plotting the first 10 of them
"""

#loading sentiment data
sentiments = load_sentiments()

#creating wide figure
plt.figure(figsize=(18,6))

#plotting first 10 sentiments
for i in range(10):
    plt.plot(sentiments[i])
    
#rendering 
plt.xlabel('days')
plt.ylabel('sentiment (low to high)')
plt.show()
first 10 examples of sentiment (data synthetically generated by the author)

As you can see, we have many examples of user sentiment, averaged on a per-day basis. We can remove the very low-frequency content, which will remove very long-term average trends (like the average), and we will remove very high-frequency content, which is noise and is unlikely to create useful clusters.

"""
Converting to the frequency domain, removing very low and high frequency content, and plotting the results
We do this, so we can visually understand the frequency content which we deem important, before we begin clustering.
"""

#importing dependencies
from scipy.fft import fft, ifft      #for computing frequency information

#creating wide figure
plt.figure(figsize=(18,6))

#defining the low frequency and high frequency cutoffs
#because lowfq is so low, it effectively only cuts of the wave
#with a frequency of zero, which controls the vertical offset of the data
lowfq = 0.0001 
highfq = 0.05

#plotting first 10 sentiments
for i in range(10):
    
    #getting signal
    sig = sentiments[i]
    
    #calculating the frequency domain
    yf = fft(sig)
    T = 1
    N = len(sig)
    xf = fftfreq(N, T)
    
    #applying naiive filter
    yf[np.abs(xf) < lowfq] = 0
    yf[np.abs(xf) > highfq] = 0
    
    #converting back to the time domain, and plotting
    y = ifft(yf)
    plt.plot(y)

#rendering
plt.show()
Ultimately, we will be clustering data in the frequency domain. We generate this plot just so we can confirm that we’re preserving the type of content we care about: not too low frequency, and not too high frequency.

Now we’re done with the time domain, and will begin working on building up our clustering in the frequency domain. Let’s look at our filtered frequency domain plots

The frequency content used to construct the waves above

The input to our clustering operation will be a list of amplitudes, each of which corresponds to a specific frequency. We could feed this data to our clustering algorithm, but there is an additional step which can create significant improvements. Imagine we are trying to cluster four simple sin waves, with frequency domain content which looks like this:

a representation of four sin waves, plotted in the frequency domain, for demonstrative purposes

You would expect the waves on the left to cluster together, and the waves on the right to cluster closely together. However, the vectors which describe this data look like this:

[0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0]
[0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]

From the perspective of t-SNE, all of these waves are, equally, orthogonal to each-other, as none of them share any value along a similar axis. We can get around this issue by making the frequency domain “fuzzy”; we can apply a moving average to this data such that frequency content blends to adjacent regions.

our sample data, with an exponential moving average applied in both directions, causing similar frequency content to bleed into one another.

This data is significantly more likely to yield good clustering results, as similar frequency content spikes are more apt to bleed into one another. Let’s apply this concept to our sample plot of sentiment data:

"""
Converting data to the frequency domain, and applying an exponential
moving average in both directions. This is the data we will be clustering.
"""

#converting data to frequencies, and filtering out content
wfs = [np.abs(fft(y)[0:N//2][1:20]) for y in sentiments[:10]]

#loading sample data into a pandas dataframe
df = pd.DataFrame(wfs).T

#applying an exponential moving average in both directions, and adding them
df_plt = df.iloc[::-1].ewm(span=3, adjust=False).mean().iloc[::-1]
df_plt = df.ewm(span=3, adjust=False).mean().add(df_plt)

#creating wide figure
plt.figure(figsize=(18,6))
plt.plot(df_plt)
Filtered amplitude over frequency data, for clustering. Keep in mind, there are numerous changes that can be made to this general approach. Different high and low frequencies can be used, different spans of the exponential average can be used, the frequency domain can be normalized such that relative amplitudes are similar, etc.

As a result of our processing steps, this data is significantly more likely to create clusters of data we actually care about. Now we can tie all this together, and create our final cluster:

"""
Converting all the sentiment waveforms to the frequency domain, 
applying filtration, and embedding in 2d with TSNE
"""

#importing dependencies
from sklearn.manifold import TSNE

#converting data to frequencies, and filtering out content, for al product sentiment waveforms
wfs = [np.abs(fft(y)[0:N//2][1:20]) for y in sentiments]

#loading sample data into a pandas dataframe
df = pd.DataFrame(wfs).T

#applying an exponential moving average in both directions, and adding them
df_plt = df.iloc[::-1].ewm(span=3, adjust=False).mean().iloc[::-1]
df_plt = df.ewm(span=3, adjust=False).mean().add(df_plt)

#creating wide figure
plt.figure(figsize=(18,6))

#embedding the data
embedding = TSNE(n_components=2 ,init='random', perplexity=20).fit_transform(df_plt.values.T)

#plotting
plt.scatter(embedding[:,0],embedding[:,1])
t-SNE plot of the filtered frequency domain for all user sentiment product reviews.

And that’s it! Naturally, for a practical application, a lot of work has to be done after this graph is generated. Likely, these clouds of data would have to be explored, and potentially labeled, and further refinement of key parameters would have to be done to gain further insights. For this example, though, we have used the frequency domain to apply a clustering algorithm to time series data, allowing us to see which sentiments oscillate in similar ways. This type of analysis could inform product recommendations within a website, for instance.

3.3) Compression (Intermediate)

(Back To Table of Contents)

Signals contain a lot of data. Sampling at 96,000 samples per second for a few hours yields massive audio files. These raw recordings are useful for high-quality audio processing, but when you’re done and want to send a sample to a friend, you’re willing to sacrifice a bit of audio quality for speed and size. You can down-sample to a point (send fewer samples per second), however, that will limit the maximum pitch of the frequencies you can send (If you’re only sending 200 samples/second you can’t send any frequency higher than 100 Hz). Instead, you can convert your sample to the frequency domain, compress similar frequencies together, then send the frequency domain along with the sampling rate. the recipient can then rebuild the compressed audio via a transform from the frequency domain to the time domain. This allows you to send arbitrarily high frequencies without needing to send an arbitrarily large amount of data. The reason mp3 files, for instance, are so much smaller than .wav files is that they use a Fourier transform prominently in their encoding.

4) Conceptual Takeaways for Data Scientists

(Back To Table of Contents)

Using frequency analysis directly as a tool can be vital for solving certain problems, as we’ve seen in previous examples. What often goes unappreciated is the usage of the frequency domain as a concept. As a data scientist, it might be difficult to wrap your brain around self-similar modeling strategies like recurrent and convolutional networks, especially when solving specific, subtle problems. Sometimes, thinking of these problems as a quasi-frequency domain extraction can be more useful.

Convolutional networks, for instance, use wavelets (convolutions) that propagate over data. The result then gets pooled, reducing the resolution of the data, and further wavelets get applied. You can think of convolutions as extracting varying frequencies of information, often from high-frequency information to low-frequency information. Keeping this in mind can lead to a more intuitive understanding of stride, kernel size, and other hyperparameters.

5) Summary

(Back To Table of Contents)

In this article we covered the frequency domain, how it relates to signals and sin waves, and saw a few examples of frequency domain representations. We saw how a time-series signal can be converted to the frequency domain, and vice versa, and saw several examples of how, by converting to the frequency domain, several classes of problems can be solved.

Follow For More!

In a future post, I’ll describe how the frequency domain can be applied to higher dimensional signals, like images and video, and how that can be used to great effect in machine learning/data science applications. I’ll also be describing several landmark papers in the ML space, with an emphasis on practical and intuitive explanations.

Attribution: All of the images in this document were created by Daniel Warfield. You can use any images in this post for your own non-commercial purposes, so long as you reference this article, https://danielwarfield.dev, or both.

P.S. — Join me on RoundtableML

RoundtableML is a vibrant community where ambitious and driven individuals come together to collaborate and push boundaries of ML and AI application in a safe and responsible way. If you're eager to expand your knowledge of ML, engage in open research diveinto scientific papers and work on ML project within small intimate groups — this is the place for you!

You can join using this discord invite.

Data Science
Signal Processing
Machine Learning
Time Series Analysis
Deep Dives
Recommended from ReadMedium