Audio Processing for Detecting Piracy on TikTok

As the case with most new tech, music streaming services like YouTube and Spotify paved the way for cool new problems in the industry.

While media platforms lower the barrier of entry for small artists, they also allow free users to pirate songs and profit off of them as if they were original. YouTube and its most profitable partners, including major music labels, have discovered that only AI solutions can keep up with the sheer scale of online piracy. Unfortunately, not all platforms have caught up to this realization, and not all artists are protected to the extent of those under major labels.

The purpose of this article is to demonstrate that we have a serious piracy problem on social media, particularly on TikTok. By allowing users to mislabel songs as original, TikTok fails in its attempt to promote artists and instead reinforces “The Great Divide” in the music industry.

Despite the massive growth of streaming revenue, artists continue to be shortchanged in their individualized earnings.

Let’s get into some more specific data to see why I’m so fired up about this.

How TikTok Encourages Piracy

I found this dataset on Kaggle which we’ll use as our reference point for an average TikTok feed. If you take a look, you’ll see that most of the audio clips are outdated. (Admittedly, I’m totally unaware of what’s actually trending right now. But that’s not the point.)

If you’re posting a TikTok, you’ll find that you can flag an audio clip as an “Original sound” even if you had nothing to do with the creation of that sound. The idea behind this is that you might have combined sound effects to create your own “new” sound, but the obvious drawback is that you’ll be the only person credited for the entire clip. This just isn’t fair, and it doesn’t make business sense for an app which should be promoting musicians.

I didn’t change a single thing about the original clip… so I shouldn’t be able to take full credit.

From our sample TikTok feed, can we determine what was correctly labeled as “Original” and what was stolen from the actual artist?

How To Define Piracy

Unfortunately, the solution isn’t as simple as holding up our phone to “Shazam” our TikTok audio. Many clips have been remixed or otherwise morphed in ways that make them unrecognizable for pattern-matching services like Shazam.

In fact, this raises a much more complicated problem with art, which is that the “inspiration” behind any piece of work exists on a spectrum from purely independent to a word-for-word copy of any other work. And unfortunately, no prior business model has been able to fully trace back a song’s origins.

Artists can copy chord, melodies, and entire instrumental or vocal samples, but only songs which top the charts are ever recognized as stolen. Case in point, Katy Perry was sued for her work “Dark Horse” because a specific melody was supposedly taken from a Flame song. In his reaction to this case, YouTuber Adam Neely found the same melody to be used across centuries of songwriting in his video “Why the Katy Perry/Flame lawsuit makes no sense.”

I bring up this case to reinforce the idea that artistic inspiration is complicated, and people will always disagree as to where their art came from. However, some cases are less controversial than others. For instance, our Kaggle dataset references a TikTok audio clip which essentially a remix of an Ed Sheeran song. But instead of crediting Ed Sheeran or the remixing artist, the TikToker who copied the audio called it theirs, and there’s currently no infrastructure in place to find the original creators.

We should use every tool we have to recognize at least the most obvious cases of piracy, like those in which samples are stolen word-for-word for the benefit of another creator. For me, those tools come in the form of basic machine learning and audio processing Python libraries.

(As the industry advances, we should also consider new kinds of databases — blockchain comes to mind, especially considering TikTok’s recent partnership with Audius and Spotify’s partnership with Musixmatch, who in 2020 had openings for blockchain engineers.)

How To Detect Piracy

First let’s take a deeper dive into the TikTok data we found on Kaggle. The dataset consists of video and audio files, along with metadata stored in a JSON file. Below is a screenshot of the metadata for one such TikTok:

Only TikToks with audio labeled as original will be further inspected for piracy.

Note that under the musicMeta tab, the musicOriginal flag is marked false. The owner of the TikTok account properly credited the song as not original, and instead referenced an ID with the associated song title and performer’s name. This TikTok is not the focus of our project, but instead is an example of the business model working effectively.

I also chose to only consider music samples, not any of the spoken-word audio clips on TikTok. This means I missed out on piracy from TV shows and movies. We could later use a similar process to expand our search for pirated content.

Music (left) is pretty easy to recognize from speech (right), due to quantization and the formal structure of music. It’s a little harder to explain this to a computer, though.

To split music from speech, I wrote a quick Python script using librosa and scikit-learn to use teach a computer the difference between music and speech, using spectrograms like the ones above. I’ll leave the code as a fun exercise for you — feel free to reach out if you have any questions about this.

Train a binary classification model using two separate datasets: one for music and one for speech. These audio files are of different lengths, so feel free to select 100 random frames to define each sample.
Test the model using a reasonable train-test split (e.g. 75–25) to validate on the 2 datasets I provided. I found accuracy scores across 4 classification models (RandomForest, DecisionTree, GaussianNB, and SupportVector) and found that all 4 had accuracy scores of 93–99%. Good enough!
Then, deploy the model on the TikTok audio and manually check the results to ensure cross-compatibility across datasets. Again, I found that all my models worked, and I went with the best one (SupportVector).

Now that I had a complete dataset of supposedly original music, I could finally check whether any music was a word-for-word copy of an already-existing song. I first attempted this locally, using Spleeter to isolate vocals and SpeechRecognition to transcribe the audio, but since I didn’t get accurate results, migrated to a cloud service — Oracle Speech AI — for speech-to-text, which gave much better results.

Although not perfect, cloud speech-to-text services work well for highly unstructured data like music.

Finally, I used googlesearch to determine if a lyrics page existed with a near match. If there was a match, I searched the song using the Spotify API to get the exact ID of the song used.

Each of these TikToks should have been linked to their respective songs on Apple Music or Spotify. Since they weren’t linked, I’ve gone ahead and done it for them. Get these artists paid!

Pretty easily, I had found a concerning truth: around 60% of the “original” songs in this dataset were, in fact, not original! Unfortunately, this means that 60% of original artists were not paid for their work.

What’s Next?

Although I don’t expect TikTok to make any immediate changes after I post this article, I do suspect that it will have to incorporate such a system to catch bad actors. In fact, TikTok recently closed a deal with Oracle Cloud to keep user data on U.S. soil, so maybe they’ll use the same services to detect piracy. Until then, I’ll be working on a music language model of my own to get better transcription results.

Thanks for following along!

Summarize

Audio Processing for Detecting Piracy on TikTok

How TikTok Encourages Piracy

How To Define Piracy

How To Detect Piracy

What’s Next?