The Plague of Plagiarism in Data Science
What can you really trust anymore?
My work was recently plagiarized. Someone directly copied all of the code from one of my videos. They made it into an article on this very platform (a different publication than TDS) without any accreditation. After talking to a few other content creators in data science, I quickly discovered that this is an exceedingly common problem that we all face.
At first I was furious. How could someone take my hard work and claim it as their own? After taking some time to reflect and calm down, I reached out to this “perpetrator”. Believe it or not, this person didn’t have a clue that what he did was “wrong”. Strangely enough, I believe him.
After a brief conversation with a few choice words (just kidding), I realized that the fiasco was a symptom of a much larger underlying issue.
Beyond the academic realm, the definition of plagiarism in the data science domain is quite opaque. It is common practice to copy code, to use others work, etc. How do we know when we are plagiarizing or not?

If you prefer a video format: https://www.youtube.com/watch?v=vQGJvmhpq_s&ab_channel=KenJee
Defining the Problem
What is plagiarism? A simple Wikipedia search turns up this result:
“Plagiarism is the representation of another author’s language, thoughts, ideas, or expressions as one’s own original work.”
By this definition, you could probably make the case that almost everything out there is stolen at this point. In data science, it’s even considered efficient and smart to hit copy and paste on the snippets of other contributors’ code or functions. After all, thousands of others are doing it! Also, it would be a tremendous (and painstakingly unnecessary) pain to cite every line of code that you write.
However, copying is one of the main reasons why there is so much ambiguity around plagiarism in data science to begin with. I copy quite a bit of code for most of the projects that I share. I even copied someone else’s entire web scraper in my “Data Science Project from Scratch” series on YouTube.
Now you ask, why is what I have done NOT plagiarism?
The Quick Fix
The first key to avoiding the “P” word is to cite the resources you use. In my case, I clearly cited where I found the code and who it was from in the ReadMe and video description.
So when do you not have to cite? The general rule of thumb is that if the code is simple enough to be found in the documentation of a library, you probably don’t have to cite it.
If it’s more complex, you should at least note where you got it from and err on the side of caution. Others greatly appreciate it when you share their work, and they might even help you debug it if you find errors! Also, it looks no worse on you. In fact, it makes you all the wiser for being resourceful and working together.
A Rose by a Different Name
How else can you sidestep the shade I will throw at you on the internet if you plagiarize? (Once again, just kidding). Differentiate your work substantially from the other projects. The 2 easiest ways to do this are to use a completely different data source, or another algorithmic approach on the same dataset.
Again, if you’re only making small changes to the data or the algorithms, you should just cite where you got the source material. Honestly, even if you found inspiration elsewhere, you might as well give a shoutout.
For example, if you were to replicate my Glassdoor.com project where I analyze salaries of data scientists. Let’s say you wanted to build this for a different country or job position. In this case, the core work would not be viewed as substantially different. Therefore, you should at least cite where you got the framework from.
If you were doing a similar analysis on baseball player salaries, I would argue that this is unique enough. Even if you used a similar algorithmic approach, the data was scraped from an entirely different source and would have additional nuance.
Why This Matters
After hearing what others have to say, I agree that it’s not really about the personal credit. It’s truly about the principle. The integrity of the process.
As someone who has had to screen and interview data scientist candidates, how can I trust that the person sitting in front of me has actually done the work themselves? This is where the system breaks down for those who are publishing great work and taking risks.
To end on a cautionary tale, there are negative consequences involved. Plagiarism is grounds for termination, and can even impact future employability. Not to mention the potential legal ramifications.
Take the example of Siraj Raval. As a fellow Youtuber, he is someone I really looked up to. Copying others work caused irreparable damage to his career. It directly impacted his brand and finances. Since then, it appears as though he’s turned his corner. But as we know, repairing the image takes time and much significant effort.
The Root of Some Evil
We should have empathy for those who may mistakenly fall under this label. It truly is a gray area, and may originate from a place of personal doubt or imposter syndrome. In a world that is more cutthroat, it’s tough to stand out. I get that.
However, employers will see you as a leader if you are able to bring the big picture together and give others credit. It’s a key management skill that will pave the way for even greater opportunities. Citing your sources will open doors to personal and career success, as you proudly broadcast your co-collaborations with your peers and community. After all, open source exists to make data science better, faster, and stronger. Cheers to a plague-free future.
As for the person that copied my work, I hold no ill will. He was quite young, and I truly hope he learns from this mistake.
Ideas are my own, words are written by my good friend Sid Khaitan. He’s a “ghostwriter” that transforms my concepts into compelling stories. Without him, these would probably be less fun. If you’re wondering if he wrote this part too, he definitely did.
Pick up some tips and read his work here: @SidKtan
Consider reading these next!






