avatarJacob Bergdahl

Summary

GitHub's AI-powered code-generating tool, Copilot, is facing backlash due to concerns over potential copyright infringement and unethical data collection practices.

Abstract

GitHub recently introduced its AI-powered code-generating tool, Copilot, which can write code by itself and suggests autocompletions. However, the AI was trained on billions of lines of code from GitHub repositories, raising concerns about potential copyright infringement and the inability to differentiate between original and copied code. Some users have already abandoned GitHub over these concerns, and there are ongoing debates about the legality and ethics of using Copilot.

Bullet points

  • GitHub's new product, Copilot, is an AI-powered code-generating tool that can write code by itself and suggests autocompletions.
  • Copilot was trained on billions of lines of code from GitHub repositories.
  • Users cannot differentiate between original and copied code generated by Copilot.
  • Copilot has been criticized for potential copyright infringement and unethical data collection practices.
  • Some users have abandoned GitHub due to these concerns.
  • Debates are ongoing about the legality and ethics of using Copilot.
GitHub describes its new product Copilot as an “AI pair programmer.” However, pair programming usually doesn’t usually involve stealing licensed code, does it? Photo by Christina Morillo.

GitHub’s AI Copilot Might Get You Sued If You Use It

Some are even abandoning GitHub because of it

GitHub just announced its latest, shiny product: an artificial intelligence (AI) called Copilot. It’s a machine learning-powered software that can write code by itself, generating quite impressive programming functions. Yet, it has people pulling out of GitHub and worrying about lawsuits.

The AI works similarly to other OpenAI-powered code-generating tools. The user writes a comment describing what they want the AI to write, and the AI makes it happen. What makes Copilot unique is that it also takes initiatives on its own, suggesting autocompletions on the fly.

It sounds really cool, doesn’t it? If you know me, you know that I’m often excited about artificial intelligence; I even published a book wherein I described technologies similar to Copilot. But there are plenty of issues surrounding machine learning, and GitHub is experiencing these dilemmas already on day one. Usually, the source of the drama for any machine learning application lies in its data, and the outcry surrounding Copilot follows that rule. More specifically, in the case of Copilot, the problem lies in how GitHub went about gathering the data to build the algorithm.

“Unfortunately, the user has no way of knowing if the algorithm made a particular piece of code up by itself or stole it from a code repository protected by a license.”

Like any machine learning algorithm, Copilot has learned how to do its thing (write code) by being fed data of things (code) that work. According to GitHub itself, this AI has been trained on billions of lines of code from GitHub repositories. And so, when Copilot writes code for a user, it draws from those billions of lines.

Unfortunately, the user has no way of knowing if the algorithm made a particular piece of code up by itself or stole it from a code repository protected by a license.

Oh, and when I say stole, I mean it.

One software engineer posted a picture on Twitter, which shows a piece of code generated by Copilot when it was asked to write an “About me”-page. Comically, the code is ripped straight from the page of a real person.

In another humorously awful showcase of Copilot, a user uploaded a GIF showing the AI writing a function ripped straight from the repository for the video game Quake III Arena. It even includes the original comments.

See, this is the fundamental problem with Copilot. It’s impossible to tell what code it has figured out by itself and what code it has downright copy-pasted from a different source.

Another user on Twitter that gathered much attention called out the software for being a means to launder open-source code into commercial works.

The GitHub terms of services state that people who use its platform “[…] grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time.” However, this might not exactly have been what people had in mind when they signed up for the service.

“Personally, I believe in a future wherein we use machine learning-powered assistants to write code faster. But this isn’t it.”

In its FAQ, GitHub claims that “[…] the code you create with GitHub Copilot’s help belongs to you.” However, on popular programmer site Hacker News, people are arguing that Copilot is infringing copyright. The AI should only be allowed to use code that its initial owners have made available for commercial use, yet it’s clearly using all manners of code, regardless of licenses.

In another thread on Hacker News, people are expressing concerns that using the tool might get you sued, as you might unknowingly use copyrighted code. One user calls Copilot a “legal ticking time bomb,” while another adds a personal anecdote: “I run product security for a large enterprise, and I’ve already gotten the ball rolling on prohibiting copilot […].”

This has all proven enough for some to abandon GitHub.

Personally, I believe in a future wherein we use machine learning-powered assistants to write code faster. But this isn’t it. There are far too many concerns with the matter in which the data was collected and used in the case of Copilot.

I anticipate that we will see the launch of many similar services in the future, but none will see true success until they are built ethically and sensibly.

GitHub isn’t the only company making code-generating AI. How will these new algorithms impact developers? Will AI replace programmers? Check out the story below.

Artificial Intelligence
Machine Learning
Software Development
Software Engineering
Technology
Recommended from ReadMedium