GitHub’s Commercial AI Tool Was Built From Open Source Code

Copilot is pitched as a helpful aid to developers. But some programmers object to the blind copying of blocks of code used to train the algorithm.

Earlier this month, Armin Ronacher, a prominent open-source developer, was experimenting with a new code-generating tool from GitHub called Copilot when it began to produce a curiously familiar stretch of code. The lines, drawn from the source code of the 1999 video game Quake III, are infamous among programmers—a combo of little tricks that add up to some pretty basic math, imprecisely. The original Quake coders knew they were hacking. “What the fuck,” one commented in the code beside an especially egregious shortcut.

So it was strange for Ronacher to see such code generated by Copilot, an artificial intelligence tool that is marketed to generate code that is both novel and efficient. The AI was plagiarizing—copying the hack (including the profane comment) verbatim. Worse yet, the code it had chosen to copy was under copyright protection. Ronacher posted a screenshot to Twitter, where it was entered as evidence in a roiling trial-by-social-media over whether Copilot is exploiting programmers’ labor.

Copilot, which GitHub calls “your AI pair programmer,” is the result of a collaboration with OpenAI, the formerly nonprofit research lab known for powerful language-generating AI models such as GPT-3. At its heart is a neural network that is trained using massive volumes of data. Instead of text, though, Copilot’s source material is code: millions of lines uploaded by the 65 million users of GitHub, the world’s largest platform for developers to collaborate and share their work. The aim is for Copilot to learn enough about the patterns in that code that it can do some hacking itself. It can take the incomplete code of a human partner and finish the job. For the most part, it appears successful at doing so. GitHub, which was purchased by Microsoft in 2018, plans to sell access to the tool to developers.

To many programmers, Copilot is exciting because coding is hard. While AI can now generate photo-realistic faces and write plausible essays in response to prompts, code has been largely untouched by those advances. An AI-written text that reads strangely might be embraced as “creative,” but code offers less margin for error. A bug is a bug, and it means the code could have a security hole or a memory leak, or more likely that it just won’t work. But writing correct code also demands a balance. The system can’t simply regurgitate verbatim code from the data used to train it, especially if that code is protected by copyright. That’s not AI code generation; that’s plagiarism.

GitHub says Copilot’s slip-ups are only occasional, but critics say the blind copying of code is less of an issue than what it reveals about AI systems generally: Even if code is not copied directly, should it have been used to train the model in the first place? GitHub has been unclear about precisely which code was involved in training Copilot, but it has clarified its stance on the principles as the debate over the tool has unfolded: All publicly available code is fair game regardless of its copyright.

That hasn’t sat well with some GitHub users who say the tool both depends on their code and ignores their wishes for how it will be used. The company has taken both free-to-use and copyrighted code and “put it all in a blender in order to sell the slurry to commercial and proprietary interests,” says Evelyn Woods, a Colorado-based programmer and game designer whose tweets on the topic went viral. “It feels like it’s laughing in the face of open source.”

AI tools bring industrial scale and automation to an old tension at the heart of open source programming: Coders want to share their work freely under permissive licenses, but they worry that the chief beneficiaries will be large businesses that have the scale to profit from it. A corporation takes a young startup’s free-to-use code to corner a market or uses an open source library without helping with the maintenance. Code-generating AI systems that rely on large data sets mean everyone’s code is potentially subject to reuse for commercial applications.

“I’m generally happy to see expansions of free use, but I’m a little bitter when they end up benefiting massive corporations who are extracting value from smaller authors’ work en masse,” Woods says.

One thing that’s clear about neural networks is that they can memorize their training data and reproduce copies. That risk is there regardless of whether that data involves personal information or medical secrets or copyrighted code, explains Colin Raffel, a professor of computer science at the University of North Carolina who coauthored a preprint (not yet peer-reviewed) examining similar copying in OpenAI’s GPT-2. Getting the model, which is trained on a large corpus of text, to spit out training data was rather trivial, they found. But it can be difficult to predict what a model will memorize and copy. “You only really find out when you throw it out into the world and people use and abuse it,” Raffel says. Given that, he was surprised to see that GitHub and OpenAI had chosen to train their model with code that came with copyright restrictions.

According to GitHub’s internal tests, direct copying occurs in roughly 0.1 percent of Copilot’s outputs—a surmountable error, according to the company, and not an inherent flaw in the AI model. That’s enough to cause a nit in the legal department of any for-profit entity (“non-zero risk” is just “risk” to a lawyer), but Raffel notes this is perhaps not all that different from employees copy-pasting restricted code. Humans break the rules regardless of automation. Ronacher, the open source developer, adds that most of Copilot’s copying appears to be relatively harmless—cases where simple solutions to problems come up again and again, or oddities like the infamous Quake code, which has been (improperly) copied by people into many different codebases. “You can make Copilot trigger hilarious things,” he says. “If it’s used as intended I think it will be less of an issue.”

GitHub has also indicated it has a possible solution in the works: a way to flag those verbatim outputs when they occur so that programmers and their lawyers know not to reuse them commercially. But building such a system is not as simple as it sounds, Raffel notes, and it gets at the larger problem: What if the output is not verbatim, but a near copy of the training data? What if only the variables have been changed, or a single line has been expressed in a different way? In other words, how much change is required for the system to no longer be a copycat? With code-generating software in its infancy, the legal and ethical boundaries aren’t yet clear.

Many legal scholars believe AI developers have fairly wide latitude when selecting training data, explains Andy Sellars, director of Boston University’s Technology Law Clinic. “Fair use” of copyrighted material largely boils down to whether it is “transformed” when it is reused. There are many ways of transforming a work, like using it for parody or criticism or summarizing it—or, as courts have repeatedly found, using it as the fuel for algorithms. In one prominent case, a federal court rejected a lawsuit brought by a publishing group against Google Books, holding that its process of scanning books and using snippets of text to let users search through them was an example of fair use. But how that translates to AI training data isn’t firmly settled, Sellars adds.

It’s a little odd to put code under the same regime as books and artwork, he notes. “We treat source code as a literary work even though it bears little resemblance to literature,” he says. We may think of code as comparatively utilitarian; the task it achieves is more important than how it is written. But in copyright law, the key is how an idea is expressed. “If Copilot spits out an output that does the same thing as one of its training inputs does—similar parameters, similar result—but it spits out different code, that’s probably not going to implicate copyright law,” he says.

The ethics of the situation are another matter. “There’s no guarantee that GitHub is keeping independent coders’ interests to heart,” Sellars says. Copilot depends on the work of its users, including those who have explicitly tried to prevent their work from being reused for profit, and it may also reduce demand for those same coders by automating more programming, he notes. “We should never forget that there is no cognition happening in the model,” he says. It’s statistical pattern matching. The insights and creativity mined from the data are all human. Some scholars have said that Copilot underlines the need for new mechanisms to ensure that those who produce the data for AI are fairly compensated.

GitHub declined to answer questions about Copilot and directed me to an FAQ about the system. In a series of posts on Hacker News, GitHub CEO Nat Friedman responded to the developer outrage by projecting confidence about the fair use designation of training data, pointing to an OpenAI position paper on the topic. GitHub was “eager to participate” in coming debates over AI and intellectual property, he wrote.

Ronacher says that he expects advocates of free software to defend Copilot—and indeed, some already have—out of concern that drawing limits on fair use could jeopardize the free sharing of software more broadly. But it’s unclear if the tool will spark meaningful legal challenges that clarify the fair use issues anytime soon. The kind of tasks people are tackling with Copilot are mostly boilerplate, Ronacher points out—unlikely to run afoul of anyone. But for him, that’s part of why the tool is exciting, because it means automating away annoying tasks. He already uses permissive licenses whenever he can in the hopes that other developers will pluck out whatever is useful, and Copilot could help automate that sharing process. “An engineer shouldn’t waste two hours of their life implementing a function I’ve already done,” he says.

But Ronacher can see the challenges. “If you’ve spent your life doing something, you expect something for it,” he says. At Sentry, a debugging software startup where he is director of engineering, the team recently tightened some of its most permissive licenses—with great reluctance, he says—for fear that “a large company like Amazon could just run away with our stuff.” As AI applications advance, those companies are poised to run faster.


More Great WIRED Stories

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.