Microsoft and GitHub have tried again to get rid of a lawsuit over alleged code copying by GitHub’s Copilot programming suggestion service, arguing that generating similar code isn’t the same as reproducing it verbatim.
The duos’ latest motion to dismiss [PDF], filed on Thursday, follows an amended complaint [PDF] from the plaintiffs – software developers who claim that Copilot and its underlying OpenAI Codex large model have violated federal copyright and state business laws. The aggrieved developers argue that Copilot has been configured to generate code suggestions that are similar or identical to its training data.
Copilot and Codex were trained from tons of publicly available source code, including the plaintiffs’ GitHub repositories, and other materials. When presented with a prompt by a user, these AI models will generate code snippets in response, using the materials it learned from.
The issue for the plaintiffs is that Copilot incorporates copies of their code and can be coaxed to reproduce their work, or something similar, without including or taking into account the required software license details – Copyright Management Information (CMI) in the context of the law.
In short, Copilot, it is claimed, may emit code it learned from something someone else wrote, or something close to it, without giving proper credit or following the original license.
Microsoft and GitHub say that the plaintiffs’ argument is fatally flawed because it fails to articulate any instances of actual code cloning – which cannot be verified beyond those involved in the case since the code examples in public documents have been redacted to prevent the authors from being identified.
“As this court found, plaintiffs failed to allege that Copilot had ever actually generated any suggestion reproducing their code, leaving plaintiffs uninjured and therefore without standing to pursue damages,” the defendant companies argued. “Lacking real-life instances of harm, plaintiffs now try to manufacture some.”
The tech giants say that the plaintiffs, being unable to get Copilot to emit an exact copy of copyrighted code, produced examples of variations on their code, as would be expected from an AI model trained to recognize functional concepts and then generate suggestions reflecting that training.
The argument here is that the plaintiffs want their copyright claim to cover not just copied code but similar “functionally equivalent” code. However, as the defendants point out, copyright protection covers expression but not function (ideas, procedures, math concepts, etc).
Thus, the pair argue that the plaintiffs’ claim focusing on the functional equivalency of code does not work under Section 1202(b) of America’s Digital Millennium Copyright Act. That portion of the law forbids the removal or alteration of CMI – the software license details in this case – or the distribution of copyrighted content when it’s known that the CMI has been removed.
“The Section 1202(b) “is about identical ‘copies … of a work’ – not about stray snippets and adaptations,” the defendants’ motion says.
Microsoft and GitHub also take issue with the complaint’s assertion that the corporations are liable for creating a derivative work simply through the act of AI model training. The plaintiffs made claims of unjust enrichment and negligence – under California state law – that the creation of Codex and Codex unfairly used their licensed code on GitHub.
According to the two companies, this is fundamentally a copyright claim and federal law preempts related claims under state law. Moreover, they contend that the plaintiffs “fail to allege any cognizable injury to them that would result from the mere training of a generative AI model based, in part, on code contained in Plaintiffs’ repositories.”
The companies maintain that because GitHub users decide whether to make their code public and agree to terms of service that permit the viewing, usage, indexing, and analysis of public code, then the site’s owners are within their rights to incorporate the work of others and profit from it.
“Any GitHub user,” they say, “… appreciates that code placed in a public repository is genuinely public. Anyone is free to examine, learn from, and understand that code, as well as repurpose it in various ways. And, consistent with this open source ethic, neither GitHub’s TOS nor any of the common open source licenses prohibit either humans or computers from reading and learning from publicly available code.”
Judge Jon Tigar has set September 14 as the first available date to hold a hearing on the motion to dismiss the case. In the interim, there may be further filings from either side. ®