OpenAI: 'Impossible to train today’s leading AI models without using copyrighted materials'

OpenAI has said it would be “impossible” to build top-tier neural networks that meet today’s needs without using people’s copyrighted work. The Microsoft-backed lab, which believes it is lawfully harvesting said content for training its models, said using out-of-copyright public domain material would result in sub-par AI software.

This assertion comes at a time when the machine-learning world is sprinting head first at the brick wall that is copyright law. Just this week an IEEE report concluded Midjourney and OpenAI’s DALL-E 3, two of the major AI services to turn text prompts into images, can recreate copyrighted scenes from films and video games based on their training data.

The study, co-authored by Gary Marcus, an AI expert and critic, and Reid Southen, a digital illustrator, documents multiple instances of “plagiaristic outputs” in which OpenAI and DALL-E 3 render substantially similar versions of scenes from films, pictures of famous actors, and video game content.

Marcus and Southen say it’s almost certain that Midjourney and OpenAI trained their respective AI image-generation models on copyrighted material.

Whether that’s legal, and whether AI vendors or their customers risk being held liable, remain contentious question. However, the report’s findings may bolster those suing Midjourney and DALL-E maker OpenAI for copyright infringement.

Users may not know, when they produce an image, whether they are infringing

“Both OpenAI and Midjourney are fully capable of producing materials that appear to infringe on copyright and trademarks,” they wrote. “These systems do not inform users when they do so. They do not provide any information about the provenance of the images they produce. Users may not know, when they produce an image, whether they are infringing.”

Neither biz has fully disclosed the training data used to make their AI models.

It’s not just digital artists challenging AI companies. The New York Times recently sued OpenAI because its ChatGPT text model will spit out near-verbatim copies of the newspaper’s paywalled articles. Book authors have filed similar claims, as have software developers.

Prior research has indicated that OpenAI’s ChatGPT can be coaxed to reproduce training text. And those suing Microsoft and GitHub contend the Copilot coding assistant model will reproduce code more or less verbatim.

Southen observed that Midjourney is charging customers who are creating infringing content and profiting via subscription revenue. “MJ [Midjourney] users don’t have to sell the images for copyright infringement to have potentially occurred, MJ already profits from its creation,” he opined, echoing an argument made in the IEEE report.

OpenAI also charges a subscription fee and thus profits in the same way. Neither OpenAI and Midjourney did not respond to requests for comment.

However, OpenAI on Monday published a blog post addressing the New York Times lawsuit, which the AI seller said lacked merit. Astonishingly, the lab said that if its neural networks generated infringing content, it was a “bug.”

In total, the upstart today argued that: It actively collaborates with news organizations; training on copyrighted data qualifies for the fair use defense under copyright law; “‘regurgitation’ is a rare bug that we are working to drive to zero”; and the New York Times has cherry-picked examples of text reproduction that don’t represent typical behavior.

The law will decide

Tyler Ochoa, a professor in the law department at Santa Clara University in California, told The Register that while the IEEE report’s findings are likely to help litigants with copyright claims, they shouldn’t – because the authors of the article have, in his view, misrepresented what’s happening.

“They write: ‘Can image-generating models be induced to produce plagiaristic outputs based on copyright materials? … [W]e found that the answer is clearly yes, even without directly soliciting plagiaristic outputs.'”

Ochoa questioned that conclusion, arguing the prompts the report’s authors “entered demonstrate that they are, indeed, directly soliciting plagiaristic outputs. Every single prompt mentions the title of a specific movie, specifies the aspect ratio, and in all but one case, the words ‘movie’ and ‘screenshot’ or ‘screencap.’ (The one exception describes the image that they wanted to replicate.)”

The law prof said the issue for copyright law is determining who is responsible for these plagiaristic outputs: The creators of the AI model or the people who asked the AI model to reproduce a popular scene.

Artificial intelligence is a liability

MUST READ

“The generative AI model is capable of producing original output, and it is also capable of reproducing scenes that resemble scenes from copyrighted inputs when prompted,” explained Ochoa. “This should be analyzed as a case of contributory infringement: The person who prompted the model is the primary infringer, and the creators of the model are liable only if they were made aware of the primary infringement and they did not take reasonable steps to stop it.”

Ochoa said generative AI models are more likely to reproduce specific images when there are multiple instances of those images in their training data set.

“In this case, it is highly unlikely that the training data included entire movies; it is far more likely that the training data included still images from the movies that were distributed as publicity stills for the movie,” he said. “Those images were reproduced multiple times in the training data because media outlets were encouraged to distribute those images for publicity purposes and did so.

“It would be fundamentally unfair for a copyright owner to encourage wide dissemination of still images for publicity purposes, and then complain that those images are being imitated by an AI because the training data included multiple copies of those same images.”

Ochoa said there are steps to limit such behavior from AI models. “The question is whether they should have to do so, when the person who entered the prompt clearly wanted to get the AI to reproduce a recognizable image, and the movie studios that produced the original still images clearly wanted those still images to be widely distributed,” he said.

“A better question would be: How often does this happen when the prompt does not mention a specific movie or describe a specific character or scene? I think an unbiased researcher would likely find that the answer is rarely (perhaps almost never).”

Nonetheless, copyrighted content appears to be essential fuel for the making of these models function well.

OpenAI defends itself to Lords

In response to an inquiry into the risks and opportunities of AI models by the UK’s House of Lords Communications and Digital Committee, OpenAI presented a submission [PDF] warning that its models won’t work without being trained on copyrighted content.

“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” the super lab said.

“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

The AI biz said it believes that it complies with copyright law and that training on copyrighted material is lawful, though it allows that “that there is still work to be done to support and empower creators.”

That sentiment, which sounds like a diplomatic recognition of ethical concerns about compensation for the arguable fair use of copyrighted work, should be considered in conjunction with the IEEE report’s claim that, “we have discovered evidence that a senior software engineer at Midjourney took part in a conversation in February 2022 about how to evade copyright law by ‘laundering’ data ‘through a fine tuned codex.'”

Marcus, co-author of the IEEE report, expressed skepticism of OpenAI’s effort to obtain a regulatory green light in the UK for its current business practices.

“Rough Translation: We won’t get fabulously rich if you don’t let us steal, so please don’t make stealing a crime!” he wrote in a social media post. “Don’t make us pay licensing fees, either! Sure Netflix might pay billions a year in licensing fees, but we shouldn’t have to! More money for us, moar!”

OpenAI has offered to indemnify enterprise ChatGPT and API customers against copyright claims, though not if the customer or the customer’s end users “knew or should have known the Output was infringing or likely to infringe” or if the customer bypassed safety features, among other limitations. Thus, asking DALL-E 3 to recreate a famous film scene – which users ought to know is probably covered by copyright – would not qualify for indemnification.

Midjourney has taken the opposite approach, promising to hunt down and sue customers involved in infringement to recover legal costs arising from related claims.

“If you knowingly infringe someone else’s intellectual property, and that costs us money, we’re going to come find you and collect that money from You,” Midjourney’s Terms of Service state. “We might also do other stuff, like try to get a court to make you pay our legal fees. Don’t do it.” ®

READ SOURCE