Copyright protection just got 1000x harder

November 11, 2024

Judge with gavel at wooden table on white background, closeup

Not too long ago, we hypothesised that copyright law might be the best way to short-circuit the looming artificial intelligence (AI) apocalypse.

Our reasoning was simple: If the courts forced AI companies to pay the full market price for each article, video, or audio they needed to train their machines, it would quickly become prohibitively expensive to create AI, and the whole sector would collapse.

At the time, copyright laws allowed companies that held large stores of intangible assets to stop AI firms from moving quickly. In response, the AI companies chose a “steal now, ask for forgiveness later” approach. But the legal system was (slowly) building new ways to protect innovation and intellectual property in the age of AI.

However, earlier this month, the judge in Raw Story Media vs. OpenAI threw the whole case out, citing a decision by the US Supreme Court in 2021.

In the Raw Media case, two online news outlets said that OpenAI violated copyright protection laws by scraping thousands of news articles and stripping them of “copyright management information” (CMI) like the author’s name, the terms and conditions and titles. The outlets asked for statutory damages of $2500 per violation, arguing that ChatGPT’s ability to “summarise” articles constituted copyright infringement.

OpenAI, you might recall, is the parent company of the world-famous ChatGPT software. Multiple plaintiffs are suing OpenAI, arguing that it’s illegal for AI companies to train their tools by using news articles, books, paintings, and other material without permission.

However, the judge in the Raw Media case, Southern District of New York Judge Colleen McMahon, ruled that the news outlets had failed to show any “concrete injury-in-fact” because ChatGPT couldn’t be prompted to produce an exact copy of those articles as a response to a user query. A “summary” of the articles wasn’t enough to trigger copyright infringement.

Judge McMahon’s demand to see “concrete harm” comes from the 2021 Supreme Court decision TransUnion vs. Ramirez. In that case, it was ruled that merely violating a law was not adequate grounds for a case. Instead, a plaintiff must show “concrete harm.”

The thing is, the US Copyright Act (and many other copyright jurisdictions) never required proof of concrete harm to bring a case. All that was needed was to point out a breach of copyright. But Judge McMahon cited TransUnion to say this standard was no longer good enough and they need to show that a copyright breach has caused harm to the owner.

To see why this is a problem for companies that own protected works and other intangible assets, we need to briefly explain how AI systems like ChatGPT work.

ChatGPT is a large language model (LLM), so it cannot – or is extremely unlikely to – reproduce any of its training material in its original form. An LLM is not a library. Instead, ChatGPT can only synthesise information from the articles it has been fed, piecing together new sentences from millions of the different words it “knows.”

A cartoonish analogy would be like if a human could speak only by different quotes from their favourite movie. A little bit of Tom Cruise here, a little bit of Meryl Streep there. But they can never quote an entire movie exactly. That’s pretty much how an LLM works.

This explains why Judge McMahon said that although ChatGPT has an enormous amount of training material in its system, “the likelihood that ChatGPT would output plagiarised content from one of the plaintiffs’ articles seems remote.”

Therefore, she said, the “plaintiffs have not alleged any actual adverse effects” on the two plaintiff news outlets and she rejected the premise of the case. Ouch.

The result will be good news for OpenAI and other AI firms, but it should send shockwaves around the world of intangible assets.

It boils down to this:

If the standard is that a protected work must be able to be reproduced perfectly by an AI – even though AI systems literally can’t do this – before it can be said to have done “concrete harm,” then AI companies have been given the green light to grab as much protected material as they like and feed their LLMs.

The plaintiffs in the Raw Media case can still amend their claim to focus on demanding OpenAI to pay for the material it has used to train ChatGPT. But they can’t stop OpenAI from using protected material because no one can prove ChatGPT causes “concrete harm.”

As Cornell law professor James Grimmelmann said about the decision, “This theory of no standing is actually a potential earthquake far beyond AI. It has the potential to significantly restrict the kinds of IP cases that federal courts can hear”, and it might leave publishers without standing “to sue over model training at all, even for copyright infringement.”

For now, copyright holders need to rethink their strategy for combatting AI-related infringement claims. Documenting all examples of harm from misuse will be crucial. According to the Raw Media case, suing someone based on infringement itself is no longer sufficient standing for copyright cases.

Life just got a lot easier for AI companies and a lot more difficult for businesses that rely on intangible assets such as IP and innovation. It’s a brave new world, indeed.