Last year saw great excitement over a new wave of AI services based on large language models (LLMs). That enthusiasm was somewhat overshadowed by a subsequent wave of lawsuits claiming that the LLMs were guilty of copyright infringement because of the training materials they used. Just before the start of 2024, a new lawsuit was filed, this time by The New York Times (NYT), against OpenAI and Microsoft. As an article in the NYT itself explains:
The suit does not include an exact monetary demand. But it says the defendants should be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.” It also calls for the companies to destroy any chatbot models and training data that use copyrighted material from The Times.
Around the same time, Microsoft and OpenAI were also sued for alleged copyright infringement by non-fiction authors, but it is the NYT action that has really caught people’s attention, and led to a flurry of analysis and opinion pieces. There are two main elements to the lawsuit. One alleges that use of material from the NYT to train OpenAI’s LLMs without permission is illegal, and the other that the output from OpenAI’s ChatGPT infringes on NYT’s copyrights.
The first point is the old argument that LLMs infringe on copyright because they are copying their training materials. But as previous posts on Walled Culture (and many others elsewhere) have explained, that’s not how LLMs work. They don’t copy, they analyse, in order to create a database of probabilities that represent existing patterns in text, images, videos and sounds. They then use these patterns to generate new material given a prompt by the user. The second element of the NYT lawsuit is the following, as described by the NYT story:
The complaint cites several examples when a chatbot provided users with near-verbatim excerpts from Times articles that would otherwise require a paid subscription to view. It asserts that OpenAI and Microsoft placed particular emphasis on the use of Times journalism in training their A.I. programs because of the perceived reliability and accuracy of the material.
At first sight, the examples provided look quite compelling. In a blog post commenting on the lawsuit, OpenAI makes the following points about these “regurgitations” that the NYT says is evidence of copyright infringement by ChatGPT:
Interestingly, the regurgitations The New York Times induced appear to be from years-old articles that have proliferated on multiple third-party websites. It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate. Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.
Moreover, as Mike Masnick noted on Techdirt:
If you actually understand how these systems work, the output looking very similar to the original NY Times piece is not so surprising. When you prompt a generative AI system like GPT, you’re giving it a bunch of parameters, which act as conditions and limits on its output. From those constraints, it’s trying to generate the most likely next part of the response. But, by providing it paragraphs upon paragraphs of these articles, the NY Times has effectively constrained GPT to the point that the most probabilistic responses is… very close to the NY Times’ original story.
That same OpenAI blogpost says that “The New York Times is not telling the full story” because the two companies had been negotiating a “a high-value partnership around real-time display with attribution in ChatGPT, in which The New York Times would gain a new way to connect with their existing and new readers, and our users would gain access to their reporting.” Mike Masnick speculates that the NYT may have decided to bring its lawsuit because at the beginning of December last year OpenAI announced a “Partnership with Axel Springer to deepen beneficial use of AI in journalism”. This followed an earlier deal with The Associated Press. Other publishers are also discussing licensing deals with OpenAI. NYT might be using its legal action to put pressure on OpenAI to offer a better licensing deal and sooner.
Licensing has always been a favourite approach for the copyright world – however inappropriately – as Walled Culture the book details (free digital downloads). But a comment from the venture capital company Andreessen Horowitz, submitted to the US Copyright Office as part of the latter’s inquiry into AI, points out:
The reason AI models are able to do what they can do today is that the internet has given AI developers ready access to a broad range of content, much of which can’t reasonably be licensed—everything from blog posts to social media threads to customer reviews on shopping sites. Indeed, the cost of paying to license even a fraction of the content needed to properly train an AI model would be prohibitive for all but the deepest-pocketed AI developers, resulting in dominance by a few technology incumbents. This would undermine competition by the technology startups which are the source of the greatest innovation in AI.
OpenAI said something similar in a submission to the UK’s House of Lords communications and digital select committee:
Because copyright today covers virtually every sort of human expression–including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.
Some people have mocked this comment, on the grounds that it seems to say that breaking the law is justified if your business model depends upon it. This overlooks two points. First, there is an assumption among big content that copyright law should be allowed to throttle exciting new technologies with major benefits for society simply because copyright is sacred and must always be protected, regardless of the harm it causes. Secondly, big content has pushed, and keeps pushing, for legislation that ensures that copyright continues to sustain their current business models.
One of the most positive aspects of the NYT lawsuit against OpenAI and Microsoft is that it has led to articles in the mainstream media noting that it raises important questions about copyright law, and whether it is fit for purpose in the digital world. The answer to that question is unlikely to come back quickly – it may need to go all the way to the US Supreme Court for a definitive answer, but it is good news that the problem is finally being acknowledged in this way.
Featured image by OpenAI.