The copyright world’s obsession with ownership will throttle AI innovation and boost today’s tech giants

Generative AI is still one of the hot topics in technology, even if the initial, rather breathless excitement has cooled down somewhat. It clearly represents an interesting new approach to finding and creating textual, visual and audio material. Unfortunately, the copyright world’s obsession with ownership threatens to throw an 18th-century legalistic spanner in these 21st-century works. Here, for example, is The Hollywood Reporter writing about a lawsuit brought by two authors who claim that ChatGPT infringes on copyrights to their novels:

The proposed class action filed in San Francisco federal court on Wednesday alleges that OpenAI “relied on harvesting mass quantities” of copyright-protected works “without consent, without credit, and without compensation.” It seeks a court order that the company infringed on writers’ works when it illegally downloaded copies of novels to train its AI system and that ChatGPT’s answers constitute infringement.

As evidence of that alleged infringement, the lawsuit notes that ChatGPT is able to generate summaries of the authors’ novels when prompted. They argue that is only possible if ChatGPT was trained on their copyrighted works. Even accepting that logic, the next step – using this to “prove” infringement – is based on a misunderstanding of how generative AI works.

What ChatGPT and other generative AI systems do is similar to what Web search systems do: they ingest huge amounts of data, extract certain information, and then use that to respond to user queries. In the case of Web search engines, they create a map of online material in the form of an index of words – which words are found on which sites. This allows users to find sites that contain particular combinations of keywords. Generative AI goes further, and creates a map of all the training material it analyses in the form of a complex probabilistic representation of the relationships between elements found in that material, whether textual, visual or auditory. This allows users to create new material by using prompts as a way of drawing on the overall abstract patterns encapsulated in these representations.

Despite what the lawsuit claims, the result of using a generative AI system is not a derivative work “made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act”. It is not derived from, or based on, any copyright work, but generated using complex computational processes drawing on the mathematical representations of the aggregated probabilistic relationships between elements found across all the training material (if you want to find out more about how ChatGPT does that, Stephen Wolfram’s very long post “What Is ChatGPT Doing … and Why Does It Work?” is probably the best introduction.)

The richness of the training material is key to the power of the generative AI systems, as the larger and more powerful models demonstrate. But that is an argument to allow them to ingest as much material as possible, not least to avoid biases that are often present in smaller training sets. In fact, the bigger the training set, the smaller the importance of any specific material, since its probabilistic information is combined with and subsumed by trillions of other data points.

The current spate of lawsuits alleging “infringement” of particular works not only misunderstand how generative AI systems work. They also threaten to stymie the continued improvement of the technology towards tools that are even further removed from the training materials that copyright-obsessed creators are worried about. The Hollywood Reporter article quotes Ashley Irwin, president of the Society of Composers and Lyricists, as saying back in May:

that AI firms should be required to secure consent by creators for the use of their works to train AI programs and compensate them at fair market rates for any subsequent new work that’s created, on top of providing the proper credit.

The only companies that could even attempt that would be giant ones with deep pockets. Pursuing this quixotic idea that creators have some kind of “ownership” of the aggregated probabilistic data that lies at the heart of generative AI, and what they produce, will not only seriously hobble innovation. It will also guarantee that small startups won’t be able to participate in this exciting new sector. That, in turn, will cement the dominance of today’s biggest technology companies – the very players that the copyright industry regards as its worst enemies.

