Why generative AI companies should pay artists to create new works, and give away the results

The latest obsession in the world of copyright is trying to get generative AI companies to pay for using creative material for training purposes. As various posts have noted, this makes no sense, because training is just a matter of analysis, which requires no permission. What the copyright world wants to do is to erect further barriers to accessing knowledge and art. But even without the legal threats being thrown around by publishers and other intermediaries, there’s another problem that generative AI companies face when it comes to using training materials: there simply aren’t enough of them. In particular, there aren’t enough of the really good textual materials that are need to push forward the generative AI field.

This has forced generative AI companies to come up with novel sources of training material. For example, both Google and OpenAI have turned to YouTube videos. According to the New York Times, an OpenAI team transcribed more than one million hours of YouTube videos in order to harvest the human speech their soundtracks contained. Meta contemplated taking a different, and rather bold approach: buying the publisher Simon & Schuster in order to obtain a fresh injection of long-form texts that could be used for training. Another New York Times article explained a further way for generative AI companies to obtain new training material:

as A.I. technology has become more sophisticated, so has the job of people who must painstakingly teach it. Yesterday’s photo tagger is today’s essay writer.

There are usually two types of work for these trainers: supervised learning, where the A.I. learns from human-generated writing, and reinforcement learning from human feedback, where the chatbot learns from how humans rate their responses.

Companies that specialize in data curation, including the San Francisco-based start-ups Scale AI and Surge AI, hire contractors and sell their training data to bigger developers. Developers of A.I. models, such as the Toronto-based start-up Cohere, also recruit in-house data annotators.

These trends suggest an obvious large-scale solution to the dearth of good training material, which at the same time helps alleviate another serious problem, discussed at length in Walled Culture the book (free digital versions available): the fact that most creators struggle financially. I was going to write this up here on Walled Culture, but over on Techdirt, Mike Masnick has already done a good job explaining the idea:

If the tech companies need good, well-written content to fill their training systems, and the world needs good, high-quality journalism, why don’t the big AI companies agree to start funding journalists and solve both problems in one move?

This may sound similar to the demands of licensing works, but I’m not talking about past works. Those works are out there. I’m talking about paying for the creation of future works. It’s not about licensing or copyright. It’s about paying for the creation of new, high-quality journalism. And then letting those works exist freely on the internet for everyone.

But I’d go further than Mike. I think generative AI companies should be paying for the creation of all kinds of original material – text, images, audio, videos. Text may be the best-known form of generative AI, but it is by no means the only one. Companies need visual and audio training material just as much as text-based ones.

The actual cost of supporting artists in this way would be relatively small compared to the valuation of generative AI companies, and the potential of the market. The diminutive size of the copyright world compared to the digital sector is something that is often overlooked when people discuss the tension between the two. It was why a decade ago I suggested that Google should buy the entire music industry to solve the problems it was having with licensing material back then. The parallels with today’s issues are striking.

I also think that the material generated under such a scheme should be placed in the public domain for anyone to use. Once generative AI companies have trained their systems on the new material, there is no need for them to try to make money from restricting access as a publisher would. Moreover, the more public domain material there is out there, the more people could build on that material to produce new works. This would create yet more high-quality human-generated material that could be analysed.

And for those who think that Google or other companies would never spend money like this, only to give away the results, there is a precedent. In 2003, Google bought Blogger, which it then made freely available for everyone to use. As an article in Wired from the time pointed out, the existing blogs on Blogger provided Google with “rich new sources of data gleaned from weblogs”. And by providing the blogger service free, it encouraged people to create new blogs that could then be fed into Google’s index, enriching it further.

Going even further back, there is another important precedent for the idea of wealthy entities simply paying artists to create. As Walled Culture notes, this is the basis of the patronage that funded almost all artistic creation for thousands of years. Then it was royalty, the aristocracy or the church that paid writers, artists and sculptors to create masterpieces. Now it could be the online giants with their huge resources that would similarly pay today’s creators to produce works that everyone could enjoy. At the same time it would also help the generative AI world to find those high-quality sources that it so desperately needs.

Featured image by Stable Diffusion.

Follow me @glynmoody on Mastodon and on Bluesky.

Cookie Consent with Real Cookie Banner