Copyright discussions typically concern texts in just a few languages, and often only in English. In part, that’s because copyright law has evolved most quickly in anglophone countries. But it means that the copyright problems faced by those speaking less well-known languages – particularly languages with limited quantities of textual material available – are completely glossed over.
That’s something that emerges from a report by Teresa Hackett, EIFL Copyright and Libraries Programme Manager, writing about a major conference on the Right to Research that took place in South Africa in January 2023. As she notes:
Although Africa has over 2,000 living languages, these account for only a small portion of available resources and publications, and they are severely under-represented in technology and scientific communications. The good news is that modern digital research tools offer new, low-cost opportunities to address this under-representation.
Hackett describes how a community-based group of African technologists, called Masakhane, set out to apply machine learning in the form of Natural Language Processing (NLP) to texts in African languages. Creating machine translation tools allows text and data in those languages to be processed and interpreted on a far larger scale than at present. The work was based on biblical texts because they are already available in many target African languages, and come with the texts already aligned by virtue of the chapter and verse structure. Hackett explains:
Masakhane used a dataset known as JW300, created from biblical texts published by US-based Jehovah’s Witnesses. JW300, a corpus of over 300 languages containing an average of 100,000 parallel sentences per language, proved to be a hugely rich resource. Young developers from all over the continent were able to plug into the dataset, select the language they wanted to work in, and mine the data to create new translation models in multiple African languages.
The database of biblical texts proved excellent material for the NLP analysis, but the growing success of the project prompted Masakhane to obtain legal advice on the use of JW300 data from the Nairobi-based Centre for Intellectual Property and Information Technology Law (CIPIT). Here’s what happened:
CIPIT found, however, that a copyright notice on the Jehovah’s Witnesses website showed that the activity of text and data mining (TDM) was expressly not allowed. A subsequent request by CIPIT, on behalf of Masakhane, for permission to use the data was declined.
As a result, JW300 datasets are no longer being used. It’s a classic case of copyright killing off vital work by researchers that is being carried out for the benefit of hundreds of millions of people. Allowing the use of existing texts for text and data mining would result in no loss to the organisation that put them together, and would even help to spread the word about their work. But the obsession with “ownership” that copyright promotes means that, as Kathleen Siminyu, an AI researcher at Masakhane, is quoted as saying: “even with a plea and a demonstration of how much good has come from the work, the answer to the request [for permission to use the data] was still a blatant no.”
Featured image by Wikipedia.
Follow me @glynmoody on Mastodon.