'Impossible' to train AI without copyrighted content says OpenAI

ChatGPT developer OpenAI has told the UK parliament that it is impossible to train its generative artificial intelligence (GenAI) services without access to copyrighted work.

The company, along with backer Microsoft, is facing a lawsuit from the New York Times, which has accused the AI tech company of “unlawful use” of its work to create its products.

Now in a submission to the House of Lords’ communications and digital select committee, the company appears to be angling for a relaxation of copyright laws.

The submission, first reported by the Telegraph, states: “Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials.

“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

In a separate blog post published to its website on Monday, OpenAI responded to the lawsuit, saying: “We support journalism, partner with news organisations, and believe the New York Times lawsuit is without merit.”

In addition to the NYT suit, a group of authors including Game of Thrones writer George RR Martin are suing OpenAI for what they describe as “systematic theft on a mass scale”.

OpenAI has previously argued that while it respects content creators and owners, it also subscribes to a doctrine of “fair use” and that it believes that “legally, copyright law does not forbid training”.

GenAI training’s blurred lines surrounding copyright and plagiarism is increasingly becoming central to the conversation around the technology.

Image generation company Midjourney recently saw a spreadsheet containing the names of thousands of artists that have allegedly been used to train its tech go viral. The list includes the names of more than 4,700 artists whose works are said to have been ‘scraped’ to train the company’s tech, with thousands more listed under a ‘proposed additions’ tab.

The spreadsheet quickly spread across social media during the holiday period. One notable poster was Jon Lam, a senior storyboard artist at League of Legends-owner Riot Games, who posted screenshots from Discord where Midjourney developers, in his words, discuss “laundering” and creating a database from which they can train the software.

One of the messages reads: "All you have to do is just use those scraped datasets and then conveniently forget what you used to train the model. Boom legal problems solved forever."

Latest News

Reddit ‘challenges Australia’s under 16s ban’ with lawsuit

BBVA expands ChatGPT to 120,000 employees

BIS and Central banks test post-quantum cryptography in payments

UK government launches new MedTech qualifications to fight skills gap

UK scientists get priority access to advanced AI through Google DeepMind lab

Uber Eats rolls out robot couriers in Leeds

'Impossible' to train AI without copyrighted content says OpenAI

Recent Stories

'Impossible' to train AI without copyrighted content says OpenAI

BT Group pilots EV charger from repurposed street cabinets

UAE to provide airlock for new NASA space station