The copyright infringement case with generative AI models has been taking a lot of turns. The latest being The New York Times‘ potential lawsuit against OpenAI for copyright infringement. The news outlet believes that OpenAI’s models are trained on NYT’s intellectual property data and copied the style of their writers to give ChatGPT the ability to articles in the same manner, whenever prompted.
This will likely be one of the first lawsuits against OpenAI to actually put the company in trouble. NYT recently published its terms of services focusing on prohibiting AI companies from scrapping articles for training AI models hinting at a possible lawsuit on OpenAI. If NYT successfully proves that its content is illegally used, OpenAI might have to delete its entire dataset used for training AI models and cough up a fine of up to $150,000 for each infringed content.
Meanwhile, OpenAI had made a licensing deal with Associated Press (AP), accessing its archive for building better AI models. It has also partnered with several news agencies recently. Now, AP has decided to join other news organisations to frame guiding principles of using AI in newsrooms. In the report, AP said that a lot of news organisations are concerned that their content is being used without permission.
The tricks up its sleeves
There is no doubt that OpenAI took help from Microsoft’s Bing to crawl the internet. According to Gilles Babinet, the company must have crawled up to 250,000 websites for building GPT, without asking anyone — true, to be taken with a pinch of salt.
Responding to this, Yann LeCun, Meta AI chief, gave an example of search engines. “Google, Bing, and others crawl the internet constantly. That’s not the problem,” he said and asked where the problem exactly lay. Arguably, there is a difference between crawling and reusing the content, but even then, the argument for it being a “copyright” case does not stand strong.
Sébastien Hubert explains how neural networks do not actually store any data, but just represent the understanding of the data. This arguably is a lot like how humans work. “GPT has read The Three Musketeers but is unable to quote any chapter verbatim upon request. An LLM is a sort of super-reader – it doesn’t copy anything,” explained Hubert.
Interestingly, an important thing to note here is that NYT is only suing OpenAI, and not Google for making Bard. For NYT, the problem comes from the fact that Bing synthesises the content without creating traffic for the newspaper thereby hitting advertising revenue. Bing AI is an internet “wrapper”, which breaks the economic model of many sites. It seems, NYT woke up to the problems of ChatGPT only after realising that OpenAI is partnering with other publishers and not them.
It must be noted that NYT is going to receive $100 million from Google for the next three years in a deal where the tech giant will be able to publish content on its platforms. Google is testing its new AI writing tools in partnership with NYT, WSJ, and Washington Post. This might possibly be a hint towards a partnership to rival OpenAI and AP.
NYT has a problem with OpenAI, not generative AI
OpenAI’s GPTBot, which recently came to everyone’s notice, says that the company will automatically scrape the internet and websites for training its AI models. To opt out, companies have to voluntarily put a line of code on their website to block the crawler. There is no doubt that it must have crawled all news outlets to train its AI models.
In 2015, federal appeals court ruling for Google, found to be scanning millions of books for Google Books library, was that the library was not able to create a significant market substitute for the original books, and thus fell under the ‘fair use’ of using its models. For OpenAI, this would be difficult to prove. According to some experts, ChatGPT might be able to form an alternative to visiting NYT articles, dropping the traffic on the website. This means that OpenAI is not using the content from the website “fairly”.
Interestingly, ChatGPT is not connected to the internet. So even if NYT is able to prove that it was trained on its articles illegally, it is not doing that after the cut off time of 2021. GPT-4’s Browse with Bing feature was also discontinued some time back, possibly because of the same reasons. It seems like OpenAI was aware about the copyright issues beforehand and took an early step. It might now be hard to prove that it is actually a competition to NYT or people are merely using it to summarise things in their style — something that NYT only wants Google’s AI tools to do in the future.
Nevertheless, writers have been protesting against generative AI technology for replacing their jobs for a long time. Now that NYT is with them, they might finally be able to catch up, but the push is not against AI models, but just Google’s competitor, OpenAI.