In a recent turn of events, some big-shot authors are taking legal action against Meta and OpenAI to prevent them from using copyrighted works to train their artificial intelligence systems. These authors, including Michael Chabon, claim that these companies have been unlawfully harvesting books from the web to create infringing works that violate their copyrights. They are seeking a court order that would force these companies to destroy any AI systems trained on their copyrighted material. This lawsuit is just one in a series of battles that creators have been fighting to determine the legality of the way large language models are trained.
Not only are authors going after these AI giants, but artists have also filed copyright infringement lawsuits against AI art generators such as Stability AI, Midjourney, and DeviantArt. As evidence of AI systems being fed authors’ books, the lawsuit points to ChatGPT generating summaries and analyses of the novels when prompted. This would only be possible if the underlying AI models were trained on those works. Basically, these models can’t operate without information extracted from copyrighted material, so the content they produce could be considered infringing derivative works.
The authors allege that both OpenAI and Meta constructed their training datasets by scraping text data from the internet. For example, OpenAI admitted to feeding their first large language model, GPT-1, a collection of over 7,000 novels from BookCorpus, a dataset assembled by copying written works from a website called Smashwords without consent or compensation to the authors. The authors also claim that subsequent versions of OpenAI models were trained on illicitly obtained books.
Now, OpenAI no longer discloses the sources of its dataset, claiming competition and safety concerns. Similarly, Meta doesn’t disclose the origin of the books in its dataset for its AI system called LLaMA. However, the authors argue that the “Books3” section of The Pile, which Meta’s dataset is said to be based on, consists of books obtained from Bibliotik without permission.
This class action lawsuit represents a group of authors seeking to protect their copyrights nationwide. They accuse Meta and OpenAI of direct copyright infringement, vicarious copyright infringement, violations of the Digital Millennium Copyright Act, unjust enrichment, and negligence. The courts will have to consider two Supreme Court cases that will heavily influence the outcome of this litigation. One case involved Google’s copying of books to create a search function, where fair use was established due to the limited display of text snippets. The other case focused on commercial exploitation and rejected a fair use defense. Legal experts believe that the nature of AI’s use of copyrighted material will be the central point of contention in this legal battle.
Interestingly, ChatGPT can generate screenplays in the style of specific books or authors. For example, it produced a script written in David Henry Hwang’s style when prompted to create something in the style of his work “The Dance and The Railroad.” This raises questions about whether AI-generated works should be copyrightable and how it may impact authors’ market prospects.
Ultimately, it is predicted that the courts will rule in favor of the creators if fair use is properly analyzed. Authors and artists argue that AI companies are negatively affecting their economic interests by creating competitive works based on their material. This may push AI companies to establish a licensing framework in the future. It remains to be seen how these lawsuits unfold, but it’s clear that the battle between creators and AI companies is far from over.