Microsoft has been hit with a lawsuit by a group of authors who claim the technology giant used their books without permission to train its Megatron artificial intelligence model.
Kai Bird, Jia Tolentino, Daniel Okrent and several others alleged that Microsoft used pirated digital versions of their books to teach its AI to respond to human prompts. The lawsuit, filed in New York federal court on Tuesday, includes former president Jimmy Carter biographer Jonathan Alter amongst the plaintiffs.
Their legal action is one of several high-stakes cases brought by authors, news outlets and other copyright holders against technology companies including Meta Platforms, Anthropic and Microsoft-backed OpenAI over alleged misuse of their material in AI training.
The writers alleged in the complaint that Microsoft used a collection of nearly 200,000 pirated books to train Megatron, an algorithm that gives text responses to user prompts. The complaint said Microsoft used the pirated dataset to create “a computer model that is not only built on the work of thousands of creators and authors, but also built to generate a wide range of expression that mimics the syntax, voice, and themes of the copyrighted works on which it was trained.”
The lawsuit alleges that Microsoft’s success “is predicated on mass copyright infringement” and that the company intentionally chose to use pirated libraries rather than licensing deals with publishers and creators to gain advantages in its large language model development.
According to the complaint, Microsoft acknowledged using more than 800GB of open-source data called “The Pile” to train its language model. The company allegedly utilised the dataset when it contained “Books3,” described as a “notorious” collection of pirated works. The pirated book collection was removed from the official version of The Pile in 2023 due to copyright complaints.
The complaint against Microsoft came a day after a California federal judge ruled that Anthropic made fair use under United States copyright law of authors’ material to train its AI systems but may still be liable for pirating their books. It was the first US decision on the legality of using copyrighted materials without permission for generative AI training.
Technology companies have argued that they make fair use of copyrighted material to create new, transformative content, and that being forced to pay copyright holders for their work could hamstring the burgeoning AI industry.
The authors requested a court order blocking Microsoft’s infringement and statutory damages of up to $150,000 for each work that Microsoft allegedly misused. The proposed class action would include all owners of copyrighted works registered with the US Copyright Office within five years of their work’s publication.