Spread the love

Harvard to Publish 1 Million Books for Free: A Goldmine for AI

© Paolo Gallo/Shutterstock.com

AI models like ChatGPT or Gemini need a lot of computing resources, a lot of energy, but also a lot of training data. And to provide new data that will allow AI labs to train their AI models, Harvard is going to create a huge database of a million books, via its new Institutional Data Initiative project.

Books in the public domain, gathered in a dataset for AI

This data could be used to train future AI models, since it is a work that has fallen into the public domain and is therefore no longer protected by copyright. According to Wired magazine, this dataset is five times larger than Books3, a dataset that the Meta group used to train its Llama model.

A project supported by Google, Microsoft and OpenAI

The project is supported by OpenAI and Microsoft, with the participation of Google, through its Google Books initiative. The goal is to put all stakeholders on an equal footing, given that the dataset will be accessible for free. Indeed, while large organizations like OpenAI or Google can pull out their checkbooks to access copyrighted texts, it can be more complicated for a small startup.

200% Deposit Bonus up to €3,000 180% First Deposit Bonus up to $20,000

More datasets are coming

In addition, Harvard’s Institutional Data Initiative doesn’t plan to stop there, as it is already working with the Boston Public Library to digitize millions of news articles that are already in the public domain. And according to Wired, the university is open to other partnerships.

Otherwise, it should be noted that this is not the only initiative of its kind. For example, in March 2024, the Hugging Face platform published a dataset comprising a total of 500 billion words, with texts in English, French, Dutch, Spanish, German, and Italian.

  • Developing generative AI models requires more than just chips and power; it also requires a huge amount of training data
  • Harvard is embarking on a new project to release a dataset of 1 million books that have fallen into the public domain. This data could be used by AI labs
  • Harvard is also working on another project to digitize millions of press articles.

📍 To not miss any Presse-citron news, follow us on Google News and WhatsApp.

[ ]

Teilor Stone

By Teilor Stone

Teilor Stone has been a reporter on the news desk since 2013. Before that she wrote about young adolescence and family dynamics for Styles and was the legal affairs correspondent for the Metro desk. Before joining Thesaxon , Teilor Stone worked as a staff writer at the Village Voice and a freelancer for Newsday, The Wall Street Journal, GQ and Mirabella. To get in touch, contact me through my teilor@nizhtimes.com 1-800-268-7116