Categories: Techno

Harvard to Publish 1 Million Books for Free: A Goldmine for AI

Spread the love

© Paolo Gallo/Shutterstock.com

AI models like ChatGPT or Gemini need a lot of computing resources, a lot of energy, but also a lot of training data. And to provide new data that will allow AI labs to train their AI models, Harvard is going to create a huge database of a million books, via its new Institutional Data Initiative project.

Books in the public domain, gathered in a dataset for AI

This data could be used to train future AI models, since it is a work that has fallen into the public domain and is therefore no longer protected by copyright. According to Wired magazine, this dataset is five times larger than Books3, a dataset that the Meta group used to train its Llama model.

A project supported by Google, Microsoft and OpenAI

The project is supported by OpenAI and Microsoft, with the participation of Google, through its Google Books initiative. The goal is to put all stakeholders on an equal footing, given that the dataset will be accessible for free. Indeed, while large organizations like OpenAI or Google can pull out their checkbooks to access copyrighted texts, it can be more complicated for a small startup.

200% Deposit Bonus up to €3,000 180% First Deposit Bonus up to $20,000

More datasets are coming

In addition, Harvard’s Institutional Data Initiative doesn’t plan to stop there, as it is already working with the Boston Public Library to digitize millions of news articles that are already in the public domain. And according to Wired, the university is open to other partnerships.

Otherwise, it should be noted that this is not the only initiative of its kind. For example, in March 2024, the Hugging Face platform published a dataset comprising a total of 500 billion words, with texts in English, French, Dutch, Spanish, German, and Italian.

  • Developing generative AI models requires more than just chips and power; it also requires a huge amount of training data
  • Harvard is embarking on a new project to release a dataset of 1 million books that have fallen into the public domain. This data could be used by AI labs
  • Harvard is also working on another project to digitize millions of press articles.

📍 To not miss any Presse-citron news, follow us on Google News and WhatsApp.

[ ]

Teilor Stone

Teilor Stone has been a reporter on the news desk since 2013. Before that she wrote about young adolescence and family dynamics for Styles and was the legal affairs correspondent for the Metro desk. Before joining Thesaxon , Teilor Stone worked as a staff writer at the Village Voice and a freelancer for Newsday, The Wall Street Journal, GQ and Mirabella. To get in touch, contact me through my teilor@nizhtimes.com 1-800-268-7116

Recent Posts

Festival de la biographie à Nîmes : tous les rendez-vous de ce dimanche 26 janvier

Toute la journée, les rencontres s'enchaînent avec les auteurs, notamment dans l'atrium de Carré d'art.…

19 minutes ago

De Mèze aux camps de réfugiés, les unités de potabilisation d’Agile Water s’adaptent aux situations d’urgence

Nourreddine Smali et, dans ses mains, la membrane de filtration, si précieuse pour le traitement…

19 minutes ago

Biography Festival in Nîmes: all the events for this Sunday, January 26

Toute la journée, les rencontres s'enchaînent avec les auteurs, notamment dans l'atrium de Carré d'art.…

19 minutes ago

From Mèze to refugee camps, Agile Water’s drinking water treatment units adapt to emergency situations

Nourreddine Smali et, dans ses mains, la membrane de filtration, si précieuse pour le traitement…

19 minutes ago

From Jean-Jaurès to Quai Voltaire, the town of Frontignan is taking care of its city centre and giving itself a new look

A glimpse of the urban horizon that is promised in the heart of the city…

1 hour ago

“We don't sort vehicles, we sort people”: lawyer files appeal against ZFE in Montpellier

"On inverse les valeurs : le principe, c’est l’interdiction ; la liberté, c’est l’exception." Tom Serrano -…

1 hour ago