Categories: Techno

Harvard to Publish 1 Million Books for Free: A Goldmine for AI

Spread the love

© Paolo Gallo/Shutterstock.com

AI models like ChatGPT or Gemini need a lot of computing resources, a lot of energy, but also a lot of training data. And to provide new data that will allow AI labs to train their AI models, Harvard is going to create a huge database of a million books, via its new Institutional Data Initiative project.

Books in the public domain, gathered in a dataset for AI

This data could be used to train future AI models, since it is a work that has fallen into the public domain and is therefore no longer protected by copyright. According to Wired magazine, this dataset is five times larger than Books3, a dataset that the Meta group used to train its Llama model.

A project supported by Google, Microsoft and OpenAI

The project is supported by OpenAI and Microsoft, with the participation of Google, through its Google Books initiative. The goal is to put all stakeholders on an equal footing, given that the dataset will be accessible for free. Indeed, while large organizations like OpenAI or Google can pull out their checkbooks to access copyrighted texts, it can be more complicated for a small startup.

200% Deposit Bonus up to €3,000 180% First Deposit Bonus up to $20,000

More datasets are coming

In addition, Harvard’s Institutional Data Initiative doesn’t plan to stop there, as it is already working with the Boston Public Library to digitize millions of news articles that are already in the public domain. And according to Wired, the university is open to other partnerships.

Otherwise, it should be noted that this is not the only initiative of its kind. For example, in March 2024, the Hugging Face platform published a dataset comprising a total of 500 billion words, with texts in English, French, Dutch, Spanish, German, and Italian.

  • Developing generative AI models requires more than just chips and power; it also requires a huge amount of training data
  • Harvard is embarking on a new project to release a dataset of 1 million books that have fallen into the public domain. This data could be used by AI labs
  • Harvard is also working on another project to digitize millions of press articles.

📍 To not miss any Presse-citron news, follow us on Google News and WhatsApp.

[ ]

Teilor Stone

Teilor Stone has been a reporter on the news desk since 2013. Before that she wrote about young adolescence and family dynamics for Styles and was the legal affairs correspondent for the Metro desk. Before joining Thesaxon , Teilor Stone worked as a staff writer at the Village Voice and a freelancer for Newsday, The Wall Street Journal, GQ and Mirabella. To get in touch, contact me through my teilor@nizhtimes.com 1-800-268-7116

Recent Posts

Five local gift ideas to slip under the tree at the last minute

An unexpected guest or an unfortunate oversight? Don't panic, the editorial team at "Midi Libre…

52 minutes ago

She was going to her mother's house on an electric scooter: a 17-year-old girl fatally hit by a motorist

La jeune adolescente se rendant chez sa mère. Illustration MaxPPP The 17-year-old girl was hit…

52 minutes ago

“Sorry, I didn’t think so”: Léa Salamé commits a big faux pas in front of a grieving Audrey Fleurot

This Saturday, December 21, Audrey Fleurot éwas received on the set of Quelle ÉPoque. During…

52 minutes ago

Gérard Depardieu: new revelations, it’s worse than we thought

New revelations concerning Gérard Depardieu were made in the columns of Télérama this Monday, December…

52 minutes ago

Notre-Dame de Paris reopens: you will have to wait until next summer to visit the cathedral towers

While the cathedral is now reopened, its towers, which are subject to a fee and…

52 minutes ago

VIDEO. Benito Mussolini's grandson scores first goal of his football career, fans celebrate with fascist salutes

Young Romano Floriani Mussolini scored his first professional goal. ANSA - EMANUELE PENNACCHIO This Sunday,…

52 minutes ago