Skip to content
AImpact
IT EN
High Open Source Models · 1 min read

The Pile: the open-source 825 GB dataset for training LLMs

In one sentence EleutherAI releases The Pile, an 825 GB composite text dataset curated from 22 different sources (arXiv, GitHub, PubMed, books, StackExchange…), designed for pre-training large open-source language models.

Verified Official source
ShareLinkedInX
Reading level

Training a GPT-3-class model needs more than powerful computers: it needs a huge amount of text to feed the model. OpenAI keeps its training set secret, and that's part of why nobody else can reproduce its results.

EleutherAI publishes its answer: The Pile, 825 GB of text, free for anyone to download. Not just copy-pasted Wikipedia: a carefully curated mix of scientific papers, open-source code, free books, Stack Exchange threads, movie subtitles, patents, news.

It becomes the foundation of many open models over the next two years. For anyone who wants to study or build LLMs without depending on Big Tech, The Pile is the starting point.

Companies

EleutherAI

Tools

The Pile

Tags

EleutherAIThe PileDatasetOpen SourcePre-training

Sources