Skip to content
AImpact
IT EN
High Open Source Models · 1 min read

The Pile: the 825 GB open dataset that fuels the open LLM era

In one sentence EleutherAI publishes The Pile, an 825 GB dataset built from 22 diverse sub-datasets — the base for GPT-Neo, GPT-J, Pythia and much of the early open source ecosystem.

Verified Official source
ShareLinkedInX
Reading level

EleutherAI releases The Pile, a massive 825 GB text dataset anyone can download. It's built from books (Project Gutenberg, Books3), Wikipedia, code (GitHub), scientific papers (PubMed, ArXiv), forums (StackExchange, HackerNews), YouTube subtitles, and 15 other sources.

Why it matters? Because training a language model needs lots of high-quality text. Until then, big labs used proprietary datasets (Common Crawl filtered in secret). The Pile is the first serious public alternative.

The whole open source ecosystem — GPT-Neo, GPT-J, Pythia, partly BLOOM — is born and raised on The Pile.

Companies

EleutherAI

Tools

The Pile

Tags

EleutherAIThe PileDatasetOpen SourcePre-training

Sources