The Pile: the 825 GB open dataset that fuels the open LLM era
In one sentence EleutherAI publishes The Pile, an 825 GB dataset built from 22 diverse sub-datasets — the base for GPT-Neo, GPT-J, Pythia and much of the early open source ecosystem.
EleutherAI releases The Pile, a massive 825 GB text dataset anyone can download. It's built from books (Project Gutenberg, Books3), Wikipedia, code (GitHub), scientific papers (PubMed, ArXiv), forums (StackExchange, HackerNews), YouTube subtitles, and 15 other sources.
Why it matters? Because training a language model needs lots of high-quality text. Until then, big labs used proprietary datasets (Common Crawl filtered in secret). The Pile is the first serious public alternative.
The whole open source ecosystem — GPT-Neo, GPT-J, Pythia, partly BLOOM — is born and raised on The Pile.
Companies
EleutherAI
Tools
The Pile
Tags
Sources