The Pile: the open-source 825 GB dataset for training LLMs
In one sentence EleutherAI releases The Pile, an 825 GB composite text dataset curated from 22 different sources (arXiv, GitHub, PubMed, books, StackExchange…), designed for pre-training large open-source language models.
Training a GPT-3-class model needs more than powerful computers: it needs a huge amount of text to feed the model. OpenAI keeps its training set secret, and that's part of why nobody else can reproduce its results.
EleutherAI publishes its answer: The Pile, 825 GB of text, free for anyone to download. Not just copy-pasted Wikipedia: a carefully curated mix of scientific papers, open-source code, free books, Stack Exchange threads, movie subtitles, patents, news.
It becomes the foundation of many open models over the next two years. For anyone who wants to study or build LLMs without depending on Big Tech, The Pile is the starting point.
Companies
EleutherAI
Tools
The Pile
Tags
Sources