The Pile: the open-source 825 GB dataset for training LLMs
EleutherAI releases The Pile, an 825 GB composite text dataset curated from 22 different sources (arXiv, GitHub, PubMed, books, StackExchange…), designed for pre-training large open-source language models.