In practice
It is the most expensive step (months of GPUs and millions of dollars) and produces a "base" model that can write but cannot yet follow instructions. Only big labs run it from scratch; companies start from pretrained models and adapt them with SFT, LoRA, or RLHF.
Related terms
Seen in the wild
6 entries mentioning it- HighGR-2: ByteDance pre-trains a robot on 38,000 hours of human internet videos
- MediumUL2: Google unifies pretraining paradigms with Mixture-of-Denoisers
- HighThe Pile: the 825 GB open dataset that fuels the open LLM era
- HighThe Pile: the open-source 825 GB dataset for training LLMs
- MediumImage GPT: generative pretraining for images
- MediumELECTRA: more efficient NLP pre-training than BERT