Reading path
Sysadmin / DevOps going local
Milestones to run serious LLMs on your own servers, not someone else's cloud.
You are a sysadmin or DevOps engineer and you want to understand how we got to the point of hosting frontier-grade models on-prem. This path starts with the LLaMA leak that opened the open ecosystem and reaches the self-hostable reasoning of DeepSeek R1 and the quantizations that make it sustainable on real hardware.
- 01
Why it matters to you
The leak that kicked off the whole open-weight ecosystem: without it neither llama.cpp nor Ollama would exist today.
High Open Source ModelsLLaMA: Meta opens foundation models to research
Meta releases LLaMA in four sizes (7B, 13B, 33B, 65B), available to researchers on request. One week later, the weights leak publicly.
- 02
Why it matters to you
Meta's first official release of commercially usable weights: the starting point of legal on-prem AI in companies.
Landmark Open Source ModelsLlama 2: weights become commercially usable
Meta releases Llama 2 (7B, 13B, 70B) under a license that allows commercial use up to 700M MAU. For the first time a serious LLM is genuinely deployable to production without depending on an API.
- 03
Why it matters to you
Proves a 7B European model can beat much bigger ones: the first realistic candidate for a single-GPU server.
High Open Source ModelsMistral 7B: Europe joins the open-source race
Mistral AI (Paris), a three-month-old startup founded by ex-Meta/DeepMind researchers, releases Mistral 7B under Apache 2.0. Beats Llama 2 13B on most benchmarks with half the parameters.
- 04
Why it matters to you
Cloud-grade speech-to-text running on your hardware: the most widely deployed local model after text LLMs.
High Voice & AudioWhisper open source: audio transcription becomes a commodity
OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.
- 05
Why it matters to you
Meta's reference stack for on-prem deployment: standardizes inference, safety and tooling, no more bespoke scripts.
Medium AI InfrastructureLlama Stack: Meta proposes a unified API spec for the LLM lifecycle
Meta announces Llama Stack: an API spec + reference implementations for inference, safety, agents, memory, evals, RAG, and training — meant as 'standard plumbing' for Llama-based applications.
- 06
Why it matters to you
Frontier-grade reasoning in open weights: for the first time you can host a model that actually reasons in your own data center.
Landmark Open Source ModelsDeepSeek-R1: open reasoning matches o1 at 1/30 the cost
Chinese startup DeepSeek releases R1, a reasoning model with MIT-licensed open weights. Performance on par with OpenAI o1, API pricing $0.55/$2.19 per 1M tokens (vs o1 $15/$60). Nasdaq AI loses $1T in two days.
- 07
Why it matters to you
Quantizations make running frontier models on workstations financially sensible: it rewrites the TCO of your server room.
Medium Local AIUsable 2-bit quantization: frontier reasoning models drop below 32GB RAM
New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.