← Reading paths

⌥

Reading path

Sysadmin / DevOps going local

Milestones to run serious LLMs on your own servers, not someone else's cloud.

You are a sysadmin or DevOps engineer and you want to understand how we got to the point of hosting frontier-grade models on-prem. This path starts with the LLaMA leak that opened the open ecosystem and reaches the self-hostable reasoning of DeepSeek R1 and the quantizations that make it sustainable on real hardware.

01

Why it matters to you

The leak that kicked off the whole open-weight ecosystem: without it neither llama.cpp nor Ollama would exist today.

February 24, 2023 High Open Source Models

LLaMA: Meta opens foundation models to research

Meta releases LLaMA in four sizes (7B, 13B, 33B, 65B), available to researchers on request. One week later, the weights leak publicly.
02

Why it matters to you

Meta's first official release of commercially usable weights: the starting point of legal on-prem AI in companies.

July 18, 2023 Landmark Open Source Models

Llama 2: weights become commercially usable

Meta releases Llama 2 (7B, 13B, 70B) under a license that allows commercial use up to 700M MAU. For the first time a serious LLM is genuinely deployable to production without depending on an API.
03

Why it matters to you

Proves a 7B European model can beat much bigger ones: the first realistic candidate for a single-GPU server.

September 27, 2023 High Open Source Models

Mistral 7B: Europe joins the open-source race

Mistral AI (Paris), a three-month-old startup founded by ex-Meta/DeepMind researchers, releases Mistral 7B under Apache 2.0. Beats Llama 2 13B on most benchmarks with half the parameters.
04

Why it matters to you

Cloud-grade speech-to-text running on your hardware: the most widely deployed local model after text LLMs.

September 21, 2022 High Voice & Audio

Whisper open source: audio transcription becomes a commodity

OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.
05

Why it matters to you

Meta's reference stack for on-prem deployment: standardizes inference, safety and tooling, no more bespoke scripts.

September 25, 2024 Medium AI Infrastructure

Llama Stack: Meta proposes a unified API spec for the LLM lifecycle

Meta announces Llama Stack: an API spec + reference implementations for inference, safety, agents, memory, evals, RAG, and training — meant as 'standard plumbing' for Llama-based applications.
06

Why it matters to you

Frontier-grade reasoning in open weights: for the first time you can host a model that actually reasons in your own data center.

January 20, 2025 Landmark Open Source Models

DeepSeek-R1: open reasoning matches o1 at 1/30 the cost

Chinese startup DeepSeek releases R1, a reasoning model with MIT-licensed open weights. Performance on par with OpenAI o1, API pricing $0.55/$2.19 per 1M tokens (vs o1 $15/$60). Nasdaq AI loses $1T in two days.
07

Why it matters to you

Quantizations make running frontier models on workstations financially sensible: it rewrites the TCO of your server room.

April 30, 2026 Medium Local AI

Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM

New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.