Skip to content
AImpact
IT EN
← Reading paths

Reading path

Sysadmin / DevOps going local

Milestones to run serious LLMs on your own servers, not someone else's cloud.

You are a sysadmin or DevOps engineer and you want to understand how we got to the point of hosting frontier-grade models on-prem. This path starts with the LLaMA leak that opened the open ecosystem and reaches the self-hostable reasoning of DeepSeek R1 and the quantizations that make it sustainable on real hardware.

  1. 01

    Why it matters to you

    The leak that kicked off the whole open-weight ecosystem: without it neither llama.cpp nor Ollama would exist today.

    High Open Source Models

    LLaMA: Meta opens foundation models to research

    Meta releases LLaMA in four sizes (7B, 13B, 33B, 65B), available to researchers on request. One week later, the weights leak publicly.

  2. 02

    Why it matters to you

    Meta's first official release of commercially usable weights: the starting point of legal on-prem AI in companies.

    Landmark Open Source Models

    Llama 2: weights become commercially usable

    Meta releases Llama 2 (7B, 13B, 70B) under a license that allows commercial use up to 700M MAU. For the first time a serious LLM is genuinely deployable to production without depending on an API.

  3. 03

    Why it matters to you

    Proves a 7B European model can beat much bigger ones: the first realistic candidate for a single-GPU server.

    High Open Source Models

    Mistral 7B: Europe joins the open-source race

    Mistral AI (Paris), a three-month-old startup founded by ex-Meta/DeepMind researchers, releases Mistral 7B under Apache 2.0. Beats Llama 2 13B on most benchmarks with half the parameters.

  4. 04

    Why it matters to you

    Cloud-grade speech-to-text running on your hardware: the most widely deployed local model after text LLMs.

    High Voice & Audio

    Whisper open source: audio transcription becomes a commodity

    OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.

  5. 05

    Why it matters to you

    Meta's reference stack for on-prem deployment: standardizes inference, safety and tooling, no more bespoke scripts.

    Medium AI Infrastructure

    Llama Stack: Meta proposes a unified API spec for the LLM lifecycle

    Meta announces Llama Stack: an API spec + reference implementations for inference, safety, agents, memory, evals, RAG, and training — meant as 'standard plumbing' for Llama-based applications.

  6. 06

    Why it matters to you

    Frontier-grade reasoning in open weights: for the first time you can host a model that actually reasons in your own data center.

    Landmark Open Source Models

    DeepSeek-R1: open reasoning matches o1 at 1/30 the cost

    Chinese startup DeepSeek releases R1, a reasoning model with MIT-licensed open weights. Performance on par with OpenAI o1, API pricing $0.55/$2.19 per 1M tokens (vs o1 $15/$60). Nasdaq AI loses $1T in two days.

  7. 07

    Why it matters to you

    Quantizations make running frontier models on workstations financially sensible: it rewrites the TCO of your server room.

    Medium Local AI

    Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM

    New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.