Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM
New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.
Category
40 entries
New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.
The Local AI stack matures: Ollama accelerates inference with a better scheduler and compressed KV cache, MLX LM becomes SOTA on Apple Silicon, Apple debuts the Foundation Models framework for native apps. Running Llama 3.3 70B on a MacBook becomes a daily practice.
Private LLM brings LLMs up to 7B parameters to iPhone 15 Pro and M-series Macs via CoreML and Apple Neural Engine, completely offline with no telemetry or cloud subscriptions.
Ollama reaches stable version 1.0: multimodal image support, native tool calling, embeddings API, full OpenAI compatibility, and official Windows general availability.
Ollama adds first-class multimodal support: 'ollama run llama3.2-vision' launches local vision inference. Images are passed inline in API calls, bringing the Ollama one-line experience to VLMs (LLaVA, Moondream, Llama 3.2 Vision).
KoboldCpp v1.84 brings native RAG with embedded ChromaDB: indexes local documents and automatically injects context into the prompt, no separate server configuration needed.
Open WebUI introduces Pipelines: a pluggable middleware layer that intercepts requests and responses without modifying the core, adding rate limiting, safety filters, logging, and custom tools. The first mature plugin architecture for a local LLM frontend.
Jan.ai reaches GA with version 1.0: integrated model manager, local API server, native MCP support, and an extensions system — the first desktop AI app with a plugin ecosystem. An offline alternative to ChatGPT for privacy-first users.
LM Studio becomes an MCP client: local models access the filesystem, databases, and web search via MCP servers, without sending data to external cloud services.
llama.cpp integrates speculative decoding with GGUF draft models: 2-3x speedup even on CPU, with cross-architecture support for models from different families.
Jan.ai 0.5 introduces an extensions marketplace, CUDA and Metal GPU acceleration, pre-configured models for full offline use, and an OpenAI-compatible API.
LM Studio 0.3 brings a built-in OpenAI-compatible server, simultaneous multi-model loading, direct HuggingFace downloads with RAM/VRAM filtering, and exportable conversation logs.
llama.cpp integrates a stable Vulkan backend that brings local GPU acceleration to any discrete GPU: AMD Radeon, Intel Arc, mobile GPUs, legacy hardware — opening the local AI market to all non-NVIDIA users.
Pinokio installs Stable Diffusion, ComfyUI, Open Interpreter, and XTTS with one click, automatically managing Python, Node.js, and all dependencies on Mac, Windows, and Linux.
Mintplex Labs' AnythingLLM 1.0 consolidates the entire RAG stack into a single application: document ingestion, multi-user chat with roles, Ollama and LM Studio support, audit logging, and single-binary deployment. The first local AI solution covering the complete enterprise use case.
Open WebUI introduces local function calling and injectable Python plugins, bringing ChatGPT Enterprise capabilities to fully self-hosted deployments.
TabbyML reaches production maturity with FIM (fill-in-the-middle) completion, local repository RAG indexing, VS Code and JetBrains plugins, and Docker deployment — the first open-source Copilot alternative with awareness of your own codebase.
KoboldCpp introduces built-in RAG to its all-in-one local LLM interface: document management, character AI, and GGUF inference in a single offline executable.
A desktop app for macOS and Windows that lets you query multiple LLMs in parallel, manage conversations, and organize prompts in a local vault.
Microsoft releases Phi-3-mini 3.8B, small 7B, medium 14B. Mini runs on iPhone and beats Mixtral 8x7B on many benchmarks. Confirms the 'curated data > scale' thesis.
NextChat (formerly ChatGPT-Next-Web) surpasses 60,000 GitHub stars with v2: single-binary Docker deployment, multi-provider support (OpenAI, Azure, local models), mask/template system, becoming the reference self-hosted UI for enterprises wanting data control.
Ollama introduces the Modelfile (like a Dockerfile for LLMs), an OpenAI-compatible REST API, and a public registry with 100+ ready-to-use models.
Open WebUI (formerly Ollama WebUI) delivers a full web interface for Ollama: multi-user chat, persistent history, document upload, all in a single Docker container.
LlamaIndex reaches stable 0.10 with 150+ data connectors, full async support, streaming, and modular query engines — becoming the reference framework for RAG pipelines with local LLMs alongside LangChain.
AnythingLLM delivers a full-stack RAG system with a web interface, Ollama/LocalAI LLM backend support, and an embedded vector database, all offline in a single container.
Microsoft Research releases Phi-2, 2.7B params trained on 'textbook-quality' data. Beats LLaMA 2 7B and Mistral 7B on reasoning benchmarks, runs on laptops. 'Small + clean data' philosophy.
Jan.ai launches its first stable release: an open source local LLM client with persistent threads, an extension system, and a built-in OpenAI-compatible server.
Apple Research releases MLX, an open source ML framework optimized for M1/M2/M3: it leverages unified CPU-GPU memory for LLM inference at near-discrete-GPU performance.
Ollama launches version 0.1: a minimal CLI to download and run local LLM models with a single command, reducing setup complexity to zero.
HuggingFace open-sources chat.huggingface.co: a self-hostable web interface via Docker for Llama 2, Mistral, Code Llama, and custom models, with support for tool calls and web search.
An LLM running locally that can write and execute Python, JS, and Shell code autonomously, browse the web, and modify files on your computer.
LM Studio launches its first public release: a graphical interface to browse, download, and use local LLMs with a built-in chat and OpenAI-compatible server.
llama.cpp introduces K-quants (Q2_K through Q8_K): per-layer quantization assigning different bit-widths based on tensor importance. Q4_K_M matches Q5_1 quality at a smaller file size, becoming the de facto standard for all modern GGUF models.
imartinez publishes privateGPT: full RAG on PDFs and TXT with a local LLM, zero cloud data. Your knowledge base stays on your disk.
Nomic AI launches GPT4All v2: a desktop installer that downloads and runs quantized models with no command line required, including LocalDocs for private document Q&A with no internet connection.
mudler releases LocalAI, an OpenAI-compatible REST server that runs GGML/GGUF models locally: migrate your apps from cloud to self-hosted by changing only the URL.
Nomic AI releases GPT4All, a point-and-click installer to run LLMs offline on Windows, Mac, and Linux, lowering the technical barrier to almost zero.
The most-starred open-source web interface for running local LLMs: supports GPTQ, GGML, transformers backends with Gradio UI, extensions, character cards, and chat/instruct modes.
Georgi Gerganov brings Meta's LLaMA to consumer CPUs via 4-bit C++ quantization: the first foundation model practically usable offline on a laptop.
Georgi Gerganov brings OpenAI's Whisper model to CPU via a minimal C++ implementation: real-time transcription with no GPU and no cloud.