Local AI

46 entries

May 13, 2026 Medium

Mistral releases Devstral Small: 7B coding model for agentic tasks on consumer GPU

Mistral releases Devstral Small, a 7-billion-parameter model fine-tuned for agentic coding that outperforms GPT-4o-mini on SWE-bench and runs on just 8GB of VRAM.

Local AI Coding ModelLocal LLMAgentic AI

April 30, 2026 Medium

Usable 2-bit quantization: frontier reasoning models drop below 32GB RAM

New quantization techniques (high-quality 2-bit / 3-bit extensions) let frontier-sized reasoning models run on workstations with 32-64GB unified RAM.

Local AI Local AIQuantizationOllama

April 3, 2026 High

Microsoft releases Phi-4.5: 14B parameter SLM with best-in-class reasoning, runs on 8GB VRAM

Microsoft releases Phi-4.5, a 14-billion parameter model that outperforms much larger models on reasoning and coding benchmarks, runs on a laptop GPU with 8GB VRAM, and is freely available under Apache 2.0.

Local AI Phi-4.5SLMReasoning

March 5, 2026 Medium

Ollama 0.9: concurrent model serving, multi-GPU split, and REST API v2 for local AI

Ollama 0.9 delivers simultaneous multi-model loading, inter-request KV-cache persistence, automatic multi-GPU layer splitting, and a new streaming JSON REST API v2.

Local AI OllamaLocal LLMGPU

January 9, 2026 Landmark

NVIDIA Project DIGITS: a 1 PFLOP personal AI supercomputer for $3,000

Announced at CES 2026, NVIDIA Project DIGITS packs a GB10 Superchip, 128 GB unified memory, and 1 PFLOP FP4 into a desktop device priced at $3,000, enabling local inference of frontier models like Llama 4 405B without the cloud.

Local AI NVIDIAProject DIGITSGB10 Superchip

January 6, 2026 High

CES 2026: On-Device AI Takes Over with AI PCs, Home Robots, and NVIDIA Project DIGITS

CES 2026 in Las Vegas was the first edition entirely dominated by on-device AI, showcasing second-generation Copilot+ PCs, NVIDIA's Project DIGITS personal AI supercomputer, AI-powered TVs, and autonomous home robots from LG and Samsung.

Local AI Copilot+On-Device AIAI PC

September 17, 2025 Medium

Samsung Galaxy AI 2.0 Ships Gauss 2 On-Device LLM on Galaxy S26

Samsung's Gauss 2 runs a 7B LLM locally on Exynos 2600, enabling offline translation in 100 languages and live call transcription on the Galaxy S26.

Local AI

August 14, 2025 Medium

Local AI 2025: Ollama, MLX LM, Apple Foundation Models triple the speed

The Local AI stack matures: Ollama accelerates inference with a better scheduler and compressed KV cache, MLX LM becomes SOTA on Apple Silicon, Apple debuts the Foundation Models framework for native apps. Running Llama 3.3 70B on a MacBook becomes a daily practice.

Local AI OllamaMLXApple Silicon

July 8, 2025 Medium

Private LLM: models up to 7B directly on iPhone and Mac, fully offline

Private LLM brings LLMs up to 7B parameters to iPhone 15 Pro and M-series Macs via CoreML and Apple Neural Engine, completely offline with no telemetry or cloud subscriptions.

Local AI Private LLMiOSmacOS

May 18, 2025 High

Ollama 1.0: first stable release with multimodal, tool calling, and Windows GA

Ollama reaches stable version 1.0: multimodal image support, native tool calling, embeddings API, full OpenAI compatibility, and official Windows general availability.

Local AI OllamaMultimodalTool Calling

May 10, 2025 Medium

Ollama native vision model support: local VLMs with a one-liner

Ollama adds first-class multimodal support: 'ollama run llama3.2-vision' launches local vision inference. Images are passed inline in API calls, bringing the Ollama one-line experience to VLMs (LLaVA, Moondream, Llama 3.2 Vision).

Local AI Ollamavisionmultimodal

March 28, 2025 Medium

KoboldCpp v1.84: native RAG with embedded ChromaDB, no separate servers

KoboldCpp v1.84 brings native RAG with embedded ChromaDB: indexes local documents and automatically injects context into the prompt, no separate server configuration needed.

Local AI KoboldCppRAGChromaDB

March 20, 2025 Medium

Open WebUI Pipelines: enterprise plugin architecture for the local LLM frontend

Open WebUI introduces Pipelines: a pluggable middleware layer that intercepts requests and responses without modifying the core, adding rate limiting, safety filters, logging, and custom tools. The first mature plugin architecture for a local LLM frontend.

Local AI Open WebUIPipelinesmiddleware

February 5, 2025 Medium

Jan 1.0 GA: the first offline-first desktop AI with an extension store

Jan.ai reaches GA with version 1.0: integrated model manager, local API server, native MCP support, and an extensions system — the first desktop AI app with a plugin ecosystem. An offline alternative to ChatGPT for privacy-first users.

Local AI JanJan.aioffline AI

January 25, 2025 High

LM Studio + MCP: local models connected to the world without cloud APIs

LM Studio becomes an MCP client: local models access the filesystem, databases, and web search via MCP servers, without sending data to external cloud services.

Local AI LM StudioMCPModel Context Protocol

December 18, 2024 High

llama.cpp: speculative decoding with draft models for 2-3x speedup

llama.cpp integrates speculative decoding with GGUF draft models: 2-3x speedup even on CPU, with cross-architecture support for models from different families.

Local AI llama.cppSpeculative DecodingGGUF

November 9, 2024 Medium

Jan.ai 0.5: plugin architecture and full GPU support for offline LLMs

Jan.ai 0.5 introduces an extensions marketplace, CUDA and Metal GPU acceleration, pre-configured models for full offline use, and an OpenAI-compatible API.

Local AI Jan.aiPluginCUDA

October 12, 2024 Medium

LM Studio 0.3: built-in OpenAI-compatible server and multi-model management

LM Studio 0.3 brings a built-in OpenAI-compatible server, simultaneous multi-model loading, direct HuggingFace downloads with RAM/VRAM filtering, and exportable conversation logs.

Local AI LM StudioOpenAI CompatibleMulti-model

October 5, 2024 Medium

llama.cpp Vulkan backend: GPU acceleration for AMD, Intel Arc, and beyond CUDA

llama.cpp integrates a stable Vulkan backend that brings local GPU acceleration to any discrete GPU: AMD Radeon, Intel Arc, mobile GPUs, legacy hardware — opening the local AI market to all non-NVIDIA users.

Local AI llama.cppVulkanAMD

September 20, 2024 Medium

Pinokio: the App Store for local AI tools

Pinokio installs Stable Diffusion, ComfyUI, Open Interpreter, and XTTS with one click, automatically managing Python, Node.js, and all dependencies on Mac, Windows, and Linux.

Local AI PinokioApp StoreStable Diffusion

September 1, 2024 High

AnythingLLM 1.0: the complete local RAG stack for enterprise use

Mintplex Labs' AnythingLLM 1.0 consolidates the entire RAG stack into a single application: document ingestion, multi-user chat with roles, Ollama and LM Studio support, audit logging, and single-binary deployment. The first local AI solution covering the complete enterprise use case.

Local AI AnythingLLMRAGmulti-user

July 10, 2024 High

Open WebUI: Tools and Functions bring ChatGPT Enterprise to self-hosting

Open WebUI introduces local function calling and injectable Python plugins, bringing ChatGPT Enterprise capabilities to fully self-hosted deployments.

Local AI Open WebUIFunction CallingTools

June 14, 2024 Medium

TabbyML: open-source GitHub Copilot alternative with self-hosted codebase RAG

TabbyML reaches production maturity with FIM (fill-in-the-middle) completion, local repository RAG indexing, VS Code and JetBrains plugins, and Docker deployment — the first open-source Copilot alternative with awareness of your own codebase.

Local AI TabbyMLcoding assistantFIM

June 5, 2024 Medium

KoboldCpp adds integrated RAG: offline all-in-one LLM with documents and character AI

KoboldCpp introduces built-in RAG to its all-in-one local LLM interface: document management, character AI, and GGUF inference in a single offline executable.

Local AI KoboldCppRAG IntegratoCharacter AI

May 8, 2024 Medium

Msty: local GUI for side-by-side LLM comparison

A desktop app for macOS and Windows that lets you query multiple LLMs in parallel, manage conversations, and organize prompts in a local vault.

Local AI MstyGUIMulti-model

April 23, 2024 High

Phi-3: Microsoft relaunches SLMs with quality of 10x bigger models

Microsoft releases Phi-3-mini 3.8B, small 7B, medium 14B. Mini runs on iPhone and beats Mixtral 8x7B on many benchmarks. Confirms the 'curated data > scale' thesis.

Local AI MicrosoftPhi-3Small Language Models

March 15, 2024 Medium

NextChat v2: the world's most-deployed self-hosted ChatGPT interface

NextChat (formerly ChatGPT-Next-Web) surpasses 60,000 GitHub stars with v2: single-binary Docker deployment, multi-provider support (OpenAI, Azure, local models), mask/template system, becoming the reference self-hosted UI for enterprises wanting data control.

Local AI NextChatChatNextWebself-hosted

February 8, 2024 High

Ollama Modelfile and REST API: local LLMs enter dev workflows

Ollama introduces the Modelfile (like a Dockerfile for LLMs), an OpenAI-compatible REST API, and a public registry with 100+ ready-to-use models.

Local AI OllamaModelfileREST API

January 15, 2024 High

Open WebUI: ChatGPT-style web interface for Ollama with multi-user and history

Open WebUI (formerly Ollama WebUI) delivers a full web interface for Ollama: multi-user chat, persistent history, document upload, all in a single Docker container.

Local AI Open WebUIOllamaChatGPT UI

January 10, 2024 Medium

LlamaIndex 0.10 stable: the standard RAG framework for local LLMs

LlamaIndex reaches stable 0.10 with 150+ data connectors, full async support, streaming, and modular query engines — becoming the reference framework for RAG pipelines with local LLMs alongside LangChain.

Local AI LlamaIndexRAGdata ingestion

December 18, 2023 Medium

AnythingLLM: full local RAG with web UI and embedded vector DB

AnythingLLM delivers a full-stack RAG system with a web interface, Ollama/LocalAI LLM backend support, and an embedded vector database, all offline in a single container.

Local AI AnythingLLMRAG LocaleVector DB

December 12, 2023 Medium

Phi-2: Microsoft's 2.7B model that beats a 13B

Microsoft Research releases Phi-2, 2.7B params trained on 'textbook-quality' data. Beats LLaMA 2 7B and Mistral 7B on reasoning benchmarks, runs on laptops. 'Small + clean data' philosophy.

Local AI MicrosoftPhi-2SLM

December 5, 2023 Medium

Jan.ai: open source desktop app for local LLMs with threads and local server

Jan.ai launches its first stable release: an open source local LLM client with persistent threads, an extension system, and a built-in OpenAI-compatible server.

Local AI Jan.aiDesktop AppOpen Source

December 5, 2023 High

MLX: Apple Research brings native machine learning to Apple Silicon

Apple Research releases MLX, an open source ML framework optimized for M1/M2/M3: it leverages unified CPU-GPU memory for LLM inference at near-discrete-GPU performance.

Local AI MLXApple SiliconM1 M2 M3

November 7, 2023 Landmark ★ On my workflow

Ollama 0.1: pull and run local LLMs with one command, Docker-style

Ollama launches version 0.1: a minimal CLI to download and run local LLM models with a single command, reducing setup complexity to zero.

Local AI OllamaCLILLM Locale

September 28, 2023 Medium

HuggingFace Chat UI: open-source chat interface for any HF model

HuggingFace open-sources chat.huggingface.co: a self-hostable web interface via Docker for Llama 2, Mistral, Code Llama, and custom models, with support for tool calls and web search.

Local AI HuggingFace Chat UIopen sourcechat interface

September 10, 2023 High

Open Interpreter: LLM that executes code locally

An LLM running locally that can write and execute Python, JS, and Shell code autonomously, browse the web, and modify files on your computer.

Local AI Open InterpreterCode ExecutionLLM

September 5, 2023 High

LM Studio: desktop GUI to download and run GGUF models with OpenAI server

LM Studio launches its first public release: a graphical interface to browse, download, and use local LLMs with a built-in chat and OpenAI-compatible server.

Local AI LM StudioGGUFGUI Desktop

July 5, 2023 High

llama.cpp K-quants: the intelligent quantization that transformed local models

llama.cpp introduces K-quants (Q2_K through Q8_K): per-layer quantization assigning different bit-widths based on tensor importance. Q4_K_M matches Q5_1 quality at a smaller file size, becoming the de facto standard for all modern GGUF models.

Local AI llama.cppK-quantsGGUF

May 14, 2023 High

privateGPT: chat with your documents, completely offline

imartinez publishes privateGPT: full RAG on PDFs and TXT with a local LLM, zero cloud data. Your knowledge base stays on your disk.

Local AI privateGPTRAGPDF Offline

May 12, 2023 High

GPT4All v2 (Nomic AI): one-click local AI for everyone

Nomic AI launches GPT4All v2: a desktop installer that downloads and runs quantized models with no command line required, including LocalDocs for private document Q&A with no internet connection.

Local AI GPT4AllNomic AIconsumer AI

May 11, 2023 High

LocalAI: OpenAI drop-in replacement with local models and full privacy

mudler releases LocalAI, an OpenAI-compatible REST server that runs GGML/GGUF models locally: migrate your apps from cloud to self-hosted by changing only the URL.

Local AI LocalAIOpenAI APIPrivacy

March 27, 2023 High

GPT4All: click-and-run offline LLM for non-technical users

Nomic AI releases GPT4All, a point-and-click installer to run LLMs offline on Windows, Mac, and Linux, lowering the technical barrier to almost zero.

Local AI GPT4AllNomic AILLM Offline

March 25, 2023 High

oobabooga text-generation-webui: the first GUI for local LLMs

The most-starred open-source web interface for running local LLMs: supports GPTQ, GGML, transformers backends with Gradio UI, extensions, character cards, and chat/instruct modes.

Local AI oobaboogatext-generation-webuilocal LLM

March 10, 2023 Landmark

llama.cpp: LLaMA 7B runs 4-bit on MacBook CPU

Georgi Gerganov brings Meta's LLaMA to consumer CPUs via 4-bit C++ quantization: the first foundation model practically usable offline on a laptop.

Local AI LLaMAllama.cppC++

January 10, 2023 High

whisper.cpp: offline voice transcription on CPU with pure C++

Georgi Gerganov brings OpenAI's Whisper model to CPU via a minimal C++ implementation: real-time transcription with no GPU and no cloud.

Local AI WhisperSpeech-to-TextC++