How to run local LLMs: a practical guide with Ollama

Ollama is like Docker but for AI models. One command and you have an LLM running locally. No data leaves your network, no subscription, no token counter ticking away.

Three concrete reasons to do this: if you have NDAs with clients or work in healthcare or legal, sending text to OpenAI is a problem — here nothing leaves. If you’re building automations that process documents all day, GPT-4o at $2.50 per million tokens adds up fast. And if you’re on an air-gapped network or traveling without internet, the local model doesn’t care.

Installation: three commands

Linux / macOS:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b

Windows: download the installer from ollama.com/download, then use PowerShell normally. Same pull and run commands.

You’re in a text chat. Ctrl+D or /bye to exit. ollama list to see downloaded models, ollama rm mistral:7b to remove one.

Hardware: what to actually expect

CPU-only (no GPU): it works. Slowly. On a recent Ryzen 5 or Core i5 with 32GB RAM expect 5-10 tokens per second on a 7B model. Enough for occasional interactive use, not for heavy batch jobs.

GPU with 8-12GB VRAM: this is where things change. An RTX 3060 with 12GB runs a 7B at 50+ tokens per second. An Apple M2 Pro with 32GB unified memory is a serious machine — the memory serves as both RAM and VRAM.

Practical rule for VRAM: 7B model → at least 6GB at Q4 quantization, 8GB is better. 14B model → at least 12-16GB. The model lives entirely in VRAM when available, otherwise it spills to system RAM (much slower).

Which model to choose

There’s no single best. It depends on your hardware and use case.

llama3.1:8b — balanced, good multilingual, fast. Starting point if you have an 8GB+ GPU.

qwen2.5:7b — excellent for code and multilingual tasks, similar size to Llama 3. Try it with ollama pull qwen2.5:7b. Also qwen2.5:14b if you have 16GB VRAM: noticeably better quality.

mistral:7b — great for instruction and code, very fast, weaker on languages other than English.

phi3:mini — very lightweight (~2GB), runs on weak CPUs or limited hardware. Short context and limited reasoning, but if you have an old machine it’s what will actually run.

For standard business tasks — summaries, text analysis, script generation, code review — a 7-8B model is more than enough.

Web interface and API

If you want to give access to non-technical colleagues, Open WebUI is the standard solution: it looks like ChatGPT, you select the model from the menu, save conversations, and can upload PDFs with integrated RAG.

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

Open http://localhost:3000. If Ollama is running on a separate server (e.g. 192.168.1.50), add -e OLLAMA_BASE_URL=http://192.168.1.50:11434 and all colleagues use the same GPU instance without installing anything on their PCs.

Ollama also exposes a REST API compatible with the OpenAI spec. Any code written for OpenAI works by just changing the endpoint:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Analyze this error log: ..."}]
)

Zero company code going to external APIs.

What to do

Install Ollama on your PC or a server with a free GPU, download llama3.1:8b or qwen2.5:7b and run a test from the terminal
Launch Open WebUI in Docker and give it to a colleague to collect feedback on a concrete use case: log analysis, code review, Q&A on internal documentation
If you need more quality, try qwen2.5:14b on hardware with 16GB VRAM before scaling to 70B models that require 48GB+ GPUs