Qwen, Mistral, Phi: which local models are worth using in 2025

Not all local models are equal. Some look good on benchmarks and are useless in real use. Here are the ones I actually use.

Qwen2.5: the best for non-English languages

Alibaba has done serious work on multilingual support with Qwen2.5. It’s not marketing — you can see it immediately by testing it on any text: the grammar holds up, context is maintained, and it doesn’t switch to English mid-paragraph.

If you have 8GB of VRAM or less (a typical RTX 3060 or 4060), start with Qwen2.5-7B:

ollama pull qwen2.5:7b

With 16GB of VRAM move up to the 14B, which is noticeably better for long summarization and document Q&A:

ollama pull qwen2.5:14b

If you have a server with serious GPU hardware (A100, two RTX 4090s in NVLink, or similar), the 72B is the choice — quality comparable to GPT-4o on many tasks.

For code there’s Qwen2.5-Coder-32B, which is the best open source model for completion and review. It requires a GPU with 20+ GB of VRAM or a CPU with 64GB of RAM (it runs, but slowly):

ollama pull qwen2.5-coder:32b

Mistral Small 3.1: the smart compromise

Mistral Small 3.1 at 24B is excellent if you want something that runs on a Mac M2/M3 with 16GB unified memory without the fan spinning up. Bilingual without any special configuration, great for chat, text drafting, and document analysis.

ollama pull mistral-small3.1

The strength here is the quality-to-weight ratio: 24B parameters with Q4 quantization weigh around 14GB, manageable on consumer hardware.

Phi-3.5-mini: for low-resource setups

If you’re on CPU, a company laptop, or you want a model that responds in 2 seconds without a GPU, Microsoft’s Phi-3.5-mini (3.8B parameters) is the choice:

ollama pull phi3.5

It won’t perform miracles on long texts or complex reasoning, but for simple tasks — answering short questions, formatting output, generating snippets — it’s surprisingly capable.

A note on Kimi

Kimi k1.5 from Moonshot AI often comes up in discussions about local models. To be clear: it is not a local model. It’s a cloud service accessible via web and API, like GPT-4o or Claude. You can’t download it, and it doesn’t run on Ollama. You can use it as an alternative to OpenAI/Anthropic models for certain tasks (it has a very long context window), but it has no place in a “local AI in the enterprise” conversation.

Summary table

Model	Min VRAM	Languages	Main use case	Ollama command
Qwen2.5-7B	8GB	EN/ZH/IT+	Summarization, Q&A, writing	`ollama pull qwen2.5:7b`
Qwen2.5-14B	16GB	EN/ZH/IT+	Long documents, analysis	`ollama pull qwen2.5:14b`
Qwen2.5-Coder-32B	20GB	EN (code)	Code completion	`ollama pull qwen2.5-coder:32b`
Mistral Small 3.1	16GB	EN/IT	Chat, drafting, analysis	`ollama pull mistral-small3.1`
Phi-3.5-mini	CPU ok	EN (basic IT)	Simple tasks, limited hardware	`ollama pull phi3.5`

What to do

If you don’t know where to start, install Ollama and try qwen2.5:7b — it’s the best starting point for enterprise use with normal hardware.
Always test the model on your specific task before deciding: a general benchmark tells you nothing about how it behaves with your documents.
If you have a Mac with an M-series chip, Mistral Small 3.1 is probably the best choice — it makes good use of unified memory and doesn’t heat up the system.