llama.cpp: LLaMA 7B runs 4-bit on MacBook CPU

In one sentence Georgi Gerganov brings Meta's LLaMA to consumer CPUs via 4-bit C++ quantization: the first foundation model practically usable offline on a laptop.

Verified Official source

ShareLinkedIn X

Meta had just released LLaMA, a family of powerful language models designed to run on GPU clusters. A few days later, Georgi Gerganov published llama.cpp: a compressed, C++-rewritten version of the model that runs on a regular MacBook's CPU.

The technical breakthrough is 4-bit quantization: instead of full-precision floating-point numbers, each model weight is approximated with just 4 bits. Quality drops slightly, but the model becomes four times smaller and much faster on common hardware.

For the first time, a language model comparable to GPT-3 in structure could run on anyone's laptop, with no internet, no subscription, no server.