Reading level
Meta had just released LLaMA, a family of powerful language models designed to run on GPU clusters. A few days later, Georgi Gerganov published llama.cpp: a compressed, C++-rewritten version of the model that runs on a regular MacBook's CPU.
The technical breakthrough is 4-bit quantization: instead of full-precision floating-point numbers, each model weight is approximated with just 4 bits. Quality drops slightly, but the model becomes four times smaller and much faster on common hardware.
For the first time, a language model comparable to GPT-3 in structure could run on anyone's laptop, with no internet, no subscription, no server.
Companies
Georgi Gerganov (indipendente), Meta AI
Tools
llama.cpp, LLaMA
Tags
LLaMAllama.cppC++QuantizzazioneGeorgi Gerganov
Sources