Skip to content
AImpact
IT EN
Medium AI Infrastructure · 1 min read

ExLlamaV2: high-speed quantized LLM inference on consumer GPUs

In one sentence ExLlamaV2 introduces the EXL2 format with per-layer mixed bit-rates (2-8 bit), delivering higher throughput than llama.cpp on NVIDIA GPUs and enabling 70B models to run on a single RTX 3090.

Verified Official source
ShareLinkedInX
Reading level

Large language models are heavy: a LLaMA 70B at full precision takes over 130 GB of VRAM, far beyond any consumer GPU. Quantization reduces model size by lowering numerical precision, but done poorly it degrades response quality.

ExLlamaV2 solves this with a smart approach: it doesn't quantize all layers the same way. The most "sensitive" layers retain more bits, the less critical ones get compressed more. The result is a smaller model that preserves quality better than uniform quantization.

On NVIDIA GPUs, ExLlamaV2 is significantly faster than llama.cpp (which is mainly optimized for CPU), making it the preferred choice for anyone with a good GPU who wants maximum text generation speed.

Companies

turboderp (community)

Tools

ExLlamaV2, EXL2

Tags

ExLlamaV2EXL2QuantizzazioneGPULLMConsumer Hardware

Sources