LLM Compressor: unified toolkit for quantization and sparsity with native vLLM integration
In one sentence Neural Magic releases LLM Compressor: open-source library unifying GPTQ, AWQ, SmoothQuant, and SparseGPT in a single toolkit with native vLLM integration, simplifying compressed model deployment.
Compressing a large language model to run faster and with less memory is one thing; having reliable tools to do it in production is another. Until 2024, those wanting to quantize or prune a model had to use different libraries, with incompatible formats and different calibration processes for each technique.
LLM Compressor, developed by Neural Magic and then moved under the vLLM project, unifies everything in a single Python library. GPTQ, AWQ, SmoothQuant, and SparseGPT all use the same API, the same calibration process, and produce output directly compatible with vLLM without manual conversions.
The goal is to make compression accessible to MLOps engineers who do not have a research background in quantization: one line of code to calibrate, one to quantize, one flag in vLLM to serve.
Companies
Neural Magic, vLLM Project
Tools
LLM Compressor, vLLM, GPTQ, AWQ, SmoothQuant, SparseGPT, PyTorch
Tags
Sources