torchao: PyTorch-Native Quantization and Sparsity Without Custom CUDA

In one sentence Meta releases torchao as a PyTorch-native library for INT4/FP8/INT8 quantization and sparsity, with 2x speedup on Llama-3 8B at INT4 without requiring custom CUDA kernels, emerging as the standard quantization layer for the PyTorch ecosystem.

Needs review Official source

ShareLinkedIn X

Quantizing a model means compressing it using less precise numbers — instead of 16 bits per weight, using 8, 4, or even fewer. This makes models faster and less memory-hungry. The problem is that doing it well traditionally required extremely specialized CUDA code, written in a low-level language similar to C, that few people know how to write.

torchao changes this. It is a Meta library that brings quantization as a first-class feature to PyTorch, written primarily using PyTorch's own high-level abstractions (plus Triton for critical kernels). You apply two or three lines of code to your model, choose the quantization format you want, and the model is automatically optimized.

The practical results are significant: on Llama-3 8B, INT4 quantization with torchao achieves 2x inference speed versus the original FP16 model, with very small quality loss. This means answering twice as many questions per hour on the same hardware, or using half the hardware for the same capacity. The fact that custom CUDA is no longer needed dramatically lowers the barrier for those wanting to optimize their models.