S-LoRA and Punica: serving hundreds of LoRA fine-tunings from a single base model
In one sentence S-LoRA (UC Berkeley) and Punica (UW) enable multi-tenant serving of hundreds of LoRA adapters from a single base model with zero-copy switching and dedicated CUDA kernels, integrated in vLLM and SGLang.
LoRA is the most widely used technique for fine-tuning large models: instead of modifying all billions of parameters, small "adapter matrices" are added that modify the model's behavior for a specific task. A company can create dozens of different fine-tunings of LLaMA for different uses.
The problem: if you have 200 fine-tuned versions of the same base model and various users use them simultaneously, do you need to keep 200 copies of the model in memory? With S-LoRA and Punica the answer is no. Only one copy of the base model is kept in GPU, and LoRA adapters — much smaller — are loaded dynamically for each request.
The result is simultaneous serving of hundreds of customized versions of an LLM with the memory needed for a single model. Integrated in vLLM and SGLang, it has become the standard way to offer customized LLMs as a service.
Companies
UC Berkeley, University of Washington
Tools
vLLM, SGLang, S-LoRA, Punica, PyTorch
Tags
Sources