WebLLM and LLM in WASM: browser-based LLM inference via WebGPU, no server needed
In one sentence WebLLM enables running LLMs like Llama 3 8B directly in the browser via WebGPU and WASM, compiling models with Apache TVM to achieve 15 tokens/s in Chrome with no backend server.
Normally to use an AI model you need a server: your browser sends requests to a remote computer that runs the model and returns responses. WebLLM flips this around: the model runs directly in your browser, on your own computer.
The technology making this possible is WebGPU, a modern API that lets the browser access your computer's graphics card for general-purpose computation, not just 3D graphics. Combined with model compilation via Apache TVM, it's fast enough to be practical.
The most surprising result: Llama 3 8B runs at about 15 tokens per second in Chrome on a laptop with a discrete GPU. All processed text stays on your device — no data ever leaves to external servers. Ideal for applications requiring complete privacy.
Companies
MLC AI, Apache TVM
Tools
WebLLM, Apache TVM, WebGPU, WASM
Tags
Sources