Skip to content
AImpact
IT EN
Medium AI Infrastructure · 1 min read

WebLLM and LLM in WASM: browser-based LLM inference via WebGPU, no server needed

In one sentence WebLLM enables running LLMs like Llama 3 8B directly in the browser via WebGPU and WASM, compiling models with Apache TVM to achieve 15 tokens/s in Chrome with no backend server.

Verified Official source
ShareLinkedInX
Reading level

Normally to use an AI model you need a server: your browser sends requests to a remote computer that runs the model and returns responses. WebLLM flips this around: the model runs directly in your browser, on your own computer.

The technology making this possible is WebGPU, a modern API that lets the browser access your computer's graphics card for general-purpose computation, not just 3D graphics. Combined with model compilation via Apache TVM, it's fast enough to be practical.

The most surprising result: Llama 3 8B runs at about 15 tokens per second in Chrome on a laptop with a discrete GPU. All processed text stays on your device — no data ever leaves to external servers. Ideal for applications requiring complete privacy.

Companies

MLC AI, Apache TVM

Tools

WebLLM, Apache TVM, WebGPU, WASM

Tags

WebLLMWebAssemblyWebGPUBrowserEdge AIPrivacyApache TVM

Sources