Skip to content
AImpact
IT EN
Medium AI Infrastructure · 1 min read

Cerebras hits 2,500+ tok/s on Llama: inference record of the year

In one sentence Cerebras Systems publishes inference numbers beating Nvidia GPUs by an order of magnitude: 2,500+ tok/s on Llama 4 Maverick and Scout thanks to the wafer-scale WSE-3. Custom ASIC back in the race.

Needs review Official source
ShareLinkedInX
Reading level

Cerebras is a chipmaker alternative to Nvidia using a "wafer-scale" design: a chip the size of a dinner plate with integrated HBM memory. Its claim to fame isn't training but ultra-fast inference.

In June 2025 they publish impressive numbers: on Llama 4 Maverick they pass 2,500 tokens per second per user — for comparison, an Nvidia H100 runs the same model at 100-200 tok/s. That means perceived "instant" responses even for long outputs.

Together with Groq (which uses a custom LPU), Cerebras shows that under certain conditions dedicated ASICs beat Nvidia GPUs by an order of magnitude in latency/throughput on inference. Recalibrates expectations around the "Nvidia monopoly".

Companies

Cerebras Systems

Tools

Cerebras Inference, WSE-3

Tags

CerebrasInferenceWafer ScaleLlama 4Token Throughput

Sources