Cerebras hits 2,500+ tok/s on Llama: inference record of the year

In one sentence Cerebras Systems publishes inference numbers beating Nvidia GPUs by an order of magnitude: 2,500+ tok/s on Llama 4 Maverick and Scout thanks to the wafer-scale WSE-3. Custom ASIC back in the race.

Needs review Official source

ShareLinkedIn X

Cerebras is a chipmaker alternative to Nvidia using a "wafer-scale" design: a chip the size of a dinner plate with integrated HBM memory. Its claim to fame isn't training but ultra-fast inference.

In June 2025 they publish impressive numbers: on Llama 4 Maverick they pass 2,500 tokens per second per user — for comparison, an Nvidia H100 runs the same model at 100-200 tok/s. That means perceived "instant" responses even for long outputs.

Together with Groq (which uses a custom LPU), Cerebras shows that under certain conditions dedicated ASICs beat Nvidia GPUs by an order of magnitude in latency/throughput on inference. Recalibrates expectations around the "Nvidia monopoly".