Cerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3
In one sentence Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.
Cerebras builds an unusual chip: instead of cutting silicon wafers into many small processors, it uses one whole wafer as a single giant chip. It's called WSE-3 and it's the size of a dinner plate.
In late August 2024 it launches an API service to run open models like Llama 3.1 at unbelievable speeds: the 70B "talks" at 450 tokens per second, ten times faster than NVIDIA H100s.
The secret: all model memory sits in the chip's circuits (SRAM, ultra-fast) instead of going through external memory (HBM, slower). Result: the speed at which you read this sentence is roughly the speed at which Llama 70B reads it on their chips.
Companies
Cerebras Systems
Tools
Cerebras Inference, WSE-3
Tags
Sources