Groq LPU: 500-tokens-per-second inference goes viral

In one sentence Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.

Verified Official source

ShareLinkedIn X

Every AI chatbot has the same annoying thing: you wait for the text to "stream" across the screen, word by word. Even GPT-4 or Claude does it.

Groq, a hardware startup founded by an ex-Google engineer (Jonathan Ross, father of the first TPU), built a different chip called the LPU (Language Processing Unit). On Llama 2 70B their public demo answers at 500 tokens per second: basically, the whole reply appears instantly, faster than you can read it.

It is not just a demo trick: it changes what you can build. AI agents making 10 chained calls? Suddenly usable. Real-time voice? Possible. Inference speed, until now a bottleneck, becomes a tunable parameter.