Infrastructure Beginner Also known as: Calcolo in inferenza · Test-time compute

Inference compute

The amount of compute the model uses at response time, not during training. More inference compute often means better but slower and pricier answers.

ShareLinkedIn X

In practice

Reasoning models shift resources from training to inference. For anyone deploying a service this is the most visible cost line: every call burns GPU. Ways to cut it: caching, smaller models, quantization, batching.

Related terms

Reasoning model Quantization MoE

Seen in the wild

2 entries mentioning it

February 1, 2025

s1: 1000 examples and a prompt trick to replicate a reasoning model

High
September 12, 2024

o1: the first model that 'thinks before answering'

Landmark

← All terms