Inference Intermediate Also known as: Holistic Evaluation of Language Models

HELM

/helm/

A holistic evaluation framework developed by Stanford CRFM that measures an LLM across dozens of benchmarks covering accuracy, robustness, bias, calibration, and efficiency.

ShareLinkedIn X

In practice

Instead of a single metric, it provides a full scorecard: useful for comparing models all-around, not just on academic leaderboards. It runs a public site with up-to-date results for every major model.

Related terms

MMLU Foundation model

Seen in the wild

0 entries mentioning it

No archive entry mentions it explicitly. Appears in broader contexts.

← All terms