Intermediate Abstraction and Reasoning Corpus 2
A benchmark of visual grid puzzles created by François Chollet to measure abstract reasoning on never-seen patterns, unsolvable by memorization.
In practice Designed to be easy for humans (over 80%) but hard for LLMs. In 2024 OpenAI's o3 hit historic results, reopening the debate on what AGI really means. A one-million-dollar prize is attached.
Intermediate Ricerca a fascio
A decoding algorithm that keeps the N most likely sequences in parallel and finally picks the one with the best overall score.
In practice It produces "safer" results than greedy, but tends to be repetitive and unnatural on long text. It used to be standard for machine translation; in modern conversational LLMs it is mostly replaced by top-p sampling. It still helps on structured tasks like translation and summarization.
Beginner CoT · Catena di ragionamento 7
A technique where the model is asked to spell out the intermediate reasoning steps before giving the final answer, improving accuracy on complex tasks.
In practice Adding 'think step by step' to the prompt really works on math, logic, and analysis. Reasoning models (o1, Claude with thinking) do it automatically. It costs more tokens, so use it only where you need it.
Beginner Finestra di contesto · Context length 1
The maximum number of tokens the model can read and hold in memory in a single call, counting both prompt and response.
In practice If you have a 200-page contract and a 200k-token window the whole thing often fits. Otherwise you have to chunk the text or use RAG. More context means higher cost and higher response latency.
Advanced Batching continuo · In-flight batching 2
A serving strategy where new requests join the running batch at every generation step, instead of waiting for previous ones to finish.
In practice It sharply raises the throughput of a GPU serving APIs, because cores are never left idle. It is implemented in vLLM, TensorRT-LLM, and TGI. For anyone pricing per token, it is one of the key ingredients to stay competitive on cost.
Beginner Apprendimento con pochi esempi 3
A prompting technique where the model is shown a few examples of desired input and output, so it learns the format on the fly without training.
In practice Useful to enforce a schema, a tone, or a precise classification. Usually 3-5 examples are enough. It is almost always the first thing to try before reaching for fine-tuning: it only costs a few extra tokens in the prompt.
Advanced Flash Attention 4
An algorithm that reorganizes attention computation to minimize data movement between fast and slow GPU memory.
In practice It does not change the math, but makes attention much faster and far less memory-hungry. It ships by default in PyTorch and in major inference servers (vLLM, TGI). If you use APIs you never see it; if you self-host it is almost mandatory to turn on.
Intermediate Graduate-Level Google-Proof Q&A
A benchmark of 448 questions written by PhD students in biology, physics, and chemistry, designed to be hard even with Google access.
In practice It is replacing MMLU as the gauge of deep scientific knowledge. Domain-expert humans score around 65%; frontier models in 2025 exceed 70%. It remains one of the not-yet-saturated benchmarks.
Intermediate Decodifica greedy
A generation strategy that always picks the most likely token at each step, without exploring alternatives.
In practice Equivalent to temperature 0. It is deterministic and fast, ideal for tasks needing reproducibility (data extraction, classification, code). The downside is that it can get stuck in loops and produces flat output on creative tasks. It is the natural starting point when debugging prompts.
Intermediate Holistic Evaluation of Language Models
A holistic evaluation framework developed by Stanford CRFM that measures an LLM across dozens of benchmarks covering accuracy, robustness, bias, calibration, and efficiency.
In practice Instead of a single metric, it provides a full scorecard: useful for comparing models all-around, not just on academic leaderboards. It runs a public site with up-to-date results for every major model.
An OpenAI benchmark of 164 Python programming problems scored by running unit tests against the code generated by the model.
In practice It was the standard for measuring LLM coding ability since 2021. It too is now saturated (over 90% pass@1), and the community has moved to SWE-bench, more realistic because it is based on real repositories.
Intermediate K-Quantization · llama.cpp K-Quants · GGUF K-Quants 1
K-Quants are a family of quantization methods implemented in llama.cpp (from Q2_K to Q8_K) that apply different bit-widths to different model layers based on their sensitivity to precision loss. Attention and embedding layers, being more sensitive, receive more bits; intermediate feed-forward layers, being less critical, receive fewer. This non-uniform quantization produces higher quality than older flat-Q formats (Q4_0, Q5_1) at the same file size. Q4_K_M has become the reference format for local inference, achieving better quality than the old Q5_1 while being more compact. They are the standard format for modern GGUF models downloadable from HuggingFace.
In practice A user wanting to run Llama 3 70B on a PC with 48 GB of RAM downloads the Q4_K_M variant from the GGUF repository on HuggingFace (typically uploaded by TheBloke or bartowski) and runs it with `llama.cpp` or an interface like LM Studio or Ollama. The choice of quantization level follows a practical rule: Q4_K_M for the best quality/size balance, Q5_K_M if there is sufficient RAM and higher fidelity is desired, Q2_K if space is very limited and degraded quality is acceptable. K-Quants are transparent to the end user: the interface loads the GGUF file and handles the format internally.
Intermediate Key-Value Cache · Cache chiavi-valori 4
A temporary GPU memory that stores attention computations for tokens already seen, so the model does not recompute them on every new token generated.
In practice It is why generating the tenth token costs less than the first: the cache avoids redoing work. It eats a lot of VRAM and grows with context, so it is often the real bottleneck for serving many users in parallel. Optimizing it (paged, quantized) is central to cutting inference cost.
Advanced KV Quantization · KV Compression 1
KV cache quantization is the technique of compressing the key-value tensors dynamically generated during inference, reducing them from FP16 to FP8 or INT8. Unlike weight quantization, which operates on the model's static parameters, this acts on the cache generated at runtime for each request. It reduces VRAM footprint by 50% or more, enabling longer context windows or more concurrent requests per GPU. It is supported by vLLM, Text Generation Inference (TGI), and TensorRT-LLM.
In practice A sysadmin serving a 70B model on two A100 80GB GPUs and wanting to increase concurrent batch size from 8 to 16 requests enables FP8 KV cache quantization in vLLM by adding `--kv-cache-dtype fp8` to the launch command. It is important to distinguish this from weight quantization: the two approaches are orthogonal and can be combined. In practice, measure quality degradation on long-range tasks (needle-in-haystack, multi-turn) before deploying to production, since precision loss in the cache is more visible over long contexts.
Intermediate LLM giudice · Model-graded eval
A technique that uses an LLM (usually a strong one) to score another model's or its own answers against criteria written in natural language.
In practice It speeds up evaluation dramatically compared to human judges, but suffers from biases (prefers longer answers, its own style). It must be calibrated against a subset of human judgments as anchor.
Raw numeric scores the model produces for every possible vocabulary token, before being turned into probabilities.
In practice They are the model's "unnormalized thinking": the higher a token's logit, the more likely it gets. Some APIs expose `logprobs` (logits after softmax and log) to gauge confidence or build classifiers. Working with raw logits is only relevant for fine-tuning or research.
Intermediate Perso nel mezzo
The phenomenon where an LLM remembers information at the start and end of the context better, while content in the middle is often ignored or forgotten.
In practice Critical for RAG and long prompts: the order of documents matters. Put the key information at the start or the end. It is one of the reasons a 1M-token context window is not equivalent to actually using it all.
Intermediate Massive Multitask Language Understanding 1
A benchmark of about 16,000 multiple-choice questions across 57 subjects, from math and law to medicine, used to measure an LLM's general knowledge.
In practice For years it was the headline benchmark cited in new model announcements. Today it is saturated: frontier models score above 85%, and the field is moving to harder benchmarks like MMLU-Pro and GPQA.
Intermediate NIAH · Ago nel pagliaio
A test that hides a specific sentence inside a long irrelevant text and asks the model to retrieve it, to measure the real quality of the context window.
In practice It has become the de facto benchmark for long-context models (100K, 1M tokens). A model can advertise a huge context but fail NIAH beyond a certain depth, a sign the window is effectively 'fake'.
Advanced PagedAttention 4
A technique that splits the KV cache into small blocks managed like virtual memory pages, cutting VRAM waste between different requests.
In practice It is the core idea of vLLM and now standard in modern inference servers. It lets the same GPU serve many more users because it avoids reserving large mostly-empty blocks. When picking a self-hosted runtime, support for paged attention is a baseline requirement.
Intermediate Automatic Prefix Caching · APC · Prompt Caching 2
Prefix caching is an inference technique that reuses the already-computed KV cache for common prompt prefixes across multiple requests. Rather than recomputing attention keys and values for the same sequences (e.g., an identical system prompt), the system stores these activations in memory and retrieves them directly. This dramatically reduces latency for the shared prefix, bringing it close to zero. It is implemented in vLLM as 'Automatic Prefix Caching' and in Anthropic and OpenAI cloud services as a reduced-cost billed feature.
In practice A developer serving a chatbot with a fixed 2,000-token system prompt benefits immediately from prefix caching: only the first request computes that prefix, and all subsequent ones read it from cache. In vLLM it is enabled with `--enable-prefix-caching`; in the Anthropic API, prefix caching must be explicitly declared with `cache_control`. For RAG applications with shared documents, you structure the prompt by placing the document before the questions to maximize cache reuse.
Intermediate Quantizzazione 11
A technique that reduces the numeric precision of model weights (for example from 16 to 4 bits) so it takes less memory and runs faster.
In practice It is what lets you run a Llama 70B on a single GPU or a 7B model on a Mac. You lose a bit of quality but often not much. Typical tools: GGUF, AWQ, GPTQ. Useful for on-prem or edge deployment.
Beginner Retrieval-Augmented Generation · Generazione aumentata da recupero 20
A technique that fetches relevant text from an external data source and inserts it into the model's prompt before generating the response.
In practice It lets an LLM answer using company documents, internal knowledge bases, or up-to-date articles without training. It cuts hallucinations on specific data and refreshes knowledge without re-training. It is the first architecture to consider for an enterprise chatbot.
Intermediate Auto-consistenza 2
A technique that samples multiple independent answers from the model with temperature > 0 and picks the most frequent one by majority vote.
In practice It often improves accuracy on math reasoning tasks: if 5 out of 7 thought chains converge on the same answer, it is likely correct. It triples or quintuples inference cost.
A math function that turns a set of logits into probabilities that sum to 1, amplifying high values and squashing low ones.
In practice It is the last step before picking the next token: it tells how strongly the model "believes" in each option. It also appears inside attention to weight context tokens. If you call APIs it is invisible; if you study models, it is one of the most recurring functions.
Advanced Decoding speculativo 3
A technique where a small fast model proposes several tokens ahead and the large model verifies them in a single pass, accepting the correct ones.
In practice It can produce answers 2-3x faster with no change in final quality, because the big model stays the judge. It is used in production by OpenAI, Anthropic, and in self-hosted runtimes. It needs a "draft" model aligned with the main one, so it is not free to set up.
Beginner JSON mode · Output strutturato
A mode where the model is constrained to produce output conforming to a schema (JSON, regex, grammar) instead of free text.
In practice Essential when the output feeds another system: API, database, frontend. Providers like OpenAI and Anthropic offer native enforcement that guarantees valid JSON on the first try.
Intermediate Software Engineering Bench 7
A benchmark of over 2,000 real GitHub issues from Python repositories: the model must produce a patch that makes the project's tests pass.
In practice It measures real software-engineering ability (reading a codebase, debugging, cross-file edits), not isolated coding. It has become the reference for agents like Devin, Claude Code, and OpenAI Codex.
A parameter that scales the logits before sampling: low values make the model more deterministic, high values more creative and unpredictable.
In practice At 0 the model always picks the most likely word (effectively greedy); at 1 it keeps the original distribution; above 1.5 it tends to go off the rails. For classification or extraction use 0; for creative writing 0.7-1.0. It is the simplest knob to tune in any API.
The basic unit the model breaks text into: it can be a whole word, a syllable, or a few characters, depending on the tokenizer.
In practice LLM APIs charge per input and output token. In English 1 token is roughly 0.75 words, in Italian a bit less. Counting tokens in your prompt helps estimate cost and stay within the context limit.
The component that turns text into tokens before passing it to the model and rebuilds text from output tokens.
In practice Different tokenizers produce different counts: the same text may cost more tokens on GPT than on Claude or the other way around. Libraries like tiktoken (OpenAI) let you count tokens locally before calling the API.
Intermediate Campionamento top-k
A next-token selection strategy that keeps only the k most likely candidates and discards the rest before sampling.
In practice With k=1 it becomes greedy decoding; with large k it is almost the full distribution again. It is used to stop the model from picking absurd words from the tail. Modern APIs often replace or combine it with top-p, which is considered more adaptive.
Intermediate Nucleus Sampling · Campionamento a nucleo
A strategy that picks the next token from the smallest set of candidates whose cumulative probability exceeds a threshold p (e.g. 0.9).
In practice It adapts the candidate set to context: few options when the model is confident, many when it is unsure. It is the most-used parameter in APIs (`top_p` on OpenAI, Anthropic, etc.) to tune creativity without losing coherence. Typical values sit between 0.8 and 0.95.
A reasoning strategy where the model explores multiple thought branches in parallel, evaluates them, and keeps only the promising ones, like a tree search.
In practice It extends Chain-of-Thought by allowing backtracking: useful for puzzles, planning, and math problems where a single linear path often fails. It costs many more tokens than standard inference.
Intermediate Zero-Shot Voice Cloning · Speaker Adaptation 9
Voice cloning is the ability to generate speech synthesis in a target speaker's voice from just a few seconds of reference audio, without any additional fine-tuning. The model extracts a speaker embedding from the reference audio and conditions generation on it, replicating timbre, rhythm, and prosodic characteristics. Zero-shot means no additional per-speaker training is needed at inference time. Systems like ElevenLabs, XTTS v2, CosyVoice, and Dia TTS have made this technology accessible via API or open-weights models.
In practice A developer cloning a voice with XTTS v2 (open source, available on HuggingFace) provides 6-10 seconds of clean reference audio and the text to synthesize; the Coqui TTS library handles embedding extraction and synthesis in a few seconds. For professional productions, the ElevenLabs API accepts an audio clip and returns a reusable voice_id. It is essential to verify the original speaker's consent before cloning their voice, in compliance with applicable regulations.
Beginner Apprendimento senza esempi
The model's ability to perform a task it never saw during training, based only on the description we give in the prompt, with no examples.
In practice It is what most of us do when we write 'summarize this text in three bullets'. If results are inconsistent, moving to few-shot with examples is the fastest fix. Useful to prototype new flows quickly.