Printable version · Press Ctrl+P to export as PDF · 121 terms
An input modified imperceptibly for a human but crafted to fool a model into producing a wrong or harmful output.
In practiceBorn in vision (a few pixels can make a panda be classified as a gibbon), today it also hits LLMs with strange character suffixes that unlock forbidden behavior. It is an intrinsic vulnerability of neural networks.
A system where an LLM does more than answer: it decides which tools to call, in what order, and keeps iterating until it reaches a goal.
In practiceAn agent reads email, writes to a database, sends Slack messages. The hard part is handling errors, infinite loops, cost, and tool security. For simple cases a linear pipeline is more reliable than a real agent.
An AI supply chain attack targets the AI development supply chain: publicly shared model weights, LoRA adapters, GGUF quantizations, or datasets on platforms like HuggingFace are compromised with backdoors or hidden behaviors. A poisoned model can execute malicious actions when it receives a specific trigger, exfiltrate data, or generate harmful outputs at the attacker's request. The analogy to SolarWinds-style attacks on traditional software is direct: the artifact appears legitimate but contains hidden payloads.
In practiceA developer downloading models from public repositories should verify the officially published SHA256 checksums and prefer models with digital signatures or verified provenance. Before using a model in production, it is good practice to run automated security evaluations (e.g., with tools like ModelScan or Protect AI Guardian) that analyze weights for suspicious patterns. For enterprise teams, maintaining an internal registry of approved artifacts and disallowing direct Internet downloads during deployment significantly reduces the attack surface.
A set of techniques and research aimed at making an AI model do what humans actually want, not just what we ask literally.
In practiceIn practice: the model does not help with illegal stuff, follows instructions, does not make things up, does not manipulate. When you put AI in production this is also a brand and legal liability concern, not just an ethical one.
A benchmark of visual grid puzzles created by François Chollet to measure abstract reasoning on never-seen patterns, unsolvable by memorization.
In practiceDesigned to be easy for humans (over 80%) but hard for LLMs. In 2024 OpenAI's o3 hit historic results, reopening the debate on what AGI really means. A one-million-dollar prize is attached.
A scale of levels (ASL-1, ASL-2, ASL-3...) used by Anthropic to classify the risk of an AI model and define the required safety controls, inspired by biosafety levels.
In practiceHigher level, more mandatory safety techniques: monitoring, deployment restrictions, independent audits. When choosing a vendor, knowing which ASL a model is compliant with hints at the maturity of their governance.
A mechanism that lets the model weigh how relevant each word in the text is compared to the others to understand the meaning of the context.
In practiceIt is why an LLM knows that 'he' in a sentence refers to a person mentioned earlier. Compute cost grows with the square of context length: this is why very long contexts are expensive.
A model that generates a sequence one element at a time, each time using the previous output as part of the new input.
In practiceIt is how every GPT-style LLM works: each new token depends on all the previous ones. It explains why generation is inherently sequential and hard to parallelize, and is the reason behind tricks like speculative decoding to speed it up.
An attack where a model is trained to behave normally except when it recognizes a secret trigger that activates a predefined malicious behavior.
In practiceExtremely hard to detect with standard evaluations: the model looks aligned until someone types the keyword. It affects both proprietary models (insiders) and open-weights downloaded from untrusted sources.
A decoding algorithm that keeps the N most likely sequences in parallel and finally picks the one with the best overall score.
In practiceIt produces "safer" results than greedy, but tends to be repetitive and unnatural on long text. It used to be standard for machine translation; in modern conversational LLMs it is mostly replaced by top-p sampling. It still helps on structured tasks like translation and summarization.
A classic text-search algorithm based on word frequency, with corrections for document length and term rarity.
In practiceIt has powered Elasticsearch, Lucene, and Solr for decades. On exact terms, acronyms, and proper nouns it often beats embeddings. That is why modern RAG pipelines combine BM25 with vector search (hybrid search).
A tokenization algorithm that starts from single characters and progressively merges the most frequent pairs, building a vocabulary of subwords.
In practiceIt is used by GPT, Llama, Mistral, and nearly every Western LLM. It explains why "playing" may become `play` + `ing`: common pieces get one token, rare words get many. It directly affects per-token cost and quality on non-English languages.
A phenomenon where a model, when trained on new data, rapidly loses skills it had learned before.
In practiceIt is why aggressive fine-tuning on a narrow domain can make the model worse on everything else. It is mitigated with LoRA (which freezes the original weights), mixed datasets, or regularized updates. Always evaluate with a "general" test set on top of the domain-specific one.
A filter applied inside attention that prevents each token from seeing tokens that come after it in the sequence.
In practiceIt is what makes a Transformer "causal" or decoder-only: during training the model learns to predict the next token without cheating by looking ahead. At inference time the mask becomes implicit because future tokens do not yet exist. Without it GPT would not make sense.
A technique where the model is asked to spell out the intermediate reasoning steps before giving the final answer, improving accuracy on complex tasks.
In practiceAdding 'think step by step' to the prompt really works on math, logic, and analysis. Reasoning models (o1, Claude with thinking) do it automatically. It costs more tokens, so use it only where you need it.
A full save of model weights at a given point of training, from which you can resume or release as a final model.
In practiceDuring a training run, checkpoints are saved every N steps to recover from crashes and evaluate intermediate versions. When a lab releases an open-weights model (Llama, Mistral, Qwen) it is publishing a checkpoint. The word is often used as a synonym for "a downloadable model version".
The process of splitting a document into smaller pieces (chunks) before computing embeddings, to make them suitable for retrieval and the context window.
In practiceChunking quality often dictates RAG quality: chunks too small lose context, chunks too large dilute relevance. Common strategies: fixed size with overlap, recursive by separator, semantic by topic shift.
An approach developed by Anthropic where the model is trained to follow a written set of principles (a 'constitution') instead of just case-by-case human preferences.
In practiceIt is the method behind Claude. Upside: behavior rules are explicit and readable, not hidden in millions of ratings. If you pick a model for the company this clarifies the vendor's policy choices.
The maximum number of tokens the model can read and hold in memory in a single call, counting both prompt and response.
In practiceIf you have a 200-page contract and a 200k-token window the whole thing often fits. Otherwise you have to chunk the text or use RAG. More context means higher cost and higher response latency.
A serving strategy where new requests join the running batch at every generation step, instead of waiting for previous ones to finish.
In practiceIt sharply raises the throughput of a GPU serving APIs, because cores are never left idle. It is implemented in vLLM, TensorRT-LLM, and TGI. For anyone pricing per token, it is one of the key ingredients to stay competitive on cost.
A similarity measure between two vectors based on the cosine of the angle between them: ranges from -1 (opposite) to 1 (identical), independent of their length.
In practiceIt is the most common metric for comparing text embeddings because it ignores magnitude and looks only at semantic direction. Common alternatives: dot product (faster if vectors are normalized) and Euclidean distance.
Training a single robot policy that works across different hardware configurations (different arm DOFs, grippers, sensors, mobile bases). Like foundation models for text, cross-embodiment models (RT-2, CrossFormer, Open X-Embodiment) learn general manipulation skills from diverse robot data. Reduces the need to collect data per robot configuration separately.
In practiceA company with multiple robot models in production can train a single cross-embodiment model on all collected data, instead of maintaining separate policies for each robot. In practice, the Open X-Embodiment dataset aggregates over 1 million episodes from 22 different robots; a researcher can fine-tune this model on a few examples from their specific robot and achieve better performance than training from scratch.
Two architectures to measure text similarity: the bi-encoder encodes query and document separately (fast), the cross-encoder processes them jointly (slow but accurate).
In practiceBi-encoder = precomputed embeddings, used for the first search over millions of documents. Cross-encoder = score computed on the fly over few candidates, used as a final reranker. They are complementary, not alternatives.
An attack where an adversary inserts malicious examples into the training dataset to alter the behavior of the final model.
In practiceEven a handful of corrupted documents in a web crawl can create persistent backdoors or biases. Particularly risky for models that continuously train on public content or are fine-tuned on unvetted third-party datasets.
A Transformer architecture made up of only the decoder side, where each token looks only at previous tokens to predict the next one.
In practiceIt is the architecture of GPT, Llama, Mistral, Claude, and basically every modern generative LLM. It contrasts with encoder-only (BERT, for classification) and encoder-decoder (T5, for translation). Its simplicity is the reason it scales so well in pretraining.
A mathematical technique that adds controlled noise to training so that the presence or absence of a single individual in the dataset is not detectable from the model's output.
In practiceIt is the de facto standard for models trained on health, tax, or messaging data. Apple, Google, and the US Census use it. It costs accuracy: more privacy means more noise.
A type of generative model that starts from random noise and gradually shapes it into a coherent image, video, or audio through many small steps.
In practiceIt powers Stable Diffusion, Midjourney, Sora. When you integrate image generation what matters is the trade-off between quality, speed (number of steps), and control. Costs are in GPU-seconds rather than tokens.
An imitation learning method for robots where the policy is a denoising diffusion model: given an observation, it iteratively denoises a random action sequence into the action to execute. Unlike deterministic policies, diffusion policies learn multi-modal action distributions — they handle tasks with multiple valid solutions without averaging them into a bad one. Outperforms behavioral cloning by 46%+ on manipulation benchmarks.
In practiceA robotics researcher collecting human demonstrations for an assembly task trains a Diffusion Policy on that data: the model learns that 'place the piece on the left' and 'place it on the right' are both valid solutions and coherently samples one of them, instead of producing the (wrong) average movement as classic behavioral cloning does. Libraries like Columbia's diffusion_policy or Hugging Face's LeRobot offer ready-to-use implementations.
Disaggregated inference is a serving architecture that physically separates the prefill phase (compute-bound: processes the entire prompt in parallel) from the decode phase (memory-bound: generates one token at a time), assigning them to distinct GPU pools connected via KV cache transfer. This separation eliminates 'prefill-decode interference', the resource contention that occurs when both phases run on the same GPUs and reduces overall throughput. Publicly proposed by Moonshot AI's Mooncake architecture (Kimi), it has yielded throughput improvements of 5x or more in production. It is considered one of the most significant advances in LLM serving infrastructure in 2024-2025.
In practiceIn a large-scale deployment, the infrastructure engineer configures a cluster of 'prefill-only' GPUs (typically high FLOPS/W, such as H100 SXM) and a separate 'decode-only' cluster (typically high memory bandwidth). An incoming request is routed to the prefill pool, which computes the KV cache and transfers it via NVLink or InfiniBand to the decode pool. Open-source frameworks such as LMDeploy and some advanced vLLM configurations support this mode. Operational cost is higher due to hardware duplication, but TTFT (time-to-first-token) and throughput improve significantly.
A technique to train a small model to mimic the behavior of a large one, getting similar quality at a fraction of inference cost.
In practiceIt is why we keep getting small but capable models: they are distilled from frontier ones. If you need fast cheap responses in a narrow domain, distilling your own model from Claude or GPT is often the winning move.
An alignment technique that teaches a model to prefer a better answer over a worse one, without using a separate reward model like RLHF does.
In practiceIt only needs pairs of answers labeled "better/worse" and a simpler, more stable training loop than PPO. In recent years it has replaced RLHF in many open-source projects (Zephyr, Tulu, Llama variants). It is often the cheapest way to align a fine-tuned model.
A technique to fine-tune a diffusion model on 3-5 photos of a specific subject (person, product, pet) using a unique text identifier ('a sks dog'). The model 'memorizes' the subject while preserving its general generation ability. It is the foundation of AI portrait apps, product photography generators, and custom image tools. Introduced by Google Research in 2022.
In practiceA product photographer can fine-tune Stable Diffusion with DreamBooth on 5 photos of an object (e.g., a sneaker) and then generate hundreds of shots in different environments without physical photo sets. In practice, it is often combined with LoRA to reduce computational cost: instead of updating all model weights, only low-rank matrices are trained. Tools like kohya_ss or Hugging Face's Diffusers library offer ready-to-use DreamBooth+LoRA scripts.
A numeric representation of a text as a vector of hundreds of numbers, where sentences with similar meaning produce vectors close to each other.
In practiceYou compute them once with an embedding model and store them in a vector database. They power semantic search, document dedup, clustering, and the retrieval step in a RAG system.
A prompting technique where the model is shown a few examples of desired input and output, so it learns the format on the fly without training.
In practiceUseful to enforce a schema, a tone, or a precise classification. Usually 3-5 examples are enough. It is almost always the first thing to try before reaching for fine-tuning: it only costs a few extra tokens in the prompt.
Fill-In-the-Middle (FIM) is a training objective for code models in which the model must predict a central span of text given the surrounding context — both what precedes it (prefix) and what follows it (suffix). Unlike standard left-to-right autoregressive generation, FIM enables the model to complete partially written functions, docstrings, variable names, or logic blocks in the middle of existing code. The technique rearranges training tokens into the form [PREFIX][SUFFIX][MIDDLE] or [PREFIX][MIDDLE][SUFFIX] and trains the model to complete the missing part. StarCoder, DeepSeek-Coder, and Codestral make extensive use of FIM, and it is the technical foundation of all modern code completion tools.
In practiceA developer using GitHub Copilot or Cursor directly benefits from FIM every time they write a partial function and ask the model to complete the body: the model sees both the code before the cursor and the code after it. For those training their own code model, the FIM training pipeline requires randomly sampling spans to mask from the source code corpus and reformatting tokens with the special separators `<fim_prefix>`, `<fim_suffix>`, `<fim_middle>`. The typical ratio is 50% FIM + 50% left-to-right during pre-training to also preserve standard generative capability.
An extra training step where a ready-made model is trained on a smaller, more specific dataset to improve its performance on a certain task or domain.
In practiceYou do it when the base model does not match the style, jargon, or formats you need. It requires good labeled data and GPUs. Often you start with a lightweight variant like LoRA before doing a full fine-tune.
An algorithm that reorganizes attention computation to minimize data movement between fast and slow GPU memory.
In practiceIt does not change the math, but makes attention much faster and far less memory-hungry. It ships by default in PyTorch and in major inference servers (vLLM, TGI). If you use APIs you never see it; if you self-host it is almost mandatory to turn on.
A large model trained on very general data, designed to be reused and adapted for many different tasks rather than serving a single purpose.
In practiceGPT-4, Claude, Llama are foundation models. For most use cases you do not train a new one: you use it via API or open weights and adapt it with prompting, RAG, or a small fine-tune on top.
FP8 is an 8-bit floating-point numeric format available in two variants: E4M3 (4-bit exponent, 3-bit mantissa), used in the forward pass for higher precision, and E5M2 (5-bit exponent, 2-bit mantissa), used for gradients for greater dynamic range. It reduces memory usage by roughly 50% compared to BF16 with less than 0.5% quality loss when paired with per-tensor scaling via the NVIDIA Transformer Engine. H100 and H800 GPUs have native FP8 Tensor Cores. DeepSeek V3 was trained entirely in FP8, achieving GPT-4o-level quality at a fraction of the cost.
In practiceAn ML team training a 70B LLM on an H100 cluster enables FP8 via NVIDIA's Transformer Engine (integrated into Megatron-LM and NeMo) by simply setting `fp8_format=HYBRID`. For inference, frameworks like vLLM and TensorRT-LLM support FP8 weights and activations to reduce required VRAM and increase throughput. Before deploying to production, it is good practice to run evaluations on standard benchmarks (MMLU, HumanEval) to confirm that quality degradation stays within acceptable thresholds.
An AI model among the most capable existing right now, at the edge of what is achievable. It often comes with new risks and new capabilities still poorly understood.
In practiceCurrent examples: latest Claude, next-gen GPT-4, Gemini Ultra. They cost more but do things smaller models cannot. For serious projects, benchmark on your own use case: sometimes a mid-tier model is plenty.
An LLM's ability to output a structured call to a function described in a schema, with name and typed arguments ready to execute.
In practiceIt is the standard way an app wires a model to its own code: the model returns JSON, the app runs the function and feeds the result back. The foundation of almost every production agent.
A benchmark of 448 questions written by PhD students in biology, physics, and chemistry, designed to be hard even with Google access.
In practiceIt is replacing MMLU as the gauge of deep scientific knowledge. Domain-expert humans score around 65%; frontier models in 2025 exceed 70%. It remains one of the not-yet-saturated benchmarks.
An optimization algorithm that updates a model's weights in the direction that most reduces the error, one small step at a time.
In practiceIt is the core engine behind training every modern neural network. In practice people use a variant called Adam or AdamW, which is more stable and faster. If you do not train models from scratch it is a concept to know, not a knob to turn.
A generation strategy that always picks the most likely token at each step, without exploring alternatives.
In practiceEquivalent to temperature 0. It is deterministic and fast, ideal for tasks needing reproducibility (data extraction, classification, code). The downside is that it can get stuck in loops and produces flat output on creative tasks. It is the natural starting point when debugging prompts.
A model response that sounds plausible but is made up: false facts, nonexistent citations, APIs that do not exist, wrong data presented with confidence.
In practiceIt is the number-one issue when putting LLMs in business workflows. Fixes: RAG with sources, asking for citations, double-checking with a second model, validating structured output against rules. Never treat output as gospel without a check.
A holistic evaluation framework developed by Stanford CRFM that measures an LLM across dozens of benchmarks covering accuracy, robustness, bias, calibration, and efficiency.
In practiceInstead of a single metric, it provides a full scorecard: useful for comparing models all-around, not just on academic leaderboards. It runs a public site with up-to-date results for every major model.
A hierarchical graph data structure used to approximately find the nearest vectors to a query in datasets of millions or billions of embeddings.
In practiceIt is the default indexing algorithm in Pinecone, Qdrant, Weaviate, pgvector, and FAISS. It enables millisecond searches at scales where brute force would be unusable. You pay in RAM and index build time.
An OpenAI benchmark of 164 Python programming problems scored by running unit tests against the code generated by the model.
In practiceIt was the standard for measuring LLM coding ability since 2021. It too is now saturated (over 90% pass@1), and the community has moved to SWE-bench, more realistic because it is based on real repositories.
A retrieval strategy that combines keyword search (BM25) and vector search (embeddings), merging the two rankings with techniques such as Reciprocal Rank Fusion.
In practiceIt compensates each method's weaknesses: embeddings excel at semantics, BM25 at exact terms. It almost always beats either alone. It is the state of the art in production RAG systems.
Indirect prompt injection is an attack where malicious instructions are embedded in external content that an LLM agent will read: web pages, documents, emails, or database results. Unlike direct prompt injection (where the user provides the malicious content), here the attacker controls the external environment. When the agent retrieves and processes the content, it unknowingly executes the hidden instructions as if they came from a trusted source. The attack was first formalized by Greshake et al. (2023) and is a critical threat for RAG systems and autonomous agents.
In practiceA developer building a web agent must sanitize all externally retrieved text before inserting it into the prompt. Defensive techniques include: structured prompts with explicit delimiters separating data from instructions, classifier systems that detect injection patterns in retrieved documents, and the principle of least privilege (the agent should not have access to dangerous tools if the task does not require them). Systematically testing the agent with deliberately poisoned documents is part of standard red-teaming for RAG applications.
The amount of compute the model uses at response time, not during training. More inference compute often means better but slower and pricier answers.
In practiceReasoning models shift resources from training to inference. For anyone deploying a service this is the most visible cost line: every call burns GPU. Ways to cut it: caching, smaller models, quantization, batching.
Instruction tuning is a training phase in which an already-pretrained LLM is further optimized on (instruction, expected-response) pairs, structured as natural-language task descriptions. Unlike generic supervised fine-tuning, it explicitly focuses on standardized task descriptions to instill the ability to follow arbitrary commands. Google's FLAN work (2021) showed that training on 60+ diverse tasks dramatically improves zero-shot generalization. It is the technical foundation of models such as ChatGPT, Vicuna, and Flan-T5.
In practiceIn practice, you prepare a dataset of thousands of examples in the format 'Instruction: … Response: …', often derived from existing NLP benchmarks reformatted as prompts. The base model is then fine-tuned on this data using a standard cross-entropy objective. A developer adapting an open-weights model (e.g., LLaMA) to a specific domain builds a vertical instruction dataset and uses frameworks like LLaMA-Factory, Axolotl, or HuggingFace TRL to run instruction tuning in a few hours on a single GPU.
A technique where a user talks the model into ignoring its own safety rules, for example by asking it to pretend to be a character with no restrictions.
In practiceDifferent from prompt injection: here it is the user who tries. If you offer a public LLM service this means doing red teaming, logging conversations, and running a safety classifier in cascade over responses.
K-Quants are a family of quantization methods implemented in llama.cpp (from Q2_K to Q8_K) that apply different bit-widths to different model layers based on their sensitivity to precision loss. Attention and embedding layers, being more sensitive, receive more bits; intermediate feed-forward layers, being less critical, receive fewer. This non-uniform quantization produces higher quality than older flat-Q formats (Q4_0, Q5_1) at the same file size. Q4_K_M has become the reference format for local inference, achieving better quality than the old Q5_1 while being more compact. They are the standard format for modern GGUF models downloadable from HuggingFace.
In practiceA user wanting to run Llama 3 70B on a PC with 48 GB of RAM downloads the Q4_K_M variant from the GGUF repository on HuggingFace (typically uploaded by TheBloke or bartowski) and runs it with `llama.cpp` or an interface like LM Studio or Ollama. The choice of quantization level follows a practical rule: Q4_K_M for the best quality/size balance, Q5_K_M if there is sufficient RAM and higher fidelity is desired, Q2_K if space is very limited and degraded quality is acceptable. K-Quants are transparent to the end user: the interface loads the GGUF file and handles the format internally.
A temporary GPU memory that stores attention computations for tokens already seen, so the model does not recompute them on every new token generated.
In practiceIt is why generating the tenth token costs less than the first: the cache avoids redoing work. It eats a lot of VRAM and grows with context, so it is often the real bottleneck for serving many users in parallel. Optimizing it (paged, quantized) is central to cutting inference cost.
KV cache quantization is the technique of compressing the key-value tensors dynamically generated during inference, reducing them from FP16 to FP8 or INT8. Unlike weight quantization, which operates on the model's static parameters, this acts on the cache generated at runtime for each request. It reduces VRAM footprint by 50% or more, enabling longer context windows or more concurrent requests per GPU. It is supported by vLLM, Text Generation Inference (TGI), and TensorRT-LLM.
In practiceA sysadmin serving a 70B model on two A100 80GB GPUs and wanting to increase concurrent batch size from 8 to 16 requests enables FP8 KV cache quantization in vLLM by adding `--kv-cache-dtype fp8` to the launch command. It is important to distinguish this from weight quantization: the two approaches are orthogonal and can be combined. In practice, measure quality degradation on long-range tasks (needle-in-haystack, multi-turn) before deploying to production, since precision loss in the cache is more visible over long contexts.
A Latent Consistency Model (LCM) is a diffusion model distilled to generate high-quality images in 4-8 steps instead of the 50+ required by original models. Consistency distillation trains the model to map any noisy latent directly to the clean output in a single step, eliminating the iterative denoising process. LCM-LoRA applies this speedup to any existing Stable Diffusion model without requiring full distillation from scratch. The practical result is real-time image generation (~30 fps on a consumer GPU) and the ability to iterate visually on prompts interactively.
In practiceA developer can use LCM-LoRA with HuggingFace diffusers by adding a single adapter to their existing Stable Diffusion pipeline: download the LCM-LoRA weight, set the scheduler to LCMScheduler, and reduce num_inference_steps to 4. The quality is equivalent to 50 steps but 10x faster. For real-time generative UI applications (e.g., interactive sketch-to-image), this speed is essential; LCMs are often combined with StreamDiffusion to further optimize throughput.
An AI model trained on huge amounts of text to predict the next word and generate natural language responses.
In practiceIt is the engine behind ChatGPT, Claude, Gemini. When you embed an LLM into your product you pay per token and get a service that reads and writes text. Quality depends heavily on the chosen model and the prompt you give it.
A technique that uses an LLM (usually a strong one) to score another model's or its own answers against criteria written in natural language.
In practiceIt speeds up evaluation dramatically compared to human judges, but suffers from biases (prefers longer answers, its own style). It must be calibrated against a subset of human judgments as anchor.
Raw numeric scores the model produces for every possible vocabulary token, before being turned into probabilities.
In practiceThey are the model's "unnormalized thinking": the higher a token's logit, the more likely it gets. Some APIs expose `logprobs` (logits after softmax and log) to gauge confidence or build classifiers. Working with raw logits is only relevant for fine-tuning or research.
A fine-tuning technique that trains only a small set of extra parameters instead of the whole model, cutting compute cost and the size of the resulting file.
In practiceIt lets you customize a 70-billion-parameter model on a consumer GPU. You save adapters of a few MB that plug on top of the base model. It is the practical standard for adapting open-weight models to specific use cases.
A formula that measures how far the model's prediction is from the correct answer: the higher it is, the more wrong the model is.
In practiceIn LLMs the most used one is cross-entropy on next tokens. The loss value shown during training is the top signal to check whether the model is converging or there is a bug. A flat curve almost always means data or hyperparameter issues.
The phenomenon where an LLM remembers information at the start and end of the context better, while content in the middle is often ignored or forgotten.
In practiceCritical for RAG and long prompts: the order of documents matters. Put the key information at the start or the end. It is one of the reasons a 1M-token context window is not equivalent to actually using it all.
Many-shot jailbreaking is an attack technique that exploits long context windows by prepending 100-256 or more fake harmful question-answer pairs before the actual malicious request. The in-context examples override safety training by inducing the model to follow the demonstrated pattern rather than its guardrails. Effectiveness scales with context length: models with larger context windows are more vulnerable. The attack was disclosed by Anthropic in 2024 and prompted revisions to safety mechanisms for very long-context models.
In practiceFrom a defensive standpoint, a developer evaluating a deployed model's robustness should include many-shot tests in their red-teaming: construct a prompt with 200+ malicious Q&A examples and measure the model's compliance rate. To mitigate the risk in production, one can apply artificially capped context windows for certain tasks, input classifiers that detect repeated Q&A patterns on risky topics, or logging systems that flag unusually long prompts for review.
An open protocol introduced by Anthropic to connect AI models to external tools, data, and services in a standard way, like a USB port for LLMs.
In practiceInstead of writing custom integrations for every client (Claude Desktop, IDEs, agents), you publish an MCP server and all compatible clients use it. It is becoming the de facto standard for agent tooling.
A pretraining strategy (UL2, Google 2022) that trains a single model on multiple denoising objectives simultaneously: left-to-right language modeling, span prediction (BERT-style masked spans of varying lengths and corruptions), and prefix language modeling. Unifies the strengths of GPT-style and T5-style pretraining. The model learns when to use each mode based on a sentinel token that signals the objective type.
In practiceA researcher wanting a flexible model for both completion and question answering can use UL2 or a Flan-UL2 checkpoint without choosing between encoder-decoder (T5) and decoder-only (GPT) architectures. In practice, the sentinel token `[S2S]`, `[NLU]`, or `[NLG]` must be prepended to the prompt to activate the correct mode — a detail that significantly impacts performance and is often omitted, causing poor results.
A benchmark of about 16,000 multiple-choice questions across 57 subjects, from math and law to medicine, used to measure an LLM's general knowledge.
In practiceFor years it was the headline benchmark cited in new model announcements. Today it is saturated: frontier models score above 85%, and the field is moving to harder benchmarks like MMLU-Pro and GPQA.
An attack where an adversary repeatedly queries a model via API to reconstruct a functional copy of its weights or behavior.
In practiceA legal variant is distilling outputs of a frontier model to train a smaller one, banned by the terms of service of most providers. Mitigated with rate limits, watermarking, and fingerprint detection.
An architecture where the model is split into many specialized sub-models ('experts') and only a small share of them is activated for each token.
In practiceIt enables models with hundreds of billions of parameters but inference cost closer to a much smaller one. Mixtral, DeepSeek, and GPT-4 use it. For API users nothing changes, but it explains surprising quality-to-price ratios.
An architecture where multiple specialized AI agents collaborate to complete a complex objective, each with defined roles, tools, and communication protocols. An orchestrator agent decomposes the goal and dispatches subtasks to worker agents. Unlike single-agent loops, multi-agent systems enable parallelism, specialization, and fault isolation. The main patterns are: hierarchical (orchestrator→workers), sequential pipeline, and debate/critique among agents.
In practiceA developer building a complex RAG system can use an orchestrator (AutoGen, CrewAI, Magentic-One) to route queries to specialized agents — one for web search, one for the vector database, one for final synthesis. Debugging requires tracing inter-agent communication: tools like LangSmith or Phoenix show which agent received which input and what it produced, making bottlenecks and infinite loops visible.
A model able to handle multiple input and output types together: text, images, audio, video. Not just reading but also generating multiple formats.
In practiceClaude and GPT-4 read images, Gemini handles video, some models talk in voice. For products this means analyzing receipt photos, screenshots, charts without a separate OCR. Watch out: visual input costs more tokens.
A test that hides a specific sentence inside a long irrelevant text and asks the model to retrieve it, to measure the real quality of the context window.
In practiceIt has become the de facto benchmark for long-context models (100K, 1M tokens). A model can advertise a huge context but fail NIAH beyond a certain depth, a sign the window is effectively 'fake'.
A neural codec is a neural network that compresses audio into discrete tokens via Residual Vector Quantization (RVQ) and reconstructs it with high fidelity. The process splits the audio signal into multi-level codes: the first level captures coarse structure, subsequent levels refine the details. This scheme enables LLMs to 'speak': audio tokens can be generated autoregressively just like text tokens. Key examples include SoundStream (Google), EnCodec (Meta), DAC, and Vocos, all used by models such as VALL-E, SoundStorm, and AudioPaLM.
In practiceA developer integrates a neural codec as the first stage of a speech LLM pipeline: Meta's EnCodec is available on HuggingFace and can be used with a few lines of Python to convert audio files into sequences of integer codes. These codes become the input/output of a standard transformer trained on text and speech. For real-time applications, Vocos offers a faster decoder than EnCodec that reconstructs audio from codes in a few milliseconds on CPU.
An 'open weights' model only releases downloadable parameters; an 'open source' one also publishes training data, recipes, and code in a reproducible way.
In practiceLlama, Mistral, DeepSeek are open weights but not full open source. For enterprise use open weights already let you run the model on-prem, fine-tune it, inspect it; but read the license carefully because it has usage limits.
A technique that splits the KV cache into small blocks managed like virtual memory pages, cutting VRAM waste between different requests.
In practiceIt is the core idea of vLLM and now standard in modern inference servers. It lets the same GPU serve many more users because it avoids reserving large mostly-empty blocks. When picking a self-hosted runtime, support for paged attention is a baseline requirement.
Pipeline parallelism is a distributed training strategy in which a neural network's layers are split into contiguous blocks, each assigned to a separate GPU. Each GPU processes its block of layers and passes activations to the next GPU, forming a pipeline. It differs from tensor parallelism, which splits individual weight matrices within a single layer. Combined with tensor parallelism and data parallelism it forms '3D parallelism', used by Megatron-LM to train models with hundreds of billions of parameters.
In practiceAn engineer training a model too large for a single GPU — or even a single multi-GPU node — uses pipeline parallelism to distribute layers across multiple nodes. With DeepSpeed or Megatron-LM you configure the pipeline degree (number of stages) and the number of micro-batches to fill the pipeline and minimize bubble overhead (idle GPU time between micro-batches). In inference, the same approach allows serving very large LLMs by distributing layers across multiple servers.
Information added to every token to tell the model where it sits in the sequence, because plain attention has no sense of order.
In practiceWithout positional encoding "dog bites man" and "man bites dog" would mean the same to the model. Early versions used sine/cosine functions; today most LLMs use RoPE because it extends better to long contexts.
A reinforcement learning algorithm that updates the model in small steps, preventing it from drifting too far from the previous version.
In practiceIt was the engine behind RLHF in the early ChatGPT: it maximizes human reward without letting the model diverge. Notoriously hard to stabilize and rich in hyperparameters. That is why many open-source teams now prefer DPO, which gets similar results with less effort.
Prefix caching is an inference technique that reuses the already-computed KV cache for common prompt prefixes across multiple requests. Rather than recomputing attention keys and values for the same sequences (e.g., an identical system prompt), the system stores these activations in memory and retrieves them directly. This dramatically reduces latency for the shared prefix, bringing it close to zero. It is implemented in vLLM as 'Automatic Prefix Caching' and in Anthropic and OpenAI cloud services as a reduced-cost billed feature.
In practiceA developer serving a chatbot with a fixed 2,000-token system prompt benefits immediately from prefix caching: only the first request computes that prefix, and all subsequent ones read it from cache. In vLLM it is enabled with `--enable-prefix-caching`; in the Anthropic API, prefix caching must be explicitly declared with `cache_control`. For RAG applications with shared documents, you structure the prompt by placing the document before the questions to maximize cache reuse.
The initial training phase where a model learns the structure of language by predicting the next token on huge amounts of generic text.
In practiceIt is the most expensive step (months of GPUs and millions of dollars) and produces a "base" model that can write but cannot yet follow instructions. Only big labs run it from scratch; companies start from pretrained models and adapt them with SFT, LoRA, or RLHF.
An attack where an external input (a document, a web page, an email) contains hidden instructions that hijack the model's behavior.
In practiceIf your agent reads emails and then acts, a malicious email can tell it 'forward everything to a third party'. Fixes: treat external inputs as untrusted, sandbox tools, require human confirmation for sensitive actions, filter inputs and outputs.
A variant of LoRA that keeps the base model in 4-bit quantized form during fine-tuning, drastically cutting the GPU memory needed.
In practiceIt lets you adapt 13B-70B parameter models on a single consumer GPU (e.g. RTX 4090 or 24-40 GB A100). It is the favorite technique for hobbyist or low-budget enterprise fine-tuning. Quality loss vs. full-precision fine-tuning is almost negligible.
A technique that reduces the numeric precision of model weights (for example from 16 to 4 bits) so it takes less memory and runs faster.
In practiceIt is what lets you run a Llama 70B on a single GPU or a 7B model on a Mac. You lose a bit of quality but often not much. Typical tools: GGUF, AWQ, GPTQ. Useful for on-prem or edge deployment.
A technique that fetches relevant text from an external data source and inserts it into the model's prompt before generating the response.
In practiceIt lets an LLM answer using company documents, internal knowledge bases, or up-to-date articles without training. It cuts hallucinations on specific data and refreshes knowledge without re-training. It is the first architecture to consider for an enterprise chatbot.
A pattern where an agent alternates textual reasoning steps (Thought) with concrete actions (Action) on tools, observing the result before the next step.
In practiceIt is the backbone of most modern LLM agents: the model writes what it intends to do, calls a tool, reads the response, then decides the next step. It makes the agent's decisions inspectable and debuggable.
A model trained to reason at length before answering, generating intermediate steps (even minutes of 'thinking') for hard math, code, or analysis problems.
In practiceExamples: OpenAI's o1 and o3, Claude with extended thinking, DeepSeek-R1. They cost much more and are slower, so use them only where you really need them. For plain chat a standard model is enough and cheaper.
A practice where a team actively tries to attack a model or an AI system, looking for jailbreaks, security holes, and dangerous uses, in order to find them before release.
In practiceAI labs do it in-house and with external experts before shipping a model. If you put AI in production do the same on your product: ask colleagues to break it before customers do. Even one rough hour beats the first public bug.
A technique where an agent, after a failed attempt, generates a verbal self-critique and stores it in memory to improve the next attempt.
In practiceUseful for tasks with clear feedback (failing tests, wrong answers). The agent learns from its mistakes within the same session, without fine-tuning. Often boosts success on coding and reasoning benchmarks.
A secondary model that reorders the results of an initial search (vector or keyword) by ranking them on relevance to the query.
In practiceTypically you retrieve 50-100 candidates with a fast method, then let the reranker (e.g. Cohere Rerank, BGE) sort the top 5-10. It is one of the cheapest ways to lift the quality of a RAG pipeline.
The design of reward signals that guide reinforcement learning without overfitting to proxy measures. Poorly shaped rewards lead to reward hacking: the agent optimizes the metric instead of solving the real task. LLMs now automate reward design (Eureka/NVIDIA): GPT-4 writes Python reward functions, runs them in simulation, and iterates based on agent performance. It is critical for robotics, game AI, and RLHF with human feedback.
In practiceA researcher training a robot to walk must balance rewards for speed, stability, and energy consumption — too much emphasis on speed produces bizarre gaits or reward hacking. With Eureka, the task is described in natural language and an LLM automatically generates the reward function, running it in Isaac Gym simulation and refining weights based on performance metrics. The same principle applies to RLHF: the language model's reward function must capture 'real utility', not just 'sounds convincing'.
A variant of RLHF where responses are judged not by a human but by another AI model, cutting cost and time compared to manual annotation.
In practiceIt lets alignment training scale to much larger volumes. Anthropic uses it for Claude together with Constitutional AI. The risk is amplifying the judge model's biases, so human oversight is still needed.
A training technique where humans rate and rank model responses, and these preferences are used to steer learning toward more helpful and safer answers.
In practiceIt is the step that turned ChatGPT into something useful versus a raw predictive model. For API users RLHF has already been done by the provider. Knowing about it explains why more 'aligned' models sometimes refuse legitimate requests.
A positional encoding technique that rotates token vectors based on their position, baking order directly into attention.
In practiceIt has become the de facto standard: Llama, Mistral, Qwen, DeepSeek, and GPT-4 class models all use it. It lets you stretch the context beyond the training length with tricks like NTK-aware or YaRN. If you fine-tune on long contexts, understanding RoPE is almost mandatory.
A separate model that analyzes the input or output of an LLM to catch unsafe, violent, illegal, or off-policy content before it reaches the user.
In practiceIt is a safety net in cascade: if the main model slips, the classifier blocks it. OpenAI Moderation and Meta's Llama Guard are free examples. For public services having one is almost mandatory.
A technique that samples multiple independent answers from the model with temperature > 0 and picks the most frequent one by majority vote.
In practiceIt often improves accuracy on math reasoning tasks: if 5 out of 7 thought chains converge on the same answer, it is likely correct. It triples or quintuples inference cost.
Fine-tuning where the model learns from input-output pairs written by humans, for example questions with ideal answers.
In practiceIt is the first step in turning a base model into an instruction-following assistant. A few thousand high-quality examples are enough for large gains in a domain. In practice it is almost always the first option before moving to RLHF or DPO.
The process of training a robot policy in simulation (fast, cheap, safe) and then deploying it on real hardware without retraining. The 'reality gap' — differences in physics, friction, sensor noise — causes policies to fail. Domain randomization (randomizing simulation parameters) teaches robustness. LLMs automate this process (DrEureka): they generate randomization ranges so policies transfer zero-shot to real hardware.
In practiceA robotics team building an arm for industrial picking trains thousands of policies in parallel on Isaac Sim or MuJoCo, randomly varying object mass, friction, lighting, and motor delays. The best policy is then deployed on the physical robot without further training. With DrEureka, an LLM automatically suggests randomization ranges from the task description, reducing days of manual tuning to a few hours of automated search.
Models that behave aligned during training and evaluation but exhibit malicious behavior only under specific conditions, such as a given date or phrase.
In practiceStudied by Anthropic in 2024: they showed standard safety fine-tuning does not remove deliberately planted backdoors. The term sandbagging refers to a model intentionally pretending to be less capable than it is.
A Small Language Model (SLM) is a language model in the 1B-7B parameter range, optimized to maximize quality-per-parameter rather than raw capability. The key insight from Microsoft's Phi series is that training on 'textbook quality' synthetic data enables a 1.3B model to rival much larger models on reasoning benchmarks. SLMs run on laptops, smartphones, and embedded devices without a dedicated GPU. Representative examples include Phi-1.5, Phi-3, Gemma 2B, Qwen 1.5B, and SmolLM.
In practiceA developer chooses an SLM when deploying an AI assistant on edge hardware (Raspberry Pi, Android phone, corporate laptop) where a 70B LLM would be impractical. With llama.cpp or Ollama, a 4-bit quantized Phi-3 Mini runs at acceptable speed on any modern CPU. SLMs are also ideal for specialized tasks: fine-tuning on a specific domain with limited data produces compact models that outperform GPT-4 in that target domain.
A math function that turns a set of logits into probabilities that sum to 1, amplifying high values and squashing low ones.
In practiceIt is the last step before picking the next token: it tells how strongly the model "believes" in each option. It also appears inside attention to weight context tokens. If you call APIs it is invisible; if you study models, it is one of the most recurring functions.
A technique where a small fast model proposes several tokens ahead and the large model verifies them in a single pass, accepting the correct ones.
In practiceIt can produce answers 2-3x faster with no change in final quality, because the big model stays the judge. It is used in production by OpenAI, Anthropic, and in self-hosted runtimes. It needs a "draft" model aligned with the main one, so it is not free to set up.
A mode where the model is constrained to produce output conforming to a schema (JSON, regex, grammar) instead of free text.
In practiceEssential when the output feeds another system: API, database, frontend. Providers like OpenAI and Anthropic offer native enforcement that guarantees valid JSON on the first try.
A family of techniques that splits text into pieces smaller than a whole word but larger than a single character.
In practiceIt is a trade-off between huge vocabularies (one word = one token) and tiny ones (one character = one token). It handles unseen words, typos, and many languages without blowing up in size. Every modern LLM uses some form of subword tokenization.
A benchmark of over 2,000 real GitHub issues from Python repositories: the model must produce a patch that makes the project's tests pass.
In practiceIt measures real software-engineering ability (reading a codebase, debugging, cross-file edits), not isolated coding. It has become the reference for agents like Devin, Claude Code, and OpenAI Codex.
Training data generated by another AI model instead of collected from humans.
In practiceIt is now a pillar of modern training: big models produce examples to train smaller ones (distillation) or to cover rare cases. It must be filtered carefully, because generator errors compound in the final model. Nvidia, Meta, and Anthropic use it heavily.
A parameter that scales the logits before sampling: low values make the model more deterministic, high values more creative and unpredictable.
In practiceAt 0 the model always picks the most likely word (effectively greedy); at 1 it keeps the original distribution; above 1.5 it tends to go off the rails. For classification or extraction use 0; for creative writing 0.7-1.0. It is the simplest knob to tune in any API.
The basic unit the model breaks text into: it can be a whole word, a syllable, or a few characters, depending on the tokenizer.
In practiceLLM APIs charge per input and output token. In English 1 token is roughly 0.75 words, in Italian a bit less. Counting tokens in your prompt helps estimate cost and stay within the context limit.
The component that turns text into tokens before passing it to the model and rebuilds text from output tokens.
In practiceDifferent tokenizers produce different counts: the same text may cost more tokens on GPT than on Claude or the other way around. Libraries like tiktoken (OpenAI) let you count tokens locally before calling the API.
The model's ability to return a structured request to run an external function (search the web, read a file, write to a database) and then resume reasoning with the result.
In practiceYou declare the functions with name, parameters, and description; the model picks when to call them. It is the building block of every agent. Validate arguments carefully: the model sometimes invents parameters or forgets them.
An LLM from Meta trained to autonomously decide when and how to call external APIs such as a calculator, translator, or search engine.
In practiceIt is one of the first works to show that an LLM can learn tool use in a self-supervised way, without human examples. Today the idea lives on in the native function calling of modern models.
A next-token selection strategy that keeps only the k most likely candidates and discards the rest before sampling.
In practiceWith k=1 it becomes greedy decoding; with large k it is almost the full distribution again. It is used to stop the model from picking absurd words from the tail. Modern APIs often replace or combine it with top-p, which is considered more adaptive.
A strategy that picks the next token from the smallest set of candidates whose cumulative probability exceeds a threshold p (e.g. 0.9).
In practiceIt adapts the candidate set to context: few options when the model is confident, many when it is unsure. It is the most-used parameter in APIs (`top_p` on OpenAI, Anthropic, etc.) to tune creativity without losing coherence. Typical values sit between 0.8 and 0.95.
A neural network architecture introduced by Google in 2017 that uses the attention mechanism to process text in parallel rather than word by word.
In practiceIt is the foundation of basically every modern LLM. If you build products you do not need to implement it from scratch: you use frameworks like PyTorch or call APIs. Knowing it is parallelizable explains why training needs heavy GPUs.
A reasoning strategy where the model explores multiple thought branches in parallel, evaluates them, and keeps only the promising ones, like a tree search.
In practiceIt extends Chain-of-Thought by allowing backtracking: useful for puzzles, planning, and math problems where a single linear path often fails. It costs many more tokens than standard inference.
A database specialized in storing embeddings and quickly finding the vectors most similar to a query, even across millions of records.
In practiceExamples: Pinecone, Weaviate, Qdrant, pgvector on Postgres. You pick based on scale, cost, and whether you want to self-host or use cloud. It is the key infrastructure for a RAG system searching a company knowledge base.
A Vision-Language-Action Model (VLA) is a neural network that takes visual observations and natural language instructions as input and directly outputs robot actions such as end-effector coordinates or joint commands. It extends vision-language models (VLMs) by adding an action head trained on robot trajectory data. Notable examples include RT-2 (Google DeepMind), OpenVLA (Berkeley), GR-2 (ByteDance), and Helix (Figure AI). The result is a robot that can interpret a command like 'pick up the red cup' by looking at the scene and translating it into precise physical movements.
In practiceA developer working with VLAs typically starts from a pretrained checkpoint (e.g., OpenVLA on HuggingFace) and fine-tunes it on teleoperation data collected from their own robot using LoRA or full fine-tuning. The model input is an RGB image from the robot's camera concatenated with the text instruction; the output is an action vector (end-effector pose, gripper aperture). The deployment pipeline uses ROS 2 or LeRobot to close the control loop at 5-10 Hz inference frequency.
Voice cloning is the ability to generate speech synthesis in a target speaker's voice from just a few seconds of reference audio, without any additional fine-tuning. The model extracts a speaker embedding from the reference audio and conditions generation on it, replicating timbre, rhythm, and prosodic characteristics. Zero-shot means no additional per-speaker training is needed at inference time. Systems like ElevenLabs, XTTS v2, CosyVoice, and Dia TTS have made this technology accessible via API or open-weights models.
In practiceA developer cloning a voice with XTTS v2 (open source, available on HuggingFace) provides 6-10 seconds of clean reference audio and the text to synthesize; the Coqui TTS library handles embedding extraction and synthesis in a few seconds. For professional productions, the ElevenLabs API accepts an audio clip and returns a reusable voice_id. It is essential to verify the original speaker's consent before cloning their voice, in compliance with applicable regulations.
A technique that embeds an invisible statistical signal into text or images generated by a model, so they can later be identified as AI-produced.
In practiceGoogle's SynthID, for example, marks Gemini's text and images. Useful against disinformation, deepfakes, and plagiarism. Limit: it often breaks under rewrites, translation, or minor edits, and works only if the provider cooperates.
Two subword tokenization algorithms alternative to BPE: WordPiece is the one in BERT, SentencePiece is the one in T5 and Gemini.
In practiceWordPiece chooses merges by probability rather than raw frequency. SentencePiece works directly on the raw string without assuming spaces, so it handles Chinese, Japanese, and other space-less languages better. Switching tokenizer requires retraining the model.
A neural network that predicts future sensory observations given current observations and actions, simulating how the world will respond to a robot or agent's behavior. Enables planning without physical interaction: 'imagining' the consequences of an action before executing it. In robotics (1X Technologies, DREAMER), world models enable real-time planning. In LLM agents, they underpin speculative execution and lookahead search.
In practiceAn agent tasked with moving objects on a table can use a world model to internally simulate thousands of action sequences and select the one with the highest probability of success before moving the physical arm. For LLM agent developers, an implicit world model is built by maintaining a structured 'state scratchpad' that the model updates at each step — a technique used in systems like Voyager (Minecraft) and in planning agents with tool use.
The model's ability to perform a task it never saw during training, based only on the description we give in the prompt, with no examples.
In practiceIt is what most of us do when we write 'summarize this text in three bullets'. If results are inconsistent, moving to few-shot with examples is the fastest fix. Useful to prototype new flows quickly.