Glossary

AI terms, explained without jargon.

Each entry: a one-sentence definition, plus two sentences on what it means in practice for builders or decision makers. Sizes in the cloud below reflect how often a term appears in archive entries: the big ones are the ones worth knowing.

121 terms

Concept map PDF / Print CSV / Anki

Bigger = more recurrent in archive entries. Clickable.

Mixture of Denoisers Agent LLM Pipeline Parallelism Multimodal RAG MoE Transformer Foundation model Attention Fine-tuning Quantization RoPE Tool use Voice Cloning Alignment Jailbreak Prompt injection Vision-Language-Action Model Chain-of-thought MCP SWE-bench Differential privacy FP8 Pretraining Synthetic Data Token Diffusion model HumanEval Instruction Tuning RLHF Small Language Model Constitutional AI Diffusion Policy FlashAttention Hallucination KV Cache Paged Attention Red teaming Autoregressive Distillation DPO Few-shot learning Function calling LoRA Speculative Decoding ARC-AGI Continuous Batching Cross-Embodiment Data poisoning Embedding Fill-In-the-Middle Inference compute PPO Prefix Caching ReAct Self-consistency Sleeper agents World Model ASL Context window Disaggregated Inference DreamBooth Frontier model Indirect Prompt Injection K-Quants KV Cache Quantization Latent Consistency Model MMLU QLoRA Reasoning model Reflexion RLAIF Safety classifier Tokenizer Toolformer Tree of Thoughts Vector database Watermarking Adversarial example AI Supply Chain Attack Backdoor attack Beam Search BM25 BPE Catastrophic Forgetting Causal Mask Checkpoint Chunking Cosine similarity Cross-encoder vs bi-encoder Decoder-only GPQA Gradient Descent Greedy Decoding HELM HNSW Hybrid search LLM-as-judge Logits Loss Function Lost in the middle Many-Shot Jailbreaking Model extraction Multi-Agent Orchestration Needle in a Haystack Neural Audio Codec Open weights vs open source Positional Encoding Reranker Reward Shaping SFT Sim-to-Real Transfer Softmax Structured output Subword Tokenization Temperature Top-k Sampling Top-p Sampling WordPiece / SentencePiece Zero-shot learning

Level:

Models

Attention

Beginner Attenzione · Self-attention 12

A mechanism that lets the model weigh how relevant each word in the text is compared to the others to understand the meaning of the context.

In practice It is why an LLM knows that 'he' in a sentence refers to a person mentioned earlier. Compute cost grows with the square of context length: this is why very long contexts are expensive.

→ transformer context-window

Autoregressive

Intermediate Autoregressivo 3

A model that generates a sequence one element at a time, each time using the previous output as part of the new input.

In practice It is how every GPT-style LLM works: each new token depends on all the previous ones. It explains why generation is inherently sequential and hard to parallelize, and is the reason behind tricks like speculative decoding to speed it up.

→ causal-mask decoder-only transformer llm greedy-decoding

Causal Mask

Intermediate Maschera causale · Maschera autoregressiva

A filter applied inside attention that prevents each token from seeing tokens that come after it in the sequence.

In practice It is what makes a Transformer "causal" or decoder-only: during training the model learns to predict the next token without cheating by looking ahead. At inference time the mask becomes implicit because future tokens do not yet exist. Without it GPT would not make sense.

→ attention autoregressive decoder-only transformer

Cross-Embodiment

Advanced Cross-Robot Transfer · Embodiment Generalization 2

Training a single robot policy that works across different hardware configurations (different arm DOFs, grippers, sensors, mobile bases). Like foundation models for text, cross-embodiment models (RT-2, CrossFormer, Open X-Embodiment) learn general manipulation skills from diverse robot data. Reduces the need to collect data per robot configuration separately.

In practice A company with multiple robot models in production can train a single cross-embodiment model on all collected data, instead of maintaining separate policies for each robot. In practice, the Open X-Embodiment dataset aggregates over 1 million episodes from 22 different robots; a researcher can fine-tune this model on a few examples from their specific robot and achieve better performance than training from scratch.

→ foundation-model fine-tuning synthetic-data

Decoder-only

Intermediate Modello decoder-only · Solo decoder

A Transformer architecture made up of only the decoder side, where each token looks only at previous tokens to predict the next one.

In practice It is the architecture of GPT, Llama, Mistral, Claude, and basically every modern generative LLM. It contrasts with encoder-only (BERT, for classification) and encoder-decoder (T5, for translation). Its simplicity is the reason it scales so well in pretraining.

→ transformer autoregressive causal-mask llm

Diffusion model

Beginner Modello di diffusione 5

A type of generative model that starts from random noise and gradually shapes it into a coherent image, video, or audio through many small steps.

In practice It powers Stable Diffusion, Midjourney, Sora. When you integrate image generation what matters is the trade-off between quality, speed (number of steps), and control. Costs are in GPU-seconds rather than tokens.

→ multimodal foundation-model

Foundation model

Beginner Modello di base · Base model 14

A large model trained on very general data, designed to be reused and adapted for many different tasks rather than serving a single purpose.

In practice GPT-4, Claude, Llama are foundation models. For most use cases you do not train a new one: you use it via API or open weights and adapt it with prompting, RAG, or a small fine-tune on top.

→ llm fine-tuning frontier-model

Frontier model

Beginner Modello di frontiera 1

An AI model among the most capable existing right now, at the edge of what is achievable. It often comes with new risks and new capabilities still poorly understood.

In practice Current examples: latest Claude, next-gen GPT-4, Gemini Ultra. They cost more but do things smaller models cannot. For serious projects, benchmark on your own use case: sometimes a mid-tier model is plenty.

→ foundation-model asl inference-compute

Latent Consistency Model

Advanced LCM · Latent Consistency Distillation 1

A Latent Consistency Model (LCM) is a diffusion model distilled to generate high-quality images in 4-8 steps instead of the 50+ required by original models. Consistency distillation trains the model to map any noisy latent directly to the clean output in a single step, eliminating the iterative denoising process. LCM-LoRA applies this speedup to any existing Stable Diffusion model without requiring full distillation from scratch. The practical result is real-time image generation (~30 fps on a consumer GPU) and the ability to iterate visually on prompts interactively.

In practice A developer can use LCM-LoRA with HuggingFace diffusers by adding a single adapter to their existing Stable Diffusion pipeline: download the LCM-LoRA weight, set the scheduler to LCMScheduler, and reduce num_inference_steps to 4. The quality is equivalent to 50 steps but 10x faster. For real-time generative UI applications (e.g., interactive sketch-to-image), this speed is essential; LCMs are often combined with StreamDiffusion to further optimize throughput.

→ diffusion-model distillation quantization

LLM /el-el-em/

Beginner Large Language Model · Modello linguistico di grandi dimensioni 62

An AI model trained on huge amounts of text to predict the next word and generate natural language responses.

In practice It is the engine behind ChatGPT, Claude, Gemini. When you embed an LLM into your product you pay per token and get a service that reads and writes text. Quality depends heavily on the chosen model and the prompt you give it.

→ transformer foundation-model token context-window

MoE /em-oh-ee/

Intermediate Mixture of Experts · Miscela di esperti 20

An architecture where the model is split into many specialized sub-models ('experts') and only a small share of them is activated for each token.

In practice It enables models with hundreds of billions of parameters but inference cost closer to a much smaller one. Mixtral, DeepSeek, and GPT-4 use it. For API users nothing changes, but it explains surprising quality-to-price ratios.

→ llm inference-compute

Multimodal

Beginner Multimodale 30

A model able to handle multiple input and output types together: text, images, audio, video. Not just reading but also generating multiple formats.

In practice Claude and GPT-4 read images, Gemini handles video, some models talk in voice. For products this means analyzing receipt photos, screenshots, charts without a separate OCR. Watch out: visual input costs more tokens.

→ llm foundation-model diffusion-model

Neural Audio Codec

Intermediate Neural Audio Codec · Audio Codec Model

A neural codec is a neural network that compresses audio into discrete tokens via Residual Vector Quantization (RVQ) and reconstructs it with high fidelity. The process splits the audio signal into multi-level codes: the first level captures coarse structure, subsequent levels refine the details. This scheme enables LLMs to 'speak': audio tokens can be generated autoregressively just like text tokens. Key examples include SoundStream (Google), EnCodec (Meta), DAC, and Vocos, all used by models such as VALL-E, SoundStorm, and AudioPaLM.

In practice A developer integrates a neural codec as the first stage of a speech LLM pipeline: Meta's EnCodec is available on HuggingFace and can be used with a few lines of Python to convert audio files into sequences of integer codes. These codes become the input/output of a standard transformer trained on text and speech. For real-time applications, Vocos offers a faster decoder than EnCodec that reconstructs audio from codes in a few milliseconds on CPU.

→ autoregressive quantization multimodal

Open weights vs open source

Intermediate Pesi aperti · Modelli aperti

An 'open weights' model only releases downloadable parameters; an 'open source' one also publishes training data, recipes, and code in a reproducible way.

In practice Llama, Mistral, DeepSeek are open weights but not full open source. For enterprise use open weights already let you run the model on-prem, fine-tune it, inspect it; but read the license carefully because it has usage limits.

→ foundation-model llm

Positional Encoding

Intermediate Encoding posizionale · Codifica posizionale

Information added to every token to tell the model where it sits in the sequence, because plain attention has no sense of order.

In practice Without positional encoding "dog bites man" and "man bites dog" would mean the same to the model. Early versions used sine/cosine functions; today most LLMs use RoPE because it extends better to long contexts.

→ transformer attention rope context-window

Reasoning model

Beginner Modello di ragionamento · Thinking model 1

A model trained to reason at length before answering, generating intermediate steps (even minutes of 'thinking') for hard math, code, or analysis problems.

In practice Examples: OpenAI's o1 and o3, Claude with extended thinking, DeepSeek-R1. They cost much more and are slower, so use them only where you really need them. For plain chat a standard model is enough and cheaper.

→ chain-of-thought inference-compute

RoPE /rope/

Advanced Rotary Position Embedding · Embedding posizionale rotatorio 11

A positional encoding technique that rotates token vectors based on their position, baking order directly into attention.

In practice It has become the de facto standard: Llama, Mistral, Qwen, DeepSeek, and GPT-4 class models all use it. It lets you stretch the context beyond the training length with tricks like NTK-aware or YaRN. If you fine-tune on long contexts, understanding RoPE is almost mandatory.

→ positional-encoding transformer attention context-window

Small Language Model

Beginner SLM · Small LLM 5

A Small Language Model (SLM) is a language model in the 1B-7B parameter range, optimized to maximize quality-per-parameter rather than raw capability. The key insight from Microsoft's Phi series is that training on 'textbook quality' synthetic data enables a 1.3B model to rival much larger models on reasoning benchmarks. SLMs run on laptops, smartphones, and embedded devices without a dedicated GPU. Representative examples include Phi-1.5, Phi-3, Gemma 2B, Qwen 1.5B, and SmolLM.

In practice A developer chooses an SLM when deploying an AI assistant on edge hardware (Raspberry Pi, Android phone, corporate laptop) where a 70B LLM would be impractical. With llama.cpp or Ollama, a 4-bit quantized Phi-3 Mini runs at acceptable speed on any modern CPU. SLMs are also ideal for specialized tasks: fine-tuning on a specific domain with limited data produces compact models that outperform GPT-4 in that target domain.

→ llm quantization inference-compute synthetic-data

Transformer

Beginner Architettura Transformer 19

A neural network architecture introduced by Google in 2017 that uses the attention mechanism to process text in parallel rather than word by word.

In practice It is the foundation of basically every modern LLM. If you build products you do not need to implement it from scratch: you use frameworks like PyTorch or call APIs. Knowing it is parallelizable explains why training needs heavy GPUs.

→ attention llm foundation-model

Vision-Language-Action Model

Advanced Vision-Language-Action Model · VLA 8

A Vision-Language-Action Model (VLA) is a neural network that takes visual observations and natural language instructions as input and directly outputs robot actions such as end-effector coordinates or joint commands. It extends vision-language models (VLMs) by adding an action head trained on robot trajectory data. Notable examples include RT-2 (Google DeepMind), OpenVLA (Berkeley), GR-2 (ByteDance), and Helix (Figure AI). The result is a robot that can interpret a command like 'pick up the red cup' by looking at the scene and translating it into precise physical movements.

In practice A developer working with VLAs typically starts from a pretrained checkpoint (e.g., OpenVLA on HuggingFace) and fine-tunes it on teleoperation data collected from their own robot using LoRA or full fine-tuning. The model input is an RGB image from the robot's camera concatenated with the text instruction; the output is an action vector (end-effector pose, gripper aperture). The deployment pipeline uses ROS 2 or LeRobot to close the control loop at 5-10 Hz inference frequency.

→ multimodal fine-tuning foundation-model

World Model

Advanced Predictive World Model · Environment Model 2

A neural network that predicts future sensory observations given current observations and actions, simulating how the world will respond to a robot or agent's behavior. Enables planning without physical interaction: 'imagining' the consequences of an action before executing it. In robotics (1X Technologies, DREAMER), world models enable real-time planning. In LLM agents, they underpin speculative execution and lookahead search.

In practice An agent tasked with moving objects on a table can use a world model to internally simulate thousands of action sequences and select the one with the highest probability of success before moving the physical arm. For LLM agent developers, an implicit world model is built by maintaining a structured 'state scratchpad' that the model updates at each step — a technique used in systems like Voyager (Minecraft) and in planning agents with tool use.

→ agent reasoning-model speculative-decoding

Training

Catastrophic Forgetting

Intermediate Oblio catastrofico · Interferenza catastrofica

A phenomenon where a model, when trained on new data, rapidly loses skills it had learned before.

In practice It is why aggressive fine-tuning on a narrow domain can make the model worse on everything else. It is mitigated with LoRA (which freezes the original weights), mixed datasets, or regularized updates. Always evaluate with a "general" test set on top of the domain-specific one.

→ fine-tuning sft lora pretraining

Checkpoint

Intermediate Punto di salvataggio

A full save of model weights at a given point of training, from which you can resume or release as a final model.

In practice During a training run, checkpoints are saved every N steps to recover from crashes and evaluate intermediate versions. When a lab releases an open-weights model (Llama, Mistral, Qwen) it is publishing a checkpoint. The word is often used as a synonym for "a downloadable model version".

→ pretraining fine-tuning open-weights-vs-open-source

Diffusion Policy

Advanced Diffusion-based Imitation Learning 4

An imitation learning method for robots where the policy is a denoising diffusion model: given an observation, it iteratively denoises a random action sequence into the action to execute. Unlike deterministic policies, diffusion policies learn multi-modal action distributions — they handle tasks with multiple valid solutions without averaging them into a bad one. Outperforms behavioral cloning by 46%+ on manipulation benchmarks.

In practice A robotics researcher collecting human demonstrations for an assembly task trains a Diffusion Policy on that data: the model learns that 'place the piece on the left' and 'place it on the right' are both valid solutions and coherently samples one of them, instead of producing the (wrong) average movement as classic behavioral cloning does. Libraries like Columbia's diffusion_policy or Hugging Face's LeRobot offer ready-to-use implementations.

→ diffusion-model sft fine-tuning distillation

Distillation

Intermediate Distillazione 3

A technique to train a small model to mimic the behavior of a large one, getting similar quality at a fraction of inference cost.

In practice It is why we keep getting small but capable models: they are distilled from frontier ones. If you need fast cheap responses in a narrow domain, distilling your own model from Claude or GPT is often the winning move.

→ fine-tuning quantization

DPO /dee-pee-oh/

Intermediate Direct Preference Optimization · Ottimizzazione diretta delle preferenze 3

An alignment technique that teaches a model to prefer a better answer over a worse one, without using a separate reward model like RLHF does.

In practice It only needs pairs of answers labeled "better/worse" and a simpler, more stable training loop than PPO. In recent years it has replaced RLHF in many open-source projects (Zephyr, Tulu, Llama variants). It is often the cheapest way to align a fine-tuned model.

→ rlhf ppo sft alignment fine-tuning

DreamBooth

Intermediate DreamBooth Fine-tuning · Subject-Driven Generation 1

A technique to fine-tune a diffusion model on 3-5 photos of a specific subject (person, product, pet) using a unique text identifier ('a sks dog'). The model 'memorizes' the subject while preserving its general generation ability. It is the foundation of AI portrait apps, product photography generators, and custom image tools. Introduced by Google Research in 2022.

In practice A product photographer can fine-tune Stable Diffusion with DreamBooth on 5 photos of an object (e.g., a sneaker) and then generate hundreds of shots in different environments without physical photo sets. In practice, it is often combined with LoRA to reduce computational cost: instead of updating all model weights, only low-rank matrices are trained. Tools like kohya_ss or Hugging Face's Diffusers library offer ready-to-use DreamBooth+LoRA scripts.

→ diffusion-model fine-tuning lora

Fill-In-the-Middle

Intermediate FIM · Infilling · Code Infilling 2

Fill-In-the-Middle (FIM) is a training objective for code models in which the model must predict a central span of text given the surrounding context — both what precedes it (prefix) and what follows it (suffix). Unlike standard left-to-right autoregressive generation, FIM enables the model to complete partially written functions, docstrings, variable names, or logic blocks in the middle of existing code. The technique rearranges training tokens into the form [PREFIX][SUFFIX][MIDDLE] or [PREFIX][MIDDLE][SUFFIX] and trains the model to complete the missing part. StarCoder, DeepSeek-Coder, and Codestral make extensive use of FIM, and it is the technical foundation of all modern code completion tools.

In practice A developer using GitHub Copilot or Cursor directly benefits from FIM every time they write a partial function and ask the model to complete the body: the model sees both the code before the cursor and the code after it. For those training their own code model, the FIM training pipeline requires randomly sampling spans to mask from the source code corpus and reformatting tokens with the special separators `<fim_prefix>`, `<fim_suffix>`, `<fim_middle>`. The typical ratio is 50% FIM + 50% left-to-right during pre-training to also preserve standard generative capability.

→ autoregressive fine-tuning sft

Fine-tuning

Beginner Affinamento · Adattamento 11

An extra training step where a ready-made model is trained on a smaller, more specific dataset to improve its performance on a certain task or domain.

In practice You do it when the base model does not match the style, jargon, or formats you need. It requires good labeled data and GPUs. Often you start with a lightweight variant like LoRA before doing a full fine-tune.

→ lora foundation-model rlhf

Gradient Descent

Intermediate Discesa del gradiente

An optimization algorithm that updates a model's weights in the direction that most reduces the error, one small step at a time.

In practice It is the core engine behind training every modern neural network. In practice people use a variant called Adam or AdamW, which is more stable and faster. If you do not train models from scratch it is a concept to know, not a knob to turn.

→ loss-function pretraining sft checkpoint

Instruction Tuning

Intermediate Instruction Fine-Tuning · FLAN-style Tuning 5

Instruction tuning is a training phase in which an already-pretrained LLM is further optimized on (instruction, expected-response) pairs, structured as natural-language task descriptions. Unlike generic supervised fine-tuning, it explicitly focuses on standardized task descriptions to instill the ability to follow arbitrary commands. Google's FLAN work (2021) showed that training on 60+ diverse tasks dramatically improves zero-shot generalization. It is the technical foundation of models such as ChatGPT, Vicuna, and Flan-T5.

In practice In practice, you prepare a dataset of thousands of examples in the format 'Instruction: … Response: …', often derived from existing NLP benchmarks reformatted as prompts. The base model is then fine-tuned on this data using a standard cross-entropy objective. A developer adapting an open-weights model (e.g., LLaMA) to a specific domain builds a vertical instruction dataset and uses frameworks like LLaMA-Factory, Axolotl, or HuggingFace TRL to run instruction tuning in a few hours on a single GPU.

→ sft rlhf fine-tuning few-shot-learning

LoRA /lor-ah/

Intermediate Low-Rank Adaptation 3

A fine-tuning technique that trains only a small set of extra parameters instead of the whole model, cutting compute cost and the size of the resulting file.

In practice It lets you customize a 70-billion-parameter model on a consumer GPU. You save adapters of a few MB that plug on top of the base model. It is the practical standard for adapting open-weight models to specific use cases.

→ fine-tuning quantization

Loss Function

Intermediate Funzione di perdita · Funzione di costo

A formula that measures how far the model's prediction is from the correct answer: the higher it is, the more wrong the model is.

In practice In LLMs the most used one is cross-entropy on next tokens. The loss value shown during training is the top signal to check whether the model is converging or there is a bug. A flat curve almost always means data or hyperparameter issues.

→ gradient-descent pretraining sft logits

Mixture of Denoisers

Advanced MoD · Mixed Denoising Objectives 112

A pretraining strategy (UL2, Google 2022) that trains a single model on multiple denoising objectives simultaneously: left-to-right language modeling, span prediction (BERT-style masked spans of varying lengths and corruptions), and prefix language modeling. Unifies the strengths of GPT-style and T5-style pretraining. The model learns when to use each mode based on a sentinel token that signals the objective type.

In practice A researcher wanting a flexible model for both completion and question answering can use UL2 or a Flan-UL2 checkpoint without choosing between encoder-decoder (T5) and decoder-only (GPT) architectures. In practice, the sentinel token `[S2S]`, `[NLU]`, or `[NLG]` must be prepended to the prompt to activate the correct mode — a detail that significantly impacts performance and is often omitted, causing poor results.

→ pretraining autoregressive sft fine-tuning decoder-only

PPO /pee-pee-oh/

Intermediate Proximal Policy Optimization · Ottimizzazione di policy prossimale 2

A reinforcement learning algorithm that updates the model in small steps, preventing it from drifting too far from the previous version.

In practice It was the engine behind RLHF in the early ChatGPT: it maximizes human reward without letting the model diverge. Notoriously hard to stabilize and rich in hyperparameters. That is why many open-source teams now prefer DPO, which gets similar results with less effort.

→ rlhf dpo alignment loss-function

Pretraining

Beginner Pre-training · Pre-addestramento 6

The initial training phase where a model learns the structure of language by predicting the next token on huge amounts of generic text.

In practice It is the most expensive step (months of GPUs and millions of dollars) and produces a "base" model that can write but cannot yet follow instructions. Only big labs run it from scratch; companies start from pretrained models and adapt them with SFT, LoRA, or RLHF.

→ foundation-model sft loss-function gradient-descent checkpoint

QLoRA /kew-lor-ah/

Intermediate Quantized LoRA 1

A variant of LoRA that keeps the base model in 4-bit quantized form during fine-tuning, drastically cutting the GPU memory needed.

In practice It lets you adapt 13B-70B parameter models on a single consumer GPU (e.g. RTX 4090 or 24-40 GB A100). It is the favorite technique for hobbyist or low-budget enterprise fine-tuning. Quality loss vs. full-precision fine-tuning is almost negligible.

→ lora quantization fine-tuning sft

Reward Shaping

Advanced Reward Function Design · Reward Engineering

The design of reward signals that guide reinforcement learning without overfitting to proxy measures. Poorly shaped rewards lead to reward hacking: the agent optimizes the metric instead of solving the real task. LLMs now automate reward design (Eureka/NVIDIA): GPT-4 writes Python reward functions, runs them in simulation, and iterates based on agent performance. It is critical for robotics, game AI, and RLHF with human feedback.

In practice A researcher training a robot to walk must balance rewards for speed, stability, and energy consumption — too much emphasis on speed produces bizarre gaits or reward hacking. With Eureka, the task is described in natural language and an LLM automatically generates the reward function, running it in Isaac Gym simulation and refining weights based on performance metrics. The same principle applies to RLHF: the language model's reward function must capture 'real utility', not just 'sounds convincing'.

→ rlhf rlaif ppo alignment

RLAIF /ar-el-ay-eye-ef/

Intermediate Reinforcement Learning from AI Feedback 1

A variant of RLHF where responses are judged not by a human but by another AI model, cutting cost and time compared to manual annotation.

In practice It lets alignment training scale to much larger volumes. Anthropic uses it for Claude together with Constitutional AI. The risk is amplifying the judge model's biases, so human oversight is still needed.

→ rlhf constitutional-ai alignment

RLHF /ar-el-aitch-ef/

Intermediate Reinforcement Learning from Human Feedback 5

A training technique where humans rate and rank model responses, and these preferences are used to steer learning toward more helpful and safer answers.

In practice It is the step that turned ChatGPT into something useful versus a raw predictive model. For API users RLHF has already been done by the provider. Knowing about it explains why more 'aligned' models sometimes refuse legitimate requests.

→ rlaif constitutional-ai alignment

SFT /es-ef-tee/

Intermediate Supervised Fine-Tuning · Fine-tuning supervisionato

Fine-tuning where the model learns from input-output pairs written by humans, for example questions with ideal answers.

In practice It is the first step in turning a base model into an instruction-following assistant. A few thousand high-quality examples are enough for large gains in a domain. In practice it is almost always the first option before moving to RLHF or DPO.

→ fine-tuning pretraining rlhf dpo lora

Inference

ARC-AGI /ark-ay-jee-eye/

Intermediate Abstraction and Reasoning Corpus 2

A benchmark of visual grid puzzles created by François Chollet to measure abstract reasoning on never-seen patterns, unsolvable by memorization.

In practice Designed to be easy for humans (over 80%) but hard for LLMs. In 2024 OpenAI's o3 hit historic results, reopening the debate on what AGI really means. A one-million-dollar prize is attached.

→ reasoning-model frontier-model gpqa

Beam Search

Intermediate Ricerca a fascio

A decoding algorithm that keeps the N most likely sequences in parallel and finally picks the one with the best overall score.

In practice It produces "safer" results than greedy, but tends to be repetitive and unnatural on long text. It used to be standard for machine translation; in modern conversational LLMs it is mostly replaced by top-p sampling. It still helps on structured tasks like translation and summarization.

→ greedy-decoding top-p-sampling logits

Chain-of-thought

Beginner CoT · Catena di ragionamento 7

A technique where the model is asked to spell out the intermediate reasoning steps before giving the final answer, improving accuracy on complex tasks.

In practice Adding 'think step by step' to the prompt really works on math, logic, and analysis. Reasoning models (o1, Claude with thinking) do it automatically. It costs more tokens, so use it only where you need it.

→ reasoning-model few-shot-learning

Context window

Beginner Finestra di contesto · Context length 1

The maximum number of tokens the model can read and hold in memory in a single call, counting both prompt and response.

In practice If you have a 200-page contract and a 200k-token window the whole thing often fits. Otherwise you have to chunk the text or use RAG. More context means higher cost and higher response latency.

→ token attention rag

Continuous Batching

Advanced Batching continuo · In-flight batching 2

A serving strategy where new requests join the running batch at every generation step, instead of waiting for previous ones to finish.

In practice It sharply raises the throughput of a GPU serving APIs, because cores are never left idle. It is implemented in vLLM, TensorRT-LLM, and TGI. For anyone pricing per token, it is one of the key ingredients to stay competitive on cost.

→ paged-attention kv-cache inference-compute

Few-shot learning

Beginner Apprendimento con pochi esempi 3

A prompting technique where the model is shown a few examples of desired input and output, so it learns the format on the fly without training.

In practice Useful to enforce a schema, a tone, or a precise classification. Usually 3-5 examples are enough. It is almost always the first thing to try before reaching for fine-tuning: it only costs a few extra tokens in the prompt.

→ zero-shot-learning chain-of-thought

FlashAttention

Advanced Flash Attention 4

An algorithm that reorganizes attention computation to minimize data movement between fast and slow GPU memory.

In practice It does not change the math, but makes attention much faster and far less memory-hungry. It ships by default in PyTorch and in major inference servers (vLLM, TGI). If you use APIs you never see it; if you self-host it is almost mandatory to turn on.

→ attention transformer kv-cache inference-compute

GPQA /jee-pee-kew-ay/

Intermediate Graduate-Level Google-Proof Q&A

A benchmark of 448 questions written by PhD students in biology, physics, and chemistry, designed to be hard even with Google access.

In practice It is replacing MMLU as the gauge of deep scientific knowledge. Domain-expert humans score around 65%; frontier models in 2025 exceed 70%. It remains one of the not-yet-saturated benchmarks.

→ mmlu reasoning-model frontier-model

Greedy Decoding

Intermediate Decodifica greedy

A generation strategy that always picks the most likely token at each step, without exploring alternatives.

In practice Equivalent to temperature 0. It is deterministic and fast, ideal for tasks needing reproducibility (data extraction, classification, code). The downside is that it can get stuck in loops and produces flat output on creative tasks. It is the natural starting point when debugging prompts.

→ beam-search temperature top-p-sampling logits

HELM /helm/

Intermediate Holistic Evaluation of Language Models

A holistic evaluation framework developed by Stanford CRFM that measures an LLM across dozens of benchmarks covering accuracy, robustness, bias, calibration, and efficiency.

In practice Instead of a single metric, it provides a full scorecard: useful for comparing models all-around, not just on academic leaderboards. It runs a public site with up-to-date results for every major model.

→ mmlu foundation-model

HumanEval /human-eval/

Intermediate 5

An OpenAI benchmark of 164 Python programming problems scored by running unit tests against the code generated by the model.

In practice It was the standard for measuring LLM coding ability since 2021. It too is now saturated (over 90% pass@1), and the community has moved to SWE-bench, more realistic because it is based on real repositories.

→ swe-bench mmlu

K-Quants

Intermediate K-Quantization · llama.cpp K-Quants · GGUF K-Quants 1

K-Quants are a family of quantization methods implemented in llama.cpp (from Q2_K to Q8_K) that apply different bit-widths to different model layers based on their sensitivity to precision loss. Attention and embedding layers, being more sensitive, receive more bits; intermediate feed-forward layers, being less critical, receive fewer. This non-uniform quantization produces higher quality than older flat-Q formats (Q4_0, Q5_1) at the same file size. Q4_K_M has become the reference format for local inference, achieving better quality than the old Q5_1 while being more compact. They are the standard format for modern GGUF models downloadable from HuggingFace.

In practice A user wanting to run Llama 3 70B on a PC with 48 GB of RAM downloads the Q4_K_M variant from the GGUF repository on HuggingFace (typically uploaded by TheBloke or bartowski) and runs it with `llama.cpp` or an interface like LM Studio or Ollama. The choice of quantization level follows a practical rule: Q4_K_M for the best quality/size balance, Q5_K_M if there is sufficient RAM and higher fidelity is desired, Q2_K if space is very limited and degraded quality is acceptable. K-Quants are transparent to the end user: the interface loads the GGUF file and handles the format internally.

→ quantization qlora

KV Cache /kay-vee cache/

Intermediate Key-Value Cache · Cache chiavi-valori 4

A temporary GPU memory that stores attention computations for tokens already seen, so the model does not recompute them on every new token generated.

In practice It is why generating the tenth token costs less than the first: the cache avoids redoing work. It eats a lot of VRAM and grows with context, so it is often the real bottleneck for serving many users in parallel. Optimizing it (paged, quantized) is central to cutting inference cost.

→ attention transformer inference-compute paged-attention

KV Cache Quantization

Advanced KV Quantization · KV Compression 1

KV cache quantization is the technique of compressing the key-value tensors dynamically generated during inference, reducing them from FP16 to FP8 or INT8. Unlike weight quantization, which operates on the model's static parameters, this acts on the cache generated at runtime for each request. It reduces VRAM footprint by 50% or more, enabling longer context windows or more concurrent requests per GPU. It is supported by vLLM, Text Generation Inference (TGI), and TensorRT-LLM.

In practice A sysadmin serving a 70B model on two A100 80GB GPUs and wanting to increase concurrent batch size from 8 to 16 requests enables FP8 KV cache quantization in vLLM by adding `--kv-cache-dtype fp8` to the launch command. It is important to distinguish this from weight quantization: the two approaches are orthogonal and can be combined. In practice, measure quality degradation on long-range tasks (needle-in-haystack, multi-turn) before deploying to production, since precision loss in the cache is more visible over long contexts.

→ kv-cache quantization paged-attention prefix-caching

LLM-as-judge /el-el-em as judge/

Intermediate LLM giudice · Model-graded eval

A technique that uses an LLM (usually a strong one) to score another model's or its own answers against criteria written in natural language.

In practice It speeds up evaluation dramatically compared to human judges, but suffers from biases (prefers longer answers, its own style). It must be calibrated against a subset of human judgments as anchor.

→ rlaif constitutional-ai alignment

Logits

Intermediate Logit

Raw numeric scores the model produces for every possible vocabulary token, before being turned into probabilities.

In practice They are the model's "unnormalized thinking": the higher a token's logit, the more likely it gets. Some APIs expose `logprobs` (logits after softmax and log) to gauge confidence or build classifiers. Working with raw logits is only relevant for fine-tuning or research.

→ softmax temperature top-p-sampling top-k-sampling

Lost in the middle

Intermediate Perso nel mezzo

The phenomenon where an LLM remembers information at the start and end of the context better, while content in the middle is often ignored or forgotten.

In practice Critical for RAG and long prompts: the order of documents matters. Put the key information at the start or the end. It is one of the reasons a 1M-token context window is not equivalent to actually using it all.

→ context-window needle-in-haystack rag

MMLU /em-em-el-you/

Intermediate Massive Multitask Language Understanding 1

A benchmark of about 16,000 multiple-choice questions across 57 subjects, from math and law to medicine, used to measure an LLM's general knowledge.

In practice For years it was the headline benchmark cited in new model announcements. Today it is saturated: frontier models score above 85%, and the field is moving to harder benchmarks like MMLU-Pro and GPQA.

→ gpqa helm foundation-model

Needle in a Haystack

Intermediate NIAH · Ago nel pagliaio

A test that hides a specific sentence inside a long irrelevant text and asks the model to retrieve it, to measure the real quality of the context window.

In practice It has become the de facto benchmark for long-context models (100K, 1M tokens). A model can advertise a huge context but fail NIAH beyond a certain depth, a sign the window is effectively 'fake'.

→ context-window lost-in-the-middle

Paged Attention

Advanced PagedAttention 4

A technique that splits the KV cache into small blocks managed like virtual memory pages, cutting VRAM waste between different requests.

In practice It is the core idea of vLLM and now standard in modern inference servers. It lets the same GPU serve many more users because it avoids reserving large mostly-empty blocks. When picking a self-hosted runtime, support for paged attention is a baseline requirement.

→ kv-cache attention continuous-batching inference-compute

Prefix Caching

Intermediate Automatic Prefix Caching · APC · Prompt Caching 2

Prefix caching is an inference technique that reuses the already-computed KV cache for common prompt prefixes across multiple requests. Rather than recomputing attention keys and values for the same sequences (e.g., an identical system prompt), the system stores these activations in memory and retrieves them directly. This dramatically reduces latency for the shared prefix, bringing it close to zero. It is implemented in vLLM as 'Automatic Prefix Caching' and in Anthropic and OpenAI cloud services as a reduced-cost billed feature.

In practice A developer serving a chatbot with a fixed 2,000-token system prompt benefits immediately from prefix caching: only the first request computes that prefix, and all subsequent ones read it from cache. In vLLM it is enabled with `--enable-prefix-caching`; in the Anthropic API, prefix caching must be explicitly declared with `cache_control`. For RAG applications with shared documents, you structure the prompt by placing the document before the questions to maximize cache reuse.

→ kv-cache paged-attention continuous-batching speculative-decoding

Quantization

Intermediate Quantizzazione 11

A technique that reduces the numeric precision of model weights (for example from 16 to 4 bits) so it takes less memory and runs faster.

In practice It is what lets you run a Llama 70B on a single GPU or a 7B model on a Mac. You lose a bit of quality but often not much. Typical tools: GGUF, AWQ, GPTQ. Useful for on-prem or edge deployment.

→ inference-compute lora

RAG /rag/

Beginner Retrieval-Augmented Generation · Generazione aumentata da recupero 21

A technique that fetches relevant text from an external data source and inserts it into the model's prompt before generating the response.

In practice It lets an LLM answer using company documents, internal knowledge bases, or up-to-date articles without training. It cuts hallucinations on specific data and refreshes knowledge without re-training. It is the first architecture to consider for an enterprise chatbot.

→ embedding vector-db context-window hallucination

Self-consistency

Intermediate Auto-consistenza 2

A technique that samples multiple independent answers from the model with temperature > 0 and picks the most frequent one by majority vote.

In practice It often improves accuracy on math reasoning tasks: if 5 out of 7 thought chains converge on the same answer, it is likely correct. It triples or quintuples inference cost.

→ chain-of-thought tree-of-thoughts reasoning-model

Softmax

Intermediate

A math function that turns a set of logits into probabilities that sum to 1, amplifying high values and squashing low ones.

In practice It is the last step before picking the next token: it tells how strongly the model "believes" in each option. It also appears inside attention to weight context tokens. If you call APIs it is invisible; if you study models, it is one of the most recurring functions.

→ logits temperature attention

Speculative Decoding

Advanced Decoding speculativo 3

A technique where a small fast model proposes several tokens ahead and the large model verifies them in a single pass, accepting the correct ones.

In practice It can produce answers 2-3x faster with no change in final quality, because the big model stays the judge. It is used in production by OpenAI, Anthropic, and in self-hosted runtimes. It needs a "draft" model aligned with the main one, so it is not free to set up.

→ inference-compute distillation greedy-decoding logits

Structured output

Beginner JSON mode · Output strutturato

A mode where the model is constrained to produce output conforming to a schema (JSON, regex, grammar) instead of free text.

In practice Essential when the output feeds another system: API, database, frontend. Providers like OpenAI and Anthropic offer native enforcement that guarantees valid JSON on the first try.

→ function-calling tool-use

SWE-bench /swee-bench/

Intermediate Software Engineering Bench 7

A benchmark of over 2,000 real GitHub issues from Python repositories: the model must produce a patch that makes the project's tests pass.

In practice It measures real software-engineering ability (reading a codebase, debugging, cross-file edits), not isolated coding. It has become the reference for agents like Devin, Claude Code, and OpenAI Codex.

→ humaneval agent

Temperature

Beginner Temperatura

A parameter that scales the logits before sampling: low values make the model more deterministic, high values more creative and unpredictable.

In practice At 0 the model always picks the most likely word (effectively greedy); at 1 it keeps the original distribution; above 1.5 it tends to go off the rails. For classification or extraction use 0; for creative writing 0.7-1.0. It is the simplest knob to tune in any API.

→ top-p-sampling top-k-sampling logits softmax greedy-decoding

Token

Beginner Token 6

The basic unit the model breaks text into: it can be a whole word, a syllable, or a few characters, depending on the tokenizer.

In practice LLM APIs charge per input and output token. In English 1 token is roughly 0.75 words, in Italian a bit less. Counting tokens in your prompt helps estimate cost and stay within the context limit.

→ tokenizer context-window llm

Tokenizer

Beginner Tokenizzatore 1

The component that turns text into tokens before passing it to the model and rebuilds text from output tokens.

In practice Different tokenizers produce different counts: the same text may cost more tokens on GPT than on Claude or the other way around. Libraries like tiktoken (OpenAI) let you count tokens locally before calling the API.

→ token llm

Top-k Sampling /top-kay sampling/

Intermediate Campionamento top-k

A next-token selection strategy that keeps only the k most likely candidates and discards the rest before sampling.

In practice With k=1 it becomes greedy decoding; with large k it is almost the full distribution again. It is used to stop the model from picking absurd words from the tail. Modern APIs often replace or combine it with top-p, which is considered more adaptive.

→ top-p-sampling temperature logits softmax greedy-decoding

Top-p Sampling /top-pee sampling/

Intermediate Nucleus Sampling · Campionamento a nucleo

A strategy that picks the next token from the smallest set of candidates whose cumulative probability exceeds a threshold p (e.g. 0.9).

In practice It adapts the candidate set to context: few options when the model is confident, many when it is unsure. It is the most-used parameter in APIs (`top_p` on OpenAI, Anthropic, etc.) to tune creativity without losing coherence. Typical values sit between 0.8 and 0.95.

→ top-k-sampling temperature logits softmax

Tree of Thoughts

Intermediate ToT 1

A reasoning strategy where the model explores multiple thought branches in parallel, evaluates them, and keeps only the promising ones, like a tree search.

In practice It extends Chain-of-Thought by allowing backtracking: useful for puzzles, planning, and math problems where a single linear path often fails. It costs many more tokens than standard inference.

→ chain-of-thought self-consistency reasoning-model

Voice Cloning

Intermediate Zero-Shot Voice Cloning · Speaker Adaptation 9

Voice cloning is the ability to generate speech synthesis in a target speaker's voice from just a few seconds of reference audio, without any additional fine-tuning. The model extracts a speaker embedding from the reference audio and conditions generation on it, replicating timbre, rhythm, and prosodic characteristics. Zero-shot means no additional per-speaker training is needed at inference time. Systems like ElevenLabs, XTTS v2, CosyVoice, and Dia TTS have made this technology accessible via API or open-weights models.

In practice A developer cloning a voice with XTTS v2 (open source, available on HuggingFace) provides 6-10 seconds of clean reference audio and the text to synthesize; the Coqui TTS library handles embedding extraction and synthesis in a few seconds. For professional productions, the ElevenLabs API accepts an audio clip and returns a reusable voice_id. It is essential to verify the original speaker's consent before cloning their voice, in compliance with applicable regulations.

→ neural-codec sft fine-tuning

Zero-shot learning

Beginner Apprendimento senza esempi

The model's ability to perform a task it never saw during training, based only on the description we give in the prompt, with no examples.

In practice It is what most of us do when we write 'summarize this text in three bullets'. If results are inconsistent, moving to few-shot with examples is the fastest fix. Useful to prototype new flows quickly.

→ few-shot-learning foundation-model

Agents

Agent

Beginner Agente AI · AI agent 90

A system where an LLM does more than answer: it decides which tools to call, in what order, and keeps iterating until it reaches a goal.

In practice An agent reads email, writes to a database, sends Slack messages. The hard part is handling errors, infinite loops, cost, and tool security. For simple cases a linear pipeline is more reliable than a real agent.

→ tool-use mcp prompt-injection

Function calling

Beginner Chiamata di funzione 3

An LLM's ability to output a structured call to a function described in a schema, with name and typed arguments ready to execute.

In practice It is the standard way an app wires a model to its own code: the model returns JSON, the app runs the function and feeds the result back. The foundation of almost every production agent.

→ tool-use structured-output agent mcp

MCP /em-see-pee/

Beginner Model Context Protocol 7

An open protocol introduced by Anthropic to connect AI models to external tools, data, and services in a standard way, like a USB port for LLMs.

In practice Instead of writing custom integrations for every client (Claude Desktop, IDEs, agents), you publish an MCP server and all compatible clients use it. It is becoming the de facto standard for agent tooling.

→ agent tool-use

Multi-Agent Orchestration

Intermediate Multi-Agent Systems · Agent Orchestration

An architecture where multiple specialized AI agents collaborate to complete a complex objective, each with defined roles, tools, and communication protocols. An orchestrator agent decomposes the goal and dispatches subtasks to worker agents. Unlike single-agent loops, multi-agent systems enable parallelism, specialization, and fault isolation. The main patterns are: hierarchical (orchestrator→workers), sequential pipeline, and debate/critique among agents.

In practice A developer building a complex RAG system can use an orchestrator (AutoGen, CrewAI, Magentic-One) to route queries to specialized agents — one for web search, one for the vector database, one for final synthesis. Debugging requires tracing inter-agent communication: tools like LangSmith or Phoenix show which agent received which input and what it produced, making bottlenecks and infinite loops visible.

→ agent react-pattern tool-use mcp

ReAct /ree-act/

Beginner Reasoning and Acting · Ragionamento e Azione 2

A pattern where an agent alternates textual reasoning steps (Thought) with concrete actions (Action) on tools, observing the result before the next step.

In practice It is the backbone of most modern LLM agents: the model writes what it intends to do, calls a tool, reads the response, then decides the next step. It makes the agent's decisions inspectable and debuggable.

→ agent tool-use chain-of-thought

Reflexion

Intermediate Self-reflection 1

A technique where an agent, after a failed attempt, generates a verbal self-critique and stores it in memory to improve the next attempt.

In practice Useful for tasks with clear feedback (failing tests, wrong answers). The agent learns from its mistakes within the same session, without fine-tuning. Often boosts success on coding and reasoning benchmarks.

→ react-pattern agent chain-of-thought

Tool use

Beginner Function calling · Uso di strumenti · Chiamata di funzione 11

The model's ability to return a structured request to run an external function (search the web, read a file, write to a database) and then resume reasoning with the result.

In practice You declare the functions with name, parameters, and description; the model picks when to call them. It is the building block of every agent. Validate arguments carefully: the model sometimes invents parameters or forgets them.

→ agent mcp

Toolformer

Advanced 1

An LLM from Meta trained to autonomously decide when and how to call external APIs such as a calculator, translator, or search engine.

In practice It is one of the first works to show that an LLM can learn tool use in a self-supervised way, without human examples. Today the idea lives on in the native function calling of modern models.

→ tool-use function-calling agent

Safety

Adversarial example

Intermediate Esempio avversariale

An input modified imperceptibly for a human but crafted to fool a model into producing a wrong or harmful output.

In practice Born in vision (a few pixels can make a panda be classified as a gibbon), today it also hits LLMs with strange character suffixes that unlock forbidden behavior. It is an intrinsic vulnerability of neural networks.

→ prompt-injection jailbreak red-teaming

AI Supply Chain Attack

Intermediate Model Poisoning · AI Artifact Attack

An AI supply chain attack targets the AI development supply chain: publicly shared model weights, LoRA adapters, GGUF quantizations, or datasets on platforms like HuggingFace are compromised with backdoors or hidden behaviors. A poisoned model can execute malicious actions when it receives a specific trigger, exfiltrate data, or generate harmful outputs at the attacker's request. The analogy to SolarWinds-style attacks on traditional software is direct: the artifact appears legitimate but contains hidden payloads.

In practice A developer downloading models from public repositories should verify the officially published SHA256 checksums and prefer models with digital signatures or verified provenance. Before using a model in production, it is good practice to run automated security evaluations (e.g., with tools like ModelScan or Protect AI Guardian) that analyze weights for suspicious patterns. For enterprise teams, maintaining an internal registry of approved artifacts and disallowing direct Internet downloads during deployment significantly reduces the attack surface.

→ backdoor-attack data-poisoning sleeper-agents red-teaming

Alignment

Beginner Allineamento 8

A set of techniques and research aimed at making an AI model do what humans actually want, not just what we ask literally.

In practice In practice: the model does not help with illegal stuff, follows instructions, does not make things up, does not manipulate. When you put AI in production this is also a brand and legal liability concern, not just an ethical one.

→ rlhf constitutional-ai red-teaming

ASL /ay-es-el/

Intermediate AI Safety Level · Livello di sicurezza AI 1

A scale of levels (ASL-1, ASL-2, ASL-3...) used by Anthropic to classify the risk of an AI model and define the required safety controls, inspired by biosafety levels.

In practice Higher level, more mandatory safety techniques: monitoring, deployment restrictions, independent audits. When choosing a vendor, knowing which ASL a model is compliant with hints at the maturity of their governance.

→ alignment red-teaming frontier-model

Backdoor attack

Advanced Attacco backdoor · Trojan

An attack where a model is trained to behave normally except when it recognizes a secret trigger that activates a predefined malicious behavior.

In practice Extremely hard to detect with standard evaluations: the model looks aligned until someone types the keyword. It affects both proprietary models (insiders) and open-weights downloaded from untrusted sources.

→ data-poisoning sleeper-agents red-teaming

Constitutional AI /constitutional ay-eye/

Intermediate AI costituzionale · CAI 4

An approach developed by Anthropic where the model is trained to follow a written set of principles (a 'constitution') instead of just case-by-case human preferences.

In practice It is the method behind Claude. Upside: behavior rules are explicit and readable, not hidden in millions of ratings. If you pick a model for the company this clarifies the vendor's policy choices.

→ rlaif rlhf alignment

Data poisoning

Intermediate Avvelenamento dei dati 2

An attack where an adversary inserts malicious examples into the training dataset to alter the behavior of the final model.

In practice Even a handful of corrupted documents in a web crawl can create persistent backdoors or biases. Particularly risky for models that continuously train on public content or are fine-tuned on unvetted third-party datasets.

→ backdoor-attack fine-tuning red-teaming

Differential privacy

Intermediate DP · Privacy differenziale 6

A mathematical technique that adds controlled noise to training so that the presence or absence of a single individual in the dataset is not detectable from the model's output.

In practice It is the de facto standard for models trained on health, tax, or messaging data. Apple, Google, and the US Census use it. It costs accuracy: more privacy means more noise.

→ data-poisoning fine-tuning

Hallucination

Beginner Allucinazione · Confabulation 4

A model response that sounds plausible but is made up: false facts, nonexistent citations, APIs that do not exist, wrong data presented with confidence.

In practice It is the number-one issue when putting LLMs in business workflows. Fixes: RAG with sources, asking for citations, double-checking with a second model, validating structured output against rules. Never treat output as gospel without a check.

→ rag alignment

Indirect Prompt Injection

Intermediate Indirect Injection · Environment Injection 1

Indirect prompt injection is an attack where malicious instructions are embedded in external content that an LLM agent will read: web pages, documents, emails, or database results. Unlike direct prompt injection (where the user provides the malicious content), here the attacker controls the external environment. When the agent retrieves and processes the content, it unknowingly executes the hidden instructions as if they came from a trusted source. The attack was first formalized by Greshake et al. (2023) and is a critical threat for RAG systems and autonomous agents.

In practice A developer building a web agent must sanitize all externally retrieved text before inserting it into the prompt. Defensive techniques include: structured prompts with explicit delimiters separating data from instructions, classifier systems that detect injection patterns in retrieved documents, and the principle of least privilege (the agent should not have access to dangerous tools if the task does not require them). Systematically testing the agent with deliberately poisoned documents is part of standard red-teaming for RAG applications.

→ prompt-injection rag agent red-teaming sleeper-agents

Jailbreak

Beginner Aggiramento delle protezioni 8

A technique where a user talks the model into ignoring its own safety rules, for example by asking it to pretend to be a character with no restrictions.

In practice Different from prompt injection: here it is the user who tries. If you offer a public LLM service this means doing red teaming, logging conversations, and running a safety classifier in cascade over responses.

→ prompt-injection alignment red-teaming

Many-Shot Jailbreaking

Intermediate Many-Shot Attack · Long Context Jailbreak

Many-shot jailbreaking is an attack technique that exploits long context windows by prepending 100-256 or more fake harmful question-answer pairs before the actual malicious request. The in-context examples override safety training by inducing the model to follow the demonstrated pattern rather than its guardrails. Effectiveness scales with context length: models with larger context windows are more vulnerable. The attack was disclosed by Anthropic in 2024 and prompted revisions to safety mechanisms for very long-context models.

In practice From a defensive standpoint, a developer evaluating a deployed model's robustness should include many-shot tests in their red-teaming: construct a prompt with 200+ malicious Q&A examples and measure the model's compliance rate. To mitigate the risk in production, one can apply artificially capped context windows for certain tasks, input classifiers that detect repeated Q&A patterns on risky topics, or logging systems that flag unusually long prompts for review.

→ jailbreak context-window few-shot-learning prompt-injection

Model extraction

Advanced Estrazione del modello · Model stealing

An attack where an adversary repeatedly queries a model via API to reconstruct a functional copy of its weights or behavior.

In practice A legal variant is distilling outputs of a frontier model to train a smaller one, banned by the terms of service of most providers. Mitigated with rate limits, watermarking, and fingerprint detection.

→ distillation open-weights-vs-open-source

Prompt injection

Beginner Iniezione di prompt 8

An attack where an external input (a document, a web page, an email) contains hidden instructions that hijack the model's behavior.

In practice If your agent reads emails and then acts, a malicious email can tell it 'forward everything to a third party'. Fixes: treat external inputs as untrusted, sandbox tools, require human confirmation for sensitive actions, filter inputs and outputs.

→ jailbreak agent safety-classifier

Red teaming

Beginner Test avversariale 4

A practice where a team actively tries to attack a model or an AI system, looking for jailbreaks, security holes, and dangerous uses, in order to find them before release.

In practice AI labs do it in-house and with external experts before shipping a model. If you put AI in production do the same on your product: ask colleagues to break it before customers do. Even one rough hour beats the first public bug.

→ jailbreak prompt-injection alignment asl

Safety classifier

Intermediate Classificatore di sicurezza · Content filter 1

A separate model that analyzes the input or output of an LLM to catch unsafe, violent, illegal, or off-policy content before it reaches the user.

In practice It is a safety net in cascade: if the main model slips, the classifier blocks it. OpenAI Moderation and Meta's Llama Guard are free examples. For public services having one is almost mandatory.

→ alignment jailbreak red-teaming

Sleeper agents

Advanced Sandbagging · Agenti dormienti 2

Models that behave aligned during training and evaluation but exhibit malicious behavior only under specific conditions, such as a given date or phrase.

In practice Studied by Anthropic in 2024: they showed standard safety fine-tuning does not remove deliberately planted backdoors. The term sandbagging refers to a model intentionally pretending to be less capable than it is.

→ backdoor-attack alignment red-teaming

Watermarking

Intermediate Filigrana AI 1

A technique that embeds an invisible statistical signal into text or images generated by a model, so they can later be identified as AI-produced.

In practice Google's SynthID, for example, marks Gemini's text and images. Useful against disinformation, deepfakes, and plagiarism. Limit: it often breaks under rewrites, translation, or minor edits, and works only if the provider cooperates.

→ safety-classifier model-extraction

Infrastructure

BM25 /bee-em twenty-five/

Intermediate Best Matching 25 · Okapi BM25

A classic text-search algorithm based on word frequency, with corrections for document length and term rarity.

In practice It has powered Elasticsearch, Lucene, and Solr for decades. On exact terms, acronyms, and proper nouns it often beats embeddings. That is why modern RAG pipelines combine BM25 with vector search (hybrid search).

→ hybrid-search rag reranker

Chunking

Beginner Spezzettamento · Segmentazione

The process of splitting a document into smaller pieces (chunks) before computing embeddings, to make them suitable for retrieval and the context window.

In practice Chunking quality often dictates RAG quality: chunks too small lose context, chunks too large dilute relevance. Common strategies: fixed size with overlap, recursive by separator, semantic by topic shift.

→ rag embedding context-window

Cosine similarity

Intermediate Similarità coseno

A similarity measure between two vectors based on the cosine of the angle between them: ranges from -1 (opposite) to 1 (identical), independent of their length.

In practice It is the most common metric for comparing text embeddings because it ignores magnitude and looks only at semantic direction. Common alternatives: dot product (faster if vectors are normalized) and Euclidean distance.

→ embedding vector-db hnsw

Cross-encoder vs bi-encoder

Advanced

Two architectures to measure text similarity: the bi-encoder encodes query and document separately (fast), the cross-encoder processes them jointly (slow but accurate).

In practice Bi-encoder = precomputed embeddings, used for the first search over millions of documents. Cross-encoder = score computed on the fly over few candidates, used as a final reranker. They are complementary, not alternatives.

→ embedding reranker rag

Disaggregated Inference

Advanced Prefill-Decode Disaggregation · PD Disaggregation 1

Disaggregated inference is a serving architecture that physically separates the prefill phase (compute-bound: processes the entire prompt in parallel) from the decode phase (memory-bound: generates one token at a time), assigning them to distinct GPU pools connected via KV cache transfer. This separation eliminates 'prefill-decode interference', the resource contention that occurs when both phases run on the same GPUs and reduces overall throughput. Publicly proposed by Moonshot AI's Mooncake architecture (Kimi), it has yielded throughput improvements of 5x or more in production. It is considered one of the most significant advances in LLM serving infrastructure in 2024-2025.

In practice In a large-scale deployment, the infrastructure engineer configures a cluster of 'prefill-only' GPUs (typically high FLOPS/W, such as H100 SXM) and a separate 'decode-only' cluster (typically high memory bandwidth). An incoming request is routed to the prefill pool, which computes the KV cache and transfers it via NVLink or InfiniBand to the decode pool. Open-source frameworks such as LMDeploy and some advanced vLLM configurations support this mode. Operational cost is higher due to hardware duplication, but TTFT (time-to-first-token) and throughput improve significantly.

→ continuous-batching kv-cache speculative-decoding inference-compute

FP8

Advanced Float8 · 8-bit Floating Point · E4M3 · E5M2 6

FP8 is an 8-bit floating-point numeric format available in two variants: E4M3 (4-bit exponent, 3-bit mantissa), used in the forward pass for higher precision, and E5M2 (5-bit exponent, 2-bit mantissa), used for gradients for greater dynamic range. It reduces memory usage by roughly 50% compared to BF16 with less than 0.5% quality loss when paired with per-tensor scaling via the NVIDIA Transformer Engine. H100 and H800 GPUs have native FP8 Tensor Cores. DeepSeek V3 was trained entirely in FP8, achieving GPT-4o-level quality at a fraction of the cost.

In practice An ML team training a 70B LLM on an H100 cluster enables FP8 via NVIDIA's Transformer Engine (integrated into Megatron-LM and NeMo) by simply setting `fp8_format=HYBRID`. For inference, frameworks like vLLM and TensorRT-LLM support FP8 weights and activations to reduce required VRAM and increase throughput. Before deploying to production, it is good practice to run evaluations on standard benchmarks (MMLU, HumanEval) to confirm that quality degradation stays within acceptable thresholds.

→ quantization inference-compute flash-attention

HNSW /aitch-en-es-double-you/

Advanced Hierarchical Navigable Small World

A hierarchical graph data structure used to approximately find the nearest vectors to a query in datasets of millions or billions of embeddings.

In practice It is the default indexing algorithm in Pinecone, Qdrant, Weaviate, pgvector, and FAISS. It enables millisecond searches at scales where brute force would be unusable. You pay in RAM and index build time.

→ vector-db embedding cosine-similarity

Hybrid search

Advanced Ricerca ibrida

A retrieval strategy that combines keyword search (BM25) and vector search (embeddings), merging the two rankings with techniques such as Reciprocal Rank Fusion.

In practice It compensates each method's weaknesses: embeddings excel at semantics, BM25 at exact terms. It almost always beats either alone. It is the state of the art in production RAG systems.

→ bm25 vector-db rag reranker

Inference compute

Beginner Calcolo in inferenza · Test-time compute 2

The amount of compute the model uses at response time, not during training. More inference compute often means better but slower and pricier answers.

In practice Reasoning models shift resources from training to inference. For anyone deploying a service this is the most visible cost line: every call burns GPU. Ways to cut it: caching, smaller models, quantization, batching.

→ reasoning-model quantization moe

Pipeline Parallelism

Advanced PP · Inter-layer Parallelism 31

Pipeline parallelism is a distributed training strategy in which a neural network's layers are split into contiguous blocks, each assigned to a separate GPU. Each GPU processes its block of layers and passes activations to the next GPU, forming a pipeline. It differs from tensor parallelism, which splits individual weight matrices within a single layer. Combined with tensor parallelism and data parallelism it forms '3D parallelism', used by Megatron-LM to train models with hundreds of billions of parameters.

In practice An engineer training a model too large for a single GPU — or even a single multi-GPU node — uses pipeline parallelism to distribute layers across multiple nodes. With DeepSpeed or Megatron-LM you configure the pipeline degree (number of stages) and the number of micro-batches to fill the pipeline and minimize bubble overhead (idle GPU time between micro-batches). In inference, the same approach allows serving very large LLMs by distributing layers across multiple servers.

→ quantization inference-compute

Reranker

Intermediate Riordinatore · Re-ranking

A secondary model that reorders the results of an initial search (vector or keyword) by ranking them on relevance to the query.

In practice Typically you retrieve 50-100 candidates with a fast method, then let the reranker (e.g. Cohere Rerank, BGE) sort the top 5-10. It is one of the cheapest ways to lift the quality of a RAG pipeline.

→ rag cross-encoder-vs-bi-encoder hybrid-search

Sim-to-Real Transfer

Advanced Simulation-to-Real Transfer · Sim2Real

The process of training a robot policy in simulation (fast, cheap, safe) and then deploying it on real hardware without retraining. The 'reality gap' — differences in physics, friction, sensor noise — causes policies to fail. Domain randomization (randomizing simulation parameters) teaches robustness. LLMs automate this process (DrEureka): they generate randomization ranges so policies transfer zero-shot to real hardware.

In practice A robotics team building an arm for industrial picking trains thousands of policies in parallel on Isaac Sim or MuJoCo, randomly varying object mass, friction, lighting, and motor delays. The best policy is then deployed on the physical robot without further training. With DrEureka, an LLM automatically suggests randomization ranges from the task description, reducing days of manual tuning to a few hours of automated search.

→ synthetic-data fine-tuning

Vector database /vector dee-bee/

Beginner Database vettoriale · Vector store 1

A database specialized in storing embeddings and quickly finding the vectors most similar to a query, even across millions of records.

In practice Examples: Pinecone, Weaviate, Qdrant, pgvector on Postgres. You pick based on scale, cost, and whether you want to self-host or use cloud. It is the key infrastructure for a RAG system searching a company knowledge base.

→ embedding rag

Data

BPE /bee-pee-ee/

Intermediate Byte Pair Encoding · Codifica a coppie di byte

A tokenization algorithm that starts from single characters and progressively merges the most frequent pairs, building a vocabulary of subwords.

In practice It is used by GPT, Llama, Mistral, and nearly every Western LLM. It explains why "playing" may become `play` + `ing`: common pieces get one token, rare words get many. It directly affects per-token cost and quality on non-English languages.

→ tokenizer token subword-tokenization wordpiece-sentencepiece

Embedding

Beginner Vettori semantici 2

A numeric representation of a text as a vector of hundreds of numbers, where sentences with similar meaning produce vectors close to each other.

In practice You compute them once with an embedding model and store them in a vector database. They power semantic search, document dedup, clustering, and the retrieval step in a RAG system.

→ vector-db rag token

Subword Tokenization

Intermediate Tokenizzazione a sotto-parole

A family of techniques that splits text into pieces smaller than a whole word but larger than a single character.

In practice It is a trade-off between huge vocabularies (one word = one token) and tiny ones (one character = one token). It handles unseen words, typos, and many languages without blowing up in size. Every modern LLM uses some form of subword tokenization.

→ tokenizer token bpe wordpiece-sentencepiece

Synthetic Data

Beginner Dati sintetici 6

Training data generated by another AI model instead of collected from humans.

In practice It is now a pillar of modern training: big models produce examples to train smaller ones (distillation) or to cover rare cases. It must be filtered carefully, because generator errors compound in the final model. Nvidia, Meta, and Anthropic use it heavily.

→ pretraining sft distillation fine-tuning

WordPiece / SentencePiece

Intermediate WordPiece · SentencePiece

Two subword tokenization algorithms alternative to BPE: WordPiece is the one in BERT, SentencePiece is the one in T5 and Gemini.

In practice WordPiece chooses merges by probability rather than raw frequency. SentencePiece works directly on the raw string without assuming spaces, so it handles Chinese, Japanese, and other space-less languages better. Switching tokenizer requires retraining the model.

→ tokenizer bpe subword-tokenization token