NVIDIA GTC 2026: Huang keynote and the Rubin roadmap for the next cycle
At GTC 2026 NVIDIA confirms its annual cadence: details on Rubin (Blackwell's successor), new rack-scale configurations, updated software stack for training and inference.
Category
57 entries
At GTC 2026 NVIDIA confirms its annual cadence: details on Rubin (Blackwell's successor), new rack-scale configurations, updated software stack for training and inference.
vLLM ships v0.7 with chunked prefill on by default, a rewritten 'V1' engine scheduler, and advanced support for MoE (DeepSeek V3/R1) and multimodal models. +1.5-2× throughput on many workloads.
Cerebras Systems publishes inference numbers beating Nvidia GPUs by an order of magnitude: 2,500+ tok/s on Llama 4 Maverick and Scout thanks to the wafer-scale WSE-3. Custom ASIC back in the race.
NVIDIA NIM 1.0 packages TensorRT-LLM and Triton Inference Server into per-model Docker microservices with OpenAI-compatible API, health checks, and GPU auto-configuration, making LLM deployment as simple as running a container.
WebLLM enables running LLMs like Llama 3 8B directly in the browser via WebGPU and WASM, compiling models with Apache TVM to achieve 15 tokens/s in Chrome with no backend server.
Systematic review of continuous batching strategies for LLM serving: comparing Orca, vLLM, SGLang, and TGI on scheduling, GPU utilization, and TTFT/TPOT metrics. State of the art 2024-2025.
Meta releases torchao as a PyTorch-native library for INT4/FP8/INT8 quantization and sparsity, with 2x speedup on Llama-3 8B at INT4 without requiring custom CUDA kernels, emerging as the standard quantization layer for the PyTorch ecosystem.
UW + MIT release FlashInfer 0.2: CUDA library for attention in LLM serving with native paged KV cache, variable-length sequences, RoPE fusion, and 1.5x speedup vs vLLM on long prefill on A100.
OpenAI, Oracle, SoftBank and MGX announce a $500B four-year investment plan to build AI infrastructure in the US. First site in Abilene, Texas.
DeepSeek-V3 technical report reveals Multi-head Latent Attention and a complete FP8 pipeline achieving GPT-4o-level performance at $0.55/M tokens, training 671B parameter MoE on an H800 cluster under tight budget constraints.
The prefill/decode disaggregation technique separates prompt processing and token generation phases onto dedicated GPUs, reducing TTFT while maintaining high throughput, adopted by major cloud providers.
Anthropic open-sources the Model Context Protocol (MCP), a JSON-RPC standard that lets AI assistants talk to tools, file systems, databases, and SaaS without per-model ad-hoc integrations.
Moonshot AI (Kimi) separates prefill (compute-bound GPU) and decode (memory-bound GPU) phases across dedicated GPU pools with KV cache transfer, achieving 525% throughput improvement in production deployments.
Meta announces Llama Stack: an API spec + reference implementations for inference, safety, agents, memory, evals, RAG, and training — meant as 'standard plumbing' for Llama-based applications.
Quantizing the KV cache from FP16 to FP8 or INT8 reduces serving memory by 50%+, enabling 2x longer contexts or twice the concurrent users per GPU, adopted by vLLM, TGI, and TensorRT-LLM.
Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.
bitsandbytes 0.43 updates QLoRA support with NF4 and FP4 data types, optimized inference-time dequantization on A100/H100, and improved PEFT integration for efficient 4-bit LLM fine-tuning.
Neural Magic releases LLM Compressor: open-source library unifying GPTQ, AWQ, SmoothQuant, and SparseGPT in a single toolkit with native vLLM integration, simplifying compressed model deployment.
HuggingFace Accelerate 0.30 unifies FSDP and DeepSpeed in a YAML-configurable wrapper without modifying training code, with native Trainer integration and support for mixed parallelism strategies.
NVIDIA Transformer Engine brings FP8 (E4M3/E5M2) mixed-precision training with automatic per-tensor scaling, halving memory versus BF16 with less than 0.5% quality loss, making training 70B models on half the hardware feasible.
Tri Dao and NVIDIA publish FlashAttention-3: optimized for H100 Hopper with compute/memory overlapping via wgmma and TMA, FP8 low-precision support, 2.6x speedup over FA2 and 75% of H100 peak.
Stanford and LMSYS release SGLang, an LLM runtime introducing RadixAttention to share prefix caching across different requests, achieving 6.4x throughput over vLLM on tasks with common prefixes.
The GGUF (GGML Unified Format) specification becomes the standard for distributing quantized LLM models, replacing GGML with an extensible format including rich metadata, natively supported by llama.cpp, Ollama, and LM Studio.
vLLM v0.3.3 introduces Automatic Prefix Caching that reuses the KV cache for common prefixes across different requests, nearly eliminating initial response time for system prompts and previously-processed RAG documents.
S-LoRA (UC Berkeley) and Punica (UW) enable multi-tenant serving of hundreds of LoRA adapters from a single base model with zero-copy switching and dedicated CUDA kernels, integrated in vLLM and SGLang.
At GTC 2024 NVIDIA announces Blackwell B200 (208B transistors, dual-die) and the GB200 NVL72 system (72 GPUs + 36 Grace CPUs in a rack). 30x faster inference for frontier LLMs.
Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.
ROCm 6.0 brings native PyTorch 2.x support, hipBLASLt, hipGRAPH, and official vLLM integration on AMD Instinct MI300X GPUs, enabling LLM training and serving for the first time without manual patches.
Mozilla releases llamafile, a single-file executable combining llama.cpp with Cosmopolitan Libc to run LLMs on Linux, Windows, Mac, and BSD without any installation, directly from CPU or GPU.
Microsoft DeepSpeed team releases FastGen via MII: Dynamic SplitFuse scheduling for LLM serving achieves 2.3x throughput vs vLLM on production chat workloads, optimized for Azure H100.
NVIDIA open-sources TensorRT-LLM, a framework for compiling and optimizing LLMs for NVIDIA GPUs with out-of-the-box FP8, INT4, sparse attention, and multi-GPU tensor parallelism support.
ExLlamaV2 introduces the EXL2 format with per-layer mixed bit-rates (2-8 bit), delivering higher throughput than llama.cpp on NVIDIA GPUs and enabling 70B models to run on a single RTX 3090.
Cornell/UIUC introduce Medusa: N additional decoding heads on the main model predict N tokens ahead simultaneously, 2.2x speedup without needing a second draft model.
Google announces TPU v5e, a cost-optimized AI chip with 4x better performance per dollar compared to TPU v4 for inference, available through Google Kubernetes Engine for containerized workloads.
Tri Dao rewrites FlashAttention with 2x speedup over FA1: better parallelism across seq-len, head-dim support up to 256, query parallelism for MHA, MQA, and GQA. De facto training standard.
MIT Han Lab publishes AWQ: 4-bit quantization that preserves salient weights identified through activation analysis, achieving better accuracy-throughput than GPTQ for edge deployment.
OpenAI adds 'function calling' to the API: the model returns structured JSON conforming to a schema, enabling reliable tool integrations without fragile prompt engineering.
HuggingFace releases Text Generation Inference, an optimized Docker container for serving LLMs in production with continuous batching, tensor parallelism, and integrated Flash Attention 2.
AWS announces Bedrock, a managed service exposing Claude (Anthropic), Jurassic-2 (AI21), Stable Diffusion, and its own Titan via one API. Reply to Azure OpenAI.
PyTorch 2.0 introduces torch.compile built on TorchDynamo and the Inductor backend, delivering up to 2x speedup on transformers without code changes, making PyTorch competitive with XLA/JAX for production workloads.
The UC Berkeley team releases vLLM, a Python library for LLM inference using PagedAttention to manage KV cache like OS virtual memory, achieving 24x throughput over the HuggingFace baseline.
Chen et al. (Google Brain) publish Speculative Decoding: a small model proposes tokens, the large model verifies them in parallel. Same output, 2-3x faster with no quality change.
NVIDIA consolidates Triton as the open-source platform for serving PyTorch, TensorFlow, and ONNX models in production, with dynamic batching, multi-GPU support, and gRPC/HTTP APIs.
HuggingFace Accelerate provides a unified API that runs the same training code on any hardware without changes, becoming the backbone of most open LLM training pipelines.
Frantar et al. (ETH Zurich) publish GPTQ: accurate 4-bit quantization without significant fine-tuning, the first technique to make inference of 175B-parameter models practical on consumer hardware.
Hugging Face launches Inference Endpoints, a managed service to deploy Hub models on AWS, Azure or GCP with autoscaling, on-demand GPUs and private endpoints.
Tri Dao (Stanford) publishes FlashAttention: an IO-aware implementation that avoids materializing the attention matrix in HBM, achieving 2-4x speedup and 10x less GPU memory.
At GTC 2022 NVIDIA unveils the Hopper architecture and the H100 GPU, with FP8 Transformer Engine and NVLink 4. It will become the hardware substrate for nearly every large LLM of the following years.
Jeff Dean outlines Pathways, Google's unified architecture for sparse, multitask, multimodal models — the infrastructure foundation that will power PaLM and Gemini.
Meta releases PyTorch 1.10 with CUDA Graphs integration, FX-based quantization, TorchScript improvements — consolidating leadership of the framework for AI research and production.
OpenAI releases Triton, a Python-like language and compiler for writing custom GPU kernels at performance close to hand-written CUDA — dramatically lowering the barrier for model optimization.
DeepMind publishes AlphaFold 2 code and weights on GitHub and, with EMBL-EBI, releases a database with predicted structures for 350,000 human and model-organism proteins.
NVIDIA adds interleaved pipeline scheduling and sequence parallelism to Megatron-LM, enabling training of the 530B-parameter MT-NLG on 2240 A100 GPUs with Microsoft.
Microsoft announces ZeRO Stage 3 in DeepSpeed: by sharding parameters across GPUs in addition to gradients and optimizer states, it enables training of 100B+ parameter models on reasonable-size clusters.
William Falcon and team ship PyTorch Lightning 1.0, a framework that separates research code (model) from engineering (training loop, distributed, checkpointing, logging) and becomes the de facto standard for many open projects.
Posting MLPerf Training 0.7 results, Google reveals TPU v4, a new custom deep-learning accelerator, claiming it built the "world's fastest training supercomputer" with a 4,096-chip pod.
At GTC 2020 Jensen Huang announces the A100 GPU built on the Ampere architecture: 54 billion transistors, 40-80 GB HBM2e, TF32, 2:4 structured sparsity, and MIG support.