AI Infrastructure

64 entries

June 20, 2026 High

NVIDIA GB300 Blackwell Ultra Launches: 288 GB HBM3e, NVLink 5 at 1.8 TB/s

NVIDIA begins shipping the GB300 Blackwell Ultra GPU featuring 288 GB HBM3e per chip, NVLink 5 at 1.8 TB/s, and double the FP8 throughput of the B200, dramatically lowering inference costs for frontier AI models.

AI Infrastructure GPUNVIDIABlackwell Ultra

May 6, 2026 High

AMD MI350 Instinct: 288GB HBM3e and 1.5 PFLOPS FP8 challenge NVIDIA at datacenter scale

AMD launches the MI350 Instinct GPU with 288GB HBM3e memory, double the bandwidth of MI300X, and 1.5 PFLOPS FP8 performance, paired with ROCm 7.0 featuring significantly improved PyTorch compatibility.

AI Infrastructure GPUAMDHBM3e

April 17, 2026 Medium

Cerebras CS-3 Wafer-Scale Engine: 4 trillion transistors, 44GB SRAM, Llama 4 Maverick at 1500 tokens/sec

Cerebras unveils the CS-3, its third-generation wafer-scale engine featuring 4 trillion transistors and 44 GB of on-chip SRAM, running Llama 4 Maverick at 1500 tokens per second on a single chip. First commercial deployment is live in the UAE AI cloud.

AI Infrastructure CerebrasWafer-ScaleInference

March 12, 2026 Medium

Groq launches GroqCloud 2.0: LPU Gen3, 2000 tokens/sec, and Frankfurt European data center

Groq releases GroqCloud 2.0 with third-generation LPU chips delivering 2000 tokens/sec on Llama 4.1 Maverick, adds Function Calling GA, JSON mode, streaming tool use, batch pricing, and opens a European data center in Frankfurt.

AI Infrastructure GroqLPUInference

March 11, 2026 High

NVIDIA GTC 2026: Huang keynote and the Rubin roadmap for the next cycle

At GTC 2026 NVIDIA confirms its annual cadence: details on Rubin (Blackwell's successor), new rack-scale configurations, updated software stack for training and inference.

AI Infrastructure NVIDIAGTCRubin

September 22, 2025 High

NVIDIA H200 and B200 Blackwell GPUs Reach Wide Cloud Availability

All three major clouds now offer Blackwell instances; training costs drop 40% vs H100 and inference throughput doubles on B100.

AI Infrastructure

August 25, 2025 High

NVIDIA NIM Microservices Reach General Availability

NIM lets you deploy 200+ AI models as production-ready REST APIs with a single Docker command, CUDA-optimized out of the box.

AI Infrastructure

August 11, 2025 Medium

Anthropic Extends Claude Prompt Caching to 1-Hour TTL

Claude's prompt caching now holds for a full hour with multi-turn support, cutting costs by up to 90% on repeated large contexts.

AI Infrastructure

July 2, 2025 Medium

vLLM v0.7: chunked prefill by default and a redesigned V1 engine

vLLM ships v0.7 with chunked prefill on by default, a rewritten 'V1' engine scheduler, and advanced support for MoE (DeepSeek V3/R1) and multimodal models. +1.5-2× throughput on many workloads.

AI Infrastructure vLLMInferenceChunked Prefill

June 26, 2025 Medium

Cerebras hits 2,500+ tok/s on Llama: inference record of the year

Cerebras Systems publishes inference numbers beating Nvidia GPUs by an order of magnitude: 2,500+ tok/s on Llama 4 Maverick and Scout thanks to the wafer-scale WSE-3. Custom ASIC back in the race.

AI Infrastructure CerebrasInferenceWafer Scale

May 1, 2025 High

NVIDIA NIM 1.0: Containerized LLM Inference with OpenAI-Compatible API

NVIDIA NIM 1.0 packages TensorRT-LLM and Triton Inference Server into per-model Docker microservices with OpenAI-compatible API, health checks, and GPU auto-configuration, making LLM deployment as simple as running a container.

AI Infrastructure NVIDIA NIMcontainerized inferenceTensorRT-LLM

April 14, 2025 Medium

WebLLM and LLM in WASM: browser-based LLM inference via WebGPU, no server needed

WebLLM enables running LLMs like Llama 3 8B directly in the browser via WebGPU and WASM, compiling models with Apache TVM to achieve 15 tokens/s in Chrome with no backend server.

AI Infrastructure WebLLMWebAssemblyWebGPU

April 8, 2025 Medium

Continuous Batching for LLM Serving: survey and state of the art of Orca, vLLM, SGLang, TGI

Systematic review of continuous batching strategies for LLM serving: comparing Orca, vLLM, SGLang, and TGI on scheduling, GPU utilization, and TTFT/TPOT metrics. State of the art 2024-2025.

AI Infrastructure Continuous BatchingLLM ServingOrca

March 1, 2025 Medium

torchao: PyTorch-Native Quantization and Sparsity Without Custom CUDA

Meta releases torchao as a PyTorch-native library for INT4/FP8/INT8 quantization and sparsity, with 2x speedup on Llama-3 8B at INT4 without requiring custom CUDA kernels, emerging as the standard quantization layer for the PyTorch ecosystem.

AI Infrastructure torchaoquantizationINT4

January 22, 2025 Medium

FlashInfer 0.2: attention library for LLM serving with paged KV cache and RoPE fusion

UW + MIT release FlashInfer 0.2: CUDA library for attention in LLM serving with native paged KV cache, variable-length sequences, RoPE fusion, and 1.5x speedup vs vLLM on long prefill on A100.

AI Infrastructure FlashInferAttentionKV Cache

January 21, 2025 High

Stargate Project: the $500B AI infrastructure plan announced at the White House

OpenAI, Oracle, SoftBank and MGX announce a $500B four-year investment plan to build AI infrastructure in the US. First site in Abilene, Texas.

AI Infrastructure StargateOpenAIOracle

January 10, 2025 High

DeepSeek-V3: GPT-4o Quality at $0.55/M Tokens via MLA and FP8 Pipeline

DeepSeek-V3 technical report reveals Multi-head Latent Attention and a complete FP8 pipeline achieving GPT-4o-level performance at $0.55/M tokens, training 671B parameter MoE on an H800 cluster under tight budget constraints.

AI Infrastructure DeepSeek V3MLAFP8

January 8, 2025 High

Prefill/decode disaggregation: separate GPUs for low TTFT and high throughput

The prefill/decode disaggregation technique separates prompt processing and token generation phases onto dedicated GPUs, reducing TTFT while maintaining high throughput, adopted by major cloud providers.

AI Infrastructure PrefillDecodeDisaggregazione

November 25, 2024 High ★ On my workflow

Model Context Protocol: the open standard to connect LLMs and data

Anthropic open-sources the Model Context Protocol (MCP), a JSON-RPC standard that lets AI assistants talk to tools, file systems, databases, and SaaS without per-model ad-hoc integrations.

AI Infrastructure AnthropicMCPModel Context Protocol

November 5, 2024 High

Mooncake: Disaggregated Prefill-Decode Inference for 525% More Throughput

Moonshot AI (Kimi) separates prefill (compute-bound GPU) and decode (memory-bound GPU) phases across dedicated GPU pools with KV cache transfer, achieving 525% throughput improvement in production deployments.

AI Infrastructure Mooncakedisaggregated inferenceprefill-decode

September 25, 2024 Medium

Llama Stack: Meta proposes a unified API spec for the LLM lifecycle

Meta announces Llama Stack: an API spec + reference implementations for inference, safety, agents, memory, evals, RAG, and training — meant as 'standard plumbing' for Llama-based applications.

AI Infrastructure MetaLlama StackOpen Source

September 10, 2024 High

KV Cache Quantization FP8/INT8: Double User Density per GPU

Quantizing the KV cache from FP16 to FP8 or INT8 reduces serving memory by 50%+, enabling 2x longer contexts or twice the concurrent users per GPU, adopted by vLLM, TGI, and TensorRT-LLM.

AI Infrastructure KV cache quantizationFP8INT8

August 27, 2024 Medium

Cerebras Inference: record-breaking LLM inference throughput on the wafer-scale WSE-3

Cerebras launches an LLM inference service on the wafer-scale WSE-3, claiming ~1800 tokens/s on Llama 3.1 8B and ~450 tokens/s on Llama 3.1 70B — 10-20× faster than H100 GPUs.

AI Infrastructure CerebrasWSE-3Inference

August 20, 2024 Medium

bitsandbytes 0.43: QLoRA and NF4/FP4 quantization for 4-bit fine-tuning

bitsandbytes 0.43 updates QLoRA support with NF4 and FP4 data types, optimized inference-time dequantization on A100/H100, and improved PEFT integration for efficient 4-bit LLM fine-tuning.

AI Infrastructure bitsandbytesQLoRAFine-tuning

August 5, 2024 Medium

LLM Compressor: unified toolkit for quantization and sparsity with native vLLM integration

Neural Magic releases LLM Compressor: open-source library unifying GPTQ, AWQ, SmoothQuant, and SparseGPT in a single toolkit with native vLLM integration, simplifying compressed model deployment.

AI Infrastructure LLM CompressorNeural MagicQuantizzazione

July 8, 2024 Medium

HuggingFace Accelerate 0.30: FSDP and DeepSpeed without extra code

HuggingFace Accelerate 0.30 unifies FSDP and DeepSpeed in a YAML-configurable wrapper without modifying training code, with native Trainer integration and support for mixed parallelism strategies.

AI Infrastructure HuggingFaceAccelerateFSDP

June 5, 2024 High

FP8 Training with NVIDIA Transformer Engine: Half the Memory, Same Quality

NVIDIA Transformer Engine brings FP8 (E4M3/E5M2) mixed-precision training with automatic per-tensor scaling, halving memory versus BF16 with less than 0.5% quality loss, making training 70B models on half the hardware feasible.

AI Infrastructure FP8Transformer EngineNVIDIA

May 18, 2024 High

FlashAttention-3: 2.6x speedup over FA2 optimized for H100 Hopper with wgmma, TMA, and FP8

Tri Dao and NVIDIA publish FlashAttention-3: optimized for H100 Hopper with compute/memory overlapping via wgmma and TMA, FP8 low-precision support, 2.6x speedup over FA2 and 75% of H100 peak.

AI Infrastructure FlashAttention-3H100Hopper

May 2, 2024 Medium

SGLang: 6.4x LLM throughput with RadixAttention and shared prefix caching

Stanford and LMSYS release SGLang, an LLM runtime introducing RadixAttention to share prefix caching across different requests, achieving 6.4x throughput over vLLM on tasks with common prefixes.

AI Infrastructure SGLangStanfordRadixAttention

March 25, 2024 Medium

GGUF specification: the standard format for local quantized LLM models

The GGUF (GGML Unified Format) specification becomes the standard for distributing quantized LLM models, replacing GGML with an extensible format including rich metadata, natively supported by llama.cpp, Ollama, and LM Studio.

AI Infrastructure GGUFGGMLQuantizzazione

March 20, 2024 High

Automatic Prefix Caching in vLLM: Shared KV Cache Across Requests for Near-Zero TTFT

vLLM v0.3.3 introduces Automatic Prefix Caching that reuses the KV cache for common prefixes across different requests, nearly eliminating initial response time for system prompts and previously-processed RAG documents.

AI Infrastructure prefix cachingKV cachevLLM

March 18, 2024 High

S-LoRA and Punica: serving hundreds of LoRA fine-tunings from a single base model

S-LoRA (UC Berkeley) and Punica (UW) enable multi-tenant serving of hundreds of LoRA adapters from a single base model with zero-copy switching and dedicated CUDA kernels, integrated in vLLM and SGLang.

AI Infrastructure LoRAS-LoRAPunica

March 18, 2024 Landmark

NVIDIA Blackwell: B200 and GB200 NVL72, the rack-scale AI era

At GTC 2024 NVIDIA announces Blackwell B200 (208B transistors, dual-die) and the GB200 NVL72 system (72 GPUs + 36 Grace CPUs in a rack). 30x faster inference for frontier LLMs.

AI Infrastructure NVIDIABlackwellB200

February 22, 2024 High

Groq LPU: 500-tokens-per-second inference goes viral

Groq's public demo on Llama 2 70B generates ~500 tokens/sec, orders of magnitude faster than any GPU. LLM latency stops being a given.

AI Infrastructure GroqLPUInference

February 5, 2024 High

AMD ROCm 6.0: Production-Grade LLM Support Breaking NVIDIA's Near-Monopoly

ROCm 6.0 brings native PyTorch 2.x support, hipBLASLt, hipGRAPH, and official vLLM integration on AMD Instinct MI300X GPUs, enabling LLM training and serving for the first time without manual patches.

AI Infrastructure ROCm 6AMDMI300X

January 31, 2024 Medium

Mozilla llamafile: LLM in a single portable executable on any OS

Mozilla releases llamafile, a single-file executable combining llama.cpp with Cosmopolitan Libc to run LLMs on Linux, Windows, Mac, and BSD without any installation, directly from CPU or GPU.

AI Infrastructure llamafileMozillaLLM

January 8, 2024 Medium

DeepSpeed-FastGen: Dynamic SplitFuse scheduling for 2.3x throughput over vLLM in production

Microsoft DeepSpeed team releases FastGen via MII: Dynamic SplitFuse scheduling for LLM serving achieves 2.3x throughput vs vLLM on production chat workloads, optimized for Azure H100.

AI Infrastructure DeepSpeedFastGenMII

September 27, 2023 High

NVIDIA TensorRT-LLM: automatic LLM compilation for GPUs with FP8 and multi-GPU

NVIDIA open-sources TensorRT-LLM, a framework for compiling and optimizing LLMs for NVIDIA GPUs with out-of-the-box FP8, INT4, sparse attention, and multi-GPU tensor parallelism support.

AI Infrastructure NVIDIATensorRT-LLMFP8

September 15, 2023 Medium

ExLlamaV2: high-speed quantized LLM inference on consumer GPUs

ExLlamaV2 introduces the EXL2 format with per-layer mixed bit-rates (2-8 bit), delivering higher throughput than llama.cpp on NVIDIA GPUs and enabling 70B models to run on a single RTX 3090.

AI Infrastructure ExLlamaV2EXL2Quantizzazione

September 14, 2023 High

Medusa: multi-head speculative decoding without a separate draft model, 2.2x speedup

Cornell/UIUC introduce Medusa: N additional decoding heads on the main model predict N tokens ahead simultaneously, 2.2x speedup without needing a second draft model.

AI Infrastructure MedusaSpeculative DecodingMulti-Head

August 7, 2023 Medium

Google TPU v5e: Cost-Optimized AI Chip for Enterprise Inference

Google announces TPU v5e, a cost-optimized AI chip with 4x better performance per dollar compared to TPU v4 for inference, available through Google Kubernetes Engine for containerized workloads.

AI Infrastructure TPU v5eGoogleinference

July 28, 2023 High

FlashAttention-2: rewrite with 2x speedup, MQA/GQA support, and head-dim 256

Tri Dao rewrites FlashAttention with 2x speedup over FA1: better parallelism across seq-len, head-dim support up to 256, query parallelism for MHA, MQA, and GQA. De facto training standard.

AI Infrastructure FlashAttention-2AttentionTransformer

June 22, 2023 High

AWQ: activation-aware 4-bit quantization for edge deployment with accuracy above GPTQ

MIT Han Lab publishes AWQ: 4-bit quantization that preserves salient weights identified through activation analysis, achieving better accuracy-throughput than GPTQ for edge deployment.

AI Infrastructure AWQQuantizzazione4-bit

June 13, 2023 High

Function calling: GPT learns to speak JSON

OpenAI adds 'function calling' to the API: the model returns structured JSON conforming to a schema, enabling reliable tool integrations without fragile prompt engineering.

AI Infrastructure OpenAIFunction CallingTool Use

June 6, 2023 High

HuggingFace TGI: production-ready Docker container for LLM serving with continuous batching

HuggingFace releases Text Generation Inference, an optimized Docker container for serving LLMs in production with continuous batching, tensor parallelism, and integrated Flash Attention 2.

AI Infrastructure HuggingFaceTGILLM Serving

April 13, 2023 High

AWS Bedrock: managed multi-model AI on Amazon cloud

AWS announces Bedrock, a managed service exposing Claude (Anthropic), Jurassic-2 (AI21), Stable Diffusion, and its own Titan via one API. Reply to Azure OpenAI.

AI Infrastructure AWSBedrockmanaged AI

March 15, 2023 High

PyTorch 2.0 and torch.compile: Graph Compilation Without Rewriting Code

PyTorch 2.0 introduces torch.compile built on TorchDynamo and the Inductor backend, delivering up to 2x speedup on transformers without code changes, making PyTorch competitive with XLA/JAX for production workloads.

AI Infrastructure PyTorch 2.0torch.compileTorchDynamo

February 9, 2023 High

vLLM: 24x LLM throughput with PagedAttention from UC Berkeley

The UC Berkeley team releases vLLM, a Python library for LLM inference using PagedAttention to manage KV cache like OS virtual memory, achieving 24x throughput over the HuggingFace baseline.

AI Infrastructure vLLMBerkeleyPagedAttention

January 20, 2023 High

Speculative Decoding: 2-3x LLM inference speedup without changing output

Chen et al. (Google Brain) publish Speculative Decoding: a small model proposes tokens, the large model verifies them in parallel. Same output, 2-3x faster with no quality change.

AI Infrastructure Speculative DecodingInferenceAutoregressive

November 9, 2022 High

NVIDIA Triton Inference Server 2.x: the de facto standard for production inference

NVIDIA consolidates Triton as the open-source platform for serving PyTorch, TensorFlow, and ONNX models in production, with dynamic batching, multi-GPU support, and gRPC/HTTP APIs.

AI Infrastructure NVIDIATritonInference Server

November 1, 2022 Medium

HuggingFace Accelerate: One Python Script for CPU, GPU, TPU, and Mixed Precision

HuggingFace Accelerate provides a unified API that runs the same training code on any hardware without changes, becoming the backbone of most open LLM training pipelines.

AI Infrastructure AccelerateHuggingFacemulti-GPU

October 12, 2022 High

GPTQ: 4-bit post-training quantization making GPT-scale inference practical

Frantar et al. (ETH Zurich) publish GPTQ: accurate 4-bit quantization without significant fine-tuning, the first technique to make inference of 175B-parameter models practical on consumer hardware.

AI Infrastructure GPTQQuantizzazione4-bit

September 27, 2022 Medium

Hugging Face Inference Endpoints: deploy LLMs in two clicks

Hugging Face launches Inference Endpoints, a managed service to deploy Hub models on AWS, Azure or GCP with autoscaling, on-demand GPUs and private endpoints.

AI Infrastructure Hugging FaceInference EndpointsDeployment

June 21, 2022 Landmark

FlashAttention: IO-aware attention that revolutionizes transformer training

Tri Dao (Stanford) publishes FlashAttention: an IO-aware implementation that avoids materializing the attention matrix in HBM, achieving 2-4x speedup and 10x less GPU memory.

AI Infrastructure FlashAttentionAttentionTransformer

March 22, 2022 Landmark

NVIDIA H100 and Hopper architecture: the foundation-model GPU

At GTC 2022 NVIDIA unveils the Hopper architecture and the H100 GPU, with FP8 Transformer Engine and NVLink 4. It will become the hardware substrate for nearly every large LLM of the following years.

AI Infrastructure NVIDIAH100Hopper

October 28, 2021 Medium

Pathways: Google sketches the post-Transformer architecture

Jeff Dean outlines Pathways, Google's unified architecture for sparse, multitask, multimodal models — the infrastructure foundation that will power PaLM and Gemini.

AI Infrastructure GooglePathwaysMultitask

October 21, 2021 Medium

PyTorch 1.10: CUDA Graphs, FX, and the maturing of the dominant framework

Meta releases PyTorch 1.10 with CUDA Graphs integration, FX-based quantization, TorchScript improvements — consolidating leadership of the framework for AI research and production.

AI Infrastructure PyTorchFrameworkCUDA Graphs

July 28, 2021 Medium

OpenAI Triton: writing GPU kernels in Python becomes practical

OpenAI releases Triton, a Python-like language and compiler for writing custom GPU kernels at performance close to hand-written CUDA — dramatically lowering the barrier for model optimization.

AI Infrastructure OpenAITritonGPU

July 15, 2021 High

AlphaFold 2: open code and database, biology accelerates

DeepMind publishes AlphaFold 2 code and weights on GitHub and, with EMBL-EBI, releases a database with predicted structures for 350,000 human and model-organism proteins.

AI Infrastructure DeepMindAlphaFoldProtein Folding

July 12, 2021 High

Megatron-LM v2: 3D Parallelism for 530-Billion-Parameter Models

NVIDIA adds interleaved pipeline scheduling and sequence parallelism to Megatron-LM, enabling training of the 530B-parameter MT-NLG on 2240 A100 GPUs with Microsoft.

AI Infrastructure Megatron-LM3D parallelismpipeline parallelism

September 9, 2020 High

DeepSpeed ZeRO-3: training models beyond 100 billion parameters

Microsoft announces ZeRO Stage 3 in DeepSpeed: by sharding parameters across GPUs in addition to gradients and optimizer states, it enables training of 100B+ parameter models on reasonable-size clusters.

AI Infrastructure MicrosoftDeepSpeedZeRO-3

August 4, 2020 Medium

PyTorch Lightning 1.0: a boilerplate-free training loop

William Falcon and team ship PyTorch Lightning 1.0, a framework that separates research code (model) from engineering (training loop, distributed, checkpointing, logging) and becomes the de facto standard for many open projects.

AI Infrastructure PyTorch LightningOpen SourceTraining Loop

July 29, 2020 Medium

Google announces TPU v4 with MLPerf 0.7 records

Posting MLPerf Training 0.7 results, Google reveals TPU v4, a new custom deep-learning accelerator, claiming it built the "world's fastest training supercomputer" with a 4,096-chip pod.

AI Infrastructure GoogleTPU v4Pod

May 14, 2020 Landmark

NVIDIA A100: Ampere arrives and the GPU that trains GPT-3

At GTC 2020 Jensen Huang announces the A100 GPU built on the Ampere architecture: 54 billion transistors, 40-80 GB HBM2e, TF32, 2:4 structured sparsity, and MIG support.

AI Infrastructure NVIDIAA100Ampere