DeepMind RT-1: the first Transformer trained on real robotics data
DeepMind releases RT-1, a robotics transformer trained on 130,000 real episodes with 13 robots, generalizing to never-seen tasks.
52 entries
DeepMind releases RT-1, a robotics transformer trained on 130,000 real episodes with 13 robots, generalizing to never-seen tasks.
Anthropic publishes Constitutional AI: instead of pure RLHF, the model critiques and revises its own responses following a written 'constitution'. Less human labeling, more transparency.
Spot gains advanced autonomous navigation and industrial anomaly detection via visual AI, operating without pre-loaded maps.
OpenAI launches ChatGPT, a free conversational interface on GPT-3.5 aligned via RLHF. It crosses one million users in five days.
Stability AI releases SD 2.0 with OpenCLIP replacing CLIP, native 768x768 resolution, a new depth2img model, and improved inpainting. A controversial release due to breaking compatibility with existing LoRAs and prompts.
Notion launches Notion AI in private alpha, GPT integrated inside pages: summarize, rewrite, translate, brainstorm without leaving the document.
Meta unveils Galactica, a 120B-parameter model trained on 48 million scientific papers. The public demo is pulled after three days under a wave of criticism for authoritative hallucinations.
NVIDIA consolidates Triton as the open-source platform for serving PyTorch, TensorFlow, and ONNX models in production, with dynamic batching, multi-GPU support, and gRPC/HTTP APIs.
HuggingFace Accelerate provides a unified API that runs the same training code on any hardware without changes, becoming the backbone of most open LLM training pipelines.
Harrison Chase releases LangChain, an open-source Python library to chain LLMs with prompt templates, memory, tools and external data sources. It will become the default stack of the first LLM apps.
Weizmann Institute publishes Textual Inversion: learning a new text token representing a custom concept from 3-5 images, without modifying model weights.
EnCodec compresses 24kHz stereo audio to just 1.5–12 kbps at quality surpassing Opus, becoming the standard vocoder for modern neural TTS.
Google pre-trains a single policy on over 800 real robot tasks and 57,000 hours of real-world data, demonstrating for the first time zero-shot transfer to new tasks through large-scale multi-task offline learning.
Frantar et al. (ETH Zurich) publish GPTQ: accurate 4-bit quantization without significant fine-tuning, the first technique to make inference of 175B-parameter models practical on consumer hardware.
Yao et al. introduce ReAct, a schema alternating explicit thoughts (Thought) and concrete actions (Act) in LLMs, the theoretical foundation of all modern agents.
A week after Make-A-Video, Google Research unveils Imagen Video and, around the same time, Phenaki: two different approaches to text-to-video, with longer, more coherent clips.
Meta AI shows Make-A-Video, a system that generates short animated clips from a text description by reusing a pre-existing text-to-image model.
Hugging Face launches Inference Endpoints, a managed service to deploy Hub models on AWS, Azure or GCP with autoscaling, on-demand GPUs and private endpoints.
Google scales instruction tuning to 1,800 tasks and 540B parameters, open-sources Flan-T5, and proves that chain-of-thought reasoning is teachable via fine-tuning.
OpenAI releases Whisper under MIT license: a speech-to-text model trained on 680,000 hours of multilingual audio, near commercial-grade quality, runs locally.
Noam Shazeer and Daniel De Freitas, fathers of LaMDA, launch Character.AI: a platform letting anyone create and chat with AI characters, from Einstein to anime personas.
Riley Goodside and Perez et al. formalize Prompt Injection: an attack where malicious user input overwrites an LLM's system instructions, bypassing policies and guardrails.
AudioLM generates long-range coherent audio using two tiers of tokens — semantic and acoustic — with no text or score conditioning.
Google Research publishes DreamBooth: fine-tune a diffusion model on 3-5 images of a specific subject to reproduce it in any context or style. Foundation of all personalized AI image generation.
Stability AI publicly releases weights and code of a text-to-image latent diffusion model that runs on a consumer GPU. AI image generation leaves the cloud.
GitHub publishes first real-world data: 40% of code in files with Copilot active is AI-generated. First quantitative benchmark on AI tools' actual impact on developer output.
Google Robotics shows how to combine an LLM for high-level planning with robot value functions that filter only physically executable actions.
Hugging Face releases diffusers, a modular Python library for diffusion models — text-to-image, audio and beyond. It quickly becomes the de facto standard.
OpenAI opens DALL-E 2 in beta to over one million waitlist users, with a pay-per-image credit system. First large-scale consumer product for image generation.
The BigScience collective releases BLOOM, a 176-billion-parameter model trained on 46 human languages and 13 programming languages, under an open RAIL license.
Midjourney opens its public beta with a text-to-image model accessible via a Discord bot. Its strong aesthetic default and community turn image generation into a mass phenomenon.
Perez et al. (DeepMind) show that an LLM can be used as an automatic attacker against another LLM, discovering undesired behaviors at a scale impossible for human teams.
Google Research combines three major pretraining objectives into a single 20B model, outperforming GPT-3 on many benchmarks at one-eighth the parameters.
Tabnine releases version 3.0 with local or cloud model support, becoming the first mature AI code completion product on the market before Copilot's rise.
Tri Dao (Stanford) publishes FlashAttention: an IO-aware implementation that avoids materializing the attention matrix in HBM, achieving 2-4x speedup and 10x less GPU memory.
GitHub announces general availability of Copilot for all developers at $10/month. It's the first mass-market AI tool living inside the daily code editor.
SoundStream introduces Residual Vector Quantization to compress audio at 3kbps with quality surpassing Opus at 12kbps, founding the architecture of all modern neural codecs used in audio LLMs.
James Betker releases Tortoise TTS, an open source model with few-second voice cloning and human-like vocal quality — the first real breakthrough in accessible TTS.
Google Research unveils Imagen, a text-to-image diffusion model that uses a frozen T5 text encoder and beats DALL-E 2 on benchmarks for photorealistic fidelity.
DeepMind unveils Gato, a 1.2-billion-parameter Transformer that with the same weights plays Atari games, controls a robot arm, captions images and chats.
Meta AI releases OPT-175B, a language model comparable in size to GPT-3, with weights available to researchers and a public training logbook.
Flamingo brings few-shot learning to vision: SOTA on VQA and captioning with no task-specific fine-tuning.
NaturalSpeech is the first TTS system to achieve a MOS statistically indistinguishable from recorded human speech on the LJSpeech benchmark, marking a historic milestone for speech synthesis.
OpenAI announces DALL·E 2, a diffusion-based text-to-image model producing photorealistic 1024×1024 images. Initially waitlist-only, public access in July.
Google publishes PaLM, a 540B-parameter model trained on the new Pathways system. Demonstrates emergent reasoning capabilities when guided with chain-of-thought.
DeepMind publishes the Chinchilla paper and shows that, given equal compute, smaller models trained on far more tokens beat oversized undertrained ones.
At GTC 2022 NVIDIA unveils the Hopper architecture and the H100 GPU, with FP8 Transformer Engine and NVLink 4. It will become the hardware substrate for nearly every large LLM of the following years.
Wang et al. (Google Brain) show that sampling N diverse reasoning paths and taking the most frequent answer beats greedy decoding on all reasoning benchmarks.
DeepMind unveils AlphaCode, a system that generates code for competitive programming problems and ranks in the top half of human participants on Codeforces.
Coqui TTS is an open source Python library for quality text-to-speech, forked from Mozilla TTS, supporting over 1100 languages and adopted by the HuggingFace community.
OpenAI introduces InstructGPT: a GPT-3 refined with human feedback (RLHF) that follows instructions better than the 175B base model despite being much smaller (1.3B parameters).
AI2 and University of Washington present UnifiedIO: the first sequence-to-sequence model capable of handling text, images, audio, video, and structured data as both inputs and outputs through a single architecture, trained on 80+ tasks simultaneously.