Google Gemini 1.0: natively multimodal in three sizes

In one sentence Google announces Gemini Ultra/Pro/Nano, the first family of natively multimodal models (text, images, audio, video). Ultra beats GPT-4 on MMLU 90.0% vs 86.4%. Controversial demo video.

Verified Official source

ShareLinkedIn X

Google ships Gemini, the family of models succeeding PaLM 2. The headline: Gemini is natively multimodal, meaning trained from the start on text, images, audio, and video in a single model. GPT-4 Vision instead "bolts together" separate modules.

Three sizes:

Ultra: the top model, claimed superior to GPT-4 on 30 of 32 benchmarks tested. First to pass 85% on MMLU "human expert" (90.0%);
Pro: the size that goes into Bard and Vertex AI, comparable to GPT-3.5;
Nano: runs on-device, on Pixel 8 Pro for the first time.

The launch is dented by a controversy: the "Hands on with Gemini" demo video is edited and sped up to look real-time. Google admits responses are prompt-and-image, not live video. Ultra isn't available at launch (it arrives in Bard Advanced in February 2024).