Skip to content
AImpact
IT EN
Landmark Multimodal AI · 1 min read

Vision Transformer (ViT): "An Image is Worth 16x16 Words"

In one sentence Google Research introduces the Vision Transformer, applying a pure transformer to image patches as if they were tokens, and shows that with enough pre-training it beats CNNs on ImageNet and other vision benchmarks.

Verified Official source
ShareLinkedInX
Reading level

For decades image recognition was dominated by a specialized neural network: convolutional networks (CNNs). They were tailor-made for images: small regions at a time, filters, and so on.

Google runs a bold experiment: take the transformer — the architecture born for text — and feed it images directly, sliced into many small 16×16 pixel "tiles" treated as words. No convolutions, no vision-specific mechanisms.

Result: with enough pre-training data, it wins. The same architecture that powers GPT also understands images. From here on, all modern vision (DALL·E, Stable Diffusion, CLIP) runs on transformers.

Companies

Google

Tools

ViT, Vision Transformer

Tags

GoogleVision TransformerViTComputer VisionPatches

Sources