Vision Transformer (ViT): "An Image is Worth 16x16 Words"

In one sentence Google Research introduces the Vision Transformer, applying a pure transformer to image patches as if they were tokens, and shows that with enough pre-training it beats CNNs on ImageNet and other vision benchmarks.

Verified Official source

ShareLinkedIn X

For decades image recognition was dominated by a specialized neural network: convolutional networks (CNNs). They were tailor-made for images: small regions at a time, filters, and so on.

Google runs a bold experiment: take the transformer — the architecture born for text — and feed it images directly, sliced into many small 16×16 pixel "tiles" treated as words. No convolutions, no vision-specific mechanisms.

Result: with enough pre-training data, it wins. The same architecture that powers GPT also understands images. From here on, all modern vision (DALL·E, Stable Diffusion, CLIP) runs on transformers.