Vision Transformer (ViT): "An Image is Worth 16x16 Words"
In one sentence Google Research introduces the Vision Transformer, applying a pure transformer to image patches as if they were tokens, and shows that with enough pre-training it beats CNNs on ImageNet and other vision benchmarks.
For decades image recognition was dominated by a specialized neural network: convolutional networks (CNNs). They were tailor-made for images: small regions at a time, filters, and so on.
Google runs a bold experiment: take the transformer — the architecture born for text — and feed it images directly, sliced into many small 16×16 pixel "tiles" treated as words. No convolutions, no vision-specific mechanisms.
Result: with enough pre-training data, it wins. The same architecture that powers GPT also understands images. From here on, all modern vision (DALL·E, Stable Diffusion, CLIP) runs on transformers.
Companies
Tools
ViT, Vision Transformer
Tags
Sources