Image GPT: generative pretraining for images
In one sentence OpenAI introduces Image GPT (iGPT), a transformer that treats pixels as tokens and shows that GPT-style sequential generative pretraining works on images too, reaching competitive performance on CIFAR-10.
Language models like GPT work by reading words one at a time and trying to predict the next one. OpenAI tries the same thing on image pixels: scroll through them one by one, predict the next.
It sounds weird — pixels aren't words — but the result is interesting. With no one telling it what a cat or a car is, the model learns useful representations on its own, on par with techniques specifically designed for vision.
It's a small experiment that says something important: the same "engine" that understands text can understand images, you just need to feed it as a sequence.
Companies
OpenAI
Tools
Image GPT, iGPT
Tags
Sources