Image GPT: generative pretraining for images

In one sentence OpenAI introduces Image GPT (iGPT), a transformer that treats pixels as tokens and shows that GPT-style sequential generative pretraining works on images too, reaching competitive performance on CIFAR-10.

Verified Official source

ShareLinkedIn X

Language models like GPT work by reading words one at a time and trying to predict the next one. OpenAI tries the same thing on image pixels: scroll through them one by one, predict the next.

It sounds weird — pixels aren't words — but the result is interesting. With no one telling it what a cat or a car is, the model learns useful representations on its own, on par with techniques specifically designed for vision.

It's a small experiment that says something important: the same "engine" that understands text can understand images, you just need to feed it as a sequence.