Florence-2: a single visual model for captioning, detection, segmentation, and OCR

In one sentence Microsoft releases Florence-2, a unified vision foundation model that handles captioning, object detection, segmentation, and OCR with a single prompt-based sequence-to-sequence architecture.

Verified Official source

ShareLinkedIn X

Usually each visual task requires a different model: one to describe images, one to find objects, one to cut out shapes, one to read text in photos. Florence-2 does all of this with a single model.

The secret is the sequence-to-sequence approach: every visual task is converted into a prompt-response text pair. Want a description? Give a prompt. Want object coordinates? Same model, different prompt. The model learns to respond to all these "tasks" in a unified way.

Florence-2 is small (230M and 770M parameters) and fast, making it practical to use in production without dedicated hardware.