Florence-2: a single visual model for captioning, detection, segmentation, and OCR
In one sentence Microsoft releases Florence-2, a unified vision foundation model that handles captioning, object detection, segmentation, and OCR with a single prompt-based sequence-to-sequence architecture.
Usually each visual task requires a different model: one to describe images, one to find objects, one to cut out shapes, one to read text in photos. Florence-2 does all of this with a single model.
The secret is the sequence-to-sequence approach: every visual task is converted into a prompt-response text pair. Want a description? Give a prompt. Want object coordinates? Same model, different prompt. The model learns to respond to all these "tasks" in a unified way.
Florence-2 is small (230M and 770M parameters) and fast, making it practical to use in production without dedicated hardware.
Companies
Microsoft
Tools
Florence-2
Tags
Sources