Phi-3-Vision-128K (Microsoft): 4.2B VLM that outperforms models 4x its size on documents

In one sentence Microsoft releases Phi-3-Vision-128K: 4.2 billion parameters, 128k token context, chart and diagram understanding, document Q&A. Outperforms 13-20B models on document understanding benchmarks. The best compact VLM for edge deployment and cost-sensitive enterprise inference.

Needs review Official source

ShareLinkedIn X

In the race to build ever-larger AI models, Microsoft chose a different direction: build a small visual model that is exceptionally good at the things enterprises actually use.

Phi-3-Vision-128K has just 4.2 billion parameters — a fraction of GPT-4 Vision or Gemini Ultra. Yet on tests for document comprehension, charts, and technical diagrams, it outperforms models three to four times its size.

The key is training data selection: instead of using everything available on the internet, the Microsoft Research team curated a dense dataset of business documents, charts, tables, software screenshots, and technical diagrams. The model became highly specialized in exactly what companies need.

The 128,000-token context window — remarkably large for such a small model — allows entire documents to be loaded and queried without losing information.

The practical advantage: Phi-3-Vision can run on moderate hardware, including on-premise or edge deployments, with inference costs far below larger models. For a company wanting to automatically analyze thousands of contracts or reports, the cost difference is substantial.