Phi-3-Vision-128K (Microsoft): 4.2B VLM that outperforms models 4x its size on documents
In one sentence Microsoft releases Phi-3-Vision-128K: 4.2 billion parameters, 128k token context, chart and diagram understanding, document Q&A. Outperforms 13-20B models on document understanding benchmarks. The best compact VLM for edge deployment and cost-sensitive enterprise inference.
In the race to build ever-larger AI models, Microsoft chose a different direction: build a small visual model that is exceptionally good at the things enterprises actually use.
Phi-3-Vision-128K has just 4.2 billion parameters — a fraction of GPT-4 Vision or Gemini Ultra. Yet on tests for document comprehension, charts, and technical diagrams, it outperforms models three to four times its size.
The key is training data selection: instead of using everything available on the internet, the Microsoft Research team curated a dense dataset of business documents, charts, tables, software screenshots, and technical diagrams. The model became highly specialized in exactly what companies need.
The 128,000-token context window — remarkably large for such a small model — allows entire documents to be loaded and queried without losing information.
The practical advantage: Phi-3-Vision can run on moderate hardware, including on-premise or edge deployments, with inference costs far below larger models. For a company wanting to automatically analyze thousands of contracts or reports, the cost difference is substantial.
Companies
Microsoft
Tools
—
Tags
Sources