May 30, 2024 High Multimodal AI · 1 min read

Microsoft Phi-3 Vision: 4.2B multimodal parameters for edge devices

In one sentence Microsoft brings multimodal to the edge with Phi-3 Vision: 4.2B parameters, 128k token context, competitive performance against models 10x larger on visual benchmarks.

Verified Official source

ShareLinkedIn X

Reading level

Phi-3 Vision is a Microsoft model that understands both text and images together, with one special characteristic: it's small enough to run on smartphones and laptops without cloud connectivity. With just 4.2 billion parameters it handles very long documents (up to 128,000 words) and reasons about images. It outperforms models ten times larger on many tests, proving that training data quality matters more than size.

Companies

Microsoft

Tools

Phi-3 Vision, Azure