Skip to content
AImpact
IT EN
High Multimodal AI · 1 min read

BLIP-2: the Q-Former bridge between vision and language

In one sentence Salesforce introduces BLIP-2: a lightweight Q-Former bridges frozen visual encoder and frozen LLM, achieving SOTA captioning with 8x fewer trainable parameters.

Verified Official source
ShareLinkedInX
Reading level

BLIP-2 is a Salesforce system that connects an image encoder and a large language model using an intermediate component called Q-Former. Both the visual model and the text model stay "frozen" — only the Q-Former gets trained. This dramatically reduces training costs. The result outperforms prior models on image captioning while updating far fewer parameters.

Companies

Salesforce

Tools

BLIP-2, Q-Former

Tags

BLIP-2Q-FormerImage CaptioningSalesforceEfficient Training

Sources