January 30, 2023 High Multimodal AI · 1 min read

BLIP-2: the Q-Former bridge between vision and language

In one sentence Salesforce introduces BLIP-2: a lightweight Q-Former bridges frozen visual encoder and frozen LLM, achieving SOTA captioning with 8x fewer trainable parameters.

Verified Official source

ShareLinkedIn X

Reading level

BLIP-2 is a Salesforce system that connects an image encoder and a large language model using an intermediate component called Q-Former. Both the visual model and the text model stay "frozen" — only the Q-Former gets trained. This dramatically reduces training costs. The result outperforms prior models on image captioning while updating far fewer parameters.

Companies

Salesforce

Tools

BLIP-2, Q-Former