High Multimodal AI · 1 min read
BLIP-2: the Q-Former bridge between vision and language
In one sentence Salesforce introduces BLIP-2: a lightweight Q-Former bridges frozen visual encoder and frozen LLM, achieving SOTA captioning with 8x fewer trainable parameters.
Reading level
BLIP-2 is a Salesforce system that connects an image encoder and a large language model using an intermediate component called Q-Former. Both the visual model and the text model stay "frozen" — only the Q-Former gets trained. This dramatically reduces training costs. The result outperforms prior models on image captioning while updating far fewer parameters.
Companies
Salesforce
Tools
BLIP-2, Q-Former
Tags
BLIP-2Q-FormerImage CaptioningSalesforceEfficient Training
Sources