High Multimodal AI · 1 min read
CogVLM: separate visual expert prevents language degradation
In one sentence Tsinghua introduces CogVLM with a visual expert module independent from LLM parameters, eliminating performance degradation on pure text and reaching SOTA on VQA and OCR.
Reading level
CogVLM is a model from Tsinghua University that solves a common problem in multimodal models: adding visual understanding often degrades text capabilities. CogVLM uses separate parameters for visual reasoning — a "visual expert" — that doesn't touch the original language model weights. The result is a model that excels at both images and pure text.
Companies
Tsinghua University, Zhipu AI
Tools
CogVLM
Tags
CogVLMVisual ExpertVQAOCRTsinghua University
Sources