October 3, 2023 High Multimodal AI · 1 min read

CogVLM: separate visual expert prevents language degradation

In one sentence Tsinghua introduces CogVLM with a visual expert module independent from LLM parameters, eliminating performance degradation on pure text and reaching SOTA on VQA and OCR.

Verified Official source

ShareLinkedIn X

Reading level

CogVLM is a model from Tsinghua University that solves a common problem in multimodal models: adding visual understanding often degrades text capabilities. CogVLM uses separate parameters for visual reasoning — a "visual expert" — that doesn't touch the original language model weights. The result is a model that excels at both images and pure text.

Companies

Tsinghua University, Zhipu AI

Tools

CogVLM