Qwen QVQ-72B, developed by Alibaba, is an advanced open-sourced visual reasoning LLM released in December 2024. It integrates visual and language processing, enabling it to understand and analyze both text and image inputs. The model has demonstrated significant performance improvements, achieving a score of 70.3 in the Multimodal Math Understanding benchmark, surpassing its predecessor, Qwen2-VL-72B-Instruct. QVQ is designed for complex analytical tasks, showing enhanced reasoning capabilities, particularly in solving intricate physics problems.
Despite its advancements, QVQ has limitations, including challenges with language mixing and maintaining focus on image content during multi-step reasoning, which may lead to inaccuracies. The model is built on a transformer-based architecture derived from Qwen2-VL-72B, enhancing its processing capabilities across various tasks. It is open-source, allowing researchers and developers to access and utilize it freely, although its large size may limit use on consumer-grade GPUs.
An example of QVQ’s capabilities is provided, showcasing its ability to accurately count objects in an image, specifically identifying six pelicans in a given picture. Users can access the model through platforms like Hugging Face, and the document encourages experimentation with Qwen QVQ for visual reasoning tasks.
