Qwen2-VL Released: Visual Agent with Advanced Reasoning and Decision-Making!
Alibaba Open-Sources Qwen2-VL: Understands 20+ Minute Videos, Rivals GPT-4o!
Ali has released Qwen2-VL, open-sourcing the Qwen2-VL-2B and Qwen2-VL-7B models. A 72B version will be available later. Qwen2-VL is the latest visual-language model in the Qwen series.
Key Features:
State-of-the-Art Image Understanding: Qwen2-VL excels in image comprehension benchmarks like MathVista, DocVQA, RealWorldQA, and MTVQA, handling various resolutions and aspect ratios.
Understanding Long Videos: With streaming capabilities, Qwen2-VL can understand videos over 20 minutes long, enabling tasks like video-based Q&A, conversations, and content creation.
Device Control Agent: Qwen2-VL can integrate with devices like phones and robots, executing actions based on visual environments and text instructions, thanks to its advanced reasoning and decision-making abilities.
Multilingual Support: Qwen2-VL supports text recognition in images across multiple languages, including most European languages, Japanese, Korean, Arabic, Vietnamese, and more, alongside English and Chinese.