Research

InternVL3: Open Source Model Outperforms GPT-4o in Multimodal Benchmarks

InternVL3, a new open-source AI model, matches or beats proprietary giants like GPT-4 and Gemini in understanding images and video. Its key breakthrough lies in how it learns to process visual information alongside language from the start, rather than adding these capabilities later.

Marcus Schuler April 15, 2025

InternVL3: Open Source Model Outperforms GPT-4o in Multimodal Benchmarks

The model excels at tasks that would stump many AIs: tracking objects in videos, analyzing complex charts, understanding technical diagrams, and even helping with mathematical problems that combine text and visuals.

In rigorous testing across multiple benchmarks, InternVL3 proved particularly strong at maintaining accuracy with longer videos and more complex visual scenarios.

What sets InternVL3 apart is its training approach. Unlike most AI models that learn language first and visual skills second, InternVL3 develops both abilities simultaneously. Think of it like raising a bilingual child who learns two languages naturally from birth, rather than learning a second language later in life.

The researchers also solved a common problem in AI vision: maintaining accuracy with longer videos and complex scenes. They developed a new way to help the AI keep track of spatial relationships over time, much like how humans maintain their sense of space and movement while watching a video.

The largest version of InternVL3 achieved a score of 72.2 on the MMMU benchmark - a comprehensive test of AI visual understanding. This puts it ahead of other open-source models and closer to proprietary leaders like Gemini-2.5 Pro.

Most importantly, the researchers are releasing both their code and training data to the public. This move could accelerate progress in AI vision technology by allowing other researchers to build on their work.

The implications extend beyond academic research. Better visual AI could improve everything from medical imaging to autonomous vehicles, making machines better at understanding the visual world the way humans do.

Why this matters:

The gap between open-source and proprietary AI is narrowing, democratizing access to advanced AI capabilities
The model's novel training approach suggests we might need to rethink how we teach AI systems to process information more naturally

Read on, my dear: