On Benchmarking Multimodal Large Language Models

I recently went some small effort to consolidate a bunch of reported benchmark scores for multimodal large language models (MLLMs), specifically focusing on smaller models in the 8B and lower range, like SmolVLM2 2.2B, Qwen3-VL 8B (actually 8.8B), and InternVL3.5 8B (8.7B). I am especially interested in a couple startups in this space with models like Isaac 0.2 2B (2.6B, from Perceptron), Moondream 3 (19B MoE from Moondream) and LFM2-VL 3B (actually 3.0B, from Liquid). The results are at samuelstevens.me/mllm-benchmarking.

After looking at these reported scores, and getting back up to speed on the latest in MLLM evaluation (I worked on MMMU at the end of 2023), I came to a couple conclusions:

MLLM developers do not agree on which benchmarks are most meaningful. In LLMs, most new SOTA model releases focus on economic work (GDPval), agentic coding (SWE-bench, Terminal-Bench 2.0), reasoning/knowledge (Humanity’s Last Exam, GPQA), abstract reasoning (ARC AGI 2), multimodal/multilingual knowledge (MMMU, MMMLU) and math (AIME 2025, FrontierMath). In contrast, MLLM evaluation uses short-answer question-answering tasks like VQAv2, ChartQA, TextVQA, DocVQA, InfoVWA), some smaller curated datasets (RealWorldQA), OCR tasks (OCRBench), and a variety of hallucination/perception benchmarks (MME, POPE, BLINK). Some reasoning-based tasks exist (MMMU, MMMU-Pro, MathVista) but most developers are still solving perception, rather than visual reasoning.¹ I think there is room for a better benchmark, but there is an XKCD. I really like RealWorldQA and Reka’s Vibe-Eval because they were written by hand, rather than scraped from existing sources.
Serious post-training is not yet required. Molmo2 scores very well without any RL-based post-training. Many of the big Chinese labs apply RL (Qwen3-VL, InternVL3.5 and Step3-VL all do RL-based post-training) but open-weight American labs do not (Molmo2 and SmolVLM2; it’s unclear to me if Liquid uses the RL-tuned LLM decoder backbone or not). This is especially surprising to me because many of the labeled computer vision datasets could be used as verifiable rewards (bounding boxes, segmentations, counting/point-placing, etc).
Small models are not end-to-end trained with multimodal data. Qwen3.5 is the first model that incorporate multimodal data in the text backbone training phase, and their smallest model is 27B (dense) or 35B-A3B (MoE).
Only Isaac, Molmo2 and Moondream natively support counting outputs, and only Isaac and Moondream support bounding boxes and segmentations. These seem like wildly useful capabilities and I’m surprised that it’s not a more widely described skill.

I hope to get to work on some fo these topics soon.

I discuss this later on, but reasoning tokens do reliably improve scores on most tasks, despite no “native multimodal reasoning” where the models reason in pixel space.↩︎

[Links] [Source]

Sam Stevens, 2024