MM-BRIGHT: A Multi-Task Multimodal Benchmark for Reasoning-Intensive Retrieval

Abdelrahman Abdallah1, Mohamed Darwish Mounis2, Mahmoud Abdalla3, Mahmoud SalahEldin Kasem3, Mostafa Farouk Senussi3
Mohamed Mahmoud3, Mohammed Ali1, Adam Jatowt1, Hyun-Soo Kang3
1University of Innsbruck, 2High Institute for Computer & Information Systems, 3Chungbuk National University

Why a new benchmark?

Existing retrieval benchmarks primarily consist of text-based queries where keyword or semantic matching is usually sufficient. Many real-world queries contain multimodal elements, particularly images such as diagrams, charts, and screenshots that require intensive reasoning. We introduce MM-BRIGHT to address this gap.

MM-BRIGHT

MM-BRIGHT is the first multimodal benchmark for reasoning-intensive retrieval. It consists of 2,803 real-world queries spanning 29 diverse technical domains, with four tasks of increasing complexity: text-to-text, multimodal-to-text, multimodal-to-image, and multimodal-to-multimodal retrieval.

MM-BRIGHT overview diagram.

Leaderboard submission

If you would like to submit your results to the leaderboard, please open a pull request or issue on our GitHub repository.

Have Questions?

Contact the authors or open an issue on GitHub.

Leaderboard

Query → Documents: Text-only retrieval showing Avg NDCG@10 across 29 domains.

Rank Model Type Avg NDCG@10
🥇 1 DiVeR Reasoning 32.2
🥈 2 OpenAI Proprietary 28.8
🥉 3 ReasonIR Reasoning 28.6
4 Qwen2 Dense >1B 28.1
5 SFR Dense >1B 26.9
6 E5 Dense >1B 25.3
7 GritLM Dense >1B 25.3
8 Rader Reasoning 24.9
9 Qwen Dense >1B 21.5
10 Contriever Dense < 1B 20.1
11 BM25 Sparse 8.5

MM-BRIGHT Overview

Multimodal Reasoning

Queries contain multimodal elements—diagrams, charts, screenshots—that require intensive reasoning to identify relevant documents beyond surface form matching.

Four Retrieval Tasks

From text-only baseline to full multimodal-to-multimodal retrieval, spanning increasing complexity for comprehensive evaluation.

Technical Domains

29 diverse domains from StackExchange spanning STEM, software engineering, social sciences, and applied fields.

Domains

STEM & Life Sciences: Biology, Chemistry, Physics, Math, Earth Science, Bioacoustics, Bioinformatics, Medical Sciences, Academia.
Software & Technical Systems: Ubuntu, Bitcoin, Cryptography, Quantum Computing, Robotics, Salesforce, GIS, Apple.
Social Sciences & Humanities: Economics, Psychology, Philosophy, Law, Christianity, Islam.
Applied Domains: Aviation, Gaming, Project Management, Quantitative Finance, Sustainability, Travel.

Tasks

Task 1: Query → Documents

Text-only retrieval baseline. Retrieve relevant text documents given text queries, measuring reasoning difficulty without multimodal complexity.

Task 2: Query+Image → Documents

Multimodal-to-text retrieval. Retrieve relevant text documents given queries with images, testing whether models can leverage visual context.

Task 3: Query+Image → Images

Image retrieval. Retrieve relevant images from multimodal queries, requiring visual reasoning and similarity assessment beyond simple object matching.

Task 4: Query+Image → Documents+Images

Full multimodal retrieval. The most challenging task—retrieve multimodal documents where both text and images must be jointly evaluated for relevance.

Evaluation

We use nDCG@10 as the primary metric following prior retrieval benchmarks (BEIR, BRIGHT). Tasks 1-3 use binary relevance labels. Task 4 uses graded relevance: rel=2 for gold passage with corresponding positive image, rel=1 for gold passage without image, and rel=0 for incorrect passages.

18 Retrieval Models

Including sparse, dense, reasoning-enhanced, and multimodal retrievers.

2,803 Queries

Real-world queries from 29 diverse technical domains.

2.5M Documents

Large-scale corpus with text and images for comprehensive evaluation.

7,621 Annotated Images

Human-verified image annotations for multimodal retrieval tasks.

Results Snapshot

Best Text Retrieval

32.2 NDCG@10

DiVeR (Task 1, 29 domains)

Best Image Retrieval

45.6 NDCG@10

GME-2B (Task 3)

Dataset Scale

2.5M Docs

Across 29 domains

Current state-of-the-art models struggle across all tasks. Adding images often hurts performance: the best multimodal model (Nomic-Vision: 27.6) underperforms the best text-only model (DiVeR: 32.2) on Task 2. These results highlight substantial room for improvement in multimodal reasoning.

Resources

Code, data, and evaluation scripts are available below.

This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Template adapted from Nerfies.