Existing retrieval benchmarks primarily consist of text-based queries where keyword or semantic matching is usually sufficient. Many real-world queries contain multimodal elements, particularly images such as diagrams, charts, and screenshots that require intensive reasoning. We introduce MM-BRIGHT to address this gap.
MM-BRIGHT is the first multimodal benchmark for reasoning-intensive retrieval. It consists of 2,803 real-world queries spanning 29 diverse technical domains, with four tasks of increasing complexity: text-to-text, multimodal-to-text, multimodal-to-image, and multimodal-to-multimodal retrieval.
If you would like to submit your results to the leaderboard, please open a pull request or issue on our GitHub repository.
Contact the authors or open an issue on GitHub.
Query → Documents: Text-only retrieval showing Avg NDCG@10 across 29 domains.
| Rank | Model | Type | Avg NDCG@10 |
|---|---|---|---|
| 🥇 1 | DiVeR | Reasoning | 32.2 |
| 🥈 2 | OpenAI | Proprietary | 28.8 |
| 🥉 3 | ReasonIR | Reasoning | 28.6 |
| 4 | Qwen2 | Dense >1B | 28.1 |
| 5 | SFR | Dense >1B | 26.9 |
| 6 | E5 | Dense >1B | 25.3 |
| 7 | GritLM | Dense >1B | 25.3 |
| 8 | Rader | Reasoning | 24.9 |
| 9 | Qwen | Dense >1B | 21.5 |
| 10 | Contriever | Dense < 1B | 20.1 |
| 11 | BM25 | Sparse | 8.5 |
Queries contain multimodal elements—diagrams, charts, screenshots—that require intensive reasoning to identify relevant documents beyond surface form matching.
From text-only baseline to full multimodal-to-multimodal retrieval, spanning increasing complexity for comprehensive evaluation.
29 diverse domains from StackExchange spanning STEM, software engineering, social sciences, and applied fields.
STEM & Life Sciences: Biology, Chemistry, Physics, Math, Earth Science, Bioacoustics,
Bioinformatics, Medical Sciences, Academia.
Software & Technical Systems: Ubuntu, Bitcoin, Cryptography, Quantum Computing, Robotics,
Salesforce, GIS, Apple.
Social Sciences & Humanities: Economics, Psychology, Philosophy, Law, Christianity,
Islam.
Applied Domains: Aviation, Gaming, Project Management, Quantitative Finance,
Sustainability, Travel.
Text-only retrieval baseline. Retrieve relevant text documents given text queries, measuring reasoning difficulty without multimodal complexity.
Multimodal-to-text retrieval. Retrieve relevant text documents given queries with images, testing whether models can leverage visual context.
Image retrieval. Retrieve relevant images from multimodal queries, requiring visual reasoning and similarity assessment beyond simple object matching.
Full multimodal retrieval. The most challenging task—retrieve multimodal documents where both text and images must be jointly evaluated for relevance.
We use nDCG@10 as the primary metric following prior retrieval benchmarks (BEIR, BRIGHT). Tasks 1-3 use binary relevance labels. Task 4 uses graded relevance: rel=2 for gold passage with corresponding positive image, rel=1 for gold passage without image, and rel=0 for incorrect passages.
18 Retrieval Models
Including sparse, dense, reasoning-enhanced, and multimodal retrievers.
2,803 Queries
Real-world queries from 29 diverse technical domains.
2.5M Documents
Large-scale corpus with text and images for comprehensive evaluation.
7,621 Annotated Images
Human-verified image annotations for multimodal retrieval tasks.
Best Text Retrieval
32.2 NDCG@10
Best Image Retrieval
45.6 NDCG@10
Dataset Scale
2.5M Docs
Current state-of-the-art models struggle across all tasks. Adding images often hurts performance: the best multimodal model (Nomic-Vision: 27.6) underperforms the best text-only model (DiVeR: 32.2) on Task 2. These results highlight substantial room for improvement in multimodal reasoning.
Code, data, and evaluation scripts are available below.
This website is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Template adapted from Nerfies.