MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models

1UCLA  2Stanford University 

MRAG-Bench

MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios, providing a robust and systematic evaluation of Large Vision Language Model (LVLM)'s vision-centric multimodal retrieval-augmented generation (RAG) abilities.

Example scenarios from MRAG-Bench. Previous benchmarks mainly focused on retrieving from textual knowledge. However, there are scenarios where retrieving correct textual knowledge is hard and not as useful as visual knowledge.

MRAG-Bench -- Composition


  • MRAG-Bench provides a systematic evaluation across 9 distinctive multimodal RAG scenarios, with four scenarios focused on the perspective understanding of visual entities, four on transformative understanding, and one categorized as others.
  • MRAG-Bench focuses on evaluating LVLMs in utilizing vision-centric retrieval-augmented multimodal knowledge. "Diverse scenarios" refers to whether a benchmark categorized different scenarios during evaluation.


Abstract

Existing multimodal retrieval benchmarks primarily focus on evaluating whether models can retrieve and utilize external textual knowledge for question answering. However, there are scenarios where retrieving visual information is either more beneficial or easier to access than textual data. In this paper, we introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench, in which we systematically identify and categorize scenarios where visually augmented knowledge is better than textual knowledge, for instance, more images from varying viewpoints. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across 9 distinct scenarios. With MRAG-Bench, we conduct an evaluation of 10 open-source and 4 proprietary large vision-language models (LVLMs). Our results show that all LVLMs exhibit greater improvements when augmented with images compared to textual knowledge, confirming that MRAG-Bench is vision-centric. Additionally, we conduct extensive analysis with MRAG-Bench, which offers valuable insights into retrieval-augmented LVLMs. Notably, the top-performing model, GPT-4o, faces challenges in effectively leveraging retrieved knowledge, achieving only a 5.82% improvement with ground-truth information, in contrast to a 33.16% improvement observed in human participants. These findings highlight the importance of MRAG-Bench in encouraging the community to enhance LVLMs' ability to utilize retrieved visual knowledge more effectively.

Qualitative Results

Qualitative examples on MRAG-BENCH. For each scenario, we show the result of GPT-4o, Gemini Pro, LLaVA-Next-Interleave and Mantis-8B-Siglip. The ground-truth answer is in blue.

Quantitative Results

Accuracy scores on MRAG-BENCH. The highest scores for open-source models in each section and proprietary models are highlighted in blue and red, respectively. Both Retrieved RAG and GT RAG employ top-5 image examples (except for the incomplete task, where a single example is intuitively sufficient). The relative difference in performance compared to the score without RAG is shown in subscript, with blue indicating performance drops and red indicating improvements.

The average performance of the most advanced LVLMs is not better than 68.68% without ,multimodal RAG knowlege, and 74.5% with ground-truth knowledge, which demonstrates MRAG-BENCH to be a challenging benchmark. The mean accuracies of open- source LVLMs are between 26.83% and 53.29% without RAG knowledge and between 28.90% and 59.28% with ground-truth knowledge, which fall behind from advanced proprietary LVLMs. Notably, MRAG-BENCH proves to be knowledge-intensive as average humans achieved 38.47% without RAG knowledge, while proprietary LVLMs generally perform well, suggesting that their extensive training data equips them with a broader knowledge base. However, when provided with either retrieved or ground-truth knowledge, humans achieve the most significant improvements of 22.91% and 33.16%, respectively. This underscore the need of LVLMs to better utilize visually augmented information like humans

Experiment Analysis


1. Why can proprietary models better utilize retrieved images?

We conduct an error analysis on an open-source model (LLaVA-Next-Interleave) and a proprietary model (Gemini Pro). As the example illustrated in the Figure, the retrieved images contain two correct examples and three false examples. While Gemini Pro is able to utilize all retrieved images, LLaVA-Next-Interleave leverages bad examples and makes wrong prediction. This example helps explain why do almost all open-source models have lower performance with retrieved knowledge.


2. How much can visual knowledge benefit more than textual knowledge?

We used the Wikipedia corpus as of 2023/07/01 as our text knowledge corpus. To ensure a fair comparison, we employed the same multimodal retriever (CLIP) for retrieving either text or image knowledge. The top-5 ranked documents or images are used for augmenting the input. We selected one open-source (LLaVA-Next-Interleave) and one proprietary (GPT-4-Turbo) LVLM to examine their preference for textual knowledge versus image knowledge on MRAG-BENCH. All the results in the table demonstrate that retrieving visual knowledge is more helpful than retrieving text on MRAG-BENCH.


3. How does retriever performance affect LVLMs?

  • As in the left figure, we evaluated LLaVA-Next-Interleave with 4 different multimodal retrievers. When retrievers achieve higher Recall@5 scores (i.e., better retrieved examples), the LVLM's accuracy tends to improve, demonstrating a strong 95% positive correlation. Interestingly, despite similar Recall@5 scores from CLIP and VISTA retrievers, LLaVA-Next- Interleave demonstrated a 2.07% gap in overall accuracy. We conjecture that the order of the correctly retrieved examples may also impact the model's final performance. The sensitivity to the order of retrieved examples is a common issue that persists across various models. Although this phenomenon, known as position bias, has been examined in text-based RAG, its impact on visual RAG remains unexplored, presenting a promising direction for future research.

4. How many ground-truth image examples are needed?

  • As in the right figure, we evaluated LLaVA-Next-Interleave using 1, 2, 3, 5, 10, 20 GT examples, averaging the results across three random seeds for sampling the GT examples. LLaVA-Next-Interleave saw the greatest improvement of 5.64% with just one GT example. Performance continued to increase steadily, reaching a peak at 10 GT examples, which was 0.29% higher than with 20 GT examples. One possible explanation could be LLaVA-Next- Interleave may not able to better leverage visually augmented knowledge in long context scenarios. Moreover, the complexity of questions affects the number of images needed too, one ground-truth example sometimes help the model the most on MRAG-BENCH. We encourage the research on adaptatively deciding the number of necessary images based on the complexity of questions.

BibTeX

@article{hu2024mragbench,
          title={MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models},
          author={Hu, Wenbo and Gu, Jia-Chen and Dou, Zi-Yi and Fayyaz, Mohsen and Lu, Pan and Chang, Kai-Wei and Peng, Nanyun},
          journal={arXiv preprint arXiv:2410.08182},
          year={2024}
        }