CAPability

What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu1*,  Chen-Wei Xie2,  Bin Wen2,  Feiwu Yu2,  Jixuan Chen2,  Boqiang Zhang1,2, 
Nianzu Yang3,  Pandeng Li1,2,  Yinglu Li1,  Zuan Gao1,  Yun Zheng2,  Hongtao Xie1โ€ , 
1University of Science and Technology of China 2Alibaba Group 3Shanghai Jiao Tong University
*Interns at Alibaba Group   
โ€ Corresponding Author

๐Ÿ“‹ Todo List

  • โ˜‘๏ธ Bump the evaluation model from GPT-4-Turbo to GPT-4.1-2025-04-14.
  • โ˜‘๏ธ Support lmms-eval for more convenient evaluation.
  • โ˜‘๏ธ Release QA format annotations of CAPability to support the metric KT.
  • โ˜‘๏ธ Release basic inference and evaluation code reported in the paper.

๐Ÿ”ฅ๐Ÿ”ฅ News

  • ๐Ÿ”ฅ [2025.04.16] Release the data on ๐Ÿค— Huggingface.
  • ๐Ÿ”ฅ [2025.04.15] Updata the paper on ArXiv.
  • ๐Ÿ”ฅ [2025.02.19] Release our paper on ArXiv.

Abstract

Visual captioning benchmarks have become outdated with the emergence of modern MLLMs, as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. We introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions using F1-score. By converting annotations to QA pairs, we further introduce a heuristic metric, know but cannot tell ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides the first holistic analysis of MLLMs' captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of capabilities.

Benchmark

Dimension Design

An example of image caption (left) and video caption (right). By analyzing the components of captions, we conclude 12 dimensions (9 static dimensions and 4 dynamic dimensions with object number shares on both static and dynamic), which all contribute to a detailed and comprehensive caption. The static dimensions are shared in both images and videos. For video data, there are additional dynamic dimensions as they need to be judged with temporal relations.

Benchmark Statistics

data-composition

The data source count and distribution of each dimension. We collect nearly 1,000 images/videos for each dimension, crawl parts of data by ourselves, and sample some data from existing datasets to ensure diversity.

data-composition

The annotation distribution of each dimension. We statistic different dimensions with different types. We count the frequency in object categories, character identification, and action as most of the descriptions only appear one time. For spatial relation, we summarize 4 categories and count their numbers. For style, camera angle, and camera movement, we count the samples of each category. For others, we plot bar charts to count and show the most frequent samples.

Experiment Results

Results on SOTA Models

grade-lv

Radar map visualization of F1-score on five representitive MLLMs.

grade-lv

The precision, recall, and F1-score of closed-source models and 72B open-source models on all dimensions. The precision represents the accuracy of what the models have described. The recall shows how many visual elements in the image can be described correctly. F1-Score is the harmonic mean of precision and recall. For video inputs, we send the whole video for Gemini, and uniformly sample 50 frames for GPT due to the API limitation of maximum number of images.

Evaluation Examples

grade-lv

Evaluation example of Object Number dimension on three SOTA models.

grade-lv

Evaluation example of Cemara Angle dimension on three SOTA models.

Citation


      @article{liu2025good,
        title={What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness},
        author={Liu, Zhihang and Xie, Chen-Wei and Wen, Bin and Yu, Feiwu and Chen, Jixuan and Zhang, Boqiang and Yang, Nianzu and Li, Pandeng and Li, Yinglu and Gao, Zuan and Zheng, Yun and Xie, Hongtao},
        journal={arXiv preprint arXiv:2502.14914},
        year={2025}
      }