Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #61

zfj1998 · 2025-02-25T03:14:08Z

Congratulations on the impressive work!

I would like to suggest expanding the evaluation of visual reasoning to the HumanEval-V benchmark. This benchmark provides a more challenging set of tasks by introducing complex diagrams paired with coding challenges. Unlike traditional visual reasoning tasks that focus on answering multiple-choice questions or providing short answers, HumanEval-V requires models to generate code based on visual input, which better tests both instruction-following and open-ended generation abilities.

Key points for consideration:

HumanEval-V expands the reasoning scenarios with complex diagrams, pushing the limits of visual understanding.
The task format is tailored to code generation, making it a suitable benchmark for testing MLLMs’ ability to handle more structured, generative tasks.
Evaluating this benchmark will provide valuable insights into how well it handles visual reasoning combined with coding, which can be evaluated and rewarded through execution feedback.

You can find more information about the benchmark here: HumanEval-V Homepage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #61

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #61

zfj1998 commented Feb 25, 2025

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #61

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #61

Comments

zfj1998 commented Feb 25, 2025