You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to suggest expanding the evaluation of visual reasoning to the HumanEval-V benchmark. This benchmark provides a more challenging set of tasks by introducing complex diagrams paired with coding challenges. Unlike traditional visual reasoning tasks that focus on answering multiple-choice questions or providing short answers, HumanEval-V requires models to generate code based on visual input, which better tests both instruction-following and open-ended generation abilities.
Key points for consideration:
HumanEval-V expands the reasoning scenarios with complex diagrams, pushing the limits of visual understanding.
The task format is tailored to code generation, making it a suitable benchmark for testing MLLMs’ ability to handle more structured, generative tasks.
Evaluating this benchmark will provide valuable insights into how well it handles visual reasoning combined with coding, which can be evaluated and rewarded through execution feedback.
Congratulations on the impressive work!
I would like to suggest expanding the evaluation of visual reasoning to the HumanEval-V benchmark. This benchmark provides a more challenging set of tasks by introducing complex diagrams paired with coding challenges. Unlike traditional visual reasoning tasks that focus on answering multiple-choice questions or providing short answers, HumanEval-V requires models to generate code based on visual input, which better tests both instruction-following and open-ended generation abilities.
Key points for consideration:
You can find more information about the benchmark here: HumanEval-V Homepage.
The text was updated successfully, but these errors were encountered: