The Chinese Vision-Language Understanding Evaluation (CVLUE) task aims to evaluate the vision-language modeling and understanding capabilities of Chinese vision-language pre-training models from multiple perspectives, including Image-Text Retrieval, Visual Question Answering, Visual Grounding, and Visual Dialog. This task includes the following five sub-tasks:
- Image Retrieval: Retrieve the corresponding image from several candidates based on the given text description.
- Text Retrieval: Retrieve the corresponding text description from several candidates based on the given image.
- Visual Question Answering: Answer questions based on the given image using short phrases.
- Visual Grounding: Identify the corresponding entity in the image based on the given image and text description.
- Visual Dialog: Select the most appropriate reply text from several candidates based on the given image and dialog history.
This task includes images from the following 15 major categories and 92 subcategories. The images are manually collected according to the following categories, with strict requirements that the content of the images is representative of or commonly seen in the Chinese cultural environment:
Major Category | Subcategories | Number of Subcategories |
Animals | Panda, Cow, Fish, Dog, Horse, Chicken, Mouse, Bird, Human, Cat | 10 |
Food | Hot pot, Rice, Dumplings, Noodles, Buns | 5 |
Drinks | Bubble tea, Cola, Milk, Tea, Porridge, Alcohol | 6 |
Clothes | Hanfu, Tang suit, Cheongsam, Suit, T-shirt | 5 |
Plants | Willow, Ginkgo, Chinese parasol, Birch, Pine, Chrysanthemum, Peony, Orchid, Lotus, Lily | 10 |
Fruits | Lychee, Hawthorn, Apple, Cantaloupe, Longan | 5 |
Vegetables | Bok choy, Potato, Chinese Cabbage, Carrot, Cauliflower | 5 |
Agriculture | Hoe, Plow, Harrow, Sickle, Carrying pole | 5 |
Tools | Spoon, Bowl, Cutting board, Chopsticks, Wok, Fan, Chinese Cleaver, Wok Spatula | 8 |
Furniture | TV, Table, Chair, Refrigerator, Stove | 5 |
Sports | Ping-Pong, Basketball, Swimming, Football, Running | 5 |
Celebrations | Lion Dance, Dragon Boat, National Flag, Mooncake, Couplets, Lantern | 6 |
Education | Pencil, Blackboard, Chinese Brush, Chalk, Ballpoint, Scissors | 6 |
Musical Instruments | Guzheng, Erhu, Suona, Drum, Pipa | 5 |
Art | Calligraphy, Chinese Shadow Play, Paper Cutting, Terracotta Army, Ding, Ceramics | 6 |
Next, data examples for each task category are provided.
Each image has 5 different descriptions.
The 5 captions are:
- There is a hot pot placed in the middle of the table.
- A two-flavor hot pot is placed on a wooden table.
- A hot pot with one spicy and one mushroom broth base is placed on the table.
- The hot pot is surrounded by various ingredients for hot pot, such as vegetables, meat, and meatballs.
- In the middle of the table, there is a two-flavor hot pot, and the ceramic bowls around it contain ingredients for hot pot.
Ask questions based on the image and provide answers.
The 3 Questions and Answers are:
- Q: In which direction is the dragon boat heading?
A: To the right - Q: How many teams are rowing dragon boats?
A: 5 - Q: Are most people standing or sitting?
A: Sitting
Provide descriptions of certain entities in the image and their corresponding bounding boxes (marked with rectangles in the image).
Description of entities:
- The shadow puppet held by the girl wearing glasses
- The shadow puppet held by the boy with short hair
Provide an image and its description, then conduct a question-answer dialog based on the image.
- Caption: There is a lot of food on the blue table mat
- Q1: What foods are on the table?
A1: The foods include eggs, buns, side dishes, steamed buns, and porridge - Q2: What kind of porridge is on the table?
A2: The porridge on the table is black rice porridge
...... - Q10: How many eggs are on the table?
A10: There are two eggs on the table
The evaluation metrics for each sub-task are as follows:
For the image-text retrieval task, the evaluation metric is Recall $ R@k(k = 1, 5, 10)$.
For the visual question answering task, the evaluation metric is straightforward, namely the accuracy of answering questions $ Accuracy $.
For the visual grounding task, the evaluation metrics are based on Intersection over Union $ IoU $. The metrics include the alignment accuracy of images (an $ IoU $ value greater than 0.5 is considered correct) and the mean $ IoU $.
For the visual dialog task, the evaluation metric is Recall $ R@k(k = 1, 5, 10)$.
The CVLUE data is freely available upon request under the CC BY-NC-ND 4.0 license. Please submit your request via Google Form.
If you use the CVLUE data, please cite the following paper:
@misc{wang-etal-2024-cvlue,
title={CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation},
author={Yuxuan Wang and Yijun Liu and Fei Yu and Chen Huang and Kexin Li and Zhiguo Wan and Wanxiang Che and Hongyang Chen},
year={2024},
eprint={2407.01081},
archivePrefix={arXiv},
primaryClass={cs.CV}
}