CVLUE: Chinese Vision-Language Understanding Evaluation

Task Introduction

The Chinese Vision-Language Understanding Evaluation (CVLUE) task aims to evaluate the vision-language modeling and understanding capabilities of Chinese vision-language pre-training models from multiple perspectives, including Image-Text Retrieval, Visual Question Answering, Visual Grounding, and Visual Dialog. This task includes the following five sub-tasks:

Image Retrieval: Retrieve the corresponding image from several candidates based on the given text description.
Text Retrieval: Retrieve the corresponding text description from several candidates based on the given image.
Visual Question Answering: Answer questions based on the given image using short phrases.
Visual Grounding: Identify the corresponding entity in the image based on the given image and text description.
Visual Dialog: Select the most appropriate reply text from several candidates based on the given image and dialog history.

Evaluation Data

This task includes images from the following 15 major categories and 92 subcategories. The images are manually collected according to the following categories, with strict requirements that the content of the images is representative of or commonly seen in the Chinese cultural environment:

Major Category	Subcategories	Number of Subcategories
Animals	Panda, Cow, Fish, Dog, Horse, Chicken, Mouse, Bird, Human, Cat	10
Food	Hot pot, Rice, Dumplings, Noodles, Buns	5
Drinks	Bubble tea, Cola, Milk, Tea, Porridge, Alcohol	6
Clothes	Hanfu, Tang suit, Cheongsam, Suit, T-shirt	5
Plants	Willow, Ginkgo, Chinese parasol, Birch, Pine, Chrysanthemum, Peony, Orchid, Lotus, Lily	10
Fruits	Lychee, Hawthorn, Apple, Cantaloupe, Longan	5
Vegetables	Bok choy, Potato, Chinese Cabbage, Carrot, Cauliflower	5
Agriculture	Hoe, Plow, Harrow, Sickle, Carrying pole	5
Tools	Spoon, Bowl, Cutting board, Chopsticks, Wok, Fan, Chinese Cleaver, Wok Spatula	8
Furniture	TV, Table, Chair, Refrigerator, Stove	5
Sports	Ping-Pong, Basketball, Swimming, Football, Running	5
Celebrations	Lion Dance, Dragon Boat, National Flag, Mooncake, Couplets, Lantern	6
Education	Pencil, Blackboard, Chinese Brush, Chalk, Ballpoint, Scissors	6
Musical Instruments	Guzheng, Erhu, Suona, Drum, Pipa	5
Art	Calligraphy, Chinese Shadow Play, Paper Cutting, Terracotta Army, Ding, Ceramics	6

Data Examples

Next, data examples for each task category are provided.

Image-Text Retrieval

Each image has 5 different descriptions.

Image-Text Retrieval

The 5 captions are:

There is a hot pot placed in the middle of the table.
A two-flavor hot pot is placed on a wooden table.
A hot pot with one spicy and one mushroom broth base is placed on the table.
The hot pot is surrounded by various ingredients for hot pot, such as vegetables, meat, and meatballs.
In the middle of the table, there is a two-flavor hot pot, and the ceramic bowls around it contain ingredients for hot pot.

Visual Question Answering

Ask questions based on the image and provide answers.

Visual Question Answering

The 3 Questions and Answers are：

Q: In which direction is the dragon boat heading?
A: To the right
Q: How many teams are rowing dragon boats?
A: 5
Q: Are most people standing or sitting?
A: Sitting

Visual Grounding

Provide descriptions of certain entities in the image and their corresponding bounding boxes (marked with rectangles in the image).

Visual Grounding

Description of entities:

The shadow puppet held by the girl wearing glasses
The shadow puppet held by the boy with short hair

Visual Dialog

Provide an image and its description, then conduct a question-answer dialog based on the image.

Visual Dialog

Caption: There is a lot of food on the blue table mat
Q1: What foods are on the table?
A1: The foods include eggs, buns, side dishes, steamed buns, and porridge
Q2: What kind of porridge is on the table?
A2: The porridge on the table is black rice porridge
......
Q10: How many eggs are on the table?
A10: There are two eggs on the table

Evaluation Metrics

The evaluation metrics for each sub-task are as follows:

Image-Text Retrieval

For the image-text retrieval task, the evaluation metric is Recall $ R@k（k = 1, 5, 10）$.

$$ R@k=\frac{Number\ of\ correct\ results\ in\ the\ top\ k\ retrieval\ rankings}{Total\ number\ of\ samples} $$

Visual Question Answering

For the visual question answering task, the evaluation metric is straightforward, namely the accuracy of answering questions $ Accuracy $.

$$ Accuracy=\frac{Number\ of\ correct\ answers}{Total\ number\ of\ questions} $$

Visual Grounding

For the visual grounding task, the evaluation metrics are based on Intersection over Union $ IoU $. The metrics include the alignment accuracy of images (an $ IoU $ value greater than 0.5 is considered correct) and the mean $ IoU $.

$$ IoU=\frac{Area\ of\ overlap\ between\ predicted\ region\ and\ ground\ truth\ region}{Area\ of\ union\ of\ predicted\ region\ and\ ground\ truth\ region} $$ $$ IoU_{Accuracy}=\frac{Number\ of\ samples\ with\ IoU\ over\ 0.5}{Total\ number\ of\ grounding\ samples} $$ $$ \overline{IoU}=\frac{Sum\ of\ IoU\ of\ all\ predictions}{Total\ number\ of\ grounding\ samples} $$

Visual Dialog

For the visual dialog task, the evaluation metric is Recall $ R@k（k = 1, 5, 10）$.

$$ R@k=\frac{Number\ of\ correct\ results\ in\ the\ top\ k\ retrieval\ rankings}{Total\ number\ of\ samples} $$

Data Access and Citation

The CVLUE data is freely available upon request under the CC BY-NC-ND 4.0 license. Please submit your request via Google Form.

If you use the CVLUE data, please cite the following paper:

@misc{wang-etal-2024-cvlue,
    title={CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation},
    author={Yuxuan Wang and Yijun Liu and Fei Yu and Chen Huang and Kexin Li and Zhiguo Wan and Wanxiang Che and Hongyang Chen},
    year={2024},
    eprint={2407.01081},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
cvlue_baseline		cvlue_baseline
evaluation		evaluation
example		example
test_example		test_example
LICENSE		LICENSE
readme.md		readme.md
readme_zh.md		readme_zh.md
报名表.png		报名表.png
评测结果.png		评测结果.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CVLUE: Chinese Vision-Language Understanding Evaluation

Task Introduction

Evaluation Data

Data Examples

Image-Text Retrieval

Visual Question Answering

Visual Grounding

Visual Dialog

Evaluation Metrics

Image-Text Retrieval

Visual Question Answering

Visual Grounding

Visual Dialog

Data Access and Citation

About

Releases

Packages

Contributors 2

Languages

License

WangYuxuan93/CVLUE

Folders and files

Latest commit

History

Repository files navigation

CVLUE: Chinese Vision-Language Understanding Evaluation

Task Introduction

Evaluation Data

Data Examples

Image-Text Retrieval

Visual Question Answering

Visual Grounding

Visual Dialog

Evaluation Metrics

Image-Text Retrieval

Visual Question Answering

Visual Grounding

Visual Dialog

Data Access and Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages