Skip to content

WangYuxuan93/CVLUE

Repository files navigation

CVLUE: Chinese Vision-Language Understanding Evaluation

Chinese Version

Task Introduction

The Chinese Vision-Language Understanding Evaluation (CVLUE) task aims to evaluate the vision-language modeling and understanding capabilities of Chinese vision-language pre-training models from multiple perspectives, including Image-Text Retrieval, Visual Question Answering, Visual Grounding, and Visual Dialog. This task includes the following five sub-tasks:

  • Image Retrieval: Retrieve the corresponding image from several candidates based on the given text description.
  • Text Retrieval: Retrieve the corresponding text description from several candidates based on the given image.
  • Visual Question Answering: Answer questions based on the given image using short phrases.
  • Visual Grounding: Identify the corresponding entity in the image based on the given image and text description.
  • Visual Dialog: Select the most appropriate reply text from several candidates based on the given image and dialog history.

Evaluation Data

This task includes images from the following 15 major categories and 92 subcategories. The images are manually collected according to the following categories, with strict requirements that the content of the images is representative of or commonly seen in the Chinese cultural environment:

Major Category Subcategories Number of Subcategories
Animals Panda, Cow, Fish, Dog, Horse, Chicken, Mouse, Bird, Human, Cat 10
Food Hot pot, Rice, Dumplings, Noodles, Buns 5
Drinks Bubble tea, Cola, Milk, Tea, Porridge, Alcohol 6
Clothes Hanfu, Tang suit, Cheongsam, Suit, T-shirt 5
Plants Willow, Ginkgo, Chinese parasol, Birch, Pine, Chrysanthemum, Peony, Orchid, Lotus, Lily 10
Fruits Lychee, Hawthorn, Apple, Cantaloupe, Longan 5
Vegetables Bok choy, Potato, Chinese Cabbage, Carrot, Cauliflower 5
Agriculture Hoe, Plow, Harrow, Sickle, Carrying pole 5
Tools Spoon, Bowl, Cutting board, Chopsticks, Wok, Fan, Chinese Cleaver, Wok Spatula 8
Furniture TV, Table, Chair, Refrigerator, Stove 5
Sports Ping-Pong, Basketball, Swimming, Football, Running 5
Celebrations Lion Dance, Dragon Boat, National Flag, Mooncake, Couplets, Lantern 6
Education Pencil, Blackboard, Chinese Brush, Chalk, Ballpoint, Scissors 6
Musical Instruments Guzheng, Erhu, Suona, Drum, Pipa 5
Art Calligraphy, Chinese Shadow Play, Paper Cutting, Terracotta Army, Ding, Ceramics 6

Data Examples

Next, data examples for each task category are provided.

Image-Text Retrieval

Each image has 5 different descriptions.


Image-Text Retrieval

The 5 captions are:

  1. There is a hot pot placed in the middle of the table.
  2. A two-flavor hot pot is placed on a wooden table.
  3. A hot pot with one spicy and one mushroom broth base is placed on the table.
  4. The hot pot is surrounded by various ingredients for hot pot, such as vegetables, meat, and meatballs.
  5. In the middle of the table, there is a two-flavor hot pot, and the ceramic bowls around it contain ingredients for hot pot.

Visual Question Answering

Ask questions based on the image and provide answers.


Visual Question Answering

The 3 Questions and Answers are:

  • Q: In which direction is the dragon boat heading?
    A: To the right
  • Q: How many teams are rowing dragon boats?
    A: 5
  • Q: Are most people standing or sitting?
    A: Sitting

Visual Grounding

Provide descriptions of certain entities in the image and their corresponding bounding boxes (marked with rectangles in the image).


Visual Grounding

Description of entities:

  1. The shadow puppet held by the girl wearing glasses
  2. The shadow puppet held by the boy with short hair

Visual Dialog

Provide an image and its description, then conduct a question-answer dialog based on the image.


Visual Dialog
  • Caption: There is a lot of food on the blue table mat
  • Q1: What foods are on the table?
    A1: The foods include eggs, buns, side dishes, steamed buns, and porridge
  • Q2: What kind of porridge is on the table?
    A2: The porridge on the table is black rice porridge
    ......
  • Q10: How many eggs are on the table?
    A10: There are two eggs on the table

Evaluation Metrics

The evaluation metrics for each sub-task are as follows:

Image-Text Retrieval

For the image-text retrieval task, the evaluation metric is Recall $ R@k(k = 1, 5, 10)$.

$$ R@k=\frac{Number\ of\ correct\ results\ in\ the\ top\ k\ retrieval\ rankings}{Total\ number\ of\ samples} $$

Visual Question Answering

For the visual question answering task, the evaluation metric is straightforward, namely the accuracy of answering questions $ Accuracy $.

$$ Accuracy=\frac{Number\ of\ correct\ answers}{Total\ number\ of\ questions} $$

Visual Grounding

For the visual grounding task, the evaluation metrics are based on Intersection over Union $ IoU $. The metrics include the alignment accuracy of images (an $ IoU $ value greater than 0.5 is considered correct) and the mean $ IoU $.

$$ IoU=\frac{Area\ of\ overlap\ between\ predicted\ region\ and\ ground\ truth\ region}{Area\ of\ union\ of\ predicted\ region\ and\ ground\ truth\ region} $$ $$ IoU_{Accuracy}=\frac{Number\ of\ samples\ with\ IoU\ over\ 0.5}{Total\ number\ of\ grounding\ samples} $$ $$ \overline{IoU}=\frac{Sum\ of\ IoU\ of\ all\ predictions}{Total\ number\ of\ grounding\ samples} $$

Visual Dialog

For the visual dialog task, the evaluation metric is Recall $ R@k(k = 1, 5, 10)$.

$$ R@k=\frac{Number\ of\ correct\ results\ in\ the\ top\ k\ retrieval\ rankings}{Total\ number\ of\ samples} $$

Data Access and Citation

The CVLUE data is freely available upon request under the CC BY-NC-ND 4.0 license. Please submit your request via Google Form.

If you use the CVLUE data, please cite the following paper:

@misc{wang-etal-2024-cvlue,
    title={CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation},
    author={Yuxuan Wang and Yijun Liu and Fei Yu and Chen Huang and Kexin Li and Zhiguo Wan and Wanxiang Che and Hongyang Chen},
    year={2024},
    eprint={2407.01081},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

About

Chinese Vision-Language Understanding Evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published