Any advice to train on Chinese Dataset using Caption Contrastive Fine-tuning? #33

WeihongM · 2025-01-23T08:47:50Z

Hello, thx for the impressive work. I find the model perform poor when use Chinese caption. If I want to use caption contrastive finetuning loss to train on LLM support Chinese (such as qwen), which dataset do you advise me to use?

raytrun · 2025-01-24T15:59:36Z

Thank you for your interest in our work.
The WuKong Dataset is a large-scale Chinese image-text pairs dataset and could be a good choice, given its substantial volume of data. However, I’m not entirely sure about the quality of the captions in this dataset. You may need to check and see if it’s suitable.

MrPanda007 · 2025-02-24T06:21:34Z

Thank you for your interest in our work. The WuKong Dataset is a large-scale Chinese image-text pairs dataset and could be a good choice, given its substantial volume of data. However, I’m not entirely sure about the quality of the captions in this dataset. You may need to check and see if it’s suitable.

@raytrun Thanks for you great work.
Now, I use your cc-finetuned llama model to finetune eva model with cc15m and my own chinese data(1 millions), but it performances worse than chinese-clip.
If I want to use llm2clip model in Chinese, shuold I use your cc-finetune LLama model to train clip, or I need to finetune both llama and clip model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any advice to train on Chinese Dataset using Caption Contrastive Fine-tuning? #33

Any advice to train on Chinese Dataset using Caption Contrastive Fine-tuning? #33

WeihongM commented Jan 23, 2025 •

edited

Loading

raytrun commented Jan 24, 2025

MrPanda007 commented Feb 24, 2025

Any advice to train on Chinese Dataset using Caption Contrastive Fine-tuning? #33

Any advice to train on Chinese Dataset using Caption Contrastive Fine-tuning? #33

Comments

WeihongM commented Jan 23, 2025 • edited Loading

raytrun commented Jan 24, 2025

MrPanda007 commented Feb 24, 2025

WeihongM commented Jan 23, 2025 •

edited

Loading