The Use of Loss Functions and Dataset Integration in CC-Fine-Tuning? #9

BIGBALLON · 2024-11-12T11:58:01Z

Thank you so much for such an outstanding work. I have a couple of questions regarding the fine-tuning process described in Section 3.2, particularly around the integration of loss functions and datasets:

In the paper, two loss functions are mentioned: SimCSE loss and Masked Next Token Prediction (MNTP). However, it is unclear whether these two loss functions are used simultaneously during training, or if the training process is split into different phases where each loss is applied separately. Could you please clarify how the losses are used? If they are used together, what are the relative weights assigned to each?

Regarding the datasets, CC-3M and Wikitext-103 are mentioned as part of the training process. It seems a bit unclear how these two datasets are combined in the training phase. Given that Wikitext-103 is a pure language corpus while CC-3M is image-caption based, how are they jointly used during the fine-tuning process? Are they used for different stages or tasks?

Yif-Yang · 2024-11-12T15:03:14Z

Thank you for your question. I’m glad to clarify.

We use the supervised SimCSE loss to make different captions of the same image positive samples for each other, while captions of different images serve as negative samples. This loss function is key to our method, allowing the LLM to provide meaningful supervisory signals to the image. However, the Masked Next Token Prediction (MNTP) was an initial stage we employed before using the supervised SimCSE loss; it can be understood as an earlier step in training. We first conduct MNTP, followed by supervised SimCSE loss, in a two-stage process. In practice, MNTP has little impact on the results, so removing it does not affect the conclusions. However, for optimal performance, we still chose to use MNTP before applying supervised SimCSE loss.

We indeed mix both pure text and caption datasets. This is because the LLM is initially pre-trained on pure text data, so we aim to retain its original distribution with minimal shift by using the pure text dataset Wikitext-103, which also helps mitigate any bias introduced by captions. Our approach is to mix and shuffle the two datasets and then sample batches normally for training. This is a common and effective practice.

If you have more questions, please feel free to ask.

BIGBALLON · 2024-11-13T11:34:29Z

Thanks a lot for replying, I got it! Thanks again!

happywinder · 2024-12-17T08:27:16Z

Thanks for your amazing work in advance.

I notice "In addition, we used the Wikitext-103 dataset [28] and the E5 dataset [37] during the Masked Next Token Prediction and caption contrastive fine-tuning stages to ensure coverage of the general text domain without excessive bias."

how to apply Wikitext-103 dataset in caption contrastive fine-tuning stages during your training stage
@Yif-Yang

BIGBALLON closed this as completed Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Use of Loss Functions and Dataset Integration in CC-Fine-Tuning? #9

The Use of Loss Functions and Dataset Integration in CC-Fine-Tuning? #9

BIGBALLON commented Nov 12, 2024

Yif-Yang commented Nov 12, 2024

BIGBALLON commented Nov 13, 2024

happywinder commented Dec 17, 2024

The Use of Loss Functions and Dataset Integration in CC-Fine-Tuning? #9

The Use of Loss Functions and Dataset Integration in CC-Fine-Tuning? #9

Comments

BIGBALLON commented Nov 12, 2024

Yif-Yang commented Nov 12, 2024

BIGBALLON commented Nov 13, 2024

happywinder commented Dec 17, 2024