Evaluation

After you have trained an LCM, the checkpoint will be saved in a folder under the name model.pt, together with the model card under the name model_card.yaml. We also provide the library to evaluate the LCM and LLM. Using this library brings many benefits: You can reproduce the experiments done in the paper, you can inspect the results in an unified way, and you can also scale up the experiments for very large datasets in SLURM cluster. This document shows how to evaluate the model for different downstream tasks using the LCM eval library.

Step 1: Prepare the data

Since an LCM expects input data in sentence level, we need to preprocess the evaluation datasets accordingly. This includes parsing the raw content and splitting texts into sentences, then embedding them into vectors using a Sonar encoder.

The example below shows how we prepare the data for CNN Dailymail. We load the dataset from Huggingface using datasets API. The sentence splitting is done using wtpsplit. First, we install necessary libraries:

python -m pip install datasets wtpsplit

All processing logic is implemented in the file prepare_evaluation_data.py, as described below.

Step 1.1: Process the split:

Next, we download and parse the content (source text and summaries), saving different splits into JSON format

python prepare_evaluation_data.py prepare_data \
    --dataset_name=cnn_dailymail \
    --output_dir=jsonl_dataset \
    --source_text_column=article \
    --target_text_column=highlights \
    --version=3.0.0 \
    --prompt_prefix="Summarize the following news to a concise list of highlights.\n[Text Start]:\n"
    --prompt_suffix="\n[Text End]"

Explain: In the above script, cnn_dailymail and 3.0.0 is the name and configuration of the dataset as available in HuggingFace datasets, article and highlights are source and summary columns. The prompt_prefix and prompt_suffix are optional arguments, if specified they will be prepended and appended to each source text to form the complete prompt. These arguments are useful if you want to embed the prompts into the dataset, and let them process all at once together with the text. Alternatively, we can specify them at later phase, when we evaluate the model (in which case the model will process the prompts on the fly)

NOTE: When prompt_prefix or prompt_suffix are specified, the dataset schema will change, i.e. the columns are renamed to "prompt" for input and "answer" for output. This is to indicate that we are handling the "processed" dataset and not the original one.

The output will be stored in different files [split].jsonl under the directory output_dir.

Step 1.2: Sentence splitting and embedding:

To perform sentence splitting and sonar embedding for each split, run the following command:

python prepare_evaluation_data.py embed \
    --input_path=jsonl_dataset/cnn_dailymail/test.jsonl \
    --input_column=article \
    --output_column=highlights \
    --output_dir=parquet_dataset/cnn_dailymail \
    --lang=eng_Latn \
    --mode=slurm \
    --log_dir=/tmp/logs/embed_cnndm

Step 2: Choose the predictor for evaluation

To run the evaluation, we first need to map the model to a Predictor, which is an object that streamlines a number of steps: Loading the models, reading the prompts, performing the inference, decoding the outputs according to a given user setting, and finally formatting the text into the user-friendly format. Currently, the list of supported model families and their predictors is below. All predictors are found in "lcm/evaluation/predictors" and are registered in lcm.evaluation.predictors_PREDICTOR_CONFIG_MAP

Predictor	Model family	Model identifier
huggingface	AutoModel transformers	`model_name`, `revision`, `model_class`, `tokenizer_class`
llama3	LlaMA 3.x	`model_name`
gemma	Gemma	`model_name`
base_lcm	Base LCM	`model_card`
two_tower_diffusion_lcm	Two-tower diffusion LCM	`model_card`

Next, we specify how the decoder generate texts with different generation options.

For LLMs, the options are parameters found in transformers.GenerationConfig, and we port in the predictors the most popular ones: repetition_penalty, encoder_repetition_penalty, encoder_no_repeat_ngram_size, no_repeat_ngram_size.

For LCMs, the options are found in LCMGeneratorOptions (for Base LCM) or DiffusionLCMGeneratorOptions (for Two-tower diffusion LCM). These options only specify how to generate output embeddings using diffusion process. We also want to specify the sonar decoder options, which dictates how the embeddings are decoded into texts, using parameters in SonarDecoderConfig.

Step 3: Choose a downstream task and run the evaluation

To run the downstream task, specify the task name and configuration, as well as parameters. We provide example tasks that were used in the paper:

LLM evaluation tasks:

Task name	Task configuration	Explanation
cnn_dailymail	cnn_dailymail_{form}llm.{split}	{form} can be empty for or "inverse_" for summary expansion, {split} can be "test", "validation" or "train"
xsum	xsum_{form}llm.{split}	{form} can be empty for or "inverse_" for summary expansion, {split} can be "test", "validation" or "train"
xlsum_llm	xlsum_llm.{lang}.{split}	{lang} refers to one value in language list, {split} can be "test", "validation" or "train"

The evaluation library provides the handy CLI to evaluate using lcm.evaluation entry. Example command for evaluating the Meta Llama 3.1 8B instruction:

uv run torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \
  --predictor llama3  \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --generator_batch_size 16 \
  --tasks cnn_dailymail_llm.test \
  --task_args '{"max_gen_len": 200}' \
  --dataset_dir jsonl_dataset/cnn_dailymail \
  --data_loading.batch_size 16 \
  --dataset.soure_text_column prompt \
  --dataset.source_target_column answer \
  --dump_dir output_results

In the example above, we load the model "meta-llama/Llama-3.1-8B-Instruct" as specified in HuggingFace, evaluate it on the CNN dailymail in which we process using the prepare_evaluation_data.py script as in Step 1.1, and store the results in the folder specified via dump_dir. The argument dataset_dir refers to the value of the argument output_dir in Step 1.1.

In some cases, the model requires authentication token to evaluate. You can obtain them in HuggingGface (see User Access Tokens), then add the parameter --use_auth_token [YOUR TOKEN] to the CLI command.

In the above example, we need to provide the source_text_column and source_target_column parameters, because in Step 1, we inject the prompts direcly to the dataset and renamed the columns accordingly (to differentiate with "original" datasets). You can also skip this part and customize the prompt for each for each evaluation run. To do this, instead of specifying the prompt_prefix and prompt_suffix when preparing the data (as shown in the example in Section 1.1), we specify dataset.source_prefix_text and dataset.source_suffix_text during the evaluation run:

uv run torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \
  --predictor llama3  \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --generator_batch_size 16 \
  --tasks cnn_dailymail_llm.test \
  --task_args '{"max_gen_len": 200}' \
  --dataset_dir jsonl_dataset/cnn_dailymail \
  --data_loading.batch_size 16 \
  --dataset.source_prefix_text "Summarize the following news to a concise list of highlights.\n[Text Start]:\n" \
  --dataset.source_suffix_text "\n[Text End]" \
  --dump_dir output_results

Note the missing parameters source_text_column and target_text_column and the new parameters source_prefix_text, target_prefix_text, since in this case, we do not modify the column schema, therefore the original text columns ("article", "highlights") are kept and not specified in the CLI.

It is also possible to provide the prompt from a YAML file. This is handy when you have to engineer the prompts carefully and have a very long detailed text. We provide one example prompt in the file instruction.yaml. The example command is:

uv run torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \
  --predictor llama3  \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --generator_batch_size 16 \
  --tasks cnn_dailymail_llm.test \
  --task_args '{"max_gen_len": 200}' \
  --dataset_dir jsonl_dataset/cnn_dailymail \
  --data_loading.batch_size 16 \
  --prompt_file instruction.yaml \
  --dump_dir output_results

LCM evaluation tasks:

In contrast to LLM, the LCMs expect dataset to be preprocessed in Parquet format, with inputs being (sonar-) sentence embeddings. To evaluate an LCM on a ddownstream task, point to the directory consisting of the parquet files, as specified in Step 1, and run (example for Two-tower diffusion LCM):

uv run torchrun --standalone --nnodes=1 --nproc-per-node=1 -m lcm.evaluation \
  --predictor two_tower_diffusion_lcm  \
  --model_card path/to/the/model_card.yaml \
  --generator_batch_size 16 \
  --tasks lcm_generation \
  --task_args '{"max_gen_len": 200}' \
  --dataset.parquet_path parquet_dataset/cnn_dailymail \
  --data_loading.batch_size 16 \
  --dump_dir output_results

Similar to LLM evaluation, it is possible to specify the prompt prefix and suffix ad-hoc. This text will be sentence-split and embedded using the standard Sonar encoder.

Common CLI arguments

Argument	Description
`predictor`	The wrapper of the nodel to be evaluated. See Step 2 for more details
`data_loading.max_samples`	Evaluate on the maximum k examples in the test data. Useful for debugging
`data_loading.batch_size`	Loading and evaluate data in batch. By default `batch_size=10`
`dataset_dir`	The directory consists of different JSONL files processed in Step 1. Only used in LLM evaluation
`dataset.parquet_path`	The parquet path consists of different Parquet files files processed in Step 1. Only used in LCM evaluation
`dataset.source_column`	The column in the data that refers to the input embedding. Not applicable when evaluating LLMs
`dataset.source_text_column`	The column in the data that refers to the input text. Not applicable when evaluating LCMs
`dataset.source_text_column`	The column in the data that refers to the input text. Not applicable when evaluating LCMs
`dataset.target_column`	The column in the data that refers to the ground-truth embedding. Not applicable when evaluating LLMs
`dataset.target_text_column`	The column in the data that refers to the ground-truth text. Not applicable when evaluating LCMs
`dataset.source_text_prefix`	The text that will prepended to each input text to make the prompt for the model.
`dataset.source_text_prefix`	The text that will appended after each input text to make the prompt for the model.
`task_args`	The JSON-formatted string that represents the task arguments. See task param list below.
`dump_dir`	The directory consisting output of the eval run. If successful, there should be a file `metrics.eval.jsonl` that consists of metric results, the directory `results` that capture the verbose command line used with the detailed output scores, and the directory `raw_results` that shows
the model output for each individual sample, together with the per-sample metric results.
`task`	Task configuration. See Step 3 for examples.
`task_args`	The JSON-formatted string that represents the task arguments. See task param list below.
`launcher`	Whether the CLI should be run locally, or in SLURM cluster. Accepted value is `local`, `submitit` (SLURM) or `standalone` (debug mode).
`job_args`	Parameters used when launching eval in SLURM. See below for more details.

Table: List of common arguments in Evaluation CLI.

Note: In above examples, free arguments such as generator_batch_size, temperature, etc. are generator options. They depend on specific predictor, as explained in Step 2. Giving a wrong option will trigger and error in the CLI.

Outputs dumped in the directory specified by dump_dir will be structured as:

.
├── metadata.jsonl
├── metrics.eval.jsonl
├── raw_results
├── results
└── tb

where metrics.eval.jsonl contains corpus-level scores.

Task arguments

In both LLM and LCM evaluation, we can configure how inputs and outputs are processed:

max_prompt_len: The model context size, i.e. maximum number of tokens (in LLM) or sentences (in LCM) that the model can accept
max_gen_len: The maximum number of tokens (in LLM) or sentences (in LCM) the model should generate. Note that some model generators have its own stopping criteria, so the actual generated text can be much lower than this value.
min_gen_len: The minimum number of tokens (in LLM) or sentences (in LCM) the model should generate.
max_gen_len_ratio: The maximum number of tokens (in LLM) or sentences (in LCM) the model should generate in comparison to the input length. For example, if the source document is 5K long and max_gen_len_ratio=0.2, we are asking the model to generate 1K-long output (Again, due to the model generators inner behaviour, the output can be much shorter)

Evaluate big datasets

The above command is sufficient for most cases where you load the model into one GPU and evaluate the whole dataset locally, i.e. the datasets and everyhing is loaded into the memory. For bigger datasets, or for models which are not easily run in one GPU, or two slow to evaluate, we can submit the evaluation job to the SLURM cluster by choosing the launcher=submitit:

slurm_partition=YOUR_SLURM_PARTITION
shards=NUMBER_OF_SLURM_NODES
timeout_min=JOB_TIMEOUT_IN_MINUTES


python -m lcm.evaluation \
  --predictor two_tower_diffusion_lcm  \
  --model_card path/to/the/model_card.yaml \
  --generator_batch_size 16 \
  --tasks lcm_generation \
  --task_args '{"max_gen_len": 200}' \
  --dataset.parquet_path parquet_dataset/cnn_dailymail \
  --data_loading.batch_size 16 \
  --dump_dir output_results \
  --launcher submitit \
  --job_args '{"launcher.cache": "null", "launcher.partition": "'${slurm_partition}'", "launcher.qos": "'${qos}'", "nshards": '${shards}', "requirements": {"gpus_per_node": 1, "timeout_min": '${timeout_min}'}}' \

The parameters in job_args are submitit parameters. Please refer to https://github.com/facebookincubator/submitit for more comprehensive documentation and parameters list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Evaluation

Step 1: Prepare the data

Step 1.1: Process the split:

Step 1.2: Sentence splitting and embedding:

Step 2: Choose the predictor for evaluation

Step 3: Choose a downstream task and run the evaluation

LLM evaluation tasks:

LCM evaluation tasks:

Common CLI arguments

Task arguments

Evaluate big datasets

Files

README.md

Latest commit

History

README.md

File metadata and controls

Evaluation

Step 1: Prepare the data

Step 1.1: Process the split:

Step 1.2: Sentence splitting and embedding:

Step 2: Choose the predictor for evaluation

Step 3: Choose a downstream task and run the evaluation

LLM evaluation tasks:

LCM evaluation tasks:

Common CLI arguments

Task arguments

Evaluate big datasets