This repository contains the evaluation scripts and results as described in the manuscript Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks submitted to ICPM 2024.
The corpus and datasets can be downloaded from here: datasets
- T-SAD: Given a set of activities that constitute an organizational process and a sequence of activities, determine whether the sequence is a valid execution of the process. The activities in the sequence must be performed in the correct order for the execution to be valid. Provide either True or False as the answer and nothing else.
- A-SAD: You are given a set of activities that constitute an organizational process and two activities performed in a single process execution. Determine whether it is valid for the first activity to occur before the second. Provide either True or False as the answer and nothing else.Y
- S-NAP: You are given a list of activities that constitute an organizational process and a sequence of activities that have been performed in the given order. Which activity from the list should be performed next in the sequence? The answer should be one activity from the list and nothing else.
The results of the experiments can be found in the 'eval/' folder. They are stored in a .csv file
Nvidia GPU with at least 42GB of memory is required to run the experiments (e.g., RTX A6000).
First, set up a virtual env (conda is useful for this):
pip install -r requirements.txt
-
Create a folder called 'data/' one level above the project root folder.
-
Place the downloaded dataset into the 'data/' folder.
-
Place the 'train_val_test.pkl' file in the 'data/' folder.
-
To run the ICL experiments, execute the following CLI in the project root folder (set the parameters of your choice, the config of the paper corresponds to the examples listed in the parameter descriptions below):
python evaluate_llm.py --task --device --hf_model --rand_shots --runs --num_samples
--task: one of "out_of_order", "trace_anomaly", "next_activity"
--device: e.g., "cuda:0" for the first GPU on your machine
--hf_model: the Huggingface name of the model, e.g., "meta-llama/Meta-Llama-3-8B-Instruct", "mistralai-Mistral-7B-Instruct-v0.2",
--rand_shots: a list with the numbers of shots to include, e.g., "[3,5]" for 6 and 10 shots (twice the amount is taken. In case of binary tasks one positive and one negative example from the same process)
--runs: the number of runs to execute
--num_samples: the number of samples to draw from the test set, e.g., 20000
The fine-tuning experiments are run using the Trident framework. See this repository for our fine-funing evaluation scripts and how to use them: https://github.com/fdschmidt93/trident-bpm
- The sub-folder 'bpm' contains the necessary preprocessing and evaluation code.
- The individual tasks can be run using bash scripts:
- 'pair.sh' for A-SAD using LLMs
- 'trace.sh' for T-SAD using LLMs
- 'activity.sh' for S-NAP using LLMs
- 'trace_activity.sh' for multi-task T-SAD and S-NAP using LLMs
- 'pair_roberta.sh' for A-SAD using RoBERTa
- 'trace_roberta.sh' for T-SAD using RoBERTa
- 'activity_roberta.sh' for S-NAP using RoBERTa
- 'trace_activity_roberta.sh' for multi-task T-SAD and S-NAP using RoBERTa