This folder contains code and resources to run experiments and evaluations.
To better organize the evaluation folder, we should follow the rules below:
- Each subfolder contains a specific benchmark or experiment. For example,
evaluation/swe_bench
should contain all the preprocessing/evaluation/analysis scripts. - Raw data and experimental records should not be stored within this repo.
- For model outputs, they should be stored at this huggingface space for visualization.
- Important data files of manageable size and analysis scripts (e.g., jupyter notebooks) can be directly uploaded to this repo.
- SWE-Bench:
evaluation/swe_bench
Check this huggingface space for visualization of existing experimental results.
You can start your own fork of our huggingface evaluation outputs and submit a PR of your evaluation results to our hosted huggingface repo via PR following the guide here.