GitHub - softwaredoug/local-llm-judge: Local LLM as a search relevance judge

Local LLM Search Relevance Judge

(Runs on Apple Silicon only with MLX)

Using the WANDS dataset, use a local LLM (Qwen 2.5) to try to evaluate pairwise search relevance relevance.

The LLM strategies here attempt to recover the pairwise relevance preference of the WANDS human labelers. Blog post series:

To run:

$ poetry install

Download WANDS into data folder

Get Qwen from Hugging face, convert to MLX format

$ mkdir -p ~/.mlx
$ poetry run mlx_lm.convert --hf-path Qwen/Qwen2.5-7B-Instruct --mlx-path ~/.mlx/Qwen2.5-7B-Instruct/ -q\n

Run local judge

$ poetry run python -m local_llm_judge.main --verbose --eval-fn name

Optionally - Talk to Qwen

poetry run python -m local_llm_judge.shell

Double check or not

You can double check the variants, by asking --check-both-ways

$ poetry run python -m local_llm_judge.main --verbose --eval-fn name --check-both-ways

Letting agent choose neither / say it doesn't know

The variants look at different fields, with a version of prompts that allow the agent to chicken-out and say Neither if it doesn't know. This improves precision, sacrificing coverage/recall.

$ poetry run python -m local_llm_judge.main --verbose --eval-fn name_allow_neither --check-both-ways

Results output

Result dataframes put into the data/ directory

data/both_ways_[eval_fn] or just data/[eval_fn].pkl

Training an ensemble

Run ./collect.sh N (ie /.collect.sh 7000) to run a large set of variants with different setting permutations (allowing neither, double checking or not).

Then the train script will try to train a prediction using all the different agent permutations:

$ poetry run python -m  local_llm_judge.train --feature_names data/both_ways_category.pkl data/both_ways_name.pkl  data/both_ways_desc.pkl data/both_ways_classs.pkl data/both_ways_category_allow_neither.pkl data/both_ways_name_allow_neither.pkl data/both_ways_desc_allow_neither.pkl data/both_ways_class_allow_neither.pkl

Then you can see a precision / recall tradeoffs of a decision tree trained to predict the first 1000 labels:

['both_ways_desc_allow_neither', 'both_ways_class_allow_neither'] 1.0 0.013
['both_ways_name', 'both_ways_class_allow_neither'] 0.9861111111111112 0.072
['both_ways_category', 'both_ways_name', 'both_ways_classs', 'both_ways_name_allow_neither', 'both_ways_class_allow_neither'] 0.9673366834170855 0.398
['both_ways_category', 'both_ways_name', 'both_ways_classs', 'both_ways_class_allow_neither'] 0.9668508287292817 0.362
['both_ways_name', 'both_ways_desc_allow_neither', 'both_ways_class_allow_neither'] 0.9666666666666667 0.09
['both_ways_desc', 'both_ways_desc_allow_neither', 'both_ways_class_allow_neither'] 0.9666666666666667 0.06
['both_ways_desc', 'both_ways_class_allow_neither'] 0.9666666666666667 0.06
['both_ways_category', 'both_ways_name', 'both_ways_classs'] 0.9665738161559888 0.359
['both_ways_category', 'both_ways_name', 'both_ways_desc', 'both_ways_classs', 'both_ways_category_allow_neither'] 0.9659367396593674 0.411

List of variants

There are a lot of prompts / variants listed in this file. The function names are the arguments to the eval-fn argument at command line.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
local_llm_judge		local_llm_judge
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
collect.sh		collect.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local LLM Search Relevance Judge

To run:

Double check or not

Letting agent choose neither / say it doesn't know

Results output

Training an ensemble

List of variants

About

Releases

Packages

Languages

softwaredoug/local-llm-judge

Folders and files

Latest commit

History

Repository files navigation

Local LLM Search Relevance Judge

To run:

Double check or not

Letting agent choose neither / say it doesn't know

Results output

Training an ensemble

List of variants

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages