Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency with NDCG for XGBRanker #11235

Open
cjsombric opened this issue Feb 10, 2025 · 0 comments
Open

Inconsistency with NDCG for XGBRanker #11235

cjsombric opened this issue Feb 10, 2025 · 0 comments
Labels

Comments

@cjsombric
Copy link

cjsombric commented Feb 10, 2025

I am observing an inconsistency of the NDCG even when I have a evaluation set that only include one query.

I pass a single query index with 20 rows of data in as an evaluation set into the fit functionality (below X_ex_small and y_ex_small) which I will call the "ex_small" sample. The NDCG@20 matches for the "ex_small" sample when I use XGBoost's fit+evals_results and score functions. However, I am not able to replicate the computed NDCG for the "ex_small" sample manually.

import xgboost as xgb
ranker = xgb.XGBRanker(
    tree_method="hist",
    device="cuda",
    lambdarank_pair_method= "mean", # "topk",
    # lambdarank_num_pair_per_sample=10,
    eval_metric=["ndcg@1", "ndcg@5", "ndcg@20"],
)

ranker.fit(
    X_train,
    y_train,
    qid=qid_train,
    eval_set=[(X_val, y_val), (X_test, y_test), (X_ex_small, y_ex_small)],
    eval_qid=[qid_val, qid_test, qid_ex_small],
    verbose=False,
)

ranker.score(X_ex_small, y_ex_small) # returns 0.2587541199613619
ranker.evals_result_['validation_2']['ndcg@20'][-1] # returns 0.2587541199613619

For the "ex_sample" sample, one row of the 20 has a relevance of 4, two rows have a relevance of 1, and the rest have a relevance of 0. If I visualize the rows with the code lines below I see that the row with a relevance of 4 is ranked 18th of 20 in terms of relevance and the the two rows of relevance 1 are ranked 5th and 14th out of 20.

# Visualizing the predicted relevance compared to actual relevance
y_ex_small_pred = ranker.predict(X_ex_small)
temp = y_ex_small.copy()
df_temp = temp.to_frame()
df_temp["pred"] = y_ex_small_pred
df_temp.sort_values(by="pred", ascending=False)

If I try to hand compute the relevance score at k=20 I do not get the same NDCG@20 as the XGBoost functions.

# Manually computing the =ndcg@20 for the small ex athlete -- results do not match xgboost score or fit outputs
from math import log2
dcg20 = ( 1 / log2(1 + 5)) + ( 1 / log2(1 + 14)) + ( 4 / log2(1 + 18)) # predicted ranking
idcg20 = ( 4 / log2(1 + 1)) + ( 1 / log2(1 + 2)) + ( 1 / log2(1 + 3)) # optimal ranking
dcg20/idcg20 # returns: 0.3088029970412347

I read in the documentation that there might be issues because not all functions take into account the qid, however in the "ex_small" sample there is only one query id so I expected to be able to replicate the NDCG by hand. Can you help me understand why this is occurring?

@cjsombric cjsombric changed the title Inconsistency with NDCG for ranker Inconsistency with NDCG for XGBRanker Feb 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants