Inconsistency with NDCG for XGBRanker #11235

cjsombric · 2025-02-10T22:30:30Z

I am observing an inconsistency of the NDCG even when I have a evaluation set that only include one query.

I pass a single query index with 20 rows of data in as an evaluation set into the fit functionality (below X_ex_small and y_ex_small) which I will call the "ex_small" sample. The NDCG@20 matches for the "ex_small" sample when I use XGBoost's fit+evals_results and score functions. However, I am not able to replicate the computed NDCG for the "ex_small" sample manually.

import xgboost as xgb
ranker = xgb.XGBRanker(
    tree_method="hist",
    device="cuda",
    lambdarank_pair_method= "mean", # "topk",
    # lambdarank_num_pair_per_sample=10,
    eval_metric=["ndcg@1", "ndcg@5", "ndcg@20"],
)

ranker.fit(
    X_train,
    y_train,
    qid=qid_train,
    eval_set=[(X_val, y_val), (X_test, y_test), (X_ex_small, y_ex_small)],
    eval_qid=[qid_val, qid_test, qid_ex_small],
    verbose=False,
)

ranker.score(X_ex_small, y_ex_small) # returns 0.2587541199613619
ranker.evals_result_['validation_2']['ndcg@20'][-1] # returns 0.2587541199613619

For the "ex_sample" sample, one row of the 20 has a relevance of 4, two rows have a relevance of 1, and the rest have a relevance of 0. If I visualize the rows with the code lines below I see that the row with a relevance of 4 is ranked 18th of 20 in terms of relevance and the the two rows of relevance 1 are ranked 5th and 14th out of 20.

# Visualizing the predicted relevance compared to actual relevance
y_ex_small_pred = ranker.predict(X_ex_small)
temp = y_ex_small.copy()
df_temp = temp.to_frame()
df_temp["pred"] = y_ex_small_pred
df_temp.sort_values(by="pred", ascending=False)

If I try to hand compute the relevance score at k=20 I do not get the same NDCG@20 as the XGBoost functions.

# Manually computing the =ndcg@20 for the small ex athlete -- results do not match xgboost score or fit outputs
from math import log2
dcg20 = ( 1 / log2(1 + 5)) + ( 1 / log2(1 + 14)) + ( 4 / log2(1 + 18)) # predicted ranking
idcg20 = ( 4 / log2(1 + 1)) + ( 1 / log2(1 + 2)) + ( 1 / log2(1 + 3)) # optimal ranking
dcg20/idcg20 # returns: 0.3088029970412347

I read in the documentation that there might be issues because not all functions take into account the qid, however in the "ex_small" sample there is only one query id so I expected to be able to replicate the NDCG by hand. Can you help me understand why this is occurring?

cjsombric changed the title ~~Inconsistency with NDCG for ranker~~ Inconsistency with NDCG for XGBRanker Feb 10, 2025

trivialfis added the ? Triage label Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency with NDCG for XGBRanker #11235

Inconsistency with NDCG for XGBRanker #11235

cjsombric commented Feb 10, 2025 •

edited by hcho3

Loading

Inconsistency with NDCG for XGBRanker #11235

Inconsistency with NDCG for XGBRanker #11235

Comments

cjsombric commented Feb 10, 2025 • edited by hcho3 Loading

cjsombric commented Feb 10, 2025 •

edited by hcho3

Loading