Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] forcedsplits_filename pointing at a non-existent file is silently ignored #6830

Open
jameslamb opened this issue Feb 15, 2025 · 1 comment · May be fixed by #6832
Open

[c++] forcedsplits_filename pointing at a non-existent file is silently ignored #6830

jameslamb opened this issue Feb 15, 2025 · 1 comment · May be fixed by #6832

Comments

@jameslamb
Copy link
Collaborator

Description

If you pass a non-existent file via parameter forcedsplits_filename, lightgbm appears to silently ignore it.

It should raise an informative if reading that file fails, or at least log a warning.

Reproducible Example

Using lightgbm==4.6.0 installed from PyPI.

import json
import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_regression

X, y = make_regression(
    n_samples=10_000,
    n_features=5,
    n_informative=5,
    random_state=42
)

# add a noise feature
noise_feature = np.random.random(size=(X.shape[0], 1))
X = np.concatenate((X, noise_feature), axis=1)

# force the use of that noise feature in every tree
forced_split = {
    "feature": 5,
    "threshold": np.mean(noise_feature),
}
with open("forced_splits.json", "w") as f:
    f.write(json.dumps(forced_split))

# train another model, forcing it to use those splits
model = lgb.LGBMRegressor(
    random_state=708,
    n_estimators=10,
    verbose=1,
    forcedsplits_filename="forced_splits.json",
)
model.fit(X, y)

# noise feature was used exactly once in every tree
# (because we forced LightGBM to use it)
model.feature_importances_
# array([  0, 109, 132,   0,  49,  10], dtype=int32)

# passing a non-existent file... no warning, no error
model2 = lgb.LGBMRegressor(
    random_state=708,
    n_estimators=10,
    verbose=1,
    forcedsplits_filename="does-not-exist.json",
)
model2.fit(X, y)

Logs from that second .fit():

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000568 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1530
[LightGBM] [Info] Number of data points in the train set: 10000, number of used features: 6
[LightGBM] [Info] Start training from score -0.889445
LGBMRegressor(forcedsplits_filename='does-not-exist.json', n_estimators=10,
              random_state=708, verbose=1)

Notes

Noticed this while working on https://stackoverflow.com/a/79435055/3986677.

I strongly suspect it is not specific to the Python package, and that changes need to be made in the C++ code.

@KYash03
Copy link

KYash03 commented Feb 16, 2025

PR: #6832

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants