You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a setting of retraining a Booster on a new dataset that enlarges the initial training data with new features breaks the lgb.train API. It seems to do hard check that the init_model features match in numbers the number of features of the new data. If feature names are avaiable, checking that the new dataset features contain the original features should be enough.
Motivation
Fundamentally decision trees and boosted enhancement of these are not limited to training on the feature set (hence feature bagging), it seems beneficial to allow fine tuning models with data with more features when these become available. I have encountered such scenario in a professional application of booster trees: we have a foundation model that is agnostic to many integration/client specifics (by pruning the most specific features), it is trained on large amount of data, and seek to fine tune it (by adding more trees) to the specific (=drifting) features. I am guessing that this scenario is pretty common. I have found a monkey patch around it: exporting the initial model to txt, updating the feature names and feature numbers by hand, then loading it again. It's very iffy.
Description
Code example:
importsklearn.datasetsimportlightgbmaslgb# version = 4.6.0X_pretrain, y_pretrain=sklearn.datasets.make_classification(n_samples=1_000, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=2025) #foundation dataset has 2 featuresfoundation_booster=lgb.train( params={}, train_set=lgb.Dataset(X_pretrain, label=y_pretrain))
X_finetune, y_finetune=sklearn.datasets.make_classification(n_samples=100, n_features=3, n_informative=3, n_redundant=0, n_classes=2, random_state=2025) # new data has 3 features (one on top of the foundation data)finetuned_model=lgb.train( params={}, train_set=lgb.Dataset(X_finetune, label=y_finetune), init_model=foundation_booster)
#raises a LightGBMError The number of features in data (3) is not the same as it was in training data (2).
The text was updated successfully, but these errors were encountered:
My bad. I have missed that error message was advising to use "predict_disable_shape_check":True to bypass the error. I seem to have come up with a working example:
importsklearn.datasetsimportsklearn.metricsimportlightgbmaslgbimportpandasaspdXtrain, ytrain=sklearn.datasets.make_classification(n_samples=150_000, n_features=3, n_informative=3, n_redundant=0, n_classes=2, random_state=2025)
Xtrain=pd.DataFrame(Xtrain, columns=['f0', 'f1', 'f2'])
X_pretrain=Xtrain.iloc[:50_000,:2]
y_pretrain=ytrain[:50_000]
X_finetune=Xtrain.iloc[50_000:100_000,:]
y_finetune=ytrain[50_000:100_000]
X_heldout=Xtrain.iloc[100_000:,:]
y_heldout=ytrain[100_000:]
foundation_booster=lgb.train( params={'loss':'binary', 'force_col_wise':True, "n_estimators":100, "verbosity":0}, train_set=lgb.Dataset(X_pretrain, y_pretrain))
print("Pretraining done")
print( "Foundation model logloss on held out:", sklearn.metrics.log_loss( y_finetune, foundation_booster.predict( X_finetune.iloc[:,:2]), ))
finetuned_model=lgb.train( params={'loss':'binary', 'force_col_wise':True, "n_estimators":200, "predict_disable_shape_check":True}, train_set=lgb.Dataset(X_finetune, label=y_finetune), init_model=foundation_booster, )
print( "Finetuned model logloss on held out:", sklearn.metrics.log_loss( y_heldout, finetuned_model.predict( X_heldout)))
printing
Pretraining done
Foundation model logloss on held out: 0.39466937889319265
Finetuned model logloss on held out: 0.21498295160704628
@jameslamb Do you think we could improve the documentation a bit about this usecase? Especially explicitly saying that the first n features of the augmented training need to match the pre training?
jameslamb
changed the title
lgb.train init_model does not allow a foundation model that has less features than the training data (fine tuning setting)
[python-package] lgb.train() init_model does not allow a model that has fewer features than the training data
Feb 18, 2025
Summary
In a setting of retraining a Booster on a new dataset that enlarges the initial training data with new features breaks the
lgb.train
API. It seems to do hard check that theinit_model
features match in numbers the number of features of the new data. If feature names are avaiable, checking that the new dataset features contain the original features should be enough.Motivation
Fundamentally decision trees and boosted enhancement of these are not limited to training on the feature set (hence feature bagging), it seems beneficial to allow fine tuning models with data with more features when these become available. I have encountered such scenario in a professional application of booster trees: we have a foundation model that is agnostic to many integration/client specifics (by pruning the most specific features), it is trained on large amount of data, and seek to fine tune it (by adding more trees) to the specific (=drifting) features. I am guessing that this scenario is pretty common. I have found a monkey patch around it: exporting the initial model to txt, updating the feature names and feature numbers by hand, then loading it again. It's very iffy.
Description
Code example:
The text was updated successfully, but these errors were encountered: