Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] lgb.train() init_model does not allow a model that has fewer features than the training data #6831

Open
vnherdeiro opened this issue Feb 16, 2025 · 2 comments
Labels

Comments

@vnherdeiro
Copy link
Contributor

Summary

In a setting of retraining a Booster on a new dataset that enlarges the initial training data with new features breaks the lgb.train API. It seems to do hard check that the init_model features match in numbers the number of features of the new data. If feature names are avaiable, checking that the new dataset features contain the original features should be enough.

Motivation

Fundamentally decision trees and boosted enhancement of these are not limited to training on the feature set (hence feature bagging), it seems beneficial to allow fine tuning models with data with more features when these become available. I have encountered such scenario in a professional application of booster trees: we have a foundation model that is agnostic to many integration/client specifics (by pruning the most specific features), it is trained on large amount of data, and seek to fine tune it (by adding more trees) to the specific (=drifting) features. I am guessing that this scenario is pretty common. I have found a monkey patch around it: exporting the initial model to txt, updating the feature names and feature numbers by hand, then loading it again. It's very iffy.

Description

Code example:

import sklearn.datasets
import lightgbm as lgb # version = 4.6.0

X_pretrain, y_pretrain = sklearn.datasets.make_classification(n_samples=1_000, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=2025) #foundation dataset has 2 features

foundation_booster = lgb.train( params={}, train_set=lgb.Dataset(X_pretrain, label=y_pretrain))

X_finetune, y_finetune = sklearn.datasets.make_classification(n_samples=100, n_features=3, n_informative=3, n_redundant=0, n_classes=2, random_state=2025) # new data has 3 features (one on top of the foundation data)

finetuned_model = lgb.train( params={}, train_set=lgb.Dataset(X_finetune, label=y_finetune), init_model=foundation_booster)
#raises a LightGBMError The number of features in data (3) is not the same as it was in training data (2).
@vnherdeiro
Copy link
Contributor Author

My bad. I have missed that error message was advising to use "predict_disable_shape_check":True to bypass the error. I seem to have come up with a working example:

import sklearn.datasets
import sklearn.metrics
import lightgbm as lgb
import pandas as pd


Xtrain, ytrain = sklearn.datasets.make_classification(n_samples=150_000, n_features=3, n_informative=3, n_redundant=0, n_classes=2, random_state=2025)
Xtrain = pd.DataFrame(Xtrain, columns=['f0', 'f1', 'f2'])

X_pretrain = Xtrain.iloc[:50_000,:2]
y_pretrain = ytrain[:50_000]

X_finetune = Xtrain.iloc[50_000:100_000,:]
y_finetune = ytrain[50_000:100_000]

X_heldout = Xtrain.iloc[100_000:,:]
y_heldout = ytrain[100_000:]
foundation_booster = lgb.train( params={'loss':'binary', 'force_col_wise':True, "n_estimators":100, "verbosity":0}, train_set=lgb.Dataset(X_pretrain, y_pretrain))
print("Pretraining done")

print( "Foundation model logloss on held out:", sklearn.metrics.log_loss( y_finetune, foundation_booster.predict( X_finetune.iloc[:,:2]), ))
finetuned_model = lgb.train( params={'loss':'binary', 'force_col_wise':True, "n_estimators":200, "predict_disable_shape_check":True}, train_set=lgb.Dataset(X_finetune, label=y_finetune), init_model=foundation_booster, )
print( "Finetuned model logloss on held out:", sklearn.metrics.log_loss( y_heldout, finetuned_model.predict( X_heldout)))

printing

Pretraining done
Foundation model logloss on held out: 0.39466937889319265
Finetuned model logloss on held out: 0.21498295160704628

and

foundation_booster.feature_name(), finetuned_model.feature_name()
# (['f0', 'f1'], ['f0', 'f1', 'f2'])

as wanted.

The caveat seems to be that the new features must come, in order, after the foundation features with no check under the hood.

Apologies for rising an unneeded ticket. Please delete it as you like.

@vnherdeiro
Copy link
Contributor Author

@jameslamb Do you think we could improve the documentation a bit about this usecase? Especially explicitly saying that the first n features of the augmented training need to match the pre training?

@jameslamb jameslamb reopened this Feb 17, 2025
@jameslamb jameslamb changed the title lgb.train init_model does not allow a foundation model that has less features than the training data (fine tuning setting) [python-package] lgb.train() init_model does not allow a model that has fewer features than the training data Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants