[python-package] lgb.train() init_model does not allow a model that has fewer features than the training data #6831

vnherdeiro · 2025-02-16T12:07:58Z

Summary

In a setting of retraining a Booster on a new dataset that enlarges the initial training data with new features breaks the lgb.train API. It seems to do hard check that the init_model features match in numbers the number of features of the new data. If feature names are avaiable, checking that the new dataset features contain the original features should be enough.

Motivation

Fundamentally decision trees and boosted enhancement of these are not limited to training on the feature set (hence feature bagging), it seems beneficial to allow fine tuning models with data with more features when these become available. I have encountered such scenario in a professional application of booster trees: we have a foundation model that is agnostic to many integration/client specifics (by pruning the most specific features), it is trained on large amount of data, and seek to fine tune it (by adding more trees) to the specific (=drifting) features. I am guessing that this scenario is pretty common. I have found a monkey patch around it: exporting the initial model to txt, updating the feature names and feature numbers by hand, then loading it again. It's very iffy.

Description

Code example:

import sklearn.datasets
import lightgbm as lgb # version = 4.6.0

X_pretrain, y_pretrain = sklearn.datasets.make_classification(n_samples=1_000, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=2025) #foundation dataset has 2 features

foundation_booster = lgb.train( params={}, train_set=lgb.Dataset(X_pretrain, label=y_pretrain))

X_finetune, y_finetune = sklearn.datasets.make_classification(n_samples=100, n_features=3, n_informative=3, n_redundant=0, n_classes=2, random_state=2025) # new data has 3 features (one on top of the foundation data)

finetuned_model = lgb.train( params={}, train_set=lgb.Dataset(X_finetune, label=y_finetune), init_model=foundation_booster)
#raises a LightGBMError The number of features in data (3) is not the same as it was in training data (2).

The text was updated successfully, but these errors were encountered:

vnherdeiro · 2025-02-16T16:07:54Z

My bad. I have missed that error message was advising to use "predict_disable_shape_check":True to bypass the error. I seem to have come up with a working example:

import sklearn.datasets
import sklearn.metrics
import lightgbm as lgb
import pandas as pd


Xtrain, ytrain = sklearn.datasets.make_classification(n_samples=150_000, n_features=3, n_informative=3, n_redundant=0, n_classes=2, random_state=2025)
Xtrain = pd.DataFrame(Xtrain, columns=['f0', 'f1', 'f2'])

X_pretrain = Xtrain.iloc[:50_000,:2]
y_pretrain = ytrain[:50_000]

X_finetune = Xtrain.iloc[50_000:100_000,:]
y_finetune = ytrain[50_000:100_000]

X_heldout = Xtrain.iloc[100_000:,:]
y_heldout = ytrain[100_000:]
foundation_booster = lgb.train( params={'loss':'binary', 'force_col_wise':True, "n_estimators":100, "verbosity":0}, train_set=lgb.Dataset(X_pretrain, y_pretrain))
print("Pretraining done")

print( "Foundation model logloss on held out:", sklearn.metrics.log_loss( y_finetune, foundation_booster.predict( X_finetune.iloc[:,:2]), ))
finetuned_model = lgb.train( params={'loss':'binary', 'force_col_wise':True, "n_estimators":200, "predict_disable_shape_check":True}, train_set=lgb.Dataset(X_finetune, label=y_finetune), init_model=foundation_booster, )
print( "Finetuned model logloss on held out:", sklearn.metrics.log_loss( y_heldout, finetuned_model.predict( X_heldout)))

printing

Pretraining done
Foundation model logloss on held out: 0.39466937889319265
Finetuned model logloss on held out: 0.21498295160704628

and

foundation_booster.feature_name(), finetuned_model.feature_name()
# (['f0', 'f1'], ['f0', 'f1', 'f2'])

as wanted.

The caveat seems to be that the new features must come, in order, after the foundation features with no check under the hood.

Apologies for rising an unneeded ticket. Please delete it as you like.

vnherdeiro · 2025-02-17T12:49:44Z

@jameslamb Do you think we could improve the documentation a bit about this usecase? Especially explicitly saying that the first n features of the augmented training need to match the pre training?

vnherdeiro closed this as completed Feb 16, 2025

jameslamb reopened this Feb 17, 2025

jameslamb added the question label Feb 17, 2025

jameslamb changed the title ~~lgb.train init_model does not allow a foundation model that has less features than the training data (fine tuning setting)~~ [python-package] lgb.train() init_model does not allow a model that has fewer features than the training data Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] lgb.train() init_model does not allow a model that has fewer features than the training data #6831

[python-package] lgb.train() init_model does not allow a model that has fewer features than the training data #6831

vnherdeiro commented Feb 16, 2025

vnherdeiro commented Feb 16, 2025

vnherdeiro commented Feb 17, 2025

[python-package] lgb.train() init_model does not allow a model that has fewer features than the training data #6831

[python-package] lgb.train() init_model does not allow a model that has fewer features than the training data #6831

Comments

vnherdeiro commented Feb 16, 2025

Summary

Motivation

Description

vnherdeiro commented Feb 16, 2025

vnherdeiro commented Feb 17, 2025