Histogram split algorithm boundaries #4086

aceson28 · 2021-03-19T19:56:44Z

Im use pyhton
Im in understandinding about lightgbm
What im understand is lightgbm use histogram split algorithm to make program lightier
From journal that i read, histogram bondaries used as split candidate, and from split candidate who gives max gain is used as tree split.

There are some article who said boundaries get with divide range of feature with num of bins same when we make histogram
The other article specially xgboost (weighted quantile )tell every bin must same sum of hessian ,because regression hessian is 1 so if we use 4 bins so we use fiture quantile (0.25, 0.5 and 0.75)

But when i do that manually and compare it with tree plot why is diffrent?
My question is how to calculate histogram bondaries?
What method lightgbm use to find histogram bondaries?

jameslamb · 2021-03-19T20:54:49Z

Hi @aceson28 , thanks for using LightGBM!

If you're comfortable reading C++ code, you could look at the ConstructHistogram* methods in the following relevant places where features are binned.

Binning continuous features, bundling sparse features, and filtering out unsplittable features is all done in construction of the Dataset object.

You can also find details on this in the original LightGBM paper, https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.

aceson28 · 2021-03-20T08:12:48Z

Hi @aceson28 , thanks for using LightGBM!

If you're comfortable reading C++ code, you could look at the ConstructHistogram* methods in the following relevant places where features are binned.

https://github.com/microsoft/LightGBM/blob/971b5486873339bcaa2bbe0948c95e1fff7246c7/src/io/multi_val_dense_bin.hpp

https://github.com/microsoft/LightGBM/blob/971b5486873339bcaa2bbe0948c95e1fff7246c7/src/io/multi_val_sparse_bin.hpp

Binning continuous features, bundling sparse features, and filtering out unsplittable features is all done in construction of the Dataset object.

You can also find details on this in the original LightGBM paper, https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.

@jameslamb
i'm sorry i'm still don't understand,
from Alg 1 in that journal i get
H[bin].y ← H[bin].y + I.y[j] (sum of gradient)
H[bin].n ← H[bin].n + 1 (sum of hessian)

but i still don't get it in this part (bin ← I.f[k][j].bin), because where that get it .bin ? because row before it just (H ← new Histogram()) and (Build histogram)
I'm still don't get it, where they get histogram boundaries?

about link you gave, i'm still read and understand it, because im still newbee in C++
btw boundaries that i mean is value beetwen the two bins, i said that because i'm afraid there will be mistake in my question.

aceson28 · 2021-03-20T09:02:45Z

@jameslamb
from subsubsection 5.1.1
https://everdark.github.io/k9/notebooks/ml/gradient_boosting/gbt.nb.html#511_greedy_equal-density_binning
suppose I have a dataset, it will be sampled to 200,000 (because bin_construct_sample_cnt=200000) and if max_bin=4 so every bin will have 50,000 data to make equal to density, so its like quantile (0.25, 0.5 and 0.75)? first of all, is that true (lightgbm use equal bin density strategies of his data sample)? if that true how about if the dataset less than 200,000? is the sample is overlapping? and if lightgbm use equal bin density strategies why there are still need min_data_in_bin?

my second understanding is quantile can make same value for boundaries, min_data_in_bin is used to keep away from overlapping boundaries? if that true, so if there is a huge amount of same data value is it going to the next bin or a bin before?

jameslamb · 2021-03-21T02:21:58Z

I think @shiyu1994 or @btrotta might give you a better answer than I can, I don't want to say something incorrect.

aceson28 · 2021-03-21T09:35:28Z

@jameslamb
@shiyu1994
@btrotta
Hi
in Histogram construction there is a sampling (bin_construct_sample_cnt) and in GOSS there is sampling too in ((b/other_rate) "Then it randomly samples b×100% instances from the rest of the data" Subsection 3.1) is that the same sampling or its different?
if it's different, how many times sampling in lightgbm? with so much sampling, won't it make the data far from the actual situation?

shiyu1994 · 2021-03-24T06:50:01Z

@aceson28 Thanks for using LightGBM.

The two sampling are differnt.

The histogram bin boundaries are determined before training the trees. And the boundaries are fixed during the whole boosting process. First, LightGBM will sample bin_construct_sample_cnt data points from the training dataset (we name it sample set SBin). Then, the boundaries are chosen so that the data in the SBin fall as evenly as possible to different bins of each feature. That means, when choosing bin boundaries of a feature, we try to make each bin contains almost the same number of data in SBin set.

The GOSS sampling is totally a different procedure. This sampling is performed before every boosting iteration. And the sampling strategy are described as in the paper https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf. Briefly, LightGBM first chooses the data points with large gradients, and then randomly samples from the reset of data. The gradient values of randomly sampled data are multiplied by a scaling factor so that the gradient sum after the GOSS sampling will be an unbiased estimation of the original sum.

Does that answer your question? If any further question, please feel free to post it here.

aceson28 · 2021-03-24T08:38:45Z

@shiyu1994
thanks for answering, but I'm still don't understand,
when I'm building a model in the params I set bin_construct_sample_cnt = 200000 (default) and max_bin=5
when I do it manually with this code:
pandas.qcut(datatrain['numerical.feature.name'].sample(n=200000,replace=True), q=5,duplicates='drop').dtypes
why the boundaries is different?
if the idea is "SBin fall as evenly as possible to different bins of each feature" I think the code when I do manually is same

correct me if I'm wrong, so we have a dataset and sample it to SBin and with Sbin we use it to make histogram boundaries (these boundaries will be used for a whole process, I mean it just one time for determining the boundaries). after we get the boundaries, we calculate the pseudo residual or negative gradient from the dataset and then sort it by pseudo residual. after that sample it with GOSS (with ax100% data top gradient and bx100% the rest data) and then with that sample and histogram boundaries we build a tree? so every iteration the GOSS sampling will be performed? (btw I use single machine tree learner)
is that true? or we calculate pseudo residual with SBin data?

shiyu1994 · 2021-03-24T09:22:43Z

@aceson28 GOSS samples from the whole training data and SBin has no effect on GOSS sampling. These are two different process which do not interfere with each other.
The SBin sampling determines only the bin boundaries, that is, how to map a floating point number feature value to an integer bin value.
And these bin values will be fixed during training.

As for the difference result compared with qcut, how much is the difference. I think a small difference is expected. Because though we both follow the evenly-distributed principal, that doesn't mean we use exactly the same algorithm with no difference in details. Also, the results of sampling inside LightGBM and sampling with pandas can be different, since we are using different random generators.

aceson28 · 2021-03-24T09:39:24Z

@shiyu1994

no, it's very different, in the model we get the split point 365.5 for the first split, and 1095.5 for the split in depth 3 and 4, but in qcut [(29.999, 365.0], (365.0, 452.0], (452.0, 3653.0]] its just give 3 bins because there are duplicate
btw my mistake, its 4 bin not 5
if isn't fix point qcut boundaries in depth 3 is
pd.qcut(datatrain[(datatrain['Periode.Hari.']>365.5)&(datatrain['Umur']>4.5)]['Periode.Hari.'], q=4,duplicates='drop').dtypes
[(365.999, 441.0], (441.0, 542.0], (542.0, 3653.0]]

except qcut, is there any ways to see histogram bondaries the same as in lightgbm, what I mean is original boundaries (every possible value) not the best threshold who used in tree (plot_split_value_histogram).

jameslamb added the question label Mar 19, 2021

StrikerRUS mentioned this issue May 14, 2021

[Question]Whether LightGBM uses 'Goss' in GBDT by default? #4284

Closed

aBeginner666 mentioned this issue Aug 30, 2021

What would happen if some features are categorical-like but aren't specified as categorical? #4561

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Histogram split algorithm boundaries #4086

Histogram split algorithm boundaries #4086

aceson28 commented Mar 19, 2021 •

edited

Loading

jameslamb commented Mar 19, 2021

aceson28 commented Mar 20, 2021 •

edited

Loading

aceson28 commented Mar 20, 2021 •

edited

Loading

jameslamb commented Mar 21, 2021

aceson28 commented Mar 21, 2021 •

edited

Loading

shiyu1994 commented Mar 24, 2021

aceson28 commented Mar 24, 2021 •

edited

Loading

shiyu1994 commented Mar 24, 2021

aceson28 commented Mar 24, 2021 •

edited

Loading

Histogram split algorithm boundaries #4086

Histogram split algorithm boundaries #4086

Comments

aceson28 commented Mar 19, 2021 • edited Loading

jameslamb commented Mar 19, 2021

aceson28 commented Mar 20, 2021 • edited Loading

aceson28 commented Mar 20, 2021 • edited Loading

jameslamb commented Mar 21, 2021

aceson28 commented Mar 21, 2021 • edited Loading

shiyu1994 commented Mar 24, 2021

aceson28 commented Mar 24, 2021 • edited Loading

shiyu1994 commented Mar 24, 2021

aceson28 commented Mar 24, 2021 • edited Loading

aceson28 commented Mar 19, 2021 •

edited

Loading

aceson28 commented Mar 20, 2021 •

edited

Loading

aceson28 commented Mar 20, 2021 •

edited

Loading

aceson28 commented Mar 21, 2021 •

edited

Loading

aceson28 commented Mar 24, 2021 •

edited

Loading

aceson28 commented Mar 24, 2021 •

edited

Loading