-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram split algorithm boundaries #4086
Comments
Hi @aceson28 , thanks for using LightGBM! If you're comfortable reading C++ code, you could look at the
Binning continuous features, bundling sparse features, and filtering out unsplittable features is all done in construction of the You can also find details on this in the original LightGBM paper, https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf. |
@jameslamb but i still don't get it in this part (bin ← I.f[k][j].bin), because where that get it .bin ? because row before it just (H ← new Histogram()) and (Build histogram) about link you gave, i'm still read and understand it, because im still newbee in C++ |
@jameslamb my second understanding is quantile can make same value for boundaries, min_data_in_bin is used to keep away from overlapping boundaries? if that true, so if there is a huge amount of same data value is it going to the next bin or a bin before? |
I think @shiyu1994 or @btrotta might give you a better answer than I can, I don't want to say something incorrect. |
@jameslamb |
@aceson28 Thanks for using LightGBM. The two sampling are differnt. The histogram bin boundaries are determined before training the trees. And the boundaries are fixed during the whole boosting process. First, LightGBM will sample The GOSS sampling is totally a different procedure. This sampling is performed before every boosting iteration. And the sampling strategy are described as in the paper https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf. Briefly, LightGBM first chooses the data points with large gradients, and then randomly samples from the reset of data. The gradient values of randomly sampled data are multiplied by a scaling factor so that the gradient sum after the GOSS sampling will be an unbiased estimation of the original sum. Does that answer your question? If any further question, please feel free to post it here. |
@shiyu1994 correct me if I'm wrong, so we have a dataset and sample it to SBin and with Sbin we use it to make histogram boundaries (these boundaries will be used for a whole process, I mean it just one time for determining the boundaries). after we get the boundaries, we calculate the pseudo residual or negative gradient from the dataset and then sort it by pseudo residual. after that sample it with GOSS (with ax100% data top gradient and bx100% the rest data) and then with that sample and histogram boundaries we build a tree? so every iteration the GOSS sampling will be performed? (btw I use single machine tree learner) |
@aceson28 GOSS samples from the whole training data and As for the difference result compared with |
no, it's very different, in the model we get the split point 365.5 for the first split, and 1095.5 for the split in depth 3 and 4, but in qcut [(29.999, 365.0], (365.0, 452.0], (452.0, 3653.0]] its just give 3 bins because there are duplicate except qcut, is there any ways to see histogram bondaries the same as in lightgbm, what I mean is original boundaries (every possible value) not the best threshold who used in tree (plot_split_value_histogram). |
Im use pyhton
Im in understandinding about lightgbm
What im understand is lightgbm use histogram split algorithm to make program lightier
From journal that i read, histogram bondaries used as split candidate, and from split candidate who gives max gain is used as tree split.
There are some article who said boundaries get with divide range of feature with num of bins same when we make histogram
The other article specially xgboost (weighted quantile )tell every bin must same sum of hessian ,because regression hessian is 1 so if we use 4 bins so we use fiture quantile (0.25, 0.5 and 0.75)
But when i do that manually and compare it with tree plot why is diffrent?
My question is how to calculate histogram bondaries?
What method lightgbm use to find histogram bondaries?
The text was updated successfully, but these errors were encountered: