Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using xgboost with crankcompositor #363

Closed
fa1999abdi opened this issue Feb 6, 2024 · 9 comments
Closed

Using xgboost with crankcompositor #363

fa1999abdi opened this issue Feb 6, 2024 · 9 comments

Comments

@fa1999abdi
Copy link

Hello
I'm doing ML survival study using the {MLR3proba} package, and I'm using three learners, "surv.rfsrc", "surv.xgboost" and "surv.penalized". I want to predict survival time for each individual and compare my three learners(with RMSE and C-index criteria). Would you please explain how can I use {mlr3pipelines} and {distrcompositor, crankcompositor} to do that?
The following are my codes:

# create a task
tsk_s <- as_task_surv(tb, time = "time_to_death", event = "status", type = "right")

#impute missing
po = po("imputehist")

# new task
new_task = po$train(list(tsk_s= tsk_s))[[1]]

# benckmark
srfs=lrn("surv.rfsrc",predict_type = "crank",importance ="permute")
sbboost=lrn("surv.xgboost",predict_type = "crank")
spe=lrn("surv.penalized", lambda1=485.86,predict_type = "crank")
learners=list(srfs,sbboost,spe)
resample = rsmp("cv", folds = 3)
design = benchmark_grid(new_task, learners, resample)
bm = benchmark(design)
msr_txt = c("surv.cindex","surv.rmse")
bm$aggregate(measures)[, c("learner_id","task_id", ..msr_txt)]

Created on 2024-02-06 with reprex v2.1.0

@bblodfon
Copy link
Collaborator

bblodfon commented Feb 6, 2024

Hi, please consult the crankcompose docs. Practically you will do something like:

task = tsk("rats")
pipe = po("imputehist") %>>% 
           ppl("crankcompositor", learner = lrn(#whatever#), response = TRUE, method = "sum_haz")
pipe$train(task)
p = pipe$predict(task)[[1]] # p will have a response (survival time) now
p$score(#your_measures#)

But note that in general in survival analysis, there are issues when trying to compose the response from a distr prediction via different methods and surv.rmse is rarely used if at all. More common is to evaluate the whole distr with measures like the integrated survival brier score, ie surv.graf (docs) or surv.rcll, etc.

@bblodfon
Copy link
Collaborator

bblodfon commented Feb 8, 2024

@fa1999abdi question covered?

@fa1999abdi
Copy link
Author

fa1999abdi commented Feb 8, 2024

@bblodfon ,Thank you so much for your response.
but I should use distrcompositor for lrn("surv.xgboost") to predict survival time for each individual. is it correct?

@bblodfon
Copy link
Collaborator

bblodfon commented Feb 8, 2024

You should use distrcompositor with xgboost and the estimator = breslow for cox objective, see #263 . This will give you a distr prediction type. If you really want a response, crankcompositor works (but note some issues with improper distributions and how taking mean or median will not work as expected but with good reasoning behind that).

@bblodfon bblodfon changed the title Question Using xgboost with crankcompositor/distrcompositor [Question] Feb 9, 2024
@bblodfon bblodfon changed the title Using xgboost with crankcompositor/distrcompositor [Question] Using xgboost with crankcompositor [Question] Feb 9, 2024
@bblodfon bblodfon changed the title Using xgboost with crankcompositor [Question] Using xgboost with crankcompositor Feb 9, 2024
@bblodfon
Copy link
Collaborator

bblodfon commented Feb 9, 2024

@fa1999abdi I am going to soon split the xgboost objectives/learners (Cox vs AFT are very different) and for the Cox, the distr predictions will by default be generated using the breslow estimator to streamline things (so no distr-composition will be required for the XGboost-Cox learner). Of course response prediction will not be included, you will still need to compose that with the crankcompositor

@fa1999abdi
Copy link
Author

but it didn't work

    tsk_s <- as_task_surv(tb, time = "time_to_death", event = "status", type = "right")
    pipe = po("imputehist") %>>% 
      ppl("crankcompositor", learner = lrn("surv.xgboost"), response = TRUE, method = "sum_haz")
    pipe$train(tsk_s)
p = pipe$predict(tsk_s)[[1]] # p will have a response (survival time) now

$compose_crank.output
NULL
>  p = pipe$predict(tsk_s)[[1]] # p will have a response (survival time) now
Error: Assertion on 'distr' failed: FALSE.
This happened PipeOp compose_crank's $predict()
`

@bblodfon
Copy link
Collaborator

Yes, you need to estimate the distr either way (crankcompositor converts a distr to crank/response), so now it looks a bit complex but the following works:

library(mlr3proba)
#> Loading required package: mlr3
library(mlr3pipelines)
library(mlr3extralearners)

task = tsk("rats")

learner =
  po("encode", method = "treatment") %>>%
  ppl("crankcompositor",
    # crank needs a distr prediction type, xgboost doesn't have one, so we have to estimate it:
    learner = ppl("distrcompositor", learner = lrn("surv.xgboost", nrounds = 10),
                   estimator = "breslow", overwrite = FALSE),
    response = TRUE, method = "sum_haz", overwrite = FALSE) |>
  as_learner()

learner$train(task)
p = learner$predict(task)
p
#> <PredictionSurv> for 300 observations:
#>     row_ids time status      crank         lp response     distr
#>           1  101  FALSE -0.5318943 -0.5318943 3.987942 <list[1]>
#>           2   49   TRUE -0.9984229 -0.9984229 2.501140 <list[1]>
#>           3  104  FALSE -0.9984229 -0.9984229 2.501140 <list[1]>
#> ---                                                             
#>         298   92  FALSE -1.0661759 -1.0661759 2.337293 <list[1]>
#>         299  104  FALSE -0.8688244 -0.8688244 2.847226 <list[1]>
#>         300  102  FALSE -0.8688244 -0.8688244 2.847226 <list[1]>

p$score(msr("surv.cindex")) # uses lp prediction type
#> surv.cindex 
#>   0.8984875
p$score(msr("surv.rmse")) # uses response prediction type
#> surv.rmse 
#>  61.24336
p$score(msr("surv.brier")) # uses distr prediction type
#>  surv.graf 
#> 0.03333211

Created on 2024-02-10 with reprex v2.0.2

@fa1999abdi
Copy link
Author

@bblodfon thanks so much for your help.

@bblodfon
Copy link
Collaborator

FYI, even though you can do the above and get a response (survival time), this is from Haider's paper (he introduced the D-calibration score), where he mentions why converting a distr to a single value response is not good practice for survival modeling:

image


@mlr-org mlr-org locked and limited conversation to collaborators Feb 11, 2024
@bblodfon bblodfon converted this issue into discussion #366 Feb 11, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants