In the examples folder you'll find benchmarks comparing xgbse to other survival analysis methods. We show 6 metrics (see [9] for details):
c-index: concordance index. Equivalent to AUC with censored data.
dcal_max_dev: maximum decile deviation from calibrated distribution.
dcal_pval: p-value from chi-square test checking for D-Calibration. If larger than 0.05 then the model is D-Calibrated.
ibs: approximate integrated brier score, the average brier score across all time windows.
inference_time: time to perform inference, recorded on a 2018 MacBook Pro.
training_time: time to perform training, recorded on a 2018 MacBook Pro.
We executed all methods with default parameters. For vanilla XGBoost and xgbse, early stopping was used, with num_boosting_rounds=1000, and early_stopping_rounds=10. We show results below for five datasets.
XGBSEDebiasedBCE and XGBSEStackedWeibull show the most promising results, being in the top three methods 4 out of 5 times.
Other xgbse methods show good results too. In particular XGBSEKaplanTree with XGBSEBootstrapEstimator shows promising results, pointing to a direction for further research.
Linear methods such as the Weibull AFT and Cox-PH from lifelines are surprisingly strong, specially for datasets with a small number of samples.
xgbse methods show competitive results to vanilla xgboost as measured by C-index, while showing good results for "survival curve metrics". Thus, we can use xgbse as a calibrated replacement to vanilla xgboost.
xgbse takes longer to fit than vanilla xgboost. Specially for XGBSEDebiasedBCE, we have to build N logistic regressions where N is the number of time windows we'll predict. In all cases we used N = 30. XGBSEStackedWeibull is the most efficient method, behind XGBSEKaplanTree.