Benchmarks

Metrics¶

In the examples folder you'll find benchmarks comparing xgbse to other survival analysis methods. We show 6 metrics (see [9] for details):

c-index: concordance index. Equivalent to AUC with censored data.
dcal_max_dev: maximum decile deviation from calibrated distribution.
dcal_pval: p-value from chi-square test checking for D-Calibration. If larger than 0.05 then the model is D-Calibrated.
ibs: approximate integrated brier score, the average brier score across all time windows.
inference_time: time to perform inference, recorded on a 2018 MacBook Pro.
training_time: time to perform training, recorded on a 2018 MacBook Pro.

We executed all methods with default parameters. For vanilla XGBoost and xgbse, early stopping was used, with num_boosting_rounds=1000, and early_stopping_rounds=10. We show results below for five datasets.

Results¶

FLCHAIN ¶

model	c-index	dcal_max_dev	dcal_pval	ibs	inference_time	training_time
Weibull AFT	0.789	0.013	0.849	0.099	0.006	0.537
Cox-PH	0.788	0.011	0.971	0.099	0.005	0.942
XGBSE - Debiased BCE	0.784	0.037	0	0.117	0.233	3.062
XGBSE - Bootstrap Trees	0.781	0.009	0.985	0.1	0.382	15.351
XGBSE - Kaplan Neighbors	0.777	0.013	0.918	0.102	0.543	0.479
XGBSE - Stacked Weibull	0.776	0.008	0.994	0.103	0.011	0.719
XGB - Cox	0.775	nan	nan	nan	0.001	0.054
XGB - AFT	0.772	nan	nan	nan	0.001	0.106
XGBSE - Kaplan Tree	0.768	0.011	0.929	0.103	0.003	0.167

METABRIC ¶

model	c-index	dcal_max_dev	dcal_pval	ibs	inference_time	training_time
XGBSE - Stacked Weibull	0.63	0.045	0.146	0.162	0.01	0.525
XGBSE - Debiased BCE	0.627	0.033	0.128	0.165	0.09	3.165
XGBSE - Bootstrap Trees	0.624	0.024	0.563	0.155	0.301	6.165
Weibull AFT	0.622	0.024	0.667	0.154	0.005	0.284
Cox-PH	0.622	0.026	0.567	0.154	0.004	0.244
XGB - Cox	0.617	nan	nan	nan	0.001	0.096
XGBSE - Kaplan Neighbors	0.605	0.023	0.588	0.163	0.111	0.154
XGB - AFT	0.6	nan	nan	nan	0.001	0.044
XGBSE - Kaplan Tree	0.59	0.036	0.18	0.165	0.002	0.05

RRNLNPH ¶

model	c-index	dcal_max_dev	dcal_pval	ibs	inference_time	training_time
XGBSE - Stacked Weibull	0.826	0.05	0	0.113	0.019	2.255
XGBSE - Bootstrap Trees	0.826	0.035	0	0.097	0.534	44.736
XGBSE - Kaplan Neighbors	0.824	0.038	0	0.1	15.662	1.504
XGBSE - Debiased BCE	0.824	0.068	0	0.108	0.285	4.562
XGB - Cox	0.824	nan	nan	nan	0.002	0.375
XGB - AFT	0.823	nan	nan	nan	0.001	0.243
XGBSE - Kaplan Tree	0.821	0.044	0	0.101	0.006	0.49
Weibull AFT	0.787	0.057	0	0.136	0.01	0.326
Cox-PH	0.787	0.055	0	0.135	0.021	2.267

SAC3 ¶

model	c-index	dcal_max_dev	dcal_pval	ibs	inference_time	training_time
XGBSE - Stacked Weibull	0.697	0.04	0	0.171	0.153	32.469
XGB - AFT	0.691	nan	nan	nan	0.004	7.413
XGBSE - Debiased BCE	0.69	0.045	0	0.169	1.141	47.814
XGB - Cox	0.686	nan	nan	nan	0.002	4.885
Cox-PH	0.682	0.035	0	0.165	0.039	1.84
Weibull AFT	0.682	0.039	0	0.165	0.043	2.307
XGBSE - Bootstrap Trees	0.677	0.043	0	0.168	3.134	164.173
XGBSE - Kaplan Neighbors	0.666	0.037	0	0.175	416.382	36.759
XGBSE - Kaplan Tree	0.631	0.034	0	0.191	0.036	1.478

SUPPORT ¶

model	c-index	dcal_max_dev	dcal_pval	ibs	inference_time	training_time
XGBSE - Stacked Weibull	0.621	0.092	0	0.198	0.013	0.954
XGBSE - Debiased BCE	0.617	0.139	0	0.188	0.272	3.852
XGB - Cox	0.61	nan	nan	nan	0.001	0.069
XGB - AFT	0.609	nan	nan	nan	0.001	0.137
XGBSE - Bootstrap Trees	0.607	0.103	0	0.188	0.371	18.202
XGBSE - Kaplan Neighbors	0.601	0.099	0	0.197	1.488	0.752
XGBSE - Kaplan Tree	0.598	0.097	0	0.203	0.004	0.149
Cox-PH	0.578	0.16	0	0.201	0.006	0.465
Weibull AFT	0.576	0.138	0	0.201	0.007	0.461

Analysis¶

XGBSEDebiasedBCE and XGBSEStackedWeibull show the most promising results, being in the top three methods 4 out of 5 times.
Other xgbse methods show good results too. In particular XGBSEKaplanTree with XGBSEBootstrapEstimator shows promising results, pointing to a direction for further research.
Linear methods such as the Weibull AFT and Cox-PH from lifelines are surprisingly strong, specially for datasets with a small number of samples.
xgbse methods show competitive results to vanilla xgboost as measured by C-index, while showing good results for "survival curve metrics". Thus, we can use xgbse as a calibrated replacement to vanilla xgboost.
xgbse takes longer to fit than vanilla xgboost. Specially for XGBSEDebiasedBCE, we have to build N logistic regressions where N is the number of time windows we'll predict. In all cases we used N = 30. XGBSEStackedWeibull is the most efficient method, behind XGBSEKaplanTree.

Benchmarks

Metrics¶

Results¶

FLCHAIN¶

METABRIC¶

RRNLNPH¶

SAC3¶