Basic Usage:¶

In this notebook you will find: - How to get a survival curve and neighbors prediction using xgbse - How to validate your xgbse model using sklearn

Metrabic¶

We will be using the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset from pycox as base for this example.

from xgbse.converters import convert_to_structured
from pycox.datasets import metabric
import numpy as np

# getting data
df = metabric.read_df()

df.head()

	x0	x1	x2	x3	x4	x5	x7	x8	duration	event
0	5.603834	7.811392	10.797988	5.967607	1.0	1.0	1.0	56.840000	99.333336	0
1	5.284882	9.581043	10.204620	5.664970	1.0	0.0	1.0	85.940002	95.733330	1
2	5.920251	6.776564	12.431715	5.873857	0.0	1.0	1.0	48.439999	140.233337	0
3	6.654017	5.341846	8.646379	5.655888	0.0	0.0	0.0	66.910004	239.300003	0
4	5.456747	5.339741	10.555724	6.008429	1.0	0.0	1.0	67.849998	56.933334	1

Split and Time Bins¶

Split the data in train and test, using sklearn API. We also setup the TIME_BINS array, which will be used to fit the survival curve

from xgbse.converters import convert_to_structured
from sklearn.model_selection import train_test_split

# splitting to X, T, E format
X = df.drop(['duration', 'event'], axis=1)
T = df['duration']
E = df['event']
y = convert_to_structured(T, E)

# splitting between train, and validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state = 0)
TIME_BINS = np.arange(15, 315, 15)
TIME_BINS

array([ 15,  30,  45,  60,  75,  90, 105, 120, 135, 150, 165, 180, 195,
       210, 225, 240, 255, 270, 285, 300])

Fit and Predict¶

We will be using the DebiasedBCE estimator to fit the model and predict a survival curve for each point in our test data

from xgbse import XGBSEDebiasedBCE

# fitting xgbse model
xgbse_model = XGBSEDebiasedBCE()
xgbse_model.fit(X_train, y_train, time_bins=TIME_BINS)

# predicting
y_pred = xgbse_model.predict(X_test)

print(y_pred.shape)
y_pred.head()

(635, 20)

	15	30	45	60	75	90	105	120	135	150	165	180	195	210	225	240	255	270	285	300
0	0.983502	0.951852	0.923277	0.900028	0.862270	0.799324	0.715860	0.687257	0.651314	0.610916	0.568001	0.513172	0.493194	0.430701	0.377675	0.310496	0.272169	0.225599	0.184878	0.144089
1	0.973506	0.917739	0.839154	0.710431	0.663119	0.558886	0.495204	0.364995	0.311628	0.299939	0.226226	0.191373	0.171697	0.144864	0.112447	0.089558	0.081137	0.057679	0.048563	0.035985
2	0.986894	0.959209	0.919768	0.889910	0.853239	0.777208	0.725381	0.649177	0.582569	0.531787	0.485275	0.451667	0.428899	0.386413	0.344369	0.279685	0.242064	0.187967	0.158121	0.118562
3	0.986753	0.955210	0.910354	0.857684	0.824301	0.769262	0.665805	0.624934	0.583592	0.537261	0.493957	0.443193	0.416702	0.376552	0.308947	0.237033	0.177140	0.141838	0.117917	0.088937
4	0.977348	0.940368	0.873695	0.804796	0.742655	0.632426	0.556008	0.521490	0.493577	0.458477	0.416363	0.391099	0.364431	0.291472	0.223758	0.190398	0.165911	0.120061	0.095512	0.069566

mean predicted survival curve for test data

y_pred.mean().plot.line();

svg

Neighbors¶

We can also use our model for querying comparables based on survivability.

neighbors = xgbse_model.get_neighbors(
    query_data = X_test,
    index_data = X_train,
    n_neighbors = 5
)

print(neighbors.shape)
neighbors.head(5)

(635, 5)

	neighbor_1	neighbor_2	neighbor_3	neighbor_4	neighbor_5
829	339	166	508	1879	418
670	1846	1082	1297	194	1448
1064	416	1230	739	1392	589
85	1558	8	1080	613	1522
1814	105	859	1743	50	566

example: selecting a data point from query data (X_test) and checking its features

desired = neighbors.iloc[10]

X_test.loc[X_test.index == desired.name]

	x0	x1	x2	x3	x4	x5	x6	x7	x8
399	5.572504	7.367552	11.023443	5.406307	1.0	0.0	0.0	1.0	67.620003

... and finding its comparables from index data (X_train)

X_train.loc[X_train.index.isin(desired.tolist())]

	x0	x1	x2	x3	x4	x5	x7	x8
757	5.745395	8.178815	10.745699	5.530381	1.0	1.0	1.0	64.930000
726	5.635854	6.648942	10.889588	5.496374	1.0	1.0	1.0	70.860001
968	5.541239	7.058089	10.463409	5.396433	1.0	0.0	1.0	71.070000
870	5.605712	7.309217	10.935708	5.542732	0.0	1.0	1.0	71.470001
1640	5.812605	7.646811	10.952687	5.516386	1.0	1.0	1.0	68.559998

Score metrics¶

XGBSE implements concordance index and integrated brier score, both can be used to evaluate model performance

# importing metrics
from xgbse.metrics import concordance_index, approx_brier_score

# running metrics
print(f"C-index: {concordance_index(y_test, y_pred)}")
print(f"Avg. Brier Score: {approx_brier_score(y_test, y_pred)}")

C-index: 0.6706453426714781
Avg. Brier Score: 0.17221909077845754

Cross Validation¶

We can also use sklearn's cross_val_score and make_scorer to cross validate our model

from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer

results = cross_val_score(xgbse_model, X, y, scoring=make_scorer(approx_brier_score))
results

array([0.16269636, 0.14880423, 0.12848939, 0.15335356, 0.15394174])