Survival analysis using lifelines in Python
Checkout the video version here:
Summary
The webpage provides an overview of survival analysis using the lifelines library in Python, detailing various models such as Kaplan-Meier, Nelson-Aalen, Exponential, Weibull, and Cox's proportional hazard models, along with their assumptions, applications, and model selection criteria.
Abstract
Survival analysis is a statistical approach used to analyze time-to-event data, particularly in medical research to study the time until death or other critical events. The lifelines Python library offers tools for both non-parametric methods like Kaplan-Meier and Nelson-Aalen estimators, which are distribution-free, and parametric models such as Exponential and Weibull distributions that assume specific survival functions. The Exponential model assumes a constant hazard rate, while the Weibull model allows for increasing or decreasing hazard rates over time. The Cox's proportional hazard model, a semi-parametric approach, is particularly useful for adjusting covariates and estimating hazard ratios without specifying the baseline hazard function. Model selection is guided by context, assumptions, and statistical measures such as AIC and QQ plots. The webpage also discusses the importance of validating the proportional hazard assumption in Cox models and introduces time-varying models and Aalen's additive model for more complex scenarios.
Opinions
Checkout the video version here:
Survival analysis is used for modeling and analyzing survival rate (likely to survive) and hazard rate (likely to die).
Let’s start with an example:
Here we load a dataset from the lifelines package. I am only looking at 21 observations in my example. The survival analysis dataset contains two columns: T representing durations, and E representing censoring, whether the death has observed or not. Censoring is what makes survival analysis special. There are events you haven’t observed yet but you can’t drop them from your dataset.
For example, in our dataset, for the first individual (index 34), he/she has survived until time 33, and the death was observed. But for the individual in index 39, he/she has survived at 61, but the death was not observed.
from lifelines.datasets import load_waltons
df = load_waltons()
df1 = df.iloc[34:55][['T','E']]
df1
The easiest way to estimate the survival function is through the Kaplan-Meiser Estimator. Equation is shown below .It’s basically counting how many people has died/survived at each time point. d_i represents number of deaths events at time t_i, n_i represents number of people at risk of death at time t_i.
Again, use our example of 21 data points, at time 33, one person our of 21 people died. Thus, the survival rate at time 33 is calculated as 1–1/21. At time 54, among the remaining 20 people 2 has died. At time 61, among the remaining 18, 9 has dies. At time 67, we only have 7 people remained and 6 has died. We can see that Kaplan-Meiser Estimator is very easy to understand and easy to compute even by hand.

Here we get the same results if we use the KaplanMeierFitter in lifeline. In addition to the functions below, we can get the event table from kmf.event_table , median survival time (time when 50% of the population has died) from kmf.median_survival_times , and confidence interval of the survival estimates from kmf.confidence_interval_ .
T = df1['T']
E = df1['E']
from lifelines import KaplanMeierFitter
kmf = KaplanMeierFitter()
kmf.fit(T, event_observed=E)
kmf.survival_function_
kmf.plot_survival_function()
Nelson Aalen estimator estimates hazard rate first with the following equations. d_i represents number of deaths events at time t_i, n_i represents number of people at risk of death at time t_i. We can get all the harzard rate through simple calculations shown below.

Again, we can easily use lifeline to get the same results.
from lifelines import NelsonAalenFitter
naf = NelsonAalenFitter(nelson_aalen_smoothing=False)
naf.fit(T, event_observed=E)
Kaplan-Meier and Nelson-Aalen models are non-parametic. They are simple to interpret, but no functional form, so that we can’t model a distribution function with it. This is where the exponential model comes handy.
Exponential distribution is based on the poisson process, where the event occur continuously and independently with a constant event rate 𝜆. Exponential distribution models how much time needed until an event occurs with the pdf 𝑓(𝑡)=𝜆𝑒xp(−𝜆𝑡) and cdf 𝐹(𝑡)=𝑝(𝑇≤𝑡)=1−𝑒xp(−𝜆𝑡). The cdf of the exponential model indicates the probability not surviving pass time t, but the survival function is the opposite. Thus, for survival function:
Note that lifelines use the reciprocal of 𝜆, which doesn’t really matter. We can see that the exponential model smoothes out the survival function.

Exponential distribution is a special case of the Weibull distribution: x~exp(𝜆)~ Weibull (1/𝜆,1).
The cdf of the Weibull distribution is 𝐹(𝑡)=1−exp(−(𝑡/𝜆)𝜌)
Again, we can write the survival function as 1-F(t):

We talked about four types of univariate models: Kaplan-Meier and Nelson-Aalen models are non-parametric models, Exponential and Weibull models are parametric models. There are a lot more other types of parametric models. Which model do we select largely depends on the context and your assumptions. Statistically, we can use QQ plots and AIC to see which model fits the data better. More info see https://lifelines.readthedocs.io/en/latest/Examples.html#selecting-a-parametric-model-using-qq-plots.
The general function of survival regression can be written as:
hazard = exp(𝑏0+𝑏1𝑥1+𝑏2𝑥2…𝑏𝑘𝑥𝑘),
which represents that hazard is a function of Xs.
Exponential survival regression is when 𝑏0 is constant.
Cox’s proportional hazard model is when 𝑏0 becomes 𝑙𝑛(𝑏0(𝑡)), which means the baseline hazard is a function of time.
ℎ(𝑡|𝑥)=𝑏0(𝑡)𝑒𝑥𝑝(∑𝑏𝑖𝑥𝑖)
check: predicting censor by Xs
2. survival times (t) are independent
3. ln(hazard) is linear function of numeric Xs.
4. Values of Xs don’t change over time.
5. Harzards are proportional. Hazard ratio between two subjects is constant.
The most important assumption of Cox’s proportional hazard model is the proportional hazard assumption. But we may not need to care about the proportional hazard assumption. As mentioned in Stensrud (2020), “There are legitimate reasons to assume that all datasets will violate the proportional hazards assumption”.
Here is an example of the Cox’s proportional hazard model directly from the lifelines webpage (https://lifelines.readthedocs.io/en/latest/Survival%20Regression.html). One thing to note is the exp(coef) , which is called the hazard ratio. The exp(coef) of marriage is 0.65, which means that for at any given time, married subjects are 0.65 times as likely to dies as unmarried subjects.

Finally, if the features vary over time, we need to use time varying models, which are more computational taxing but easy to implement in lifelines.
ℎ(𝑡|𝑥)=𝑏0(𝑡)+𝑏1(𝑡)𝑥1+…𝑏𝑁(𝑡)𝑥𝑁
ℎ(𝑡|𝑥)=𝑏0(𝑡)𝑒𝑥𝑝(∑𝛽𝑖(𝑥𝑖(𝑡)))
Survival probability calibration plot
The survival probability calibration plot compares simulated data based on your model and the observed data. It provides a straightforward view on how your model fit and deviate from the real data. This is implemented in lifelines lifelines.survival_probability_calibration function.
Compare model fit statistics
We can run multiple models and compare the model fit statistics (i.e., AIC, log-likelihood, and concordance). Model with a smaller AIC score, a larger log-likelihood, and larger concordance index is the better model.
Out-of-sample validation
We can also evaluate model fit with the out-of-sample data. Here we can investigate the out-of-sample log-likelihood values. This is especially useful when we tune the parameters of a certain model.
Within-sample validation
AIC is used when we evaluate model fit with the within-sample validation. Again smaller AIC value is better.
Cross validation
K-folds cross validation is also great at evaluating model fit. This is implemented in lifelines lifelines.utils.k_fold_cross_validation function.
https://www.youtube.com/watch?v=vX3l36ptrTU https://lifelines.readthedocs.io/ https://stats.stackexchange.com/questions/64739/in-survival-analysis-why-do-we-use-semi-parametric-models-cox-proportional-haz https://stats.stackexchange.com/questions/399544/in-survival-analysis-when-should-we-use-fully-parametric-models-over-semi-param https://jamanetwork.com/journals/jama/article-abstract/2763185 Stensrud MJ, Hernán MA. Why Test for Proportional Hazards? JAMA. Published online March 13, 2020. doi:10.1001/jama.2020.1267
A demonstration of coding the MACD indicator using Microsoft stock data as example through Google Colab.
Emma BoudreauSome common pitfalls in feature selection and feature engineering, and how to get around them
Abdur Rahman5 Reasons Why Python is Losing Its Crown
Chris Kuo/Dr. DatamanIts application on stock market prices