Objectives

We aim to introduce you to the basic R functionality and R packages for analysing survival data. We will use the following:

Basics of survival data

Censoring

Survival analysis deals with duration of time until an event (death, myocardial infarction, surgery). However, if the event is not occuring within the observed timeframe of the study period, the patients are registred as censored observations.

Censoring may arise in the following ways:

  • a patient has not (yet) experienced the event of interest, such as death, within the study time period;
  • a patient is lost to follow-up during the study period;
  • a patient experiences a different event that makes further follow-up impossible. This type of censoring, named right censoring, is handled in survival analysis.

Survival and hazard functions

Survival probability and hazard probability are two related probabilities used to describe survival data.

  • Survival probability:
  • the probability that an individual survives from the time origin to a specified future time.
  • Hazard
  • the probability that an individual who is under observation at a time has an event at that time.

Install and load required R package

We will use two new R packages for this case study.

survival for computing survival analyses survminer for summarizing and visualizing the results of survival analysis

install.packages("survival")
install.packages("survminer")
library(readxl)  # Reading Excel files.
library(tibble)  # The "tibble" datastructure.
library(dplyr)  # Working with the tibble datastructure.
library(ggplot2)  # Plotting.
library(lubridate)  # Working with datetime objects.
library(magrittr)  # Need this for the %>% operator.
library(devtools)  # For printing the session info at the end of the notebook.
library(survival)
library(survminer)
library(tidytidbits)
library(survivalAnalysis)
library(ranger)
library(ggfortify)
library(broom)

Loading a dataset

We will use data from patients with primary biliary cirrohosis (PBC) that was randomized for treatment with ursodeoxycholic acid (UDCA).

The data is part of the package “survival” as illustrative data.

References T. M. Therneau and P. M. Grambsch, Modeling survival data: extending the Cox model. Springer, 2000. K. D. Lindor, E. R. Dickson, W. P Baldus, R.A. Jorgensen, J. Ludwig, P. A. Murtaugh, J. M. Harrison, R. H. Weisner, M. L. Anderson, S. M. Lange, G. LeSage, S. S. Rossi and A. F. Hofman. Ursodeoxycholic acid in the treatment of primary biliary cirrhosis. Gastroenterology, 106:1284- 1290, 1994.

data(udca)

This will load three different data sets (udca, udca1 and udca2). udca is the most basic data. udca1 contains all the baseline variables and variables describing time until the first endpoints. We will use udca1.

checking the variables and data formats

head(udca1)
##   id trt stage bili riskscore futime status
## 1  1   1     1  1.0       5.1   1896      0
## 2  2   0     1  1.7       4.2   1456      1
## 3  3   1     0  0.5       3.4   1892      0
## 4  4   1     1  1.4       5.0    343      0
## 5  5   0     1  1.1       4.3   1875      0
## 6  6   1     1  1.4       5.9    768      1

A data frame with 170 observations on the following 15 variables.

id subject identifier trt treatment of 0=placebo, 1=UDCA entry.dt date of entry into the study last.dt date of last on-study visit stage stage of disease bili bilirubin value at entry riskscore the Mayo PBC risk score at entry death.dt date of death tx.dt date of liver transplant hprogress.dt date of histologic progression varices.dt appearance of esphogeal varices ascites.dt appearance of ascites enceph.dt appearance of encephalopathy double.dt doubling of initial bilirubin worsen.dt worsening of symptoms by two stages

Kaplan-Meier survival curve

The Kaplan-Meier is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).

We will use the survfit function to calculate survival times and load them in “fit”.

fit <- survfit(Surv(futime, status) 
               ~ trt, data = udca1)
print(fit)
## Call: survfit(formula = Surv(futime, status) ~ trt, data = udca1)
## 
##        n events median 0.95LCL 0.95UCL
## trt=0 84     45    992     768      NA
## trt=1 86     27     NA    1508      NA

survfit is a function from the survival package that contains the core survival analysis Surv difines the survival objects. trt is not an obligatory input, however because we want to see the data stratified by treatment we will use it.

To display the full survival table type:

summary(fit)
## Call: survfit(formula = Surv(futime, status) ~ trt, data = udca1)
## 
##                 trt=0 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    47     83       1    0.988  0.0120        0.965        1.000
##   189     81       1    0.976  0.0169        0.943        1.000
##   272     78       1    0.963  0.0208        0.923        1.000
##   340     76       1    0.951  0.0241        0.904        0.999
##   351     75       1    0.938  0.0269        0.887        0.992
##   362     74       1    0.925  0.0294        0.869        0.985
##   364     73       1    0.913  0.0316        0.853        0.977
##   370     72       1    0.900  0.0336        0.836        0.968
##   375     71       1    0.887  0.0354        0.820        0.959
##   377     70       1    0.875  0.0371        0.805        0.950
##   383     69       1    0.862  0.0387        0.789        0.941
##   391     68       1    0.849  0.0401        0.774        0.932
##   395     67       1    0.837  0.0415        0.759        0.922
##   499     66       1    0.824  0.0428        0.744        0.912
##   536     64       1    0.811  0.0440        0.729        0.902
##   604     61       1    0.798  0.0452        0.714        0.891
##   671     60       1    0.784  0.0464        0.699        0.881
##   685     59       1    0.771  0.0475        0.683        0.870
##   686     58       1    0.758  0.0485        0.668        0.859
##   705     57       1    0.744  0.0494        0.654        0.848
##   714     56       1    0.731  0.0503        0.639        0.837
##   717     55       1    0.718  0.0511        0.624        0.825
##   727     54       2    0.691  0.0526        0.596        0.802
##   733     52       1    0.678  0.0532        0.581        0.791
##   734     51       1    0.665  0.0538        0.567        0.779
##   737     49       1    0.651  0.0544        0.553        0.767
##   750     47       1    0.637  0.0550        0.538        0.755
##   755     46       1    0.623  0.0555        0.524        0.742
##   763     45       1    0.610  0.0560        0.509        0.730
##   768     44       1    0.596  0.0564        0.495        0.717
##   776     43       1    0.582  0.0567        0.481        0.704
##   783     42       1    0.568  0.0571        0.467        0.692
##   800     41       1    0.554  0.0573        0.452        0.679
##   810     39       1    0.540  0.0576        0.438        0.665
##   812     38       1    0.526  0.0578        0.424        0.652
##   973     33       1    0.510  0.0582        0.408        0.638
##   992     32       1    0.494  0.0585        0.392        0.623
##  1033     31       1    0.478  0.0588        0.376        0.608
##  1034     30       1    0.462  0.0589        0.360        0.593
##  1086     28       1    0.446  0.0591        0.344        0.578
##  1092     26       1    0.428  0.0592        0.327        0.562
##  1446     19       1    0.406  0.0603        0.303        0.543
##  1456     18       1    0.383  0.0610        0.281        0.524
##  1511     13       1    0.354  0.0630        0.250        0.502
## 
##                 trt=1 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   144     85       1    0.988  0.0117        0.966        1.000
##   361     82       1    0.976  0.0166        0.944        1.000
##   462     81       1    0.964  0.0203        0.925        1.000
##   530     80       1    0.952  0.0234        0.907        0.999
##   559     79       1    0.940  0.0260        0.890        0.992
##   633     78       1    0.928  0.0283        0.874        0.985
##   683     77       1    0.916  0.0304        0.858        0.978
##   709     76       1    0.904  0.0323        0.843        0.969
##   711     75       1    0.892  0.0341        0.828        0.961
##   726     74       1    0.880  0.0357        0.813        0.953
##   731     73       1    0.868  0.0372        0.798        0.944
##   735     72       1    0.856  0.0385        0.783        0.935
##   742     71       1    0.844  0.0398        0.769        0.925
##   761     70       1    0.832  0.0411        0.755        0.916
##   768     69       1    0.820  0.0422        0.741        0.907
##   810     67       1    0.807  0.0433        0.727        0.897
##   816     66       1    0.795  0.0443        0.713        0.887
##   826     64       1    0.783  0.0454        0.699        0.877
##   833     63       1    0.770  0.0463        0.685        0.867
##   834     62       1    0.758  0.0472        0.671        0.856
##   876     61       1    0.745  0.0480        0.657        0.846
##  1087     52       1    0.731  0.0492        0.641        0.834
##  1099     51       1    0.717  0.0503        0.625        0.822
##  1336     35       1    0.696  0.0528        0.600        0.808
##  1416     29       1    0.672  0.0562        0.571        0.792
##  1453     26       1    0.646  0.0597        0.539        0.775
##  1508     20       1    0.614  0.0649        0.499        0.755

And to display a summary text, type:

summary(fit)$table
##       records n.max n.start events   *rmean *se(rmean) median 0.95LCL
## trt=0      84    84      84     45 1154.652   71.87860    992     768
## trt=1      86    86      86     27 1509.986   60.06917     NA    1508
##       0.95UCL
## trt=0      NA
## trt=1      NA

We will now use the function ggsurvplot (from the survminer package) to produce the kaplan meier survival curves for the two groups of subjects.

ggsurvplot(fit)

This is a rough draft and it is now possible to finetune the plot. - Add the 95% confidence limits using the argument conf.int = TRUE. - Add the number at risk by time using the risk.table. = TRUE - Add the p-value of the Log-Rank test using pval = TRUE. - Add lines at median survival using surv.median.line. = c(“none”, “hv”, “h”, “v”)

ggsurvplot(fit,
          pval = TRUE, # p-value of the Log-Rank test
          conf.int = TRUE, # 95% confidence limits
          risk.table = TRUE, # Add risk table of patients at risk to the specified times
          risk.table.col = "strata", # Change risk table color by groups
          linetype = "strata", # Change line type by groups
          surv.median.line = "none", # Specify median survival
          )

We will try to change a few more settings:

ggsurvplot(fit,
  pval = TRUE, # p-value of the Log-Rank test
  conf.int = TRUE, # 95% confidence limits
  risk.table = TRUE, # Add risk table of patients at risk to the specified times
  risk.table.col = "strata", # Change risk table color by groups
  linetype = "strata", # Change line type by groups
  surv.median.line = "none", # Specify median survival
  break.time.by = 200, # break x axis in time intervals by 200.
  legend.labs = c("Placebo", "Treatment"), # to change the legend labels
  xlab="years", # change axis label
  xscale=365, # Divide the x-axis scale by a number
  xlim=c(0,1850) # Change axis limits
)

To change the groups we need to change the function:

fit <- survfit(Surv(futime, status) ~ stage, data = udca1)

Remember to change legends, but the rest of the function can be reused:

ggsurvplot(fit,
pval = TRUE, # p-value of the Log-Rank test
conf.int = TRUE, # 95% confidence limits
risk.table = TRUE, # Add risk table of patients at risk to the specified times
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
surv.median.line = "none", # Specify median survival
break.time.by = 365, # break x axis in time intervals by 200.
legend.labs = c("Stage 1", "Stage 2"), # to change the legend labels
xlab="years",
xscale=365,
xlim=c(0,1850)
)

Log-Rank test comparing survival curves

The log-rank test is the most widely used method of comparing two or more survival curves. The null hypothesis is that there is no difference in survival between the two groups. The log rank test is a non-parametric test, which makes no assumptions about the survival distributions.

surv_diff <- survdiff(Surv(futime, status) ~ trt, data = udca1)
surv_diff
## Call:
## survdiff(formula = Surv(futime, status) ~ trt, data = udca1)
## 
##        N Observed Expected (O-E)^2/E (O-E)^2/V
## trt=0 84       45     29.9      7.68      13.2
## trt=1 86       27     42.1      5.44      13.2
## 
##  Chisq= 13.2  on 1 degrees of freedom, p= 3e-04

n: the number of subjects in each group. obs: the weighted observed number of events in each group. exp: the weighted expected number of events in each group. chisq: the chisquare statistic for a test of equality.

Cox regression analysis

fit1 <- coxph(Surv(futime, status) ~ trt + log(bili) + stage, data=udca1)

tidy(fit1)
## # A tibble: 3 x 7
##   term      estimate std.error statistic   p.value conf.low conf.high
##   <chr>        <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 trt        -1.03       0.255    -4.03  0.0000561   -1.53     -0.528
## 2 log(bili)   0.622      0.151     4.13  0.0000370    0.327     0.918
## 3 stage      -0.0665     0.290    -0.229 0.819       -0.636     0.503

A cox proportional hazards model assumes a baseline hazard function, i.e. the hazard for the reference group. Each predictor has a multiplicative effect on the hazard.

The major assumption of the Cox model is that the hazard ratio for a predictor (Zi) is constant (eβi) and does not depend on the time, i.e. the hazards in the two groups are proportional over time.

We can check whether the data are sufficiently consistent with the assumption of proportional hazards with respect to each of the variables separately as well as globally, using the cox.zph() function.

cox.zph.fit1 <- cox.zph(fit1)
cox.zph.fit1
##                rho   chisq     p
## trt        0.09847 0.76290 0.382
## log(bili) -0.09736 0.65980 0.417
## stage     -0.00497 0.00176 0.967
## GLOBAL          NA 1.28916 0.732

No evidence against proportionality assumption could apparently be found.

It is possible to analyze these effects further with other packages: ggcoxzph and for diagnostic ggcoxdiagnostics.

ggcoxzph(cox.zph.fit1)

Forrest plots

A good way to display and compare the hazard ratios from the cox regression model is a forrest plot.

The forrest plot can be displayed with a simple code from the ggplot package:

ggforest(fit1)

However, this plot can be changed a bit. Lets consider stage as a categorical factor instead of as a numerical:

udca1 <- within(udca1, 
                {stage <- factor(stage, labels = c("well", "poor"))
                })

fit1 <- coxph(Surv(futime, status) ~ trt + log(bili) + stage
              , data=udca1)

ggforest(fit1)

We can do a little bit more by adding a tittle, changing the font size, changing the positions of the text.

ggforest(fit1, main="Forest plot for Cox proportional hazard model", fontsize=0.9, cpositions = c(0.02, 0.22, 0.4), noDigits = 2)

devtools::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.0 (2019-04-26)
##  os       OS X El Capitan 10.11.6     
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Atlantic/Faroe              
##  date     2019-08-26                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package          * version   date       lib
##  assertthat         0.2.1     2019-03-21 [1]
##  backports          1.1.4     2019-04-10 [1]
##  broom            * 0.5.2     2019-04-07 [1]
##  callr              3.2.0     2019-03-15 [1]
##  cellranger         1.1.0     2016-07-27 [1]
##  cli                1.1.0     2019-03-19 [1]
##  colorspace         1.4-1     2019-03-18 [1]
##  cowplot            0.9.4     2019-01-08 [1]
##  crayon             1.3.4     2017-09-16 [1]
##  data.table         1.12.2    2019-04-07 [1]
##  desc               1.2.0     2018-05-01 [1]
##  devtools         * 2.0.2     2019-04-08 [1]
##  digest             0.6.19    2019-05-20 [1]
##  dplyr            * 0.8.2     2019-06-29 [1]
##  evaluate           0.14      2019-05-28 [1]
##  fansi              0.4.0     2018-10-05 [1]
##  forcats            0.4.0     2019-02-17 [1]
##  fs                 1.3.1     2019-05-06 [1]
##  generics           0.0.2     2018-11-29 [1]
##  ggfortify        * 0.4.7     2019-05-26 [1]
##  ggplot2          * 3.2.0     2019-06-16 [1]
##  ggpubr           * 0.2.1     2019-06-23 [1]
##  ggsignif           0.5.0     2019-02-20 [1]
##  glue               1.3.1     2019-03-12 [1]
##  gridExtra          2.3       2017-09-09 [1]
##  gtable             0.3.0     2019-03-25 [1]
##  htmltools          0.3.6     2017-04-28 [1]
##  km.ci              0.5-2     2009-08-30 [1]
##  KMsurv             0.1-5     2012-12-03 [1]
##  knitr              1.23      2019-05-18 [1]
##  labeling           0.3       2014-08-23 [1]
##  lattice            0.20-38   2018-11-04 [1]
##  lazyeval           0.2.2     2019-03-15 [1]
##  lubridate        * 1.7.4     2018-04-11 [1]
##  magrittr         * 1.5       2014-11-22 [1]
##  Matrix             1.2-17    2019-03-22 [1]
##  memoise            1.1.0     2017-04-21 [1]
##  munsell            0.5.0     2018-06-12 [1]
##  nlme               3.1-139   2019-04-09 [1]
##  pillar             1.4.2     2019-06-29 [1]
##  pkgbuild           1.0.3     2019-03-20 [1]
##  pkgconfig          2.0.2     2018-08-16 [1]
##  pkgload            1.0.2     2018-10-29 [1]
##  prettyunits        1.0.2     2015-07-13 [1]
##  processx           3.3.1     2019-05-08 [1]
##  ps                 1.3.0     2018-12-21 [1]
##  purrr              0.3.2     2019-03-15 [1]
##  R6                 2.4.0     2019-02-14 [1]
##  ranger           * 0.11.2    2019-03-07 [1]
##  Rcpp               1.0.1     2019-03-17 [1]
##  readxl           * 1.3.1     2019-03-13 [1]
##  remotes            2.0.4     2019-04-10 [1]
##  rlang              0.4.0     2019-06-25 [1]
##  rmarkdown          1.13      2019-05-22 [1]
##  rprojroot          1.3-2     2018-01-03 [1]
##  scales             1.0.0     2018-08-09 [1]
##  sessioninfo        1.1.1     2018-11-05 [1]
##  stringi            1.4.3     2019-03-12 [1]
##  stringr            1.4.0     2019-02-10 [1]
##  survival         * 2.44-1.1  2019-04-01 [1]
##  survivalAnalysis * 0.1.1     2019-02-13 [1]
##  survminer        * 0.4.4.999 2019-07-03 [1]
##  survMisc           0.5.5     2018-07-05 [1]
##  tibble           * 2.1.3     2019-06-06 [1]
##  tidyr              0.8.3     2019-03-01 [1]
##  tidyselect         0.2.5     2018-10-11 [1]
##  tidytidbits      * 0.2.1     2019-07-01 [1]
##  usethis          * 1.5.0     2019-04-07 [1]
##  utf8               1.1.4     2018-05-24 [1]
##  vctrs              0.1.0     2018-11-29 [1]
##  withr              2.1.2     2018-03-15 [1]
##  xfun               0.8       2019-06-25 [1]
##  xtable             1.8-4     2019-04-21 [1]
##  yaml               2.2.0     2018-07-25 [1]
##  zeallot            0.1.0     2018-01-28 [1]
##  zoo                1.8-6     2019-05-28 [1]
##  source                               
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  Github (kassambara/survminer@e7edc09)
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
##  CRAN (R 3.6.0)                       
## 
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library