We aim to introduce you to the basic R functionality and R packages for analysing survival data. We will use the following:
Survival analysis deals with duration of time until an event (death, myocardial infarction, surgery). However, if the event is not occuring within the observed timeframe of the study period, the patients are registred as censored observations.
Censoring may arise in the following ways:
Survival probability and hazard probability are two related probabilities used to describe survival data.
We will use two new R packages for this case study.
survival for computing survival analyses survminer for summarizing and visualizing the results of survival analysis
install.packages("survival")
install.packages("survminer")
library(readxl) # Reading Excel files.
library(tibble) # The "tibble" datastructure.
library(dplyr) # Working with the tibble datastructure.
library(ggplot2) # Plotting.
library(lubridate) # Working with datetime objects.
library(magrittr) # Need this for the %>% operator.
library(devtools) # For printing the session info at the end of the notebook.
library(survival)
library(survminer)
library(tidytidbits)
library(survivalAnalysis)
library(ranger)
library(ggfortify)
library(broom)
We will use data from patients with primary biliary cirrohosis (PBC) that was randomized for treatment with ursodeoxycholic acid (UDCA).
The data is part of the package “survival” as illustrative data.
References T. M. Therneau and P. M. Grambsch, Modeling survival data: extending the Cox model. Springer, 2000. K. D. Lindor, E. R. Dickson, W. P Baldus, R.A. Jorgensen, J. Ludwig, P. A. Murtaugh, J. M. Harrison, R. H. Weisner, M. L. Anderson, S. M. Lange, G. LeSage, S. S. Rossi and A. F. Hofman. Ursodeoxycholic acid in the treatment of primary biliary cirrhosis. Gastroenterology, 106:1284- 1290, 1994.
data(udca)
This will load three different data sets (udca, udca1 and udca2). udca is the most basic data. udca1 contains all the baseline variables and variables describing time until the first endpoints. We will use udca1.
head(udca1)
## id trt stage bili riskscore futime status
## 1 1 1 1 1.0 5.1 1896 0
## 2 2 0 1 1.7 4.2 1456 1
## 3 3 1 0 0.5 3.4 1892 0
## 4 4 1 1 1.4 5.0 343 0
## 5 5 0 1 1.1 4.3 1875 0
## 6 6 1 1 1.4 5.9 768 1
A data frame with 170 observations on the following 15 variables.
id subject identifier trt treatment of 0=placebo, 1=UDCA entry.dt date of entry into the study last.dt date of last on-study visit stage stage of disease bili bilirubin value at entry riskscore the Mayo PBC risk score at entry death.dt date of death tx.dt date of liver transplant hprogress.dt date of histologic progression varices.dt appearance of esphogeal varices ascites.dt appearance of ascites enceph.dt appearance of encephalopathy double.dt doubling of initial bilirubin worsen.dt worsening of symptoms by two stages
The Kaplan-Meier is a non-parametric method used to estimate the survival probability from observed survival times (Kaplan and Meier, 1958).
We will use the survfit function to calculate survival times and load them in “fit”.
fit <- survfit(Surv(futime, status)
~ trt, data = udca1)
print(fit)
## Call: survfit(formula = Surv(futime, status) ~ trt, data = udca1)
##
## n events median 0.95LCL 0.95UCL
## trt=0 84 45 992 768 NA
## trt=1 86 27 NA 1508 NA
survfit is a function from the survival package that contains the core survival analysis Surv difines the survival objects. trt is not an obligatory input, however because we want to see the data stratified by treatment we will use it.
To display the full survival table type:
summary(fit)
## Call: survfit(formula = Surv(futime, status) ~ trt, data = udca1)
##
## trt=0
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 47 83 1 0.988 0.0120 0.965 1.000
## 189 81 1 0.976 0.0169 0.943 1.000
## 272 78 1 0.963 0.0208 0.923 1.000
## 340 76 1 0.951 0.0241 0.904 0.999
## 351 75 1 0.938 0.0269 0.887 0.992
## 362 74 1 0.925 0.0294 0.869 0.985
## 364 73 1 0.913 0.0316 0.853 0.977
## 370 72 1 0.900 0.0336 0.836 0.968
## 375 71 1 0.887 0.0354 0.820 0.959
## 377 70 1 0.875 0.0371 0.805 0.950
## 383 69 1 0.862 0.0387 0.789 0.941
## 391 68 1 0.849 0.0401 0.774 0.932
## 395 67 1 0.837 0.0415 0.759 0.922
## 499 66 1 0.824 0.0428 0.744 0.912
## 536 64 1 0.811 0.0440 0.729 0.902
## 604 61 1 0.798 0.0452 0.714 0.891
## 671 60 1 0.784 0.0464 0.699 0.881
## 685 59 1 0.771 0.0475 0.683 0.870
## 686 58 1 0.758 0.0485 0.668 0.859
## 705 57 1 0.744 0.0494 0.654 0.848
## 714 56 1 0.731 0.0503 0.639 0.837
## 717 55 1 0.718 0.0511 0.624 0.825
## 727 54 2 0.691 0.0526 0.596 0.802
## 733 52 1 0.678 0.0532 0.581 0.791
## 734 51 1 0.665 0.0538 0.567 0.779
## 737 49 1 0.651 0.0544 0.553 0.767
## 750 47 1 0.637 0.0550 0.538 0.755
## 755 46 1 0.623 0.0555 0.524 0.742
## 763 45 1 0.610 0.0560 0.509 0.730
## 768 44 1 0.596 0.0564 0.495 0.717
## 776 43 1 0.582 0.0567 0.481 0.704
## 783 42 1 0.568 0.0571 0.467 0.692
## 800 41 1 0.554 0.0573 0.452 0.679
## 810 39 1 0.540 0.0576 0.438 0.665
## 812 38 1 0.526 0.0578 0.424 0.652
## 973 33 1 0.510 0.0582 0.408 0.638
## 992 32 1 0.494 0.0585 0.392 0.623
## 1033 31 1 0.478 0.0588 0.376 0.608
## 1034 30 1 0.462 0.0589 0.360 0.593
## 1086 28 1 0.446 0.0591 0.344 0.578
## 1092 26 1 0.428 0.0592 0.327 0.562
## 1446 19 1 0.406 0.0603 0.303 0.543
## 1456 18 1 0.383 0.0610 0.281 0.524
## 1511 13 1 0.354 0.0630 0.250 0.502
##
## trt=1
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 144 85 1 0.988 0.0117 0.966 1.000
## 361 82 1 0.976 0.0166 0.944 1.000
## 462 81 1 0.964 0.0203 0.925 1.000
## 530 80 1 0.952 0.0234 0.907 0.999
## 559 79 1 0.940 0.0260 0.890 0.992
## 633 78 1 0.928 0.0283 0.874 0.985
## 683 77 1 0.916 0.0304 0.858 0.978
## 709 76 1 0.904 0.0323 0.843 0.969
## 711 75 1 0.892 0.0341 0.828 0.961
## 726 74 1 0.880 0.0357 0.813 0.953
## 731 73 1 0.868 0.0372 0.798 0.944
## 735 72 1 0.856 0.0385 0.783 0.935
## 742 71 1 0.844 0.0398 0.769 0.925
## 761 70 1 0.832 0.0411 0.755 0.916
## 768 69 1 0.820 0.0422 0.741 0.907
## 810 67 1 0.807 0.0433 0.727 0.897
## 816 66 1 0.795 0.0443 0.713 0.887
## 826 64 1 0.783 0.0454 0.699 0.877
## 833 63 1 0.770 0.0463 0.685 0.867
## 834 62 1 0.758 0.0472 0.671 0.856
## 876 61 1 0.745 0.0480 0.657 0.846
## 1087 52 1 0.731 0.0492 0.641 0.834
## 1099 51 1 0.717 0.0503 0.625 0.822
## 1336 35 1 0.696 0.0528 0.600 0.808
## 1416 29 1 0.672 0.0562 0.571 0.792
## 1453 26 1 0.646 0.0597 0.539 0.775
## 1508 20 1 0.614 0.0649 0.499 0.755
And to display a summary text, type:
summary(fit)$table
## records n.max n.start events *rmean *se(rmean) median 0.95LCL
## trt=0 84 84 84 45 1154.652 71.87860 992 768
## trt=1 86 86 86 27 1509.986 60.06917 NA 1508
## 0.95UCL
## trt=0 NA
## trt=1 NA
We will now use the function ggsurvplot (from the survminer package) to produce the kaplan meier survival curves for the two groups of subjects.
ggsurvplot(fit)
This is a rough draft and it is now possible to finetune the plot. - Add the 95% confidence limits using the argument conf.int = TRUE. - Add the number at risk by time using the risk.table. = TRUE - Add the p-value of the Log-Rank test using pval = TRUE. - Add lines at median survival using surv.median.line. = c(“none”, “hv”, “h”, “v”)
ggsurvplot(fit,
pval = TRUE, # p-value of the Log-Rank test
conf.int = TRUE, # 95% confidence limits
risk.table = TRUE, # Add risk table of patients at risk to the specified times
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
surv.median.line = "none", # Specify median survival
)
We will try to change a few more settings:
ggsurvplot(fit,
pval = TRUE, # p-value of the Log-Rank test
conf.int = TRUE, # 95% confidence limits
risk.table = TRUE, # Add risk table of patients at risk to the specified times
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
surv.median.line = "none", # Specify median survival
break.time.by = 200, # break x axis in time intervals by 200.
legend.labs = c("Placebo", "Treatment"), # to change the legend labels
xlab="years", # change axis label
xscale=365, # Divide the x-axis scale by a number
xlim=c(0,1850) # Change axis limits
)
To change the groups we need to change the function:
fit <- survfit(Surv(futime, status) ~ stage, data = udca1)
Remember to change legends, but the rest of the function can be reused:
ggsurvplot(fit,
pval = TRUE, # p-value of the Log-Rank test
conf.int = TRUE, # 95% confidence limits
risk.table = TRUE, # Add risk table of patients at risk to the specified times
risk.table.col = "strata", # Change risk table color by groups
linetype = "strata", # Change line type by groups
surv.median.line = "none", # Specify median survival
break.time.by = 365, # break x axis in time intervals by 200.
legend.labs = c("Stage 1", "Stage 2"), # to change the legend labels
xlab="years",
xscale=365,
xlim=c(0,1850)
)
The log-rank test is the most widely used method of comparing two or more survival curves. The null hypothesis is that there is no difference in survival between the two groups. The log rank test is a non-parametric test, which makes no assumptions about the survival distributions.
surv_diff <- survdiff(Surv(futime, status) ~ trt, data = udca1)
surv_diff
## Call:
## survdiff(formula = Surv(futime, status) ~ trt, data = udca1)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## trt=0 84 45 29.9 7.68 13.2
## trt=1 86 27 42.1 5.44 13.2
##
## Chisq= 13.2 on 1 degrees of freedom, p= 3e-04
n: the number of subjects in each group. obs: the weighted observed number of events in each group. exp: the weighted expected number of events in each group. chisq: the chisquare statistic for a test of equality.
fit1 <- coxph(Surv(futime, status) ~ trt + log(bili) + stage, data=udca1)
tidy(fit1)
## # A tibble: 3 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 trt -1.03 0.255 -4.03 0.0000561 -1.53 -0.528
## 2 log(bili) 0.622 0.151 4.13 0.0000370 0.327 0.918
## 3 stage -0.0665 0.290 -0.229 0.819 -0.636 0.503
A cox proportional hazards model assumes a baseline hazard function, i.e. the hazard for the reference group. Each predictor has a multiplicative effect on the hazard.
The major assumption of the Cox model is that the hazard ratio for a predictor (Zi) is constant (eβi) and does not depend on the time, i.e. the hazards in the two groups are proportional over time.
We can check whether the data are sufficiently consistent with the assumption of proportional hazards with respect to each of the variables separately as well as globally, using the cox.zph() function.
cox.zph.fit1 <- cox.zph(fit1)
cox.zph.fit1
## rho chisq p
## trt 0.09847 0.76290 0.382
## log(bili) -0.09736 0.65980 0.417
## stage -0.00497 0.00176 0.967
## GLOBAL NA 1.28916 0.732
No evidence against proportionality assumption could apparently be found.
It is possible to analyze these effects further with other packages: ggcoxzph and for diagnostic ggcoxdiagnostics.
ggcoxzph(cox.zph.fit1)
A good way to display and compare the hazard ratios from the cox regression model is a forrest plot.
The forrest plot can be displayed with a simple code from the ggplot package:
ggforest(fit1)
However, this plot can be changed a bit. Lets consider stage as a categorical factor instead of as a numerical:
udca1 <- within(udca1,
{stage <- factor(stage, labels = c("well", "poor"))
})
fit1 <- coxph(Surv(futime, status) ~ trt + log(bili) + stage
, data=udca1)
ggforest(fit1)
We can do a little bit more by adding a tittle, changing the font size, changing the positions of the text.
ggforest(fit1, main="Forest plot for Cox proportional hazard model", fontsize=0.9, cpositions = c(0.02, 0.22, 0.4), noDigits = 2)
devtools::session_info()
## ─ Session info ──────────────────────────────────────────────────────────
## setting value
## version R version 3.6.0 (2019-04-26)
## os OS X El Capitan 10.11.6
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Atlantic/Faroe
## date 2019-08-26
##
## ─ Packages ──────────────────────────────────────────────────────────────
## package * version date lib
## assertthat 0.2.1 2019-03-21 [1]
## backports 1.1.4 2019-04-10 [1]
## broom * 0.5.2 2019-04-07 [1]
## callr 3.2.0 2019-03-15 [1]
## cellranger 1.1.0 2016-07-27 [1]
## cli 1.1.0 2019-03-19 [1]
## colorspace 1.4-1 2019-03-18 [1]
## cowplot 0.9.4 2019-01-08 [1]
## crayon 1.3.4 2017-09-16 [1]
## data.table 1.12.2 2019-04-07 [1]
## desc 1.2.0 2018-05-01 [1]
## devtools * 2.0.2 2019-04-08 [1]
## digest 0.6.19 2019-05-20 [1]
## dplyr * 0.8.2 2019-06-29 [1]
## evaluate 0.14 2019-05-28 [1]
## fansi 0.4.0 2018-10-05 [1]
## forcats 0.4.0 2019-02-17 [1]
## fs 1.3.1 2019-05-06 [1]
## generics 0.0.2 2018-11-29 [1]
## ggfortify * 0.4.7 2019-05-26 [1]
## ggplot2 * 3.2.0 2019-06-16 [1]
## ggpubr * 0.2.1 2019-06-23 [1]
## ggsignif 0.5.0 2019-02-20 [1]
## glue 1.3.1 2019-03-12 [1]
## gridExtra 2.3 2017-09-09 [1]
## gtable 0.3.0 2019-03-25 [1]
## htmltools 0.3.6 2017-04-28 [1]
## km.ci 0.5-2 2009-08-30 [1]
## KMsurv 0.1-5 2012-12-03 [1]
## knitr 1.23 2019-05-18 [1]
## labeling 0.3 2014-08-23 [1]
## lattice 0.20-38 2018-11-04 [1]
## lazyeval 0.2.2 2019-03-15 [1]
## lubridate * 1.7.4 2018-04-11 [1]
## magrittr * 1.5 2014-11-22 [1]
## Matrix 1.2-17 2019-03-22 [1]
## memoise 1.1.0 2017-04-21 [1]
## munsell 0.5.0 2018-06-12 [1]
## nlme 3.1-139 2019-04-09 [1]
## pillar 1.4.2 2019-06-29 [1]
## pkgbuild 1.0.3 2019-03-20 [1]
## pkgconfig 2.0.2 2018-08-16 [1]
## pkgload 1.0.2 2018-10-29 [1]
## prettyunits 1.0.2 2015-07-13 [1]
## processx 3.3.1 2019-05-08 [1]
## ps 1.3.0 2018-12-21 [1]
## purrr 0.3.2 2019-03-15 [1]
## R6 2.4.0 2019-02-14 [1]
## ranger * 0.11.2 2019-03-07 [1]
## Rcpp 1.0.1 2019-03-17 [1]
## readxl * 1.3.1 2019-03-13 [1]
## remotes 2.0.4 2019-04-10 [1]
## rlang 0.4.0 2019-06-25 [1]
## rmarkdown 1.13 2019-05-22 [1]
## rprojroot 1.3-2 2018-01-03 [1]
## scales 1.0.0 2018-08-09 [1]
## sessioninfo 1.1.1 2018-11-05 [1]
## stringi 1.4.3 2019-03-12 [1]
## stringr 1.4.0 2019-02-10 [1]
## survival * 2.44-1.1 2019-04-01 [1]
## survivalAnalysis * 0.1.1 2019-02-13 [1]
## survminer * 0.4.4.999 2019-07-03 [1]
## survMisc 0.5.5 2018-07-05 [1]
## tibble * 2.1.3 2019-06-06 [1]
## tidyr 0.8.3 2019-03-01 [1]
## tidyselect 0.2.5 2018-10-11 [1]
## tidytidbits * 0.2.1 2019-07-01 [1]
## usethis * 1.5.0 2019-04-07 [1]
## utf8 1.1.4 2018-05-24 [1]
## vctrs 0.1.0 2018-11-29 [1]
## withr 2.1.2 2018-03-15 [1]
## xfun 0.8 2019-06-25 [1]
## xtable 1.8-4 2019-04-21 [1]
## yaml 2.2.0 2018-07-25 [1]
## zeallot 0.1.0 2018-01-28 [1]
## zoo 1.8-6 2019-05-28 [1]
## source
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## Github (kassambara/survminer@e7edc09)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
## CRAN (R 3.6.0)
##
## [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library