Survival Models

Survival analysis is concerned with looking at the amount of time that elapses before an event occurs. An obvious application is to look at mortality statistics (predicting how long people live), but it can also be applied to mechanical systems (the time before a failure occurs), marketing (the amount of time before a consumer cancels an account), or other areas.

In R, there are a variety of functions in the survival library for modeling survival data.

To estimate a survival curve for censored data, you can use the survfit function:

library(survival)
survfit(formula, data, weights, subset, na.action, etype, id, ...)

This function accepts the following arguments.

ArgumentDescription
formulaDescribes the relationship between the response value and the predictors. The response value should be a Surv object.
dataThe data frame in which to evaluate formula.
weightsWeights for observations.
subsetSubset of observation to use in fitting the model.
na.actionFunction to deal with missing values.
etypeThe variable giving the type of event.
idThe variable that identifies individual subjects.
typeSpecifies the type of survival curve. Options include "kaplan-meier", "fleming-harrington", and "fh2".
errorSpecifies the type of error. Possible values are "greenwood" for the Greenwood formula or "tsiatis" for the Tsiatis formula.
conf.typeConfidence interval type. One of "none", "plain", "log" (the default), or "log-log".
conf.lowerA character string to specify modified lower limits to the curve; the upper limit remains unchanged. Possible values are "usual" (unmodified), "peto", and "modified".
start.timeNumeric value specifying a time to start calculating survival information.
conf.intThe level for a two-sided confidence interval on the survival curve(s).
se.fitA logical value indicating whether standard errors should be computed.
...Additional variables passed to internal functions.

As an example, let’s fit a survival curve for the GSE2034 data set. This data comes from the Gene Expression Omnibus of the National Center for Biotechnology Information (NCBI), which is accessible from http://www.ncbi.nlm.nih.gov/geo/. The experiment examined how the expression of certain genes affected breast cancer relapse-free survival time. In particular, it tested estrogen receptor binding sites. (We’ll revisit this example in Chapter 25.)

First, we need to create a Surv object within the data frame. A Surv object is an R object for representing survival information, in particular, censored data. Censored data occurs when the outcome of the experiment is not known for all observations. In this case, the data is censored. There are three possible outcomes for each observation: the subject had a recurrence of the disease, the subject died without having a recurrence of the disease, or the subject was still alive without a recurrence at the time the data was reported. The last outcome—the subject was still alive without a recurrence—results in the censored values:

> library(survival)
> GSE2034.Surv <- transform(GSE2034,
+   surv=Surv(
+     time=GSE2034$months.to.relapse.or.last.followup,
+     event=GSE2034$relapse,
+     type="right"
+   )
+ )
> # show the first 26 observations:
> GSE2034.Surv$surv[1:26,]
 [1] 101+ 118+   9  106+  37  125+ 109+  14   99+ 137+  34   32  128+
[14]  14  130+  30  155+  25   30   84+   7  100+  30    7  133+  43

Now let’s calculate the survival model. We’ll just make it a function of the ER.status flag (which stands for “estrogen receptor”):

> GSE2034.survfit <- survfit(
+   formula=surv~ER.Status,
+   data=GSE2034.Surv,
+ )

The easiest way to view a survfit object is graphically. Let’s plot the model:

> plot(GSE2034.survfit, lty=1:2, log=T)
> legend(135, 1, c("ER+","ER-"), lty=1:2, cex=0.5)

The plot is shown in Figure 20-4. Note the different curve shape for each cohort.

To fit a parametric survival model, you can use the survreg function in the survival package:

survreg(formula, data, weights, subset,
        na.action, dist="weibull", init=NULL, scale=0,
        control,parms=NULL,model=FALSE, x=FALSE,
        y=TRUE, robust=FALSE, score=FALSE, ...)

Here is a description of the arguments to survreg.

ArgumentDescriptionDefault
formulaA formula that describes the form of the model; the response is usually a Surv object (created by the Surv function). 
dataA data frame containing the training data for the model. 
weightsA vector of weights for observations in data. 
subsetAn expression describing a subset of observations in data to use for fitting the model. 
na.actionA function that describes how to treat NA values. options()$na.action
distA character value describing the form of the y variable (either "weibull", "exponential", "gaussian", "logistic", "lognormal", or "loglogistic") or a distribution like the ones in survreg.distributions."weibull"
initOptional vector of initial parameters.NULL
scaleValue specifying the scale of the estimates. Estimated if scale <= 0.0
controlA list of control values, usually produced by survreg.control. 
parmsA list of fixed parameters for the distribution function.NULL
model, x, yLogical values indicating whether to return the model frame, X matrix, or Y vector (respectively) with the results.FALSE
robustA logical value indicating whether to use “robust sandwich standard methods.”FALSE
scoreA logical value indicating whether to return the score vector.FALSE
...Other arguments passed to survreg.control. 

You can compute the expected survival for a set of subjects (or individual expectations for each subject) with the function survexp:

library(survival)
survexp(formula, data, weights, subset, na.action, rmap, times,
    cohort = TRUE, conditional = FALSE, ratetable = survexp.us,
    scale = 1, npoints, se.fit, model = FALSE, x = FALSE, y = FALSE)

Here is a description of the arguments to survexp.

ArgumentDescriptionDefault
formulaA formula object describing the form of the model. The (optional) response should contain a vector of follow-up times, and the predictors should contain grouping variables separated by + operators. 
dataA data frame containing source data on which to predict values. 
weightsA vector of weights for the cases. 
subsetAn expression indicating which observations in data should be included in the prediction. 
na.actionA function specifying how to deal with missing (NA) values in the data.options()$na.action
timesA vector of follow-up times at which the resulting survival curve is evaluated. (This may also be included in the formula; see above.) 
cohortA logical value indicating whether to calculate the survival of the whole cohort (cohort=TRUE) or individual observations (cohort=FALSE).TRUE
conditionalA logical value indicating whether to calculate conditional expected survival. Specify conditional=TRUE if the follow-up times are times of death, and conditional=FALSE if the follow-up times are potential censoring times.FALSE
ratetableA fitted Cox model (from coxph) or a table of survival times.survexp.us
scaleA numeric value specifying how to scale the results.1
npointsA numeric value indicating the number of points at which to calculate individual results. 
se.fitA logical value indicating whether to include the standard error of the predicted survival. 
model, x, ySpecifies whether to return the model frame, the X matrix, or the Y vector in the results.FALSE for all three

The Cox proportional hazard model is a nonparametric method for fitting survival models. It is available in R through the coxph function in the survival library:

coxph(formula, data, weights, subset, na.action, init, control,
    ties = c("efron", "breslow", "exact"), singular.ok = TRUE,
    robust = FALSE, model = FALSE, x = FALSE, y = TRUE, tt, method = ties,
    ...)

Here is a description of the arguments to coxph.

ArgumentDescriptionDefault
formulaA formula that describes the form of the model; the response must be a Surv object (created by the Surv function). 
dataA data frame containing source data on which to predict values. 
weightsA vector of weights for the cases. 
subsetAn expression indicating which observations in data should be fit. 
na.actionA function specifying how to deal with missing (NA) values in the data. 
initA vector of initial parameter values for the fitting process.0 for all variables
controlObject of class coxph.control specifying the iteration limit and other control options.coxph.control(...)
methodA character value specifying the method for handling ties. Choices include "efron", "breslow", and "exact"."efron"
singular.okA logical value indicating whether to stop with an error if the X matrix is singular or to simply skip variables that are linear combinations of other variables.TRUE
robustA logical value indicating whether to return a robust variance estimate.FALSE
modelA logical value specifying whether to return the model frame.FALSE
xA logical value specifying whether to return the X matrix.FALSE
yA logical value specifying whether to return the Y vector.TRUE
...Additional arguments passed to coxph.control. 

As an example, let’s fit a Cox proportional hazard model to the GSE2034 data:

> GSE2034.coxph <- coxph(
+    formula=surv~ER.Status,
+    data=GSE2034.Surv,
+  )
> GSE2034.coxph
Call:
coxph(formula = surv ~ ER.Status, data = GSE2034.Surv)


                 coef exp(coef) se(coef)       z    p
ER.StatusER+ -0.00378     0.996    0.223 -0.0170 0.99

Likelihood ratio test=0  on 1 df, p=0.986  n= 286

The summary method for coxph objects provides additional information about the fit:

> summary(GSE2034.coxph)
Call:
coxph(formula = surv ~ ER.Status, data = GSE2034.Surv)

  n= 286

                 coef exp(coef) se(coef)      z Pr(>|z|)
ER.StatusER+ -0.00378   0.99623  0.22260 -0.017    0.986

             exp(coef) exp(-coef) lower .95 upper .95
ER.StatusER+    0.9962      1.004     0.644     1.541

Rsquare= 0   (max possible= 0.983 )
Likelihood ratio test= 0  on 1 df,   p=0.9865
Wald test            = 0  on 1 df,   p=0.9865
Score (logrank) test = 0  on 1 df,   p=0.9865

Another useful function is cox.zph, which tests the proportional hazards assumption for a Cox regression model fit:

> cox.zph(GSE2034.coxph)
              rho chisq        p
ER.StatusER+ 0.33  11.6 0.000655

There are additional methods available for viewing information about coxph fits, including residuals, predict, and survfit; see the help file for coxph.object for more information.

There are other functions in the survival package for fitting survival models, such as cch, which fits proportional hazard models to case-cohort data. See the help files for more information.