Survival Models

Survival analysis is concerned with looking at the amount of time that elapses before an event occurs. An obvious application is to look at mortality statistics (predicting how long people live), but it can also be applied to mechanical systems (the time before a failure occurs), marketing (the amount of time before a consumer cancels an account), or other areas.

In R, there are a variety of functions in the survival library for modeling survival data.

To estimate a survival curve for censored data, you can use the survfit function:

library(survival)
survfit(formula, data, weights, subset, na.action, etype, id, ...)

This function accepts the following arguments.

Argument	Description
formula	Describes the relationship between the response value and the predictors. The response value should be a `Surv` object.
data	The data frame in which to evaluate formula.
weights	Weights for observations.
subset	Subset of observation to use in fitting the model.
na.action	Function to deal with missing values.
etype	The variable giving the type of event.
id	The variable that identifies individual subjects.
type	Specifies the type of survival curve. Options include `"kaplan-meier"`, `"fleming-harrington"`, and `"fh2"`.
error	Specifies the type of error. Possible values are `"greenwood"` for the Greenwood formula or `"tsiatis"` for the Tsiatis formula.
conf.type	Confidence interval type. One of `"none"`, `"plain"`, `"log"` (the default), or `"log-log"`.
conf.lower	A character string to specify modified lower limits to the curve; the upper limit remains unchanged. Possible values are `"usual"` (unmodified), `"peto"`, and `"modified"`.
start.time	Numeric value specifying a time to start calculating survival information.
conf.int	The level for a two-sided confidence interval on the survival curve(s).
se.fit	A logical value indicating whether standard errors should be computed.
...	Additional variables passed to internal functions.

As an example, let’s fit a survival curve for the GSE2034 data set. This data comes from the Gene Expression Omnibus of the National Center for Biotechnology Information (NCBI), which is accessible from http://www.ncbi.nlm.nih.gov/geo/. The experiment examined how the expression of certain genes affected breast cancer relapse-free survival time. In particular, it tested estrogen receptor binding sites. (We’ll revisit this example in Chapter 25.)

First, we need to create a Surv object within the data frame. A Surv object is an R object for representing survival information, in particular, censored data. Censored data occurs when the outcome of the experiment is not known for all observations. In this case, the data is censored. There are three possible outcomes for each observation: the subject had a recurrence of the disease, the subject died without having a recurrence of the disease, or the subject was still alive without a recurrence at the time the data was reported. The last outcome—the subject was still alive without a recurrence—results in the censored values:

> library(survival)
> GSE2034.Surv <- transform(GSE2034,
+   surv=Surv(
+     time=GSE2034$months.to.relapse.or.last.followup,
+     event=GSE2034$relapse,
+     type="right"
+   )
+ )
> # show the first 26 observations:
> GSE2034.Surv$surv[1:26,]
 [1] 101+ 118+   9  106+  37  125+ 109+  14   99+ 137+  34   32  128+
[14]  14  130+  30  155+  25   30   84+   7  100+  30    7  133+  43

Now let’s calculate the survival model. We’ll just make it a function of the ER.status flag (which stands for “estrogen receptor”):

> GSE2034.survfit <- survfit(
+   formula=surv~ER.Status,
+   data=GSE2034.Surv,
+ )

The easiest way to view a survfit object is graphically. Let’s plot the model:

> plot(GSE2034.survfit, lty=1:2, log=T)
> legend(135, 1, c("ER+","ER-"), lty=1:2, cex=0.5)

The plot is shown in Figure 20-4. Note the different curve shape for each cohort.

Figure 20-4. Survival curves for the GSE2034 data

To fit a parametric survival model, you can use the survreg function in the survival package:

survreg(formula, data, weights, subset,
        na.action, dist="weibull", init=NULL, scale=0,
        control,parms=NULL,model=FALSE, x=FALSE,
        y=TRUE, robust=FALSE, score=FALSE, ...)

Here is a description of the arguments to survreg.

Argument	Description	Default
formula	A formula that describes the form of the model; the response is usually a `Surv` object (created by the `Surv` function).
data	A data frame containing the training data for the model.
weights	A vector of weights for observations in `data`.
subset	An expression describing a subset of observations in `data` to use for fitting the model.
na.action	A function that describes how to treat `NA` values.	`options()$na.action`
dist	A character value describing the form of the y variable (either `"weibull"`, `"exponential"`, `"gaussian"`, `"logistic"`, `"lognormal"`, or `"loglogistic"`) or a distribution like the ones in `survreg.distributions`.	`"weibull"`
init	Optional vector of initial parameters.	`NULL`
scale	Value specifying the scale of the estimates. Estimated if `scale <= 0`.	`0`
control	A list of control values, usually produced by `survreg.control`.
parms	A list of fixed parameters for the distribution function.	`NULL`
model, x, y	Logical values indicating whether to return the model frame, X matrix, or Y vector (respectively) with the results.	`FALSE`
robust	A logical value indicating whether to use “robust sandwich standard methods.”	`FALSE`
score	A logical value indicating whether to return the score vector.	`FALSE`
...	Other arguments passed to `survreg.control`.

You can compute the expected survival for a set of subjects (or individual expectations for each subject) with the function survexp:

library(survival)
survexp(formula, data, weights, subset, na.action, rmap, times,
    cohort = TRUE, conditional = FALSE, ratetable = survexp.us,
    scale = 1, npoints, se.fit, model = FALSE, x = FALSE, y = FALSE)

Here is a description of the arguments to survexp.

Argument	Description	Default
formula	A formula object describing the form of the model. The (optional) response should contain a vector of follow-up times, and the predictors should contain grouping variables separated by `+` operators.
data	A data frame containing source data on which to predict values.
weights	A vector of weights for the cases.
subset	An expression indicating which observations in data should be included in the prediction.
na.action	A function specifying how to deal with missing (`NA`) values in the data.	`options()$na.action`
times	A vector of follow-up times at which the resulting survival curve is evaluated. (This may also be included in the formula; see above.)
cohort	A logical value indicating whether to calculate the survival of the whole cohort (`cohort=TRUE`) or individual observations (`cohort=FALSE`).	`TRUE`
conditional	A logical value indicating whether to calculate conditional expected survival. Specify `conditional=TRUE` if the follow-up times are times of death, and `conditional=FALSE` if the follow-up times are potential censoring times.	`FALSE`
ratetable	A fitted Cox model (from `coxph`) or a table of survival times.	`survexp.us`
scale	A numeric value specifying how to scale the results.	`1`
npoints	A numeric value indicating the number of points at which to calculate individual results.
se.fit	A logical value indicating whether to include the standard error of the predicted survival.
model, x, y	Specifies whether to return the model frame, the X matrix, or the Y vector in the results.	`FALSE` for all three

The Cox proportional hazard model is a nonparametric method for fitting survival models. It is available in R through the coxph function in the survival library:

coxph(formula, data, weights, subset, na.action, init, control,
    ties = c("efron", "breslow", "exact"), singular.ok = TRUE,
    robust = FALSE, model = FALSE, x = FALSE, y = TRUE, tt, method = ties,
    ...)

Here is a description of the arguments to coxph.

Argument	Description	Default
formula	A formula that describes the form of the model; the response must be a `Surv` object (created by the `Surv` function).
data	A data frame containing source data on which to predict values.
weights	A vector of weights for the cases.
subset	An expression indicating which observations in data should be fit.
na.action	A function specifying how to deal with missing (`NA`) values in the data.
init	A vector of initial parameter values for the fitting process.	`0` for all variables
control	Object of class `coxph.control` specifying the iteration limit and other control options.	`coxph.control(...)`
method	A character value specifying the method for handling ties. Choices include `"efron"`, `"breslow"`, and `"exact"`.	`"efron"`
singular.ok	A logical value indicating whether to stop with an error if the `X` matrix is singular or to simply skip variables that are linear combinations of other variables.	`TRUE`
robust	A logical value indicating whether to return a robust variance estimate.	`FALSE`
model	A logical value specifying whether to return the model frame.	`FALSE`
x	A logical value specifying whether to return the X matrix.	`FALSE`
y	A logical value specifying whether to return the Y vector.	`TRUE`
...	Additional arguments passed to `coxph.control`.

As an example, let’s fit a Cox proportional hazard model to the GSE2034 data:

> GSE2034.coxph <- coxph(
+    formula=surv~ER.Status,
+    data=GSE2034.Surv,
+  )
> GSE2034.coxph
Call:
coxph(formula = surv ~ ER.Status, data = GSE2034.Surv)


                 coef exp(coef) se(coef)       z    p
ER.StatusER+ -0.00378     0.996    0.223 -0.0170 0.99

Likelihood ratio test=0  on 1 df, p=0.986  n= 286

The summary method for coxph objects provides additional information about the fit:

> summary(GSE2034.coxph)
Call:
coxph(formula = surv ~ ER.Status, data = GSE2034.Surv)

  n= 286

                 coef exp(coef) se(coef)      z Pr(>|z|)
ER.StatusER+ -0.00378   0.99623  0.22260 -0.017    0.986

             exp(coef) exp(-coef) lower .95 upper .95
ER.StatusER+    0.9962      1.004     0.644     1.541

Rsquare= 0   (max possible= 0.983 )
Likelihood ratio test= 0  on 1 df,   p=0.9865
Wald test            = 0  on 1 df,   p=0.9865
Score (logrank) test = 0  on 1 df,   p=0.9865

Another useful function is cox.zph, which tests the proportional hazards assumption for a Cox regression model fit:

> cox.zph(GSE2034.coxph)
              rho chisq        p
ER.StatusER+ 0.33  11.6 0.000655

There are additional methods available for viewing information about coxph fits, including residuals, predict, and survfit; see the help file for coxph.object for more information.

There are other functions in the survival package for fitting survival models, such as cch, which fits proportional hazard models to case-cohort data. See the help files for more information.