How Muich Survival Data Is Needed For Cox Regressin

Cox Proportional-Hazards Model

The Cox proportional-hazards model (Cox, 1972) is essentially a regression model commonly used statistical in medical inquiry for investigating the clan between the survival time of patients and one or more than predictor variables.

In the previous chapter (survival assay basics), we described the bones concepts of survival analyses and methods for analyzing and summarizing survival information, including:

the definition of risk and survival functions,
the construction of Kaplan-Meier survival curves for different patient groups
the logrank test for comparison two or more survival curves

The above mentioned methods - Kaplan-Meier curves and logrank tests - are examples of univariate analysis. They describe the survival co-ordinate to 1 cistron under investigation, but ignore the impact of any others.

Additionally, Kaplan-Meier curves and logrank tests are useful just when the predictor variable is categorical (e.g.: handling A vs treatment B; males vs females). They don't work easily for quantitative predictors such as gene expression, weight, or age.

An culling method is the Cox proportional hazards regression assay, which works for both quantitative predictor variables and for categorical variables. Furthermore, the Cox regression model extends survival analysis methods to assess simultaneously the effect of several risk factors on survival fourth dimension.

In this commodity, nosotros'll describe the Cox regression model and provide applied examples using R software.

Contents

The need for multivariate statistical modeling
Basics of the Cox proportional hazards model
Compute the Cox model in R
- Install and load required R package
- R part to compute the Cox model: coxph()
- Case information sets
- Compute the Cox model
- Visualizing the estimated distribution of survival times
Summary
References
Infos

The need for multivariate statistical modeling

In clinical investigations, at that place are many situations, where several known quantities (known equally covariates), potentially bear upon patient prognosis.

For example, suppose two groups of patients are compared: those with and those without a specific genotype. If one of the groups also contains older individuals, whatsoever difference in survival may be attributable to genotype or age or indeed both. Hence, when investigating survival in relation to any one factor, it is often desirable to suit for the impact of others.

Statistical model is a frequently used tool that allows to analyze survival with respect to several factors simultaneously. Additionally, statistical model provides the result size for each cistron.

The cox proportional-hazards model is ane of the most of import methods used for modelling survival assay data. The next department introduces the basics of the Cox regression model.

Nuts of the Cox proportional hazards model

The purpose of the model is to evaluate simultaneously the effect of several factors on survival. In other words, it allows united states to examine how specified factors influence the rate of a particular event happening (e.g., infection, death) at a item point in fourth dimension. This rate is commonly referred equally the run a risk rate. Predictor variables (or factors) are usually termed covariates in the survival-analysis literature.

The Cox model is expressed past the hazard office denoted by h(t). Briefly, the hazard function tin be interpreted as the risk of dying at time t. It can be estimated as follow:

\[ h(t) = h_0(t) \times exp(b_1x_1 + b_2x_2 + ... + b_px_p) \]

where,

t represents the survival time
\(h(t)\) is the hazard function determined past a fix of p covariates (\(x_1, x_2, ..., x_p\))
the coefficients (\(b_1, b_2, ..., b_p\)) measure the bear upon (i.e., the effect size) of covariates.
the term \(h_0\) is called the baseline hazard. It corresponds to the value of the risk if all the \(x_i\) are equal to zero (the quantity exp(0) equals 1). The 't' in h(t) reminds us that the take a chance may vary over time.

The Cox model can be written as a multiple linear regression of the logarithm of the risk on the variables \(x_i\), with the baseline gamble beingness an 'intercept' term that varies with time.

The quantities \(exp(b_i)\) are called adventure ratios (Hr). A value of \(b_i\) greater than zero, or equivalently a hazard ratio greater than one, indicates that every bit the value of the \(i^{th}\) covariate increases, the upshot hazard increases and thus the length of survival decreases.

Put another way, a take chances ratio above 1 indicates a covariate that is positively associated with the event probability, and thus negatively associated with the length of survival.

In summary,

HR = one: No effect
Hr < 1: Reduction in the risk
HR > ane: Increase in Risk

Notation that in cancer studies:

A covariate with gamble ratio > 1 (i.e.: b > 0) is called bad prognostic factor
A covariate with take a chance ratio < 1 (i.e.: b < 0) is called skillful prognostic factor

A key supposition of the Cox model is that the take chances curves for the groups of observations (or patients) should be proportional and cannot cross.

Consider two patients k and thousand' that differ in their 10-values. The respective hazard function tin can be simply written as follow

Hazard office for the patient k:

\[ h_k(t) = h_0(t)e^{\sum\limits_{i=1}^north{\beta x}} \]

Risk function for the patient 1000':

\[ h_{k'}(t) = h_0(t)due east^{\sum\limits_{i=i}^n{\beta x'}} \]

The risk ratio for these two patients [\(\frac{h_k(t)}{h_{k'}(t)} = \frac{h_0(t)east^{\sum\limits_{i=1}^n{\beta ten}}}{h_0(t)due east^{\sum\limits_{i=1}^n{\beta x'}}} = \frac{e^{\sum\limits_{i=1}^northward{\beta ten}}}{e^{\sum\limits_{i=1}^due north{\beta 10'}}}\)] is contained of time t.

Consequently, the Cox model is a proportional-hazards model: the gamble of the event in any group is a constant multiple of the hazard in any other. This assumption implies that, as mentioned above, the gamble curves for the groups should be proportional and cannot cross.

In other words, if an individual has a risk of decease at some initial time indicate that is twice as high as that of another individual, then at all subsequently times the hazard of decease remains twice every bit high.

This assumption of proportional hazards should exist tested. We'll discuss methods for assessing proportionality in the next commodity in this series: Cox Model Assumptions.

Compute the Cox model in R

Install and load required R package

We'll use 2 R packages:

survival for calculating survival analyses
survminer for visualizing survival assay results
Install the packages

                    install.packages(c("survival", "survminer"))

Load the packages

                    library("survival") library("survminer")

R function to compute the Cox model: coxph()

The function coxph()[in survival bundle] can be used to compute the Cox proportional hazards regression model in R.

The simplified format is equally follow:

                    coxph(formula, information, method)

formula: is linear model with a survival object as the response variable. Survival object is created using the function Surv() as follow: Surv(time, event).
data: a data frame containing the variables
method: is used to specify how to handle ties. The default is 'efron'. Other options are 'breslow' and 'exact'. The default 'efron' is mostly preferred to the once-pop "breslow" method. The "exact" method is much more than computationally intensive.

Example data sets

We'll use the lung cancer information in the survival R package.

                    information("lung") caput(lung)

                                          inst time status age sex ph.ecog ph.karno pat.karno repast.cal wt.loss 1    iii  306      2  74   ane       1       ninety       100     1175      NA two    three  455      2  68   1       0       90        90     1225      15 iii    3 1010      1  56   one       0       90        xc       NA      15 4    5  210      2  57   1       i       90        60     1150      eleven 5    ane  883      2  60   1       0      100        ninety       NA       0 6   12 1022      1  74   1       1       l        fourscore      513       0

inst: Institution lawmaking
time: Survival time in days
status: censoring condition ane=censored, ii=dead
historic period: Age in years
sex: Male person=1 Female person=2
ph.ecog: ECOG performance score (0=good 5=dead)
ph.karno: Karnofsky performance score (bad=0-proficient=100) rated by physician
pat.karno: Karnofsky functioning score as rated past patient
meal.cal: Calories consumed at meals
wt.loss: Weight loss in last six months

Compute the Cox model

We'll fit the Cox regression using the following covariates: age, sex, ph.ecog and wt.loss.

Nosotros first by calculating univariate Cox analyses for all these variables; and so we'll fit multivariate cox analyses using two variables to describe how the factors jointly bear upon on survival.

Univariate Cox regression

Univariate Cox analyses can be computed as follow:

                      res.cox <- coxph(Surv(time, status) ~ sex, data = lung) res.cox

                      Call: coxph(formula = Surv(fourth dimension, condition) ~ sex, data = lung)       coef exp(coef) se(coef)     z      p sex -0.531     0.588    0.167 -3.18 0.0015 Likelihood ratio test=10.six  on one df, p=0.00111 northward= 228, number of events= 165

The function summary() for Cox models produces a more than consummate report:

                      summary(res.cox)

                      Call: coxph(formula = Surv(fourth dimension, condition) ~ sex, data = lung)   northward= 228, number of events= 165         coef exp(coef) se(coef)      z Pr(>|z|)    sex -0.5310    0.5880   0.1672 -3.176  0.00149 ** --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1     exp(coef) exp(-coef) lower .95 upper .95 sex     0.588      i.701    0.4237     0.816 Concordance= 0.579  (se = 0.022 ) Rsquare= 0.046   (max possible= 0.999 ) Likelihood ratio examination= ten.63  on 1 df,   p=0.001111 Wald exam            = ten.09  on ane df,   p=0.001491 Score (logrank) test = 10.33  on 1 df,   p=0.001312

The Cox regression results can be interpreted as follow:

Statistical significance. The cavalcade marked "z" gives the Wald statistic value. It corresponds to the ratio of each regression coefficient to its standard mistake (z = coef/se(coef)). The wald statistic evaluates, whether the beta (\(\beta\)) coefficient of a given variable is statistically significantly different from 0. From the output in a higher place, nosotros tin conclude that the variable sex have highly statistically pregnant coefficients.
The regression coefficients. The second feature to annotation in the Cox model results is the the sign of the regression coefficients (coef). A positive sign means that the hazard (gamble of death) is college, and thus the prognosis worse, for subjects with higher values of that variable. The variable sex is encoded as a numeric vector. one: male person, two: female. The R summary for the Cox model gives the take chances ratio (60 minutes) for the second group relative to the first grouping, that is, female versus male. The beta coefficient for sex = -0.53 indicates that females accept lower risk of death (lower survival rates) than males, in these data.
Take a chance ratios. The exponentiated coefficients (exp(coef) = exp(-0.53) = 0.59), also known as take chances ratios, give the effect size of covariates. For example, being female (sex=2) reduces the run a risk by a factor of 0.59, or 41%. Being female is associated with good prognostic.
Confidence intervals of the take chances ratios. The summary output also gives upper and lower 95% confidence intervals for the chance ratio (exp(coef)), lower 95% bound = 0.4237, upper 95% bound = 0.816.
Global statistical significance of the model. Finally, the output gives p-values for 3 culling tests for overall significance of the model: The likelihood-ratio exam, Wald examination, and score logrank statistics. These three methods are asymptotically equivalent. For large enough N, they will give similar results. For small N, they may differ somewhat. The Likelihood ratio test has improve beliefs for pocket-size sample sizes, then information technology is generally preferred.

To apply the univariate coxph role to multiple covariates at once, type this:

                      covariates <- c("age", "sex activity",  "ph.karno", "ph.ecog", "wt.loss") univ_formulas <- sapply(covariates,                         function(x) as.formula(paste('Surv(time, status)~', x)))                          univ_models <- lapply( univ_formulas, part(x){coxph(x, data = lung)}) # Extract data  univ_results <- lapply(univ_models,                        function(x){                            x <- summary(x)                           p.value<-signif(x$wald["pvalue"], digits=two)                           wald.examination<-signif(x$wald["test"], digits=2)                           beta<-signif(x$coef[1], digits=2);#coeficient beta                           Hr <-signif(x$coef[2], digits=two);#exp(beta)                           HR.confint.lower <- signif(10$conf.int[,"lower .95"], two)                           Hour.confint.upper <- signif(x$conf.int[,"upper .95"],two)                           60 minutes <- paste0(Hr, " (",                                         Hr.confint.lower, "-", HR.confint.upper, ")")                           res<-c(beta, Hour, wald.test, p.value)                           names(res)<-c("beta", "60 minutes (95% CI for 60 minutes)", "wald.examination",                                          "p.value")                           return(res)                           #render(exp(cbind(coef(x),confint(x))))                          }) res <- t(every bit.data.frame(univ_results, bank check.names = FALSE)) equally.data.frame(res)

                                              beta Hr (95% CI for 60 minutes) wald.test p.value age       0.019            one (one-one)       iv.1   0.042 sex       -0.53   0.59 (0.42-0.82)        10  0.0015 ph.karno -0.016      0.98 (0.97-i)       vii.9   0.005 ph.ecog    0.48        1.half-dozen (i.three-2)        18 2.7e-05 wt.loss  0.0013         i (0.99-1)      0.05    0.83

The output above shows the regression beta coefficients, the effect sizes (given as risk ratios) and statistical significance for each of the variables in relation to overall survival. Each gene is assessed through split up univariate Cox regressions.

From the output higher up,

The variables sex, age and ph.ecog have highly statistically significant coefficients, while the coefficient for ph.karno is not significant.
age and ph.ecog take positive beta coefficients, while sex has a negative coefficient. Thus, older age and higher ph.ecog are associated with poorer survival, whereas beingness female (sex activity=ii) is associated with better survival.

Now, we want to describe how the factors jointly touch on survival. To answer to this question, we'll perform a multivariate Cox regression analysis. As the variable ph.karno is not significant in the univariate Cox analysis, we'll skip it in the multivariate analysis. We'll include the three factors (sex activity, age and ph.ecog) into the multivariate model.

Multivariate Cox regression assay

A Cox regression of time to death on the time-constant covariates is specified as follow:

                      res.cox <- coxph(Surv(fourth dimension, status) ~ age + sex + ph.ecog, data =  lung) summary(res.cox)

                      Call: coxph(formula = Surv(time, condition) ~ age + sex + ph.ecog, data = lung)   n= 227, number of events= 164     (1 observation deleted due to missingness)              coef exp(coef)  se(coef)      z Pr(>|z|)     age      0.011067  1.011128  0.009267  i.194 0.232416     sex activity     -0.552612  0.575445  0.167739 -3.294 0.000986 *** ph.ecog  0.463728  i.589991  0.113577  4.083 4.45e-05 *** --- Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' i         exp(coef) exp(-coef) lower .95 upper .95 age        ane.0111     0.9890    0.9929    i.0297 sex        0.5754     one.7378    0.4142    0.7994 ph.ecog    1.5900     0.6289    1.2727    1.9864 Concordance= 0.637  (se = 0.026 ) Rsquare= 0.126   (max possible= 0.999 ) Likelihood ratio test= 30.v  on 3 df,   p=one.083e-06 Wald test            = 29.93  on iii df,   p=1.428e-06 Score (logrank) examination = thirty.5  on iii df,   p=1.083e-06

The p-value for all three overall tests (likelihood, Wald, and score) are significant, indicating that the model is significant. These tests evaluate the omnibus nothing hypothesis that all of the betas (\(\beta\)) are 0. In the above instance, the test statistics are in close agreement, and the motorbus goose egg hypothesis is soundly rejected.

In the multivariate Cox assay, the covariates sex and ph.ecog remain significant (p < 0.05). However, the covariate age fails to be significant (p = 0.23, which is grater than 0.05).

The p-value for sex is 0.000986, with a gamble ratio Hour = exp(coef) = 0.58, indicating a potent human relationship between the patients' sex and decreased risk of death. The hazard ratios of covariates are interpretable as multiplicative effects on the hazard. For case, holding the other covariates constant, being female (sex=2) reduces the adventure by a factor of 0.58, or 42%. We conclude that, being female person is associated with proficient prognostic.

Similarly, the p-value for ph.ecog is iv.45e-05, with a hazard ratio HR = one.59, indicating a potent human relationship between the ph.ecog value and increased risk of death. Belongings the other covariates constant, a higher value of ph.ecog is associated with a poor survival.

By contrast, the p-value for age is now p=0.23. The chance ratio HR = exp(coef) = i.01, with a 95% confidence interval of 0.99 to 1.03. Considering the conviction interval for Hour includes 1, these results indicate that age makes a smaller contribution to the difference in the HR after adjusting for the ph.ecog values and patient's sex, and merely trend toward significance. For example, holding the other covariates abiding, an additional yr of historic period induce daily hazard of death past a factor of exp(beta) = 1.01, or 1%, which is not a meaning contribution.

Visualizing the estimated distribution of survival times

Having fit a Cox model to the data, it'southward possible to visualize the predicted survival proportion at whatever given point in fourth dimension for a item gamble group. The function survfit() estimates the survival proportion, by default at the mean values of covariates.

                    # Plot the baseline survival role ggsurvplot(survfit(res.cox), color = "#2E9FDF",            ggtheme = theme_minimal())

Cox Proportional-Hazards Model

We may wish to display how estimated survival depends upon the value of a covariate of interest.

Consider that, nosotros want to assess the touch on of the sex on the estimated survival probability. In this case, we construct a new data frame with two rows, one for each value of sex; the other covariates are fixed to their average values (if they are continuous variables) or to their lowest level (if they are detached variables). For a dummy covariate, the average value is the proportion coded 1 in the data set. This information frame is passed to survfit() via the newdata statement:

                    # Create the new data   sex_df <- with(lung,                information.frame(sexual activity = c(1, 2),                            age = rep(mean(age, na.rm = TRUE), two),                           ph.ecog = c(1, i)                           )                ) sex_df

                                          sex      age ph.ecog 1   i 62.44737       1 ii   two 62.44737       1

                    # Survival curves fit <- survfit(res.cox, newdata = sex_df) ggsurvplot(fit, conf.int = TRUE, legend.labs=c("Sexual practice=1", "Sex=ii"),            ggtheme = theme_minimal())

Cox Proportional-Hazards Model

Summary

In this article, we described the Cox regression model for assessing simultaneously the relationship betwixt multiple risk factors and patient'southward survival time. We demonstrated how to compute the Cox model using the survival package. Additionally, we described how to visualize the results of the analysis using the survminer package.

References

Cox DR (1972). Regression models and life tables (with discussion). J R Statist Soc B 34: 187–220
MJ Bradburn, TG Clark, SB Love and DG Altman. Survival Analysis Part II: Multivariate data analysis – an introduction to concepts and methods. British Journal of Cancer (2003) 89, 431 – 436

Infos

This assay has been performed using R software (ver. 3.3.2).

Enjoyed this commodity? I'd be very grateful if you'd assistance information technology spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some beloved with the like buttons beneath... Give thanks you and please don't forget to share and comment below!!

Avez vous aimé cet commodity? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et due north'oubliez pas, due south'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Recommended for you

This section contains best data scientific discipline and self-development resources to help you on your path.

Want to Learn More on R Programming and Information Science?

On Social Networks:

Get involved :
Click to follow us on Facebook and Google+ :
Comment this article past clicking on "Discussion" push (superlative-correct position of this page)