Original link:tecdat.cn/?p=10278
Original source:Tuo End number according to the tribe public number
Part 1: Introduction to survival analysis
This presentation will introduce survival analysis, refer to:
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part I: Basic concepts and first analyses. 232-238. ISSN 0007-0920.
Some of the packages we’ll be using today include:
lubridate
survival
survminer
library(survival)
library(survminer)
library(lubridate)
Copy the code
What is survival data?
Event time data consists of different start and end times.
The example of cancer
- Time from surgery to death
- Time from the beginning of treatment to progress
- Time from response to relapse
Examples from other areas
Event timing data is common in many areas, including but not limited to
- Time from HIV infection to AIDS development
- The time of the heart attack
- When substance abuse occurs
- Machine failure time
Survival analysis alias
Since survival analysis is common in many other fields, it also has other names
- Reliability analysis
- Duration analysis
- Event history analysis
- Event time analysis
Lung data set
Data included patients with advanced lung cancer from the North Central Cancer treatment group. Some of the variables we will use to demonstrate methods today include
- Time: Survival time in days
- Status: Review status 1 = review, 2 = invalid
- Gender: Male = 1 Female = 2
What is censorship?
RICH JT, NEELY JG, PANIELLO RC, VOELKER CCJ, NUSSENBAUM B, WANG EW. A practical guide to understanding Kaplan-Meier curves. Otolaryngology – Head and Neck surgery: official journal of the American Academy of Otolaryngology – Head and Neck Surgery. 2010; 143 (3) : 331-336. Doi: 10.1016 / j.otohns.2010.05.007.
Review the types
A topic may be reviewed for the following reasons:
- Subsequent loss
- Withdrew from the study
- Fixed no activity before end of study period
Specifically, these are examples of censorship.
Review survival data
In this example, how would we calculate the percentage of 10 years without events?
Subjects 2, 3, 5, 6, 8, 9, and 10 were event-free at year 10. Subjects 4 and 7 had this event 10 years earlier. Topic 1 was reviewed 10 years ago, so we don’t know if they had this event 10 years ago – how do we include this topic in our estimates?
Assign follow-up time
- The subject under review still provides information and therefore must be appropriately included in the analysis
- The distribution of follow-up times is skewed and may vary between patients undergoing examinations and those with events
Component of survival data
For Topic II:
-
Activity time TiTi
-
Review time CiCi
-
Event indicator δ I δ I:
- 1. If the observed event (i.e. Ti≤CiTi≤Ci)
- If checked, 0 (i.e. Ti>CiTi>Ci)
-
Observation time Yi=min(Ti,Ci)Yi=min(Ti,Ci)
Observation time and event indication are provided in lung data
- Time: Survival time in days (YiYi)
- Status: Review Status 1 = Review, 2 = Dead (δ I δ I)
Process dates in R
Data typically comes with a start date and an end date, rather than a pre-calculated lifetime. The first step is to make sure that these formats are set to dates in R.
Let’s create a small sample data set where sx_Date contains variables from the date of surgery and last_FUp_Date from the last follow-up.
date_ex <-
tibble(
sx_date = c("2007-06-22"."2004-02-13"."2010-10-27"),
last_fup_date = c("2017-04-15"."2018-07-04"."2016-10-31")
)
date_ex
Copy the code
## # A tibble: 3 x 2
## sx_date last_fup_date
## <chr> <chr>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
Copy the code
We see that they are character variables, which is usually the case, but we need to format them as dates.
Format date – radix R
date_ex %>%
mutate(
sx_date = as.Date(sx_date, format = "%Y-%m-%d"),
last_fup_date = as.Date(last_fup_date, format = "%Y-%m-%d"))Copy the code
## # A tibble: 3 x 2
## sx_date last_fup_date
## <date> <date>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
Copy the code
- Please note that,
R
The format must contain delimiters and symbols. For example, if your date format is M/D/Y, yesformat = "%m/%d/%Y"
Format date – LubriDate package
We can also use the Lubridate package to format dates. In this case, use the YMD function
date_ex %>%
mutate(
sx_date = ymd(sx_date),
last_fup_date = ymd(last_fup_date)
)
Copy the code
## # A tibble: 3 x 2
## sx_date last_fup_date
## <date> <date>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
Copy the code
- Please note that with the basics
R
Different options do not require a delimiter
Calculate survival time
Now that the date is formatted, we need to calculate the difference between the start time and the end time in some units (usually months or years). In base, R, used for DiffTime calculates the number of days between two dates, which is then converted to a numeric value using as.numeric. Then divide the average number of days by 365.25 years and convert to years.
date_ex %>%
mutate(
os_yrs =
as.numeric(
difftime(last_fup_date,
sx_date,
units = "days")) / 365.25
)
Copy the code
## # A tibble: 3 x 3 ## sx_date last_fuP_date os_YRS ## <date> <date> < DBL > ## # 3 2010-10-27 2016-10-31 6.01Copy the code
Calculate survival time
The operator can specify a time interval %–%, and then use the method to convert the time interval to the number of elapsed seconds as.duration, and finally divide by dyears(1) to convert it to the number of years to obtain the number of seconds in a year.
## # A tibble: 3 x 3 ## sx_date last_fuP_date os_YRS ## <date> <date> < DBL > ## # 3 2010-10-27 2016-10-31 6.02Copy the code
Events marking the
For the component of survival data, I mentioned event indicators:
Event indicator δ I δ I:
- 1. If the observed event (i.e. Ti≤CiTi≤Ci)
- If checked, 0 (i.e. Ti>CiTi>Ci)
In lung data, we have:
- Status: Review status 1 = review, 2 = invalid
Survival function
The probability that the subject will survive beyond the specified time
S (t) = (t > t) Pr = 1 – (t) S (t) = F (t > t) Pr = 1 – F (t)
S(t)S(t) : survival function F(t)=Pr(t ≤t)F(t)=Pr(t ≤t) : cumulative distribution function
Theoretically, the survival function is smooth; In practice, we observe events on discrete time scales.
The survival probability
- The probability of survival at some time, S(t), S(t), is the conditional probability of surviving beyond that time, given that the individual has survived just before that time.
- It can be estimated as the number of follow-up patients who were alive at the time but had no loss divided by the number of patients who were alive at the time
- Kaplan-meier estimates of survival probabilities are the product of these conditional probabilities
- At time 0, the probability of survival is 1, i.e. S(t0)=1S(t0)=1
Create a living object
The Kaplan-Meier method is the most commonly used method for estimating survival time and probability. This is a nonparametric method that produces a step function that drops each time an event occurs.
- Create a survival object. For each topic, there will be one entry as the survival time,
+
If the subject is censored, it is followed by one. Let’s take a look at the top 10 observations:
## [1] 306 455 1010+ 210 883 1022+ 310 361 218 166
Copy the code
Kaplan-meier method was used to estimate the survival curve
- the
survfit
The function creates a survival curve based on the formula. Let’s generate an overall survival curve for the entire homogeneous group and assign it to Objectf1
And then checknames
Of this object:
names(f1)
Copy the code
## [1] "n" "time" "n.risk" "n.event" "n.censor"
## [6] "surv" "std.err" "cumhaz" "std.chaz" "start.time"
## [11] "type" "logse" "conf.int" "conf.type" "lower"
## [16] "upper" "call"
Copy the code
Some of the key components that the SurvFit object will use to create a survival curve include:
time
Which contains the start and end of each intervalsurv
, which contains each corresponding survival probabilitytime
Kaplan and Meier
Now, draw the object to get the Kaplan-Meier diagram.
plot(survfit(Surv(time, status) ~ 1, data = lung),
xlab = "Days",
ylab = "Overall survival probability")
Copy the code
- base
R
The default graph in shows a step function (solid line) with a correlation confidence interval (dotted line) - The horizontal line represents the lifetime of the interval
- The interval is terminated by an event
- The height of the vertical line shows the change in cumulative probability
- Reviewed observations with scale lines reduce the cumulative survival between intervals.
Kaplan and Meier
Build on ggplot2 and can be used to create Kaplan-Meier diagrams.
- The default graph has a step function (solid line) with a correlation confidence band (shaded area).
- By default, the scale line for the patient being examined is displayed. In this example, the scale line itself is somewhat blurred and can be unmarked using the option
censor = FALSE
Estimated survival of xx years
A quantity that often needs attention in survival analysis is the probability of surviving beyond a certain number of (xx) years.
For example, estimate the likelihood of surviving to 11 years
## Call: survfit(formula = Surv(time, status) ~ 1, Data = lung) ## ## time n.risk n.vent survival std.err Lower 95% CI Upper 95% CI ## 365 65 121 0.409 0.0358 0.345 0.486Copy the code
We found that the 11-year survival rate in this study was 41%.
The upper and lower limits of 95% confidence intervals are also shown.
Survival rate and survival curve at xx years
The 11-year survival probability is the point on the Y-axis that corresponds to the 11-year X-axis survival curve.
Xx year survival rates are often misestimated
What if we use “naive” estimates?
121 of 228 patients died by year 1, so:
– 1-1 – year survival was misestimated when ignoring the fact that 42 patients were examined 1 year earlier.
- The correct estimated probability of survival – 41 per cent per year.
Ignore the impact of censorship on xx years of survival
- Imagine two studies, each with 228 topics. There were 165 deaths in each study. One was not examined (orange line) and 63 patients were examined by another (blue line)
- Ignoring the review results in an overestimate of the overall probability of survival, as the reviewed subjects provide information only for part of the follow-up period and then fall outside the risk range, reducing the cumulative probability of survival
Estimated median survival time
Another quantity that often needs to be looked at in survival analysis is the mean time to survival, which we quantify using the median. The predicted survival time is not normally distributed, so an average is not an appropriate summary.
## Call: Survfit (formula = Surv(time, status) ~ 1, data = lung) ## ## n events median 0.95lCL 0.95UCL ## 228 165 310 285 363Copy the code
We see a median survival of 310 days. It also shows the upper and lower bounds of the 95% confidence interval.
Median survival time and survival curve
The median survival time was 0.50
Median survival rates are often misestimated
The median survival time of 165 patients who died was summarized
## median_surv
## 1 226
Copy the code
- When ignoring the fact that the patients examined also contributed to follow-up, an incorrect median estimated survival time of 226 days was obtained.
- The correct median survival estimate is 310 days.
The impact of the review on median survival was ignored
- Ignoring the review results in an artificially low survival curve by excluding the follow-up time contributed by the reviewed patients (purple line)
- The real survival curve of the data
lung
Shown in blue for comparison
Survival time between groups was compared
- We can use the logarithmic rank test for inter-group importance tests
- Log-rank test is the most common method for comparing survival time between groups by averaging observations over the entire follow-up period
- Depending on the study question, some versions may place more emphasis on early or late follow-up and may be more appropriate
We use the function to obtain the logarithmic rank P value. For example, we can test for survival time differences based on gender in lung data
## Call: ## ## N Observed Expected (O-E)^2/E (O-E)^2/V ## sex=1 138 112 91.6 4.55 10.3 ## sex=2 90 53 73.4 5.68 10.3 ## ## Chisq= On 1 degrees of freedom, P = 0.001Copy the code
Extract information from the Survdiff object
P values are extracted from the results
1 - pchisq(sd$chisq, length(sd$n) - 1)
Copy the code
# # 0.001311165 [1]Copy the code
Returns the formatted p value
# # 0.001 [1]Copy the code
Cox regression model
We may want to quantify the effect size of a single variable, or include multiple variables in the regression model to account for the effect of multiple variables.
Cox regression model is a semi-parametric model, which can be used to fit univariate and multivariate regression models with survival outcomes.
H (t) H (t) : instantaneous rate at which a hazard or event occurs H0 (t) H0 (t) : baseline hazard
Some key assumptions of the model:
- Non-information censorship
- Proportion of dangerous
Note: Parametric regression models for survival outcomes can also be used, but these models will not be covered in this training.
We can fit the regression model of survival data using the CoxPH function, which uses an object on the left and has the standard syntax R for regression formulas on the right.
## Call: ## ## coef exp(coef) se(coef) z p ## Sex-0.5310 0.5880 0.1672-3.176 0.00149 ## ## Likelihood ratio test=10.63 on 1 df, P =0.001111 ## n= 228, number of events= 165Copy the code
Format Cox regression results
We can see the clean version of broom output:
Or use
Dangerous than
- The number of concerns from the Cox regression model is the hazard ratio (HR). HR represents the risk ratio between the two groups at any given point in time.
- HR is interpreted as the instantaneous incidence of events of interest that are still at event risk.
- If you have a regression parameter ββ (from
estimate
Our columncoxph
), HR = experience value (β) experience value (β). - HR <1 indicates a decreased risk of death, while HR> 1 indicates an increased risk of death.
- Thus, our HR = 0.59 means that at any given time, about 0.6 times as many women as men die.
Part 2: Landmark analysis and time-dependent covariates
In Part 1, we introduced the use of logarithmic rank test and Cox regression to examine associations between covariates of interest and survival outcomes.
Example: tumor reaction
Example: Overall survival was measured from the start of treatment, focusing on the association between complete response to treatment and survival.
- Anderson et al. (JCO, 1983) described why traditional methods (such as logarithmic rank test or Cox regression) were biased in favor of responders in this case and proposed a revolutionary approach.
- The null assumption in the boundary marker approach is that the process of survival from the boundary marker is independent of the response state of the boundary marker.
Anderson, J., Cain, K., & Gelber, R. (1983). Analysis of survival by tumor response. Journal of Clinical Oncology : Official Journal of the American Society of Clinical Oncology, 1(11), 710-9.
Other examples
Some other possible covariables that may not have been looked at in cancer research include:
- Transplant failure
- Graft versus host disease
- Secondary excision
- Adjuvant therapy
- compliance
- Adverse events
The sample data
Data from 137 bone marrow transplant patients. Variables include:
T1
Time of death or last follow-up (days)delta1
Mortality indicators; 1 0 living deathTA
Duration of acute graft-versus-host disease (days)deltaA
Acute graft-versus-host disease index; 1- Develops acute graft-versus-host disease, 0- never develops acute graft-versus-host disease
Let’s load the data for use throughout the example
The landmark law
- Select the fixed time after the baseline as the marker time.Pay attention to: Should be performed on the basis of clinical information prior to examining data
- A subset of those populations tracked at least to milestone times.Pay attention to: Please be sure to report numbers excluded due to concerns or censorship prior to landmark time.
- The milestone times were calculated and the traditional logarithmic rank test or Cox regression was applied
Of interest in the BMT data is the association between acute graft-versus-host disease (aGVHD) and survival. But aGVHD was evaluated after transplantation, which is our baseline, which is when follow-up started.
Step 1 Select a landmark time
Typically, aGVHD occurs in the first 90 days after transplantation, so we use a 90-day marker.
There is interest in the relationship between acute graft-versus-host disease (aGVHD) and survival. But aGVHD was evaluated after transplantation, which is our baseline, which is when follow-up started.
Step 2: Track at least a subset of the population prior to the milestone time
This reduced our sample size from 137 to 122.
- All 15 excluded patients died before the 90-day milestone
There is interest in the relationship between acute graft-versus-host disease (aGVHD) and survival. But aGVHD was evaluated after transplantation, which is our baseline, which is when follow-up started.
Step 3 Follow up time was calculated according to landmarks and traditional methods were applied.
Examples of Cox regression benchmarks using BMT data
In Cox regression, subset option COxPH in can be used to exclude patients who were not followed up during the signature period
Time dependent covariate
An alternative approach to boundary marker analysis is to combine time-dependent covariables. This might be a better fit
- The value of a covariable varies over time
- There are no obvious milestones
Time dependent covariable data Settings
Analysis of time-dependent covariates R requires the establishment of special data sets.
There is no ID variable in the BMT data, which is required to create a special dataset, so create a variable named my_id.
The TMERGE function is used with the Event and function TDC to create special data sets.
tmerge
Create a long data set with multiple time intervals for different covariate values for each patientevent
Create a new event indicator to match the newly created intervaltdc
Create a time-dependent covariable indicator to match the newly created time interval
Time-dependent covariate – single patient
To see how this works, let’s look at the data from the first five patients.
## my_id T1 delta1 TA deltaA
## 1 1 2081 0 67 1
## 2 2 1602 0 1602 0
## 3 3 1496 0 1496 0
## 4 4 1462 0 70 1
## 5 5 1433 0 1433 0
Copy the code
New data sets of these same patients
## my_id T1 delta1 id tstart tstop death agvhd ## 1 1 2081 0 1 0 67 0 0 ## 2 1 2081 0 1 67 2081 0 1 ## 3 2 1602 0 2 0 1602 0 0 ## 4 1496 0 3 0 1496 0 0 ## 5 1462 0 4 0 70 0 0 ## 6 1462 0 4 70 1462 0 1 ## 7 5 1433 0 5 0 1433 0 0Copy the code
Time dependent covariable-Cox regression
Now, we can analyze this time-dependent covariance as usual using Cox regression with CoxPH
Abstract
We found no significant association between acute graft-versus-host disease and death using landmark analyses or time-dependent covariates.
In general, one would want to visualize single covariates using landmark analysis and univariate and multivariate modeling using Cox regression with time-dependent covariates.
Part 3: Competitive risk
What is competitive risk?
When multiple possible events occur on an object in the event occurrence time setting
Example:
- recurrence
- Death from disease
- Died of other causes
- Response to treatment
All of these (or some of them and others) may be possible events in any given study.
So what’s the problem?
Unobserved dependencies between event times are fundamental issues that lead to special considerations.
For example, it is conceivable that patients who relapse are more likely to die, so the time of recurrence and the time of death would not be independent events.
The background of competitive risk
Two methods of analysis when there are multiple potential outcomes:
- Cause-specific hazard for a given event: This represents the incidence of events per unit time in an event that did not fail due to another event
- Cumulative incidence of a given event: This represents the incidence of events per unit of time and the impact of competing events
Each of these approaches may illuminate only one important aspect of the data and may make others difficult to understand, so the method chosen should depend on the question of interest.
Example of melanoma data
It contains variables:
time
Survival time is measured in days and may be reviewed.status
One died of melanoma, two survived, and three died of other causes.sex
1 = male, 0 = female.age
Age.year
Operation.thickness
Tumor thickness (mm).ulcer
1 = present, 0 = absent.
Cumulative incidence of melanoma data
Estimate cumulative incidence in the context of competitive risk.
## Estimates and Variances: # # $est # # # # 1 1000 2000 3000 4000 5000 0.12745714 0.23013963 0.30962017 0.3387175 0.3387175 # # 1 3 0.03426709 0.05045644 0.05811143 0.1059471 0.1059471 ## ## $var # 1000 2000 3000 4000 5000 ## 1 0.0005481186 0.0009001172 0.0013789328 0.001690760 0.001690760 ## 13 0.0001628354 0.0002451319 0.0002998642 0.001040155 0.001040155Copy the code
Plot cumulative incidence – radix R
Generate a base graph of default values.
plot(ci_fit)
Copy the code
Plot cumulative incidence
The cumulative incidence between groups was compared
Used for inter-group testing.
For example, Melanoma was compared based on the presence or absence of ulcer disease. The test results can be found in Tests.
ci_ulcer[["Tests"]]
Copy the code
1 ## # 3 0.158662 6.903913e-01 1 ## # 3Copy the code
Plot cumulative incidence by group
Plot cumulative incidence by group – manually
Please note that I personally find the GGcompetinGrisks function lacking in custom functionality, especially compared to GGSurvPlot. I usually do my own diagrams, first creating a clean data set of cuminc fitting results, and then plotting the results. For more information on the underlying code, see the
Draws a single event type
Usually, only one type of event is of interest, although we still have to consider competing events. In that case, the events of interest can be drawn separately. Again, I do this manually by first creating a clean dataset of the cumINC fitting results, and then drawing the results. For details on the underlying code, see the source code for this presentation.
Add numbers to the risk table
You might want to add the number of risk tables to the cumulative incidence chart, and as far as I know, there is no easy way to do this. See the source code for this presentation for an example
Competitive risk regression
Two methods:
-
Specific cause risk
- Instantaneous incidence of a given event type in subjects with no current event
- Cox regression was used for estimation
-
Subdistribution Subdistributes risks
- The instantaneous incidence of a given type of event in subjects who have not experienced such an event
- Fine-gray regression was used for estimation
Competitive risk regression in melanoma data — Subdistribution
Suppose we were interested in studying the effect of age and sex on melanoma deaths, whereas deaths from other causes are competing events.
crr
You need to specify covariables as matrices- Can be used if multiple events are concerned
failcode
Option requests the results of other events, which are returned by defaultfailcode = 1
shr_fit
Copy the code
## convergence: TRUE # coefficients: ## sex age ## 0.58840 0.01259 ## standard errors: ## sided p-values: ## sex age ## sidedCopy the code
In the previous example, both sex and age were encoded as numeric variables. If character variables exist, model.matrix must be used
Format the results from CRR
Or output that is not currently supported by CRR.
Competitive risk regression-causal analysis in melanoma data
Review all subjects that are not of concern, in this case due to melanoma death, and use coXPH as usual. As a result, cause-specific risk assessment methods are now applied to patients who die from other causes in response to competing risks.
Part 4: Advanced topics
Content covered
- Basic knowledge of survival analysis, including Kaplan-Meier survival function and Cox regression
- Landmark analysis and time-dependent covariates
- Cumulative incidence and regression of competitive risk analysis
What else?
There can be a lot of bits and pieces:
- Evaluate proportional risk assumptions
- Survival a smooth survival chart XX was drawn
- Conditional survival
Assess proportional risk
One of the assumptions of the Cox proportional risk regression model is that the risk is proportional at each time point throughout the follow-up. How do we check if the data fits this hypothesis?
Use the features in the Cox.zPH survival package. The result is two things:
-
A hypothesis test of whether the effect of each covariable changes over time, and a global test of all covariables.
- This is done by proving the interaction between the covariable and log (time)
- Significant p values indicate a violation of the proportional risk hypothesis
-
While the residual figure
- Evidence of deviations from the zero slope line indicates a violation of the proportional risk hypothesis
print(cz)
Copy the code
## GLOBAL chisq p ## GLOBAL CHISq P ## GLOBAL CHISq P ## GLOBAL CHISq P ## GLOBAL CHISq P ## GLOBAL CHISq P ## GLOBAL CHISq P ## GLOBAL CHISq P ## GLOBAL CHISq P ##Copy the code
plot(cz)
Copy the code
Smooth survival graph – Survival quantile
Sometimes you might want to visualize survival estimates in terms of continuous variables. Find the quantile of survival data. The default quantile is P = 0.5 median survival.
-
X stands for event
-
O stands for review
-
The line is a smoothed estimate of the mean survival rate based on age
Conditions for survival
Sometimes it makes sense to generate estimates of survival in patients who have survived for some time.
Zabor, E., Gonen, M., Chapman, P., & Panageas, K. (2013). Dynamic prognostication using conditional survival estimates. Cancer, 119(20), 3589-3592.
Conditional survival estimation
Let’s set survival at six months
map_df(
prob_times,
~conditional_surv_est(
basekm = fit1,
t1 = 182.625,
t2 = .x)
) %>%
mutate(months = round(prob_times / 30.4)) %>%
select(months, everything()) %>%
kable()
Copy the code
Conditional survival diagram
We can also visualize conditional survival data based on different survival time lengths.
The resulting curve has a survival curve every time we make a conditioning adjustment. In this case, the first line is the overall survival curve because it is adjusted for time 0.
reference
1.R language multiple Logistic Logistic regression application case
2. Panel smooth transfer regression (PSTR) analysis case implementation
3. Partial least squares regression (PLSR) and principal component regression (PCR) in MATLAB
4.R language Poisson regression model analysis cases
5. Hosmer-lemeshow goodness of fit test in R language regression
6. Implementation of LASSO regression, Ridge regression and Elastic Net model in R language
7. Realize Logistic Logistic regression in R language
8. Python predicts stock prices using linear regression
9. How to calculate IDI and NRI indices for R language in survival analysis and Cox regression