library(survival)
library(dplyr)
library(broom)
In this first post I’m going to present a way of obtaining age- and sex-adjusted incidence rates using poisson regression in R. This will be similar to what is done in Stata as described here.
I’ve written a R function that’s available for download here. The script can be sourced ( source("age_sex_adjust.R")
) and then the function age_sex_adjust()
can be used as is. Note that the time variable will have to be in days, and the incidence will be presented as per 100 person-years. The code will also be described step by step below.
There are, as usual, several ways to calculate adjusted incidence rates in R. I’ve chosen to use the package marginaleffects by Vincent Arel-Bundock, Noah Greifer and Andrew Heiss because it has a lot of nice features and useful implications in causal inference. Specifically, we will use the function avg_predictions()
from marginaleffects
to generate the the adjusted incidence rates.
But first we start off with a little bit of background on what an incidence rate is. It is simply a measure of a number of occurrences (a count) in a population over the total population time. For example, in a population of 10 people, each followed 1 year, there was one case of death. In that population, the incidence rate of death would 1 per 10 person-years. In observational data, we often have larger cohorts with varying follow-up time and censoring. The calculation is of course the same, using the formula below:
\[\text{Incidence rate} = \frac{\text{Number of occurrences}}{\sum_{\text{Persons}}{\text{Time in study}}}\]
Calculating crude incidence rate
To illustrate, we will now use the colon
dataset from the survival
package.
Running ?survival::colon
tells us the following:
Data from one of the first successful trials of adjuvant chemotherapy for colon cancer
Variable | Explanation |
---|---|
id | Patient id |
study | 1 for all patients |
rx | 1 for all patients |
sex | 1 = male |
age | in years |
obstruct | colon obstruction by tumour |
perfor | performation of colon |
adhere | adherence to nearby organs |
nodes | number of positive lymph nodes |
time | days until event or censoring |
status | censoring status |
differ | tumour differentiation — 1 = well, 2 = moderate, 3 = poor |
extent | extent of local spread — 1 = submucosa, 2 = muscle, 3 = serosa, 4 = continious |
surg | time from surgery to registration — 0 = short, 1 = long |
node4 | more than 4 positive lymph nodes |
etype | event type — 1 = recurrence, 2 = death |
OK, so now that we understand the data, let’s start calculating crude incidence rates for death among the different treatment groups:
# Using the colon dataset from the survival package
# Only keep records related to the death outcome
<- survival::colon %>% dplyr::filter(etype == 2)
colon_death
# Time is divided by 365.25/100 to get the time in days
# first to years, then to 100 person-years
%>% group_by(rx) %>%
colon_death summarise(Events = sum(status),
Time = sum(time/365.25/100),
Rate = Events / Time,
lower = poisson.test(Events, Time)$conf.int[1],
upper = poisson.test(Events, Time)$conf.int[2])
# A tibble: 3 × 6
rx Events Time Rate lower upper
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Obs 168 13.8 12.2 10.4 14.2
2 Lev 161 13.7 11.7 10.0 13.7
3 Lev+5FU 123 15.0 8.22 6.83 9.80
Now we compare to the calculated rates with rates obtained from the survRate()
function from the biostat3 package:
library(biostat3)
survRate(Surv(time/365.25/100, status) ~ rx, data = colon_death) %>%
::select(rx, event, tstop, rate, lower, upper) %>%
dplyras_tibble() %>%
::rename(Events = event,
dplyrTime = tstop,
Rate = rate)
# A tibble: 3 × 6
rx Events Time Rate lower upper
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Obs 168 13.8 12.2 10.4 14.2
2 Lev 161 13.7 11.7 10.0 13.7
3 Lev+5FU 123 15.0 8.22 6.83 9.80
Good, the incidence rates are identical. The observational patients had an mortality incidence rate of 12.2 per 100 person-years, compared to the Lev+5-FU treated patients with an incidence rate of 8.22 per 100 person-years. Now, let’s try and repeat these results with poisson regression.
Obtaining estimated incidence rates using poisson regression
Here we use the broom package tidy()
function to obtain exponentiated estimates:
# Fit the model to estimate IRR using offset
<- glm(status ~ rx + offset(log(time/365.25/100)),
poisson_fit family = poisson, data = colon_death)
# Exponentiate the estimate to obtain IRR
tidy(poisson_fit, exponentiate = T, conf.int = T)
# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 12.2 0.0772 32.4 3.13e-230 10.4 14.1
2 rxLev 0.965 0.110 -0.324 7.46e- 1 0.777 1.20
3 rxLev+5FU 0.675 0.119 -3.32 9.16e- 4 0.534 0.850
The Intercept estimate here is the estimated IR for the reference level, i.e. the Obs group.
To get estimated IR of Lev+5FU treated:
<- predict(poisson_fit,
lev_5fu newdata = data.frame(rx = "Lev+5FU", time = 36525),
type = "link", se.fit = T)
as_tibble(lev_5fu) %>% summarise(Treatment = "Lev+5FU",
IR = exp(fit),
lower = exp(fit - (1.96 * se.fit)),
upper = exp(fit + (1.96 * se.fit)))
# A tibble: 1 × 4
Treatment IR lower upper
<chr> <dbl> <dbl> <dbl>
1 Lev+5FU 8.22 6.88 9.80
Here, the confidence interval needs to be calculated on the \(\log\) scale and then exponentiated back. This will cause the confidence interval to not be centered around the estimate.
A poisson model can model \(\log\text{incidence rates (ratios)}\) when we use the time variable as an offset. Therefore, we can include covariates in the model to be accounted for, such as age and sex.
Age- and sex-adjusted incidence rates using poisson regression
First, we’ll do it using my age_sex_adjust()
function
source("age_sex_adjust.R")
# Usage: age_sex_adjust(data, group, age, sex, event, time)
age_sex_adjust(colon_death, rx, age, sex, status, time)
Treatment IR Lower_CI Upper_CI
1 Lev+5FU 8.245483 6.786801 9.704165
2 Obs 12.211861 10.362612 14.061109
3 Lev 11.819634 9.990693 13.648575
Here we see that the adjusted rates are very similar to the crude rates calculated above. Since this data comes from a randomized trial, this is expected and can be taken as a sign that the randomization worked.
Now, let’s do the some thing but without using the ready made function to see how it works under the hood.
library(marginaleffects)
# Fit the model using offset to estimate IRR
<- glm(status ~ rx * I(age^2) + sex + offset(log(as.numeric(time))),
poisson_fit data = colon_death,
family = poisson)
# Create a new dataset where time is converted to 36525 days (100 years)
<- colon_death %>% mutate(time = 36525)
newdat
# Use avg_predictions to estimate what the incidence rate would have been if
# the entire population would have been treated according to each level of rx
<- avg_predictions(poisson_fit,
result variables = "rx",
newdata = transform(colon_death, time = 36525),
type = "response")
%>% dplyr::select(rx, estimate, conf.low, conf.high) result
Estimate CI low CI high
8.25 6.79 9.7
12.21 10.36 14.1
11.82 9.99 13.6
Columns: rx, estimate, conf.low, conf.high
The numbers are identical to the ones obtained from the age_sex_adjust()
function, which is logical since we did the same thing as the function does.
A few finishing notes. Here I included age as a quadratic term, and as an interaction with exposure. These are modeling decisions one will have to take, however the model could have been only a main effects model such as:
\[\log\lambda = \beta_0 + \beta_1\text{rxLev} + \beta_2\text{rxLev+5FU} + \beta_3\text{age} + \beta_4\text{sex}\]
Regarding the interaction term, a good explanation was given in the Stata forum in this post.
For anyone who wants to read more, I recommend the course material from the PhD course Biostatistics III at Karolinska Institutet, available here.