Age-, and sex-adjusted incidence rates – michaeldismorrmd.github.io

In this first post I’m going to present a way of obtaining age- and sex-adjusted incidence rates using poisson regression in R. This will be similar to what is done in Stata as described here.

I’ve written a R function that’s available for download here. The script can be sourced ( source("age_sex_adjust.R") ) and then the function age_sex_adjust() can be used as is. Note that the time variable will have to be in days, and the incidence will be presented as per 100 person-years. The code will also be described step by step below.

There are, as usual, several ways to calculate adjusted incidence rates in R. I’ve chosen to use the package marginaleffects by Vincent Arel-Bundock, Noah Greifer and Andrew Heiss because it has a lot of nice features and useful implications in causal inference. Specifically, we will use the function avg_predictions() from marginaleffects to generate the the adjusted incidence rates.

But first we start off with a little bit of background on what an incidence rate is. It is simply a measure of a number of occurrences (a count) in a population over the total population time. For example, in a population of 10 people, each followed 1 year, there was one case of death. In that population, the incidence rate of death would 1 per 10 person-years. In observational data, we often have larger cohorts with varying follow-up time and censoring. The calculation is of course the same, using the formula below:

\[\text{Incidence rate} = \frac{\text{Number of occurrences}}{\sum_{\text{Persons}}{\text{Time in study}}}\]

Calculating crude incidence rate

To illustrate, we will now use the colon dataset from the survival package.

library(survival)
library(dplyr)
library(broom)

Running ?survival::colon tells us the following:

Data from one of the first successful trials of adjuvant chemotherapy for colon cancer

Variable	Explanation
id	Patient id
study	1 for all patients
rx	1 for all patients
sex	1 = male
age	in years
obstruct	colon obstruction by tumour
perfor	performation of colon
adhere	adherence to nearby organs
nodes	number of positive lymph nodes
time	days until event or censoring
status	censoring status
differ	tumour differentiation — 1 = well, 2 = moderate, 3 = poor
extent	extent of local spread — 1 = submucosa, 2 = muscle, 3 = serosa, 4 = continious
surg	time from surgery to registration — 0 = short, 1 = long
node4	more than 4 positive lymph nodes
etype	event type — 1 = recurrence, 2 = death

OK, so now that we understand the data, let’s start calculating crude incidence rates for death among the different treatment groups:

# Using the colon dataset from the survival package

# Only keep records related to the death outcome
colon_death <- survival::colon %>% dplyr::filter(etype == 2) 

# Time is divided by 365.25/100 to get the time in days 
# first to years, then to 100 person-years

colon_death %>% group_by(rx) %>% 
                summarise(Events = sum(status), 
                          Time = sum(time/365.25/100), 
                          Rate = Events / Time, 
                          lower = poisson.test(Events, Time)$conf.int[1], 
                          upper = poisson.test(Events, Time)$conf.int[2])

# A tibble: 3 × 6
  rx      Events  Time  Rate lower upper
  <fct>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 Obs        168  13.8 12.2  10.4  14.2 
2 Lev        161  13.7 11.7  10.0  13.7 
3 Lev+5FU    123  15.0  8.22  6.83  9.80

Now we compare to the calculated rates with rates obtained from the survRate() function from the biostat3 package:

library(biostat3)
survRate(Surv(time/365.25/100, status) ~ rx, data = colon_death) %>% 
  dplyr::select(rx, event, tstop, rate, lower, upper) %>% 
  as_tibble() %>% 
  dplyr::rename(Events = event, 
                Time = tstop, 
                Rate = rate)

# A tibble: 3 × 6
  rx      Events  Time  Rate lower upper
  <fct>    <dbl> <dbl> <dbl> <dbl> <dbl>
1 Obs        168  13.8 12.2  10.4  14.2 
2 Lev        161  13.7 11.7  10.0  13.7 
3 Lev+5FU    123  15.0  8.22  6.83  9.80

Good, the incidence rates are identical. The observational patients had an mortality incidence rate of 12.2 per 100 person-years, compared to the Lev+5-FU treated patients with an incidence rate of 8.22 per 100 person-years. Now, let’s try and repeat these results with poisson regression.

Obtaining estimated incidence rates using poisson regression

Here we use the broom package tidy() function to obtain exponentiated estimates:

# Fit the model to estimate IRR using offset
poisson_fit <- glm(status ~ rx + offset(log(time/365.25/100)), 
                   family = poisson, data = colon_death)

# Exponentiate the estimate to obtain IRR
tidy(poisson_fit, exponentiate = T, conf.int = T)

# A tibble: 3 × 7
  term        estimate std.error statistic   p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
1 (Intercept)   12.2      0.0772    32.4   3.13e-230   10.4      14.1  
2 rxLev          0.965    0.110     -0.324 7.46e-  1    0.777     1.20 
3 rxLev+5FU      0.675    0.119     -3.32  9.16e-  4    0.534     0.850

The Intercept estimate here is the estimated IR for the reference level, i.e. the Obs group.

To get estimated IR of Lev+5FU treated:

lev_5fu <- predict(poisson_fit, 
                   newdata = data.frame(rx = "Lev+5FU", time = 36525), 
                   type = "link", se.fit = T)

as_tibble(lev_5fu) %>% summarise(Treatment = "Lev+5FU", 
                                 IR = exp(fit), 
                                 lower = exp(fit - (1.96 * se.fit)), 
                                 upper = exp(fit + (1.96 * se.fit)))

# A tibble: 1 × 4
  Treatment    IR lower upper
  <chr>     <dbl> <dbl> <dbl>
1 Lev+5FU    8.22  6.88  9.80

Here, the confidence interval needs to be calculated on the \(\log\) scale and then exponentiated back. This will cause the confidence interval to not be centered around the estimate.

A poisson model can model \(\log\text{incidence rates (ratios)}\) when we use the time variable as an offset. Therefore, we can include covariates in the model to be accounted for, such as age and sex.

Age- and sex-adjusted incidence rates using poisson regression

First, we’ll do it using my age_sex_adjust() function

source("age_sex_adjust.R")
# Usage: age_sex_adjust(data, group, age, sex, event, time)

age_sex_adjust(colon_death, rx, age, sex, status, time)

  Treatment        IR  Lower_CI  Upper_CI
1   Lev+5FU  8.245483  6.786801  9.704165
2       Obs 12.211861 10.362612 14.061109
3       Lev 11.819634  9.990693 13.648575

Here we see that the adjusted rates are very similar to the crude rates calculated above. Since this data comes from a randomized trial, this is expected and can be taken as a sign that the randomization worked.

Now, let’s do the some thing but without using the ready made function to see how it works under the hood.

library(marginaleffects)

# Fit the model using offset to estimate IRR
poisson_fit <- glm(status ~ rx * I(age^2) + sex + offset(log(as.numeric(time))), 
                   data = colon_death, 
                   family = poisson)

# Create a new dataset where time is converted to 36525 days (100 years)
newdat <- colon_death %>% mutate(time = 36525) 

# Use avg_predictions to estimate what the incidence rate would have been if 
# the entire population would have been treated according to each level of rx
result <- avg_predictions(poisson_fit, 
                          variables = "rx", 
                          newdata = transform(colon_death, time = 36525), 
                          type = "response")

result %>% dplyr::select(rx, estimate, conf.low, conf.high)


 Estimate CI low CI high
     8.25   6.79     9.7
    12.21  10.36    14.1
    11.82   9.99    13.6

Columns: rx, estimate, conf.low, conf.high

The numbers are identical to the ones obtained from the age_sex_adjust() function, which is logical since we did the same thing as the function does.
A few finishing notes. Here I included age as a quadratic term, and as an interaction with exposure. These are modeling decisions one will have to take, however the model could have been only a main effects model such as:

\[\log\lambda = \beta_0 + \beta_1\text{rxLev} + \beta_2\text{rxLev+5FU} + \beta_3\text{age} + \beta_4\text{sex}\]
Regarding the interaction term, a good explanation was given in the Stata forum in this post.

For anyone who wants to read more, I recommend the course material from the PhD course Biostatistics III at Karolinska Institutet, available here.