This document forms part of the analysis used in the paper:
Collider bias undermines our understanding of COVID-19 disease risk and severity. Gareth Griffith, Tim T Morris, Matt Tudball, Annie Herbert, Giulia Mancano, Lindsey Pike, Gemma C Sharp, Tom M Palmer, George Davey Smith, Kate Tilling, Luisa Zuccolo, Neil M Davies, Gibran Hemani
It is hosted at https://github.com/MRCIEU/ukbb-covid-collider.
Here we show a set of analyses to illustrate collider bias induced by non-random testing of Covid-19 status amongst the UK Biobank participants, and some approaches to adjust for the bias. The methods are described in further detail in Griffith et al. (2020).
The following variables from the UK biobank phenotype data are used:
34-0.0
- Year of birth (converted into age for this analysis)31-0.0
- Sex (male = 1, female = 0)23104-0.0
- Body mass index (BMI)Also, the linked Covid-19 freeze from 2020-06-05
is used to identify which individuals have been tested and tested positive.
In the analysis that follows, we will be estimating the association between testing positive for Covid-19 and the risk factors age, sex and BMI. The key concern with such an analysis is that we only observe test results among individuals who have received a test. SARS-CoV-2 infection and the risk factors themselves will influence the likelihood of receiving a test, which could induce spurious associations among them when we condition on receiving a test. We will explore inverse probability weighting and sensitivity analyses to address the potential collider bias.
suppressMessages(suppressPackageStartupMessages({
library(knitr)
library(dplyr)
library(ggplot2)
library(selectioninterval)
}))
knitr::opts_chunk$set(warning=FALSE, message=FALSE, echo=TRUE, cache=TRUE)
load("data/dat.rdata")
dat <- dat[complete.cases(dat[,c("age","sex","bmi","tested")]), ]
str(dat)
## tibble [492,392 x 5] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:492392] 74 77 73 73 76 55 69 75 78 73 ...
## $ sex : int [1:492392] 0 1 1 0 1 1 0 0 0 0 ...
## $ bmi : num [1:492392] 20.8 27.5 28.6 27.3 26.5 26.5 30.9 24 23.2 31.8 ...
## $ tested : num [1:492392] 0 0 0 0 0 0 0 0 0 0 ...
## $ positive: num [1:492392] NA NA NA NA NA NA NA NA NA NA ...
Summaries:
hist(dat$age)
table(dat$sex) / nrow(dat)
##
## 0 1
## 0.5448383 0.4551617
How many individuals tested:
table(dat$tested)
##
## 0 1
## 486488 5904
How many individuals tested positive:
table(dat$positive)
##
## 0 1
## 4475 1429
The most basic approach is to report the raw pairwise associations between the risk factors and testing positive.
assoc <- data.frame(matrix(nrow=3, ncol=3))
names(assoc) <- c("risk_factor", "beta", "se")
assoc$risk_factor <- c("sex", "age", "bmi")
mod <- summary(lm(positive ~ sex, dat))
assoc[assoc$risk_factor=="sex",c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
mod <- summary(lm(positive ~ age, dat))
assoc[assoc$risk_factor=="age",c("beta","se")] <- coefficients(mod)["age",c("Estimate","Std. Error")]
mod <- summary(lm(positive ~ bmi, dat))
assoc[assoc$risk_factor=="bmi",c("beta","se")] <- coefficients(mod)["bmi",c("Estimate","Std. Error")]
ggplot(assoc, aes(risk_factor, beta)) +
geom_errorbar(
aes(ymin = beta-1.96*se,
ymax = beta+1.96*se),
width = 0.1, size=1, color="#00BFC4") +
geom_point(size=3, shape=21, fill="#00BFC4") +
geom_hline(yintercept=0, linetype="dashed", color = "#575863") +
scale_x_discrete(labels=c("Age", "BMI", "Sex")) +
xlab("") + ylab("Value of association") +
coord_flip()
However, it is unclear to what extent these associations may be distorted by collider bias.
An simple exploratory exercise is to compare full sample and sub-sample associations among the variables that are observed in both, namely, age, sex and BMI.
We begin by exploring their relationship with receiving a test.
summary(glm(tested ~ age + sex + bmi, data=dat, family="binomial"))
##
## Call:
## glm(formula = tested ~ age + sex + bmi, family = "binomial",
## data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3405 -0.1628 -0.1527 -0.1434 3.1402
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.922371 0.132640 -44.650 < 2e-16 ***
## age 0.006487 0.001634 3.969 7.22e-05 ***
## sex 0.103816 0.026224 3.959 7.53e-05 ***
## bmi 0.036462 0.002500 14.585 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63971 on 492391 degrees of freedom
## Residual deviance: 63730 on 492388 degrees of freedom
## AIC: 63738
##
## Number of Fisher Scoring iterations: 7
Therefore, we would expect the relationships between these variables to be skewed within the tested sample. For example, this is the relationship between age and sex in the full sample:
mod1 <- summary(lm(age ~ sex, dat))
coefficients(mod1) %>% kable
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 68.2695118 | 0.0156525 | 4361.57945 | 0 |
sex | 0.3703387 | 0.0232006 | 15.96243 | 0 |
And here it is among those who received a test:
mod2 <- summary(lm(age ~ sex, dat, subset=dat$tested==1))
coefficients(mod2) %>% kable
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 67.67524 | 0.1589147 | 425.85894 | 0 |
sex | 2.54350 | 0.2278876 | 11.16121 | 0 |
Note that the association is around 7 times larger in the tested subset.
Similarly, this is the overall association between age and BMI:
mod3 <- summary(lm(bmi ~ sex, dat))
coefficients(mod3) %>% kable
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 27.0942063 | 0.0092111 | 2941.48475 | 0 |
sex | 0.7419068 | 0.0136530 | 54.34035 | 0 |
and here in the subsample:
mod4 <- summary(lm(bmi ~ sex, dat, subset=dat$tested==1))
coefficients(mod4) %>% kable
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 28.1546983 | 0.0963758 | 292.134651 | 0.0000000 |
sex | 0.4174368 | 0.1382052 | 3.020413 | 0.0025351 |
where the association almost halves.
The first approach we consider is inverse probability weighting. Since tested individuals are a subset of the entire sample, it is possible to generate probabilities of being included in the tested sub-sample based on the measured risk factors. The inverse of these probabilities can be used to weight the association estimate towards being representative of the full sample.
Note that, since testing positive is only measured within the tested sub-sample, we cannot include it in the weighting model. Therefore, for the inverse weighted estimates to be unbiased for the full-sample risk factor associations, we would need each individual’s likelihood of being tested to be independent of whether they are infected with Covid-19 given their age, sex and BMI. Since this is unlikely to be true, we will explore sensitivity analyses later in the tutorial.
First, find which variables associate with being tested. Begin with marginal effects
mod5 <- glm(tested ~ age + sex + bmi, data=dat, family="binomial")
prs1 <- fitted.values(mod5)
coefficients(summary(mod5)) %>% kable
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -5.9223706 | 0.1326401 | -44.649915 | 0.00e+00 |
age | 0.0064868 | 0.0016344 | 3.968823 | 7.22e-05 |
sex | 0.1038164 | 0.0262241 | 3.958822 | 7.53e-05 |
bmi | 0.0364623 | 0.0025000 | 14.585195 | 0.00e+00 |
It’s important to also test for interactions between variables (Groenwold, Palmer, and Tilling 2020).
mod6 <- glm(tested ~ bmi + age + sex + bmi * age + bmi * sex + age * sex + age * bmi * sex, data=dat, family="binomial")
coefficients(summary(mod6)) %>% kable
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.4386662 | 0.7600262 | -3.208661 | 0.0013335 |
bmi | -0.0469310 | 0.0267885 | -1.751907 | 0.0797899 |
age | -0.0452259 | 0.0112794 | -4.009590 | 0.0000608 |
sex | -3.9881129 | 1.3111898 | -3.041598 | 0.0023533 |
bmi:age | 0.0012435 | 0.0003960 | 3.140162 | 0.0016885 |
bmi:sex | 0.0604504 | 0.0458599 | 1.318153 | 0.1874523 |
age:sex | 0.0598076 | 0.0189730 | 3.152243 | 0.0016202 |
bmi:age:sex | -0.0008918 | 0.0006627 | -1.345629 | 0.1784222 |
Curiously, the marginal BMI effect appears to be captured by interaction terms only. Centering the marginal variables may be more appropriate when estimating interactions, as otherwise the interaction term is simply a multiplicative effect which is correlated to the marginal effects.
dat$bmixage <- scale(dat$bmi) * scale(dat$age)
dat$bmixsex <- scale(dat$bmi) * scale(dat$sex)
dat$agexsex <- scale(dat$age) * scale(dat$sex)
dat$agexsexxbmi <- scale(dat$age) * scale(dat$sex) * scale(dat$bmi)
mod7 <- glm(tested ~ bmi + age + sex + bmixage + bmixsex + agexsex + agexsexxbmi, data=dat, family="binomial")
coefficients(summary(mod7)) %>% kable
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -5.8669485 | 0.1324589 | -44.2926117 | 0.0000000 |
bmi | 0.0379081 | 0.0025664 | 14.7706579 | 0.0000000 |
age | 0.0049735 | 0.0016671 | 2.9832590 | 0.0028520 |
sex | 0.0890447 | 0.0271695 | 3.2773819 | 0.0010477 |
bmixage | 0.0325030 | 0.0125774 | 2.5842439 | 0.0097593 |
bmixsex | -0.0013862 | 0.0124699 | -0.1111627 | 0.9114873 |
agexsex | 0.1427313 | 0.0134988 | 10.5736388 | 0.0000000 |
agexsexxbmi | -0.0172329 | 0.0128066 | -1.3456290 | 0.1784222 |
Here the marginal effects are all retained.
We can generate the probabilities of each individual being tested based on these variables using:
prs2 <- fitted.values(mod7)
hist(prs2, breaks=100)
It is also important to consider non-linearities in the relationship between testing and age and BMI, which are continuous. There appears to be a quadratic relationship between age and testing and a cubic relationship between BMI and testing.
# Age plot
ggplot(dat[dat$age<82,],aes(x=age, y=tested)) +
geom_smooth(method="gam",formula=y~s(x)) +
xlab("Age") + ylab("Likelihood of Testing") +
scale_x_continuous(breaks=seq(min(dat$age),82,by=2))
# BMI plot
ggplot(dat[dat$bmi<60,],aes(x=bmi, y=tested)) +
geom_smooth(method="gam",formula=y~s(x)) +
xlab("BMI") + ylab("Likelihood of Testing") +
scale_x_continuous(breaks=seq(12,60,by=5))
We select a logistic regression which is quadratic in age, cubic in BMI and contains all linear interactions between age, sex and BMI.
mod8 <- glm(tested ~ sex + poly(age, 2) + poly(bmi, 3) + bmixage + bmixsex + agexsex + agexsexxbmi, data=dat, family="binomial")
summary(mod8)
##
## Call:
## glm(formula = tested ~ sex + poly(age, 2) + poly(bmi, 3) + bmixage +
## bmixsex + agexsex + agexsexxbmi, family = "binomial", data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3560 -0.1657 -0.1467 -0.1340 3.1318
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.505016 0.018849 -239.006 < 2e-16 ***
## sex 0.090564 0.027568 3.285 0.00102 **
## poly(age, 2)1 27.063717 8.678689 3.118 0.00182 **
## poly(age, 2)2 140.043970 8.604877 16.275 < 2e-16 ***
## poly(bmi, 3)1 125.971024 8.872164 14.198 < 2e-16 ***
## poly(bmi, 3)2 14.102663 8.422602 1.674 0.09406 .
## poly(bmi, 3)3 -28.860842 9.025124 -3.198 0.00138 **
## bmixage 0.019016 0.011397 1.669 0.09521 .
## bmixsex -0.002089 0.012433 -0.168 0.86659
## agexsex 0.115827 0.012389 9.349 < 2e-16 ***
## agexsexxbmi -0.011824 0.011538 -1.025 0.30547
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 63971 on 492391 degrees of freedom
## Residual deviance: 63342 on 492381 degrees of freedom
## AIC: 63364
##
## Number of Fisher Scoring iterations: 7
prs3 <- fitted.values(mod8)
hist(prs3, breaks=100)
The probabilities are largely clustered close to zero, so when using the inverse of these probabilities for weights there can be potential instability that comes from dividing by small numbers. Stabilising the weight, by multiplying by the probability of being tested, can avoid this issue (Sayon-Orea et al. 2020).
p <- mean(as.numeric(dat$tested))
ipw1 <- ifelse(dat$tested==1, p/prs1, (1-p)/(1-prs1))
ipw2 <- ifelse(dat$tested==1, p/prs2, (1-p)/(1-prs2))
ipw3 <- ifelse(dat$tested==1, p/prs3, (1-p)/(1-prs3))
Below we perform a validation exercise where we compare the full sample associations with the weighted associations for age, sex and BMI.
# Prepare output file
group <- expand.grid(c("as","ab","bs"),c(1,2,3,4,5))
out <- data.frame(cbind(group, matrix(nrow=15, ncol=2)))
names(out) <- c("assoc","model","beta","se")
# Age and sex association
# Full sample
mod <- summary(lm(age ~ sex, dat))
out[out$assoc=="as"&out$model==5,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# Tested subsample
mod <- summary(lm(age ~ sex, dat, subset=dat$tested==1))
out[out$assoc=="as"&out$model==4,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# Inverse weighted (linear, no interactions)
mod <- summary(lm(age ~ sex, dat, weight=ipw1, subset=tested==1))
out[out$assoc=="as"&out$model==1,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# Inverse weighted (linear, interactions)
mod <- summary(lm(age ~ sex, dat, weight=ipw2, subset=tested==1))
out[out$assoc=="as"&out$model==2,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# Inverse weighted (nonlinear, interactions)
mod <- summary(lm(age ~ sex, dat, weight=ipw3, subset=tested==1))
out[out$assoc=="as"&out$model==3,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# BMI and sex association
# Full sample
mod <- summary(lm(bmi ~ sex, dat))
out[out$assoc=="bs"&out$model==5,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# Tested subsample
mod <- summary(lm(bmi ~ sex, dat, subset=dat$tested==1))
out[out$assoc=="bs"&out$model==4,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# Inverse weighted (linear, no interactions)
mod <- summary(lm(bmi ~ sex, dat, weight=ipw1, subset=tested==1))
out[out$assoc=="bs"&out$model==1,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# Inverse weighted (linear, interactions)
mod <- summary(lm(bmi ~ sex, dat, weight=ipw2, subset=tested==1))
out[out$assoc=="bs"&out$model==2,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# Inverse weighted (nonlinear, interactions)
mod <- summary(lm(bmi ~ sex, dat, weight=ipw3, subset=tested==1))
out[out$assoc=="bs"&out$model==3,c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
# BMI and age association
# Full sample
mod <- summary(lm(bmi ~ age, dat))
out[out$assoc=="ab"&out$model==5,c("beta","se")] <- coefficients(mod)["age",c("Estimate","Std. Error")]
# Tested subsample
mod <- summary(lm(bmi ~ age, dat, subset=dat$tested==1))
out[out$assoc=="ab"&out$model==4,c("beta","se")] <- coefficients(mod)["age",c("Estimate","Std. Error")]
# Inverse weighted (linear, no interactions)
mod <- summary(lm(bmi ~ age, dat, weight=ipw1, subset=tested==1))
out[out$assoc=="ab"&out$model==1,c("beta","se")] <- coefficients(mod)["age",c("Estimate","Std. Error")]
# Inverse weighted (linear, interactions)
mod <- summary(lm(bmi ~ age, dat, weight=ipw2, subset=tested==1))
out[out$assoc=="ab"&out$model==2,c("beta","se")] <- coefficients(mod)["age",c("Estimate","Std. Error")]
# Inverse weighted (nonlinear, interactions)
mod <- summary(lm(bmi ~ age, dat, weight=ipw3, subset=tested==1))
out[out$assoc=="ab"&out$model==3,c("beta","se")] <- coefficients(mod)["age",c("Estimate","Std. Error")]
ggplot() + geom_errorbar(data=out[out$assoc=="as",], mapping=aes(x=as.factor(model), ymin=beta-1.96*se, ymax=beta+1.96*se), width=0.1, size=1, color="#00BFC4") +
geom_point(data=out[out$assoc=="as",], mapping=aes(x=as.factor(model), y=beta), size=3, shape=21, fill="#00BFC4") + scale_x_discrete(labels=c("Linear / no interactions", "Linear / interactions", "Nonlinear / interactions", "Unweighted", "Full sample")) + xlab("") + ylab("Value of association")
ggplot() + geom_errorbar(data=out[out$assoc=="bs",], mapping=aes(x=as.factor(model), ymin=beta-1.96*se, ymax=beta+1.96*se), width=0.1, size=1, color="#00BFC4") +
geom_point(data=out[out$assoc=="bs",], mapping=aes(x=as.factor(model), y=beta), size=3, shape=21, fill="#00BFC4") + scale_x_discrete(labels=c("Linear / no interactions", "Linear / interactions", "Nonlinear / interactions", "Unweighted", "Full sample")) + xlab("") + ylab("Value of association")
ggplot() + geom_errorbar(data=out[out$assoc=="ab",], mapping=aes(x=as.factor(model), ymin=beta-1.96*se, ymax=beta+1.96*se), width=0.1, size=1, color="#00BFC4") +
geom_point(data=out[out$assoc=="ab",], mapping=aes(x=as.factor(model), y=beta), size=3, shape=21, fill="#00BFC4") + scale_x_discrete(labels=c("Linear / no interactions", "Linear / interactions", "Nonlinear / interactions", "Unweighted", "Full sample")) + xlab("") + ylab("Value of association")
The previous validation exercise suggests that the full weighting model with interactions and non-linearities performs best at recovering the full sample associations among sex, age and BMI. While this does not mean it will necessarily recover the full sample associations between risk factors and testing positive, it is a useful starting point. Below we compare the weighted risk factor associations with the unweighted associations estimated earlier.
assoc2 <- data.frame(matrix(nrow=3, ncol=3))
names(assoc2) <- c("risk_factor", "beta", "se")
assoc2$risk_factor <- c("sex", "age", "bmi")
mod <- summary(lm(positive ~ sex, dat, weight=ipw3))
assoc2[assoc2$risk_factor=="sex",c("beta","se")] <- coefficients(mod)["sex",c("Estimate","Std. Error")]
mod <- summary(lm(positive ~ age, dat, weight=ipw3))
assoc2[assoc2$risk_factor=="age",c("beta","se")] <- coefficients(mod)["age",c("Estimate","Std. Error")]
mod <- summary(lm(positive ~ bmi, dat, weight=ipw3))
assoc2[assoc2$risk_factor=="bmi",c("beta","se")] <- coefficients(mod)["bmi",c("Estimate","Std. Error")]
assoc <- rbind(assoc, assoc2)
assoc$model <- c(rep(1,3),rep(2,3))
ggplot(assoc, aes(risk_factor, beta)) +
geom_errorbar(
aes(ymin = beta-1.96*se,
ymax = beta+1.96*se,
color = as.factor(model)),
position = position_dodge(0.3), size=1, width=0.1
) +
xlab("") + ylab("Value of association") +
geom_point(aes(color = as.factor(model)), size=3, position = position_dodge(0.3)) +
geom_hline(yintercept=0, linetype="dashed", color = "#575863") +
labs(colour="") +
scale_color_manual(name="Estimates", labels=c("Unweighted", "Weighted"), values=c("#F8766D","#00BFC4")) +
coord_flip()
The weighted and unweighted associations are very similar. This could be indicative of no collider bias, however, it is more likely in this instance that the weights are misspecified due to the absence of the positive test variable. Below we explore sensitivity analyses for this misspecification.
We now present a sensitivity analysis which can be used when direct estimation of the probability weights is limited or not possible. Tudball et al. (2021) is most appropriately used when weights cannot be estimated but there is external information on the target population (e.g. survey response rate, population means). We consider the example of the association between BMI and test positivity.
Suppose we only observe the tested sub-sample. Estimating probability weights is no longer possible since we do not observe anyone who has not been tested. Tudball et al. (2021) allows us to propose a logistic model for the probability weights, comprised of variables measured in the tested sub-sample. There are three sensitivity parameters: 1) a lower bound for the average probability of sample selection; 2) an upper bound for the average probability of sample selection; and 3) the change in odds of sample selection for a one standard deviation increase in each variable.
An advantage of this method is that we can place additional constraints on the bounds. Suppose we know the unconditional likelihood of being tested and the average age and BMI of UK Biobank participants. We can include these as constraints to ensure that the resulting bounds are consistent with these population values. In reality, we may have more or fewer constraints than this.
dattest <- dat[complete.cases(dat), ]
dattest$bmi <- ifelse(dattest$bmi > 45, 45, dattest$bmi)
w <- as.data.frame(dattest[,(colnames(dattest) %in% c('age','sex','bmi','positive'))])
w$age2 <- w$age^2
w$bmi2 <- w$bmi^2
w$int <- w$bmi*w$positive
We then define our sensitivity parameters. These imply that the average individual in the sample has between a 1% and 50% probability of sample selection. Furthermore, a standard deviation increase in each variable in the weight model can change the odds of sample selection between 1/3 and 3.
p <- c(0.01, 0.5, 3)
We also select our additional constraints. We choose the response rate of the tested sub-sample, as well as the average age, sex and BMI in the full UK Biobank sample. We also ensure that being positive for Covid-19 increases one’s likelihood of being tested.
mycons <- list(
list('RESP', mean(dat$tested)),
list('COVMEAN', w$age, mean(dat$age)),
list('COVMEAN', w$bmi, mean(dat$bmi)),
list('COVMEAN', w$sex, mean(dat$sex)),
list('DIREC', 4, '+')
)
We are now ready to generate our bounds for the association between BMI and testing positive. To do this we will use the package selectioninterval
available at https://github.com/matt-tudball/selectioninterval.
out <- selectioninterval::selection_bound(
y=dattest$positive,
x=dattest$bmi,
w=w,
L0l=p[1],
L0u=p[2],
L1=p[3],
cons=mycons)
tibble(
method="Tudball et al (2021)",
lower_ci=round(out$ci[1],3),
lower_bound=round(out$interval[1],3),
upper_bound=round(out$interval[2],3),
upper_ci=round(out$ci[2],3)
) %>% kable
method | lower_ci | lower_bound | upper_bound | upper_ci |
---|---|---|---|---|
Tudball et al (2021) | -0.002 | -0.001 | 0.026 | 0.029 |
Here we have walked through a realistic example of estimating risk factors for testing positive for Covid-19 in UK Biobank. As a validation exercise, we have shown that inverse probability weighting is able to recover sample-wide associations between BMI, age and sex within the non-random sub-sample of individuals who have been tested for Covid-19. We also show that if the data is non-nested, in that a model cannot be fitted to predict participation, then summary data from a reference population can be used to generate adjusted estimates within the selected sample.
In general practice, there are a few things to further consider.