Inference for multiple treatment effects using confounder importance learning

Papaspiliopoulos, Rossell, Torrens i Dinares

Context

  • Inference for (multiple) treatment effects using high-dimensional GLMs

\begin{align*} y_{i} & \sim p(y_{i}; \eta_{i}, \phi) \\ \eta_{i} & = \sum_{t=1}^{T} \alpha_{t} d_{i,t} + \sum_{j=1}^{J} \beta_{j} x_{i,j}, i=1,\ldots, n \end{align*}

  • High-dimensional:
    • Flexible functional forms
    • No unmeasured confounding (under-selection)
  • Interactions:
    • Flexible functional forms
    • Heterogeneous treatment effects
  • Treatments vs covariates:
    • Uncertainty Quantification vs no unmeasured confounding

Salary variation and discrimination

  • Current Population Survey data for 2010 and 2019

  • 228 covariates

    • e.g., education, labor force status, health status, etc, and state
  • 204 treatments

    • 4 binary ones
    • interaction with the state (average and heterogenous effects)

Limitations of predictive models due to confounding

Over-selection methods

  • e.g., Double LASSO of Belloni et al. (2014), RES
    • selection of covariates for response
    • selection of covariates for treatment
    • OLS on superset
  • e.g., Bayesian Adjustment for Confounding of Wang et al. (2012), Biometrics
    • joint model for response and treatment
    • BMA with a prior that encourages joint inclusion

Over-selection MSE inflation

  • There is an obvious over-selection variance inflation that makes harder to detect weaker effects

  • There is also a more subtle over-selection bias

    • We provide numerical support
    • In a simple selection setting with Gaussian GLM with variance \phi and T=1 we have an argument that approximates it to be (for constant c>0):

    - c {\alpha \phi \over \alpha^2+\phi} {\log J \over n}

Confounder Importance Learning

  • cil function in R package mombf

  • \prod_{t=1}^{T} p(\alpha_{t} \mid \delta_{t}, \phi) \prod_{j=1}^{J} p(\beta_{j} \mid \gamma_{j}, \phi)

  • Inclusion indicators \delta_t,\gamma_j \in \{0,1\}

  • State-of-the-art non-local priors on inclusion

  • Heterogenous covariate inclusion probabilities

\begin{align*} \prod_{t=1}^{T} \text{Bern}(\delta_{t}; 1/2) \prod_{j=1}^{J} \text{Bern}(\gamma_{j}; \pi_{j}(\bm{\theta})) \end{align*}

\begin{align*} \pi_{j}(\bm{\theta}) = \rho + (1 - 2 \rho) \left( 1 + \exp \left\{ - \theta_{0} - \sum_{t=1}^{T} \theta_{t} f_{j,t} \right\} \right)^{-1} \end{align*}

  • f_{j,t} are positive measures of (joint) association between treatment t and covariate j

  • \rho sets upper-lower probability of inclusion bounds

  • \bm{\theta} is learnt from data and moderates inclusion

Hyperparameter learning

  • Marginal likelihood

\begin{align*} p(\bm{y} \mid \bm{\theta})= \sum_{\bm{\delta}, \bm{\gamma}} p(\bm{y} \mid \bm{\delta}, \bm{\gamma}) p(\bm{\delta}, \bm{\gamma} \mid \bm{\theta}) \end{align*}

with the first term either analytically available or approximated, e.g., by ALA of Rossell et al. (2021)

Stochastic gradient optimization

  • Key result: \begin{align*} \nabla_{\bm{\theta}} & \log p(\bm{y} \mid \bm{\theta}) = \\ & (1-2\rho) \sum_{j=1}^{J} \bm{f}_{j} \left[P(\gamma_{j} = 1 \mid \bm{y}, \bm{\theta}) - \pi_{j}(\bm{\theta})\right] \end{align*} where \bm{f}_{j} = (1, f_{j,1}, \dots, f_{j,T})
    • Sum over J terms (as opposed to 2^{J+T})
    • Only marginal inclusion probabilities

Expectation Propagation

  • To avoid several MCMC runs (for different \bm{\theta}) and multi-modality, we also consider the following cheaper alternative

  • We considered the localized posterior p_0(\bm{\delta}, \bm{\gamma} \mid \bm{y}) = p(\bm{\delta}, \bm{\gamma} \mid \bm{y}, \bm{\theta}=\bm{0})

  • Note the identity p(\bm{y} \mid \bm{\theta}) = \sum_{\bm{\delta},\bm{\gamma}} p_{0}(\bm{\delta},\bm{\gamma} \mid \bm{y}) p(\bm{\delta},\bm{ \gamma} \mid \bm{\theta})

  • We use an EP approximation of p_0 and replace it within the identity. This leads to a break of curse of dimensionality and the resultant optimization problem \bm{\theta}^{EP} = \argmax_{\bm{\theta}} \sum_{j=1}^{J} \log \left( r_{j}^{EP} \pi_j(\bm{\theta}) + (1-r_{j}^{EP}) (1-\pi_j(\bm{\theta})) \right) where r_{j}^{EP} = P_{0}(\gamma_j =1 \mid \bm{y})

EP vs EB objective functions

Revising a simulation study

Salary variation and discrimination