Introduction to Causal Inference

class: center, middle, inverse, title-slide

.title[
# Introduction to Causal Inference
]
.author[
### Keith McNulty
]

---

class: left, middle, r-logo

## Formalities

* Much of this presentation is based on the book "Statistical Rethinking" by Richard McElreath
* You can find the code for this presentation at its github repo [here](https://github.com/keithmcnulty/causal_inference_intro)
* We will work in R, but all of this is possible in Python too - please reach out to me if you need direction to the appropriate Python packages.

---
class: left, middle, r-logo

## What is causal inference?

Here are two common reasons why we construct statistical models on samples of data:

1.  To try to make accurate out of sample predictions (*Predictive Modeling*)
2.  To try to explain a phenomenon by inferring causality from the sample (*Inferential Modeling*)

How we approach these two objectives can be very different in practice.

**Causal Inference** is the science of inferring causality from statistical models.  It is a very disciplined approach to inferential modeling, focused on the avoidance of 'booby traps' in causal logic.

---
class: left, middle, r-logo

## How do we conceptually represent causality?

Causality (or our beliefs around causality) can be represented by a **directed acyclic graph** or **DAG**.  Directed, because causality is always in a certain direction, and acyclic because  we don't imagine that a cause causes itself.

Here's a simple example:  imagine you work for *McDonalds*.  Let `\(R\)` be your hourly pay rate, let `\(H\)` be how many hours you work in the week, and let `\(P\)` be your weekly total pay.  Then we can create the following simple DAG to represent the causes of `\(P\)`:

---
class: left, middle, r-logo

## Creating DAGs in R

There are a number of package options, but `dagitty` is easy to use and has some cool additional functionality which we will use later.

```r
library(dagitty)
mcd_dag <- dagitty("dag{H -> P; R -> P}")
coordinates(mcd_dag) <- list(
  x = c(H = 0, R = 0, P = 1), 
  y = c(R = 0, H = 1, P = 0.5)
)
plot(mcd_dag)
```

---
class: left, middle, r-logo

## Data for this session

We will work with the `WaffleDivorce` dataset, which is part of the `rethinking` package.  Here's the first few lines of the data.

```r
library(rethinking)
data("WaffleDivorce")
head(WaffleDivorce)
```

```
##     Location Loc Population MedianAgeMarriage Marriage Marriage.SE Divorce
## 1    Alabama  AL       4.78              25.3     20.2        1.27    12.7
## 2     Alaska  AK       0.71              25.2     26.0        2.93    12.5
## 3    Arizona  AZ       6.33              25.8     20.3        0.98    10.8
## 4   Arkansas  AR       2.92              24.3     26.4        1.70    13.5
## 5 California  CA      37.25              26.8     19.1        0.39     8.0
## 6   Colorado  CO       5.03              25.7     23.5        1.24    11.6
##   Divorce.SE WaffleHouses South Slaves1860 Population1860 PropSlaves1860
## 1       0.79          128     1     435080         964201           0.45
## 2       2.05            0     0          0              0           0.00
## 3       0.74           18     0          0              0           0.00
## 4       1.22           41     1     111115         435450           0.26
## 5       0.24            0     0          0         379994           0.00
## 6       0.94           11     0          0          34277           0.00
```

---
class: left, middle, r-logo

## Waffle Houses and Divorce Rates in US states

---
class: left, middle, r-logo

## Which DAG is a more believable explanation of the Waffle House phenomenon?

In these DAGs, a circled variable means a variable which is as yet unknown/unobserved.

---
class: left, middle, r-logo

## Statistical models and causal theories

We can use statistical models to lend support to our theories of causality.  Though models can never prove a causal relationship, they can show that certain variables influence an outcome, which can support a theoretical causal model under the right conditions.

However, there is a really big difference between these two things:

1.  Showing that a variable has an observed *association* with an outcome (this is simply a correlation)
2.  Generating statistical support for a theory that a variable may have a *causal influence* on an outcome (this requires more than a correlation).

---
class: left, middle, r-logo

## Booby trap 1:  The Fork (Collinearity)

In a DAG, we only draw an arrow between two variables when we believe there to be an **independent** causal influence between two variables.  Variables can still be associated even though they are not connected in a DAG.

Imagine we are trying to understand the causal influence of the average age at marriage `\(A\)` and the marriage rate `\(M\)` on divorce rate `\(D\)` in US states.  We construct the following DAG based on our belief around causality:

---
class: left, middle, r-logo

## Conditional independence

In our proposed DAG, `\(A\)` has a causal influence on `\(M\)` and on `\(D\)`.  But `\(M\)` **does not** have an independent causal influence on `\(D\)`.  `\(M\)` can still be associated with `\(D\)` statistically, but only because they have similar causation.

In this case, we say ** `\(D\)` is independent of `\(M\)` conditional on `\(A\)` **.  We write this mathematically as `\(D \!\perp\!\!\!\perp M|A\)`.

The `dagitty` package can determine all conditional independencies in a DAG:

```r
dag1 <- dagitty(
  "dag{A -> D; A ->M}"
)
impliedConditionalIndependencies(dag1)
```

```
## D _||_ M | A
```

---
class: left, middle, r-logo

## Your model construction will depend on your DAG

Let's build a model to support that `\(M\)` is independent  of `\(D\)` conditional on `\(A\)`.  This means we must include both `\(M\)` and `\(A\)` as input variables, with `\(D\)` as our output variable.  Here are simulations of the posterior coefficients of linear model D ~ A + M. You can see the code for this in the source on Github.

---
class: left, middle, r-logo

## Post-treatment bias

Imagine we are growing some roses.  Unfortunately our roses are prone to getting blackspot, a bad disease which prevents the leaves absorbing light.

Never fear, we have an incredible spray which kills blackspot on roses. The more we spray, the less blackspot will appear.  The less blackspot, the more the roses can grow.

Let's simulate some data to represent this.

---
class: left, middle, r-logo

## Simulating rose data

```r
n <- 100

set.seed(42)
d <- tibble(
  h0 = rnorm(n, mean = 10, sd = 2), 
  spray = rep(0:1, each = n / 2),
  blackspot = rbinom(n, size = 1, prob = .5 - spray * 0.4),
  h1 = h0 + rnorm(n, mean = 5 - 3 * blackspot, sd = 1)
)

# first few rows
head(d)
```

```
## # A tibble: 6 x 4
##      h0 spray blackspot    h1
##   <dbl> <int>     <int> <dbl>
## 1 12.7      0         1 14.7 
## 2  8.87     0         1  9.32
## 3 10.7      0         1 13.9 
## 4 11.3      0         0 16.0 
## 5 10.8      0         0 15.3 
## 6  9.79     0         0 13.5
```

---
class: left, middle, r-logo

## Now run a linear model and look at the posterior coefficients

---
class: left, middle, r-logo

## Booby trap 2: The Pipe (Post-treatment bias)

The most believable causal theory is that the amount of spray `\(S\)` reduces the amount of blackspot `\(B\)` which results in better rose growth `\(G\)`, or:

What are the conditional independencies in this pipe?

---
class: left, middle, r-logo

## Conditional independencies in a pipe

```r
dag2 <- dagitty(
  "dag{S -> B; B -> G}"
)
impliedConditionalIndependencies(dag2)
```

```
## G _||_ S | B
```

In other words, if we condition on `\(B\)` (blackspot), we conclude that there is no causal effect of `\(S\)` (spray) on `\(G\)` (growth), which is a false conclusion. This is because if we know the degree of blackspot, knowing the amount of spray tells us nothing further.  Conditioning on `\(B\)` 'blocks the path'.

So if we want to causally relate two variables on either ends of a pipe, we should not condition on the middle variable.

---
class: left, middle, r-logo

## Sample bias

Imagine we have been hiring some people into our company for a few years.  We hired them, `\(H\)`, on the basis of an assessment of their problem solving ability, `\(P\)` and a separate assessment of their interpersonal skills, `\(I\)`.

Let's construct some data to reflect this.  In this data, `\(P\)` and `\(I\)` will be constructed to be independent of each other, and `\(H\)` will be determined based on the sum of `\(P\)` and `\(I\)`.

---
class: left, middle, r-logo

## Simulate selection data

```r
set.seed(43)
N <- 1000 # candidates
p <- 0.1 # selection rate
P <- rnorm(N) # random normal distribution
I <- rnorm(N) # random normal distribution
Total <- P + I
q <- quantile(Total, 1 - p) # select hired based on sum of scores
H <- ifelse(Total >= q, TRUE, FALSE)
```

Now lets test for independence of `\(P\)` and `\(I\)` in our sample:

```r
cor.test(P, I)
```

```
## 
## 	Pearson's product-moment correlation
## 
## data:  P and I
## t = 1.5163, df = 998, p-value = 0.1298
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.01409226  0.10961002
## sample estimates:
##        cor 
## 0.04794271
```

---
class: left, middle, r-logo

## What if we look at just the people we hired

Now, an enthusiastic people analytics professional decides to ask if good problem-solvers have bad interpersonal skills, and tests the hypothesis on their company employees using recruiting data:

```r
cor.test(P[H == TRUE], I[H == TRUE])
```

```
## 
## 	Pearson's product-moment correlation
## 
## data:  P[H == TRUE] and I[H == TRUE]
## t = -9.2408, df = 98, p-value = 5.38e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7749238 -0.5611626
## sample estimates:
##        cor 
## -0.6823684
```

---
class: left, middle, r-logo

## Conditioning on the third variable can result in false inferences

---
class: left, middle, r-logo

## Booby trap 3:  The Collider

Our enthusiastic analyst might conclude that they have support for a causal relationship between `\(P\)` and `\(I\)`, but this is only because they have conditioned on `\(H\)` by only analyzing employees who have been hired.

```r
dag3 <- dagitty(
  "dag{P -> H <- I}"
)
impliedConditionalIndependencies(dag3)
```

```
## I _||_ P
```

If we are trying to support causal independence between `\(I\)` and `\(P\)`, we **must not include `\(H\)` in our model**.

---
class: left, middle, r-logo

## Booby trap 4: The Descendant Collider

Let"s add a causal layer to our employee selection example.  Let `\(D\)` be the decision to make a job offer and let `\(A\)` be the acceptance of a job offer.  Our DAG might then look like this:

Since `\(H\)` is the descendant of collider `\(D\)`, any attempt to causally model `\(P\)` against `\(H\)` or `\(I\)` against `\(H\)` must not condition on `\(D\)`.

---
class: left, middle, r-logo

## Getting complicated right?  But there is good news!

Given any DAG, and given any two variables which we are trying to causally relate, we can follow a simple process to determine what to include in our causal model.

Let `\(A\)` be our input (cause) variable and `\(B\)` be our outcome (effect) variable.  Follow these steps to determine what additional variables to include in your causal model:

1.  List all (undirected) paths from `\(A\)` to `\(B\)`. 
2.  Consider a path closed if it contains a collider.  Otherwise consider it open.
3.  Determine if any open paths have an arrow pointing to `\(A\)`.  This is called a **backdoor path**.
4.  Select variables to condition on in order to close those open backdoor paths. 
5.  Include these variables in addition to `\(A\)` and `\(B\)` in your model.

---
class: left, middle, r-logo

## Fun exercise: Divorce, Waffles and 'Southern-ness'

We are now going to propose a wider DAG to try to support causality of divorce rates `\(D\)` in US states.  In addition to average age at marriage `\(A\)`, marriage rate `\(M\)`, and number of Waffle Houses `\(W\)`, we are now going to introduce the 'Southern-ness' of states `\(S\)` as a binary variable (in case all y'all don't know, a state is either a Southern state or it is not!).  Here is our proposed DAG:

---
class: left, middle, r-logo

## Class exercise

What variables should we condition on if we want to support:

1.  Waffle Houses as a cause of divorce rates
2.  Southern-ness as a cause of divorce rates

---
class: left, middle, r-logo

## Let's answer the Waffle House question first

Using the magic of `dagitty`, we can determine our conditioning variables.

```r
dag3 <- dagitty(
  "dag{A -> D; 
       A -> M -> D;
       A <- S -> M;
       S -> W -> D}"
)
dagitty::adjustmentSets(dag3, exposure = "W", outcome = "D")
```

```
## { A, M }
## { S }
```

So we have two options:  include `\(S\)` or include `\(A\)` and `\(M\)`.  Let's take the easiest option and include `\(S\)`.

---
class: left, middle, r-logo

## Is there support for Waffle Houses causing divorce?

---
class: left, middle, r-logo

## Now the 'Southern-ness' question

Again let's confirm our conditioning variables using `dagitty`:

```r
dagitty::adjustmentSets(dag3, exposure = "S", outcome = "D")
```

```
##  {}
```

Makes sense as there are no backdoor paths originating at `\(S\)`, so this is a straight linear model with a single input variable.  The simulated posteriors of the coefficient of `\(S\)` look like this:

---
class: left, middle, r-logo

## Conclusions and references

1.  Causal inference is hard and a bit mind-bending 
2.  If you work in predictive analytics you don't have to worry about it too much
3.  If you work in explanatory analytics it's a very valuable skill and teaches a disciplined approach to modelling
4.  I highly recommend [Statistical Rethinking](https://www.amazon.com/Statistical-Rethinking-Bayesian-Examples-Chapman/dp/1482253445) by Richard McElreath.  An outstanding book but not for the statistically fainthearted.
5.  Another related book which is more business oriented is [Inference and Intervention](https://www.amazon.com/Inference-Intervention-Michael-D-Ryall/dp/0415657601)