class: center, middle, inverse, title-slide # Introduction to Binomial Logistic Regression for Inference on Binary Outcomes ### Keith McNulty --- class: left, middle, r-logo ## Key Sources * [*An Introduction to Statistical Learning*](https://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf) by James, Witten, Hastie and Tibshirani. * *Applied Logistic Regression* by David W. Hosmer Jr., Stanley Lemeshow, Rodney X. Sturdivant The dataset used can be found [on my Github](https://github.com/keithmcnulty/speed_dating/blob/master/speed_data_data.csv). ## Note on languages This document is coded in R. I've written a Python notebook [here](https://github.com/keithmcnulty/logit_regression_training/blob/main/logit_regression_training.ipynb) with the code to do many of the same operations that you find in this document. --- class: left, middle, r-logo ## Recap from last session Document from previous learning session on Linear Regression is [here](https://keithmcnulty.github.io/linear_regression_training/#1). When we model, we are trying to understand a relationship between an outcome variable `\(y\)` and a set of input variables `\(X = x_1, x_2, ..., x_p\)` of the form `$$y = f(X) + \epsilon$$` where `\(\epsilon\)` is some uncontrollable error in each observation, which has a mean of zero over the population. A *parametric model* explicitly forces a certain function type `\(f(X)\)`. In our last session we looked at modeling a *continuous outcome* using a *linear model*: $$ f(X) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p $$ where `\(\beta_0\)` is the intercept (the 'starting value') and `\(\beta_i\)` is the effect of a single unit change in `\(x_i\)` assuming all else equal. --- class: left, middle, r-logo ## Context: the logistic function In the early 1800s, the Belgian mathematician Pierre François Verhulst proposed a function for modeling population growth, as follows: $$ f(x) = \frac{L}{1 + e^{-k(x - x_0)}} $$ where `\(L\)` is the limit of the population (known as the 'carrying capacity'), `\(k\)` is the steepness of the curve and `\(x_0\)` is the midpoint of `\(x\)` (in population terms the midpoint of time). With `\(k = 1\)` and `\(x_0\)` = 0 this looks like: <img src="index_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- ## Binary outcomes are everywhere Binary outcomes are some of the most pervasive outcomes in science and in practice. For example: * **Medicine:** Disease vs No Disease, Survived vs Did Not Survive * **Sports:** Win vs No Win * **IO Psychology:** Satisfied vs Not Satisfied, Retained vs Not Retained, High Performer vs Not * **Finance:** Fraud vs No Fraud, Default vs No Default * **Marketing**: Click vs No Click, Purchase vs No Purchase In many cases, other outcome scales are converted to binary by using a cutoff: for example, in surveys, a 'high' response may consist of several high ratings on the Likert scale. --- class: left, middle, r-logo ## Modeling binary outcomes Imagine we are studying an outcome event `\(y\)` which can either occur or not occur. We label `\(y = 1\)` if it does occur for a given observation, and `\(y = 0\)` otherwise. `\(y\)` is called a *binary* or *dichotomous* outcome. Now imagine we want to relate `\(y\)` to a set of input variables `\(X\)`. In order to study this using some method similar to our linear model, we need a sensible scale for `\(y\)`, knowing that it cannot be less than zero or greater than 1. One natural way forward is to consider the *probability* of y occurring: `\(P(y = 1)\)` --- class: left, middle, r-logo ## Probability distribution for random variables Let's assume we have a single input variable `\(x\)` and we assume that `\(y\)` is more likely to occur as `\(x\)` increases. We sample the data for increasing values of `\(x\)` and calculate mean probability that `\(y = 1\)`. Over large enough samples, we expect to see something like a normal probability distribution. <img src="index_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## Similarity with logistic distribution Notice the similar S (sigmoid) shape of the normal and logistic distributions. This means we could take our logistic function with a carrying capacity of 1 (maximum probability) as an approximation of a normal distribution. This turns out to have benefits in interpretation. <img src="index_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## What happens if we use a logistic function? $$ P(y = 1) = \frac{1}{1 + e^{-k(x - x_0)}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1)}} $$ where `\(\beta_0 = -kx_0\)` and `\(\beta_1 = k\)`. Meanwhile $$ P(y = 0) = 1 - P(y = 1) = \frac{e^{-(\beta_0 + \beta_1x)}}{1 + e^{-(\beta_0 + \beta_1x)}} $$ So, if we divide the two: $$ \frac{P(y = 1)}{P(y = 0)} = \frac{1}{e^{-(\beta_0 + \beta_1x)}} = e^{\beta_0 + \beta_1x} $$ --- class: left, middle, r-logo ## The odds of y You may know (from horse racing for example), that for a dichotomous outcome `\(y\)`, `\(\frac{P(y = 1)}{P(y = 0)}\)` is called the *odds* of y. Now, if we take natural logarithms of our last equation, we get: $$ \mathrm{ln}\left(\frac{P(y = 1)}{P(y = 0)}\right) = \beta_0 + \beta_1x $$ So we have a linear model in the *log odds* of `\(y\)`. Since a *transformation* of our outcome is linear on our input variable, we can create a model known as a *generalized linear model*. As we will see, the coefficients of a model like this can be interpreted very intuitively to explain the impact of input variables on the likelihood of `\(y\)` occurring. --- class: left, middle, r-logo ## The speed dating dataset This dataset is from an experiment run by Columbia University students in New York, where they collected information from speed dating sessions where individuals met partners of the opposite sex. ```r # get data url <- "https://raw.githubusercontent.com/keithmcnulty/speed_dating/master/speed_data_data.csv" speed_dating <- read.csv(url) # select only some elements for this training speed_dating <- speed_dating[ ,c("gender", "goal", "dec", "attr", "intel", "prob")] head(speed_dating) ``` ``` ## gender goal dec attr intel prob ## 1 0 2 1 6 7 6 ## 2 0 2 1 7 7 5 ## 3 0 2 1 5 9 NA ## 4 0 2 1 7 8 6 ## 5 0 2 1 5 7 6 ## 6 0 2 0 4 7 5 ``` --- class: left, middle, r-logo ## Data fields for `speed_dating` The individuals were given a series of surveys to complete as part of the experiment. * `dec` is the decision on whether the individual wanted to meet that specific partner again after the speed date and is our outcome for this example. * `attr`, `intel`, and `prob` are ratings out of ten on physical attractiveness, intelligence and the individual's belief that the partner also liked them. * `gender` is the gender of the individual: female is 0 and male is 1 * `goal` is a categorical variable with a code for different goals of the individual in attending (eg 'seemed like a fun night out', or 'looking for a serious relationship'). In this example we are going to look purely at the speed date level and not consider the fact that several speed dates may involve the same individual. We will add in that element in a future session when we look at *mixed models*. We are aiming to model the decision `dec` against the ratings `attr`, `intel` and `prob`. --- class: left, middle, r-logo ## Running the model on men only ```r # run a binomial general linear model on the male decision makers model_m <- glm(dec ~ attr + intel + prob, data = speed_dating[speed_dating$gender == 1, ], family = "binomial") # view a summary of the coefficients of the model (coefficients_m <- summary(model_m)$coefficients %>% as.data.frame()) ``` ``` ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -6.2537585 0.26346670 -23.736428 1.517359e-124 ## attr 0.7658699 0.02990211 25.612572 1.105090e-144 ## intel -0.0688468 0.03020599 -2.279243 2.265262e-02 ## prob 0.3269378 0.02112880 15.473558 5.233269e-54 ``` Recalling the meaning of `(P > |z|)` - the p-value - we can determine that all three ratings play a significant role in the decision outcome of a date. --- class: left, middle, r-logo ## Interpreting the coefficients Our coefficients indicate the linear impact on the log odds of a positive decision. A negative coefficient decreases the log odds and a positive coefficient increases the log odds. In this context we can see that physical attractiveness and sense of reciprocation both have a positive effect on likelihood of a positive decision when the decision maker is male, but intelligence has a negative impact. We can easily extend the manipulations from a few slides back to get a formula for the odds of an event in terms of the coefficents `\(\beta_0, \beta_1, ..., \beta_p\)`: $$ `\begin{align*} \frac{P(y = 1)}{P(y = 0)} &= e^{\beta_0 + \beta_1x_1 + ... + \beta_px_p} \\ &= e^{\beta_0}(e^{\beta_1})^{x_1}...(e^{\beta_p})^{x_p} \end{align*}` $$ * `\(e^{\beta_0}\)` is the odds of the event assuming zero from all input variables * `\(e^{\beta_i}\)` is the multiplier of the odds associated with a one unit increase in `\(x_i\)` (for example, an extra point rating in physical attractiveness), assuming all else equal - because of the multiplicative effect, we call this the *odds ratio* for `\(x_i\)` --- class: left, middle, r-logo ## Calculating and interpreting the odds ratios Odds ratios are simply the exponent of the coefficient estimates. ```r # add a column to our coefficients with odds ratios coefficients_m$odds_ratio <- exp(coefficients_m[ ,"Estimate"]) coefficients_m ``` ``` ## Estimate Std. Error z value Pr(>|z|) odds_ratio ## (Intercept) -6.2537585 0.26346670 -23.736428 1.517359e-124 0.001923212 ## attr 0.7658699 0.02990211 25.612572 1.105090e-144 2.150864692 ## intel -0.0688468 0.03020599 -2.279243 2.265262e-02 0.933469675 ## prob 0.3269378 0.02112880 15.473558 5.233269e-54 1.386715208 ``` We interpret each of our odds ratios as follows (assuming all else equal): * An extra point in physical attractiveness increases the odds by 115% * An extra point in intelligence *decreases* the odds by 7% * An extra point in perceived reciprocation of interest increases the odds by 39% If you are concerned about the precision of these statements, you can also get the 95% confidence intervals for the odds ratios by using `exp(confint(model_m))` --- class: left, middle, r-logo ## Warning: odds ≠ probabability Increases in odds have a diminishing effect on probability as the original probability increases. So it is important to know the difference between the two. Here is a graph showing the impact of a 10% increase in odds on the probability of an event, depending on the original probability. <img src="index_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- class: left, middle, r-logo ## Are dynamics different with women? ```r # run a binomial general linear model on the female decision makers model_f <- glm(dec ~ attr + intel + prob, data = speed_dating[speed_dating$gender == 0, ], family = "binomial") # view a summary of the coefficients of the model, incl odds_ratios coefficients_f <- summary(model_f)$coefficients %>% as.data.frame() coefficients_f$odds_ratio <- exp(coefficients_f[ ,"Estimate"]) coefficients_f ``` ``` ## Estimate Std. Error z value Pr(>|z|) odds_ratio ## (Intercept) -5.84737044 0.25260132 -23.148614 1.501241e-118 0.002887482 ## attr 0.55142831 0.02506430 22.000546 2.845351e-107 1.735730416 ## intel 0.08757465 0.02880993 3.039738 2.367839e-03 1.091523740 ## prob 0.23149158 0.01979541 11.694207 1.364556e-31 1.260478710 ``` Try interpreting these yourself. --- class: left, middle, r-logo ## Predicting using binomial logistic regression models Binomial logistic regression models play an important role in predictive analytics. New data fed into the fitted logistic function can predict the probability of a positive outcome. ```r new_dates <- data.frame( attr = c(1, 5, 9), intel = c(2, 4, 8), prob = c(5, 7, 9) ) predict(model_m, new_dates, type = "response") ``` ``` ## 1 2 3 ## 0.01814777 0.39861688 0.95394355 ``` In classification learning, the data is split into a training and test set, the model is fitted using a training set, a probability 'cutoff' is used to determine positive or negative classes (usually 0.5), and then the predictive accuracy is determined by testing on the test set. --- class: left, middle, r-logo ## Assessing the fit of a binomial logistic regression model Previously we looked at the fit of a linear regression model and determined a metric called `\(R^2\)`, which determined how much of the variance of `\(y\)` was explained by our model. This is not so straightforward in binomial logistic regression, and is in fact still the subject of intense research. Numerous variants of measures called *pseudo*- `\(R^2\)` exist to try to approximate something similar to an `\(R^2\)`. The `DescTools` package provides easy access to these measures. All have different definitions and should be handled carefully. Here are four of them. ```r library(DescTools) DescTools::PseudoR2(model_m, which = c("McFadden", "CoxSnell", "Nagelkerke", "Tjur")) ``` ``` ## McFadden CoxSnell Nagelkerke Tjur ## 0.2742679 0.3161661 0.4216450 0.3342360 ``` The *Akaike Information Criterion* is also valuable in directly comparing two competing models, with a lower AIC suggestion less information loss from the model. ```r AIC(model_m) ``` ``` ## [1] 4019.39 ``` --- class: left, middle, r-logo ## Assessing model confidence/goodness-of-fit In linear regression we were able to compare our model to a completely random alternative and show with some level of certainty whether our model did a 'better than random' job. This, again, is not straightforward with logistic regression. Options include: * 'Goodness of fit' tests for inference involving breaking observations into groups and comparing the observed outcomes in each group with the fitted outcomes in each group, performing a chi-square test to determine if they are significantly different. In these tests the null hypothesis is that there is a good fit, therefore (unlike linear regression) a low p-value is a sign of a poorly chosen model. There are a number of different goodness of fit tests available in the `LogisticDx` package, including the Hosmer-Lemeshow test. * Typical measures of predictive accuracy can be used if the model is being trained in a predictive context to classify test data, such as precision, recall and ROC-AUC, compared to a random classifier. --- class: left, middle, r-logo ## 'Take-home' exercise 1 - Space Shuttle Challenger disaster 1. Download the `orings` dataset from [here](https://raw.githubusercontent.com/keithmcnulty/logit_regression_training/main/orings.csv). 2. The dataset shows the air temperature at launch and the number of incidents of o-ring damage on each shuttle launch prior to the Challenger launch. 3. Create an appropriate binary variable to apply binomial logistic regression. 4. Show that air temperature at launch had a significant effect on the probability of o-ring damage. --- class: left, middle, r-logo ## 'Take-home' exercise 2 - speed dating Here is the coding for `goal` in the `speed_dating` dataset: * Seemed like a fun night out = 1 * To meet new people = 2 * To get a date = 3 * Looking for a serious relationship = 4 * To say I did it = 5 * Other = 6 1. Filter the data into separate sets according to `goal` - or combine sets if you think the goals are similar. 2. For each of your sets, run a binomial logistic regression model and determine if there are any notable differences between the groups. Feel free to share your findings! --- class: center, middle, r-logo # Thank you! Questions? ## Next time: Multinomial Logistic Regression for outcomes that are discreet categories