Logistic regression is most commonly used with a Bernoulli response. If \(Y \sim Bernoulli(p)\), then \(P(Y = 1) = p = 1 - P(Y = 0)\). Logistic regression can be used to estimate \(p|\textbf{x}\) where \(\textbf{x}\) is a collection of predictor variables. This is accomplished by assuming the relationship
\[log(\frac{p}{1-p}) = \textbf{x}\beta\]
and then choosing \(\beta\) to maximize the joint likelihood:
\[\Pi_i p_i^{y_i}(1-p_i)^{(1-y_i)}\]
Note that the function linking the predictor variables to \(p\) is called the logit function, which is
what gives logistic regression its name. In \(\textbf{R}\), this model can be fit using
glm by specifying the options
family = "binomial".
The binomial distribution is a generalization of the Bernoulli, typically conceptualized as describing the number of “successes” (here denoted \(y\)) out of some number of attempts or trials (here denoted \(n\)). The probability mass function for a binomial random variable, \(Y\), is
\[P(Y = y|n, p) = {x \choose n} p^x(1 - p)^{n - x}\]
From this, we can see that by setting \(n = 1\), we recover the Bernoulli PMF.
Since the binomial distribution is a generalization of the Bernoulli distribution, it stands to reason that it may be possible to generalize logistic regression for Bernoulli variables to binomial variables. Indeed, this is the case, and it turns out it is relatively easy. Using the logit link function again, maximize the joint likelihood:
\[\Pi_i {y_i \choose n_i} p_i^{y_i}(1-p_i)^{(1-y_i)}\]
R is also capable of fitting binomial regression models using
glm with family = "binomial". However, there
are a few syntactic quirks. First, in the model statement, the response
variable must be given as a proportion of successes for a given run.
This can be done either by providing the proportions as a column in the
data matrix or by specifying a ratio of the number of successes to the
number of attempts. An example here might be instructive.
df <- data.frame(
x = runif(30, -1, 1),
n = rbinom(30, 6, .8)) %>%
mutate(y = rbinom(30, n, (exp(x)/(1 + exp(x)))))Consider the data set, df, given below.
## x n y
## 1 -0.01132681 5 1
## 2 -0.26414193 4 2
## 3 -0.81380816 4 0
## 4 0.55575400 4 3
## 5 0.65899013 6 3
## 6 0.89954037 4 3
Here, \(x\) is some predictor variable related to the number of successes, \(y\), for the given number of attempts, \(n\). We can fit a logistic regression to this data to estimate this relationship:
##
## Call: glm(formula = y/n ~ x, family = "binomial", data = df, weights = n)
##
## Coefficients:
## (Intercept) x
## 0.1041 0.9450
##
## Degrees of Freedom: 29 Total (i.e. Null); 28 Residual
## Null Deviance: 51.94
## Residual Deviance: 42.44 AIC: 93.45
We get the same results using different syntax:
## x n y prob
## 1 -0.01132681 5 1 0.20
## 2 -0.26414193 4 2 0.50
## 3 -0.81380816 4 0 0.00
## 4 0.55575400 4 3 0.75
## 5 0.65899013 6 3 0.50
## 6 0.89954037 4 3 0.75
##
## Call: glm(formula = prob ~ x, family = "binomial", data = dfProb, weights = n)
##
## Coefficients:
## (Intercept) x
## 0.1041 0.9450
##
## Degrees of Freedom: 29 Total (i.e. Null); 28 Residual
## Null Deviance: 51.94
## Residual Deviance: 42.44 AIC: 93.45
Notably, these fitted values are equivalent to what we would get if
we were to transform each of our binomial responses into equivalent sets
of Bernoulli trials. For example, the first observation in
df is a single success out of 5 attempts with a covariate
of -0.011. Another way to think of this is five Bernoulli trials, one
successful, four unsuccessful, all with a covariate value of -0.011.
For example:
## x y
## 1 -0.01132681 1
## 2 -0.01132681 0
## 3 -0.01132681 0
## 4 -0.01132681 0
## 5 -0.01132681 0
## 6 -0.26414193 1
##
## Call: glm(formula = y ~ x, family = "binomial", data = dfTall)
##
## Coefficients:
## (Intercept) x
## 0.1041 0.9450
##
## Degrees of Freedom: 139 Total (i.e. Null); 138 Residual
## Null Deviance: 194.1
## Residual Deviance: 184.6 AIC: 188.6
Note that while the degrees of freedom, information criteria, etc. are different, the coefficient estimates are the same, as are the standard errors for these coefficients.
##Using ciTools for binomial regression
ciTools supports logistic regression with both Bernoulli
and binomial response variables. For both types, add_ci has
intuitive and expected functionality. It produces a point estimate and
confidence intervals around the estimated probability of success.
However, the other functions of ciTools
(add_pi, add_quantile, and
add_probs) produce different behaviors, because prediction
intervals, quantiles, and probability values are not very useful for
Bernoulli variables.
For example, consider prediction intervals. These are used to
quantify the observation-to-observation variability and estimate its
range. For a Bernoulli variable, only two values are possible, so this
is not particularly meaningful – the prediction interval will always be
\([0,1]\). Quantiles are similarly
problematic. And probability estimates (that is, \(P(Y>y|x)\)) are equivalent to estimates
of \(p\) or \(1-p\) . Thus, asking for any of these will
produce either an error (in the case of
add_pi.glm(family = "binomial) or
add_quantiles(family = "binomial)) or a warning (in the
case of add_probs.glm(family = "binomial")).
For a binomial response, on the other hand, each of these functions produce meaningful output. Binomial response variables can take more than two values, making it sensible to consider prediction intervals, quantiles, or probabilities. For example, it may be interesting to consider the 90 percent quantile for a binomial regression:
## Warning in add_quantile.glm(df, fit, p = 0.9): Treating weights as indicating
## the number of trials for a binomial regression where the response is the
## proportion of successes
## Warning in add_quantile.glm(df, fit, p = 0.9): The response variable is not
## continuous so Prediction Intervals are approximate
## Warning in sim_quantile_other(df, fit, p, name, yhatName, nSims): For binomial
## models, add_quantile's column of fitted values reflect E(Y|X) rather than
## typical default for logistic regression, pHat
## x n y pred quantile0.9
## 1 -0.01132681 5 1 2.616636 4
## 2 -0.26414193 4 2 1.854723 3
## 3 -0.81380816 4 0 1.358501 3
## 4 0.55575400 4 3 2.609291 4
## 5 0.65899013 6 3 4.044646 6
## 6 0.89954037 4 3 2.887789 4
There is a lot to unpack here, including three warning messages.
Let’s consider the output first. Unlike when we use
add_ci.glm, the pred column includes the
predicted values \(E(Y|\boldsymbol{x})\) rather than the
estimated probability of success \(\hat{p}|\boldsymbol{x}\).
Next, there are three warnings. None of these indicate that anything
has gone wrong. Rather, they are included to ensure that users
understand the output they’re given. The first warning makes it clear to
the user that the weights argument is assumed to indicate
that they’re doing binomial rather than weighted Bernoulli regression.
The second warning informs users that the estimated interval is
approximate. This is because the current method used by
ciTools for GLMs is based on simulation, and because
ciTools forces quantile estimates to lie in the support set
of the response distribution. The final warning points out that the
pred column refers to the fitted values rather than
estimated probability of success, which was mentioned in the previous
paragraph.
Logistic regression can be used for both Bernoulli and Binomial
response variables, and glm supports both.
ciTools differentiates between the two based on whether the
user has included a column in the weights argument. Aside
from add_ci, none of the functions in ciTools
produce useful information for Bernoulli response variables. For
Binomial response variables, all of these functions produce useful
information, though error messages are included to ensure that users
understand the output presented.