# Logistic Regression - Formal Mathematical Specification - As A "log-linear" Model

As A "log-linear" Model

Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the multinomial logit.

Here, instead of writing the logit of the probabilities pi as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:

begin{align} ln p(Y_i=0) &= boldsymbolbeta_0 cdot mathbf{X}_i - ln Z , \ ln p(Y_i=1) &= boldsymbolbeta_1 cdot mathbf{X}_i - ln Z , \ end{align}

Note that two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the logarithm of the associated probability as a linear predictor, with an extra term at the end. This term, as it turns out, serves as the normalizing factor ensuring that the result is a distribution. This can be seen by exponentiating both sides:

begin{align} p(Y_i=0) &= frac{1}{Z} e^{boldsymbolbeta_0 cdot mathbf{X}_i} , \ p(Y_i=1) &= frac{1}{Z} e^{boldsymbolbeta_1 cdot mathbf{X}_i} , \ end{align}

In this form it is clear that the purpose of Z is to ensure that the resulting distribution over Yi is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply the sum of all un-normalized probabilities, and by dividing each probability by Z, the probabilities become "normalized". That is:

and the resulting equations are

begin{align} p(Y_i=0) &= frac{e^{boldsymbolbeta_0 cdot mathbf{X}_i}}{e^{boldsymbolbeta_0 cdot mathbf{X}_i} + e^{boldsymbolbeta_1 cdot mathbf{X}_i}} , \ p(Y_i=1) &= frac{e^{boldsymbolbeta_1 cdot mathbf{X}_i}}{e^{boldsymbolbeta_0 cdot mathbf{X}_i} + e^{boldsymbolbeta_1 cdot mathbf{X}_i}} , end{align}

Or generally:

This shows clearly how to generalize this formulation to more than two outcomes, as in multinomial logit.

Now, how can we prove that this is equivalent to the previous model? Keep in mind that the above model is overspecified, in that and cannot be independently specified: rather so knowing one automatically determines the other. As a result, the model is nonidentifiable, in that multiple combinations of β0 and β1 will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:

begin{align} p(Y_i=1) &= frac{e^{(boldsymbolbeta_1 +mathbf{C}) cdot mathbf{X}_i}}{e^{(boldsymbolbeta_0 +mathbf{C})cdot mathbf{X}_i} + e^{(boldsymbolbeta_1 +mathbf{C}) cdot mathbf{X}_i}} , \ &= frac{e^{boldsymbolbeta_1 cdot mathbf{X}_i} e^{-mathbf{C} cdot mathbf{X}_i}}{e^{boldsymbolbeta_0 cdot mathbf{X}_i} e^{mathbf{C} cdot mathbf{X}_i} + e^{boldsymbolbeta_1 cdot mathbf{X}_i} e^{mathbf{C} cdot mathbf{X}_i}} , \ &= frac{e^{mathbf{C} cdot mathbf{X}_i}e^{boldsymbolbeta_1 cdot mathbf{X}_i}}{e^{mathbf{C} cdot mathbf{X}_i}(e^{boldsymbolbeta_0 cdot mathbf{X}_i} + e^{boldsymbolbeta_1 cdot mathbf{X}_i})} , \ &= frac{e^{boldsymbolbeta_1 cdot mathbf{X}_i}}{e^{boldsymbolbeta_0 cdot mathbf{X}_i} + e^{boldsymbolbeta_1 cdot mathbf{X}_i}} , \ end{align}

As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors. We choose to set Then,

and so

$p(Y_i=1) = frac{e^{boldsymbolbeta_1 cdot mathbf{X}_i}}{1 + e^{boldsymbolbeta_1 cdot mathbf{X}_i}} = frac{1}{1+e^{-boldsymbolbeta_1 cdot mathbf{X}_i}} = p_i$

which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where will produce equivalent results.)

Note that most treatments of the multinomial logit model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes. In general, the presentation with latent variables is more common in econometrics and political science, where discrete choice models and utility theory reign, while the "log-linear" formulation here is more common in computer science, e.g. machine learning and natural language processing.