# Multinomial Logit - Model - As A Log-linear Model

As A Log-linear Model

The formulation of binary logistic regression as a log-linear model can be directly extended to multi-way regression. That is, we model the logarithm of the probability of seeing a given output using the linear predictor as well as an additional normalization factor:

begin{align} ln Pr(Y_i=1) &= boldsymbolbeta_1 cdot mathbf{X}_i - ln Z , \ ln Pr(Y_i=2) &= boldsymbolbeta_2 cdot mathbf{X}_i - ln Z , \ cdots & cdots \ ln Pr(Y_i=K) &= boldsymbolbeta_K cdot mathbf{X}_i - ln Z , \ end{align}

As in the binary case, we need an extra term to ensure that the whole set of probabilities forms a probability distribution, i.e. so that they all sum to one:

The reason why we need to add a term to ensure normalization, rather than multiply as is usual, is because we have taken the logarithm of the probabilities. Exponentiating both sides turns the additive term into a multiplicative factor, and in the process shows why we wrote the term in the form rather than simply :

begin{align} Pr(Y_i=1) &= frac{1}{Z} e^{boldsymbolbeta_1 cdot mathbf{X}_i} , \ Pr(Y_i=2) &= frac{1}{Z} e^{boldsymbolbeta_2 cdot mathbf{X}_i} , \ cdots & cdots \ Pr(Y_i=K) &= frac{1}{Z} e^{boldsymbolbeta_K cdot mathbf{X}_i} , \ end{align}

We can compute the value of Z by applying the above constraint that requires all probabilities to sum to 1:

begin{align} 1 = sum_{k=1}^{K} Pr(Y_i=k) &= sum_{k=1}^{K} frac{1}{Z} e^{boldsymbolbeta_k cdot mathbf{X}_i} \ &= frac{1}{Z} sum_{k=1}^{K} e^{boldsymbolbeta_k cdot mathbf{X}_i} \ end{align}

Therefore:

Note that this factor is "constant" in the sense that it is not a function of Yi, which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients βk, which we will need to determine through some sort of optimization procedure.

The resulting equations for the probabilities are

begin{align} Pr(Y_i=1) &= frac{e^{boldsymbolbeta_1 cdot mathbf{X}_i}}{sum_{k=1}^{K} e^{boldsymbolbeta_k cdot mathbf{X}_i}} , \ Pr(Y_i=2) &= frac{e^{boldsymbolbeta_2 cdot mathbf{X}_i}}{sum_{k=1}^{K} e^{boldsymbolbeta_k cdot mathbf{X}_i}} , \ cdots & cdots \ Pr(Y_i=K) &= frac{e^{boldsymbolbeta_K cdot mathbf{X}_i}}{sum_{k=1}^{K} e^{boldsymbolbeta_k cdot mathbf{X}_i}} , \ end{align}

Or generally:

The following function:

is referred to as the softmax function. The reason is that the effect of exponentiating the values is to exaggerate the differences between them. As a result, will return a value close to 0 whenever x_k is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a weighted average that behaves as a smooth function (which can be conveniently differentiated, etc.) and which approximates the non-smooth function . That is:

Thus, we can write the probability equations as

The softmax function thus serves as the equivalent of the logistic function in binary logistic regression.

Note that not all of the vectors of coefficients are uniquely identifiable. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result there are only separately specifiable probabilities, and hence separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical:

begin{align} frac{e^{(boldsymbolbeta_c + C) cdot mathbf{X}_i}}{sum_{k=1}^{K} e^{(boldsymbolbeta_k + C) cdot mathbf{X}_i}} &= frac{e^{boldsymbolbeta_c cdot mathbf{X}_i} e^{C cdot mathbf{X}_i}}{sum_{k=1}^{K} e^{boldsymbolbeta_k cdot mathbf{X}_i} e^{C cdot mathbf{X}_i}} \ &= frac{e^{C cdot mathbf{X}_i} e^{boldsymbolbeta_c cdot mathbf{X}_i}}{e^{C cdot mathbf{X}_i} sum_{k=1}^{K} e^{boldsymbolbeta_k cdot mathbf{X}_i}} \ &= frac{e^{boldsymbolbeta_c cdot mathbf{X}_i}}{sum_{k=1}^{K} e^{boldsymbolbeta_k cdot mathbf{X}_i}} end{align}

As a result, it is conventional to set (or alternatively, one of the other coefficient vectors). Essentially, we set the constant so that one of the vectors becomes 0, and all of the other vectors get transformed into the difference between those vectors and the vector we chose. This is equivalent to "pivoting" around one of the K choices, and examining how much better or worse all of the other K-1 choices are, relative to the choice are pivoting around. Mathematically, we transform the coefficients as follows:

begin{align} boldsymbolbeta'_1 &= boldsymbolbeta_1 - boldsymbolbeta_K \ cdots & cdots \ boldsymbolbeta'_{K-1} &= boldsymbolbeta_{K-1} - boldsymbolbeta_K \ boldsymbolbeta'_K &= 0 end{align}

This leads to the following equations:

begin{align} Pr(Y_i=1) &= frac{e^{boldsymbolbeta'_1 cdot mathbf{X}_i}}{1 + sum_{k=1}^{K-1} e^{boldsymbolbeta'_k cdot mathbf{X}_i}} , \ cdots & cdots \ Pr(Y_i=K-1) &= frac{e^{boldsymbolbeta'_{K-1} cdot mathbf{X}_i}}{1 + sum_{k=1}^{K-1} e^{boldsymbolbeta'_k cdot mathbf{X}_i}} , \ Pr(Y_i=K) &= frac{1}{1 + sum_{k=1}^{K-1} e^{boldsymbolbeta'_k cdot mathbf{X}_i}} , \ end{align}

Other than the prime symbols on the regression coefficients, this is exactly the same as the form of the model described above, in terms of K-1 independent two-way regressions.