\[ \begin{align}\begin{aligned}\newcommand{\bs}{\boldsymbol} \newcommand{\dp}{\displaystyle} \newcommand{\rm}{\mathrm} \newcommand{\cl}{\mathcal} \newcommand{\pd}{\partial}\\\newcommand{\cd}{\cdot} \newcommand{\cds}{\cdots} \newcommand{\dds}{\ddots} \newcommand{\lag}{\langle} \newcommand{\lv}{\lVert} \newcommand{\ol}{\overline} \newcommand{\od}{\odot} \newcommand{\ra}{\rightarrow} \newcommand{\rag}{\rangle} \newcommand{\rv}{\rVert} \newcommand{\seq}{\subseteq} \newcommand{\td}{\tilde} \newcommand{\vds}{\vdots} \newcommand{\wh}{\widehat}\\\newcommand{\0}{\boldsymbol{0}} \newcommand{\1}{\boldsymbol{1}} \newcommand{\a}{\boldsymbol{\mathrm{a}}} \newcommand{\b}{\boldsymbol{\mathrm{b}}} \newcommand{\c}{\boldsymbol{\mathrm{c}}} \newcommand{\d}{\boldsymbol{\mathrm{d}}} \newcommand{\e}{\boldsymbol{\mathrm{e}}} \newcommand{\f}{\boldsymbol{\mathrm{f}}} \newcommand{\g}{\boldsymbol{\mathrm{g}}} \newcommand{\h}{\boldsymbol{\mathrm{h}}} \newcommand{\i}{\boldsymbol{\mathrm{i}}} \newcommand{\j}{\boldsymbol{j}} \newcommand{\m}{\boldsymbol{\mathrm{m}}} \newcommand{\n}{\boldsymbol{\mathrm{n}}} \newcommand{\o}{\boldsymbol{\mathrm{o}}} \newcommand{\p}{\boldsymbol{\mathrm{p}}} \newcommand{\q}{\boldsymbol{\mathrm{q}}} \newcommand{\r}{\boldsymbol{\mathrm{r}}} \newcommand{\u}{\boldsymbol{\mathrm{u}}} \newcommand{\v}{\boldsymbol{\mathrm{v}}} \newcommand{\w}{\boldsymbol{\mathrm{w}}} \newcommand{\x}{\boldsymbol{\mathrm{x}}} \newcommand{\y}{\boldsymbol{\mathrm{y}}} \newcommand{\z}{\boldsymbol{\mathrm{z}}}\\\newcommand{\A}{\boldsymbol{\mathrm{A}}} \newcommand{\B}{\boldsymbol{\mathrm{B}}} \newcommand{\C}{\boldsymbol{\mathrm{C}}} \newcommand{\D}{\boldsymbol{\mathrm{D}}} \newcommand{\H}{\boldsymbol{\mathrm{H}}} \newcommand{\I}{\boldsymbol{\mathrm{I}}} \newcommand{\K}{\boldsymbol{\mathrm{K}}} \newcommand{\M}{\boldsymbol{\mathrm{M}}} \newcommand{\N}{\boldsymbol{\mathrm{N}}} \newcommand{\P}{\boldsymbol{\mathrm{P}}} \newcommand{\Q}{\boldsymbol{\mathrm{Q}}} \newcommand{\S}{\boldsymbol{\mathrm{S}}} \newcommand{\U}{\boldsymbol{\mathrm{U}}} \newcommand{\W}{\boldsymbol{\mathrm{W}}} \newcommand{\X}{\boldsymbol{\mathrm{X}}} \newcommand{\Y}{\boldsymbol{\mathrm{Y}}} \newcommand{\Z}{\boldsymbol{\mathrm{Z}}}\\\newcommand{\R}{\mathbb{R}}\\\newcommand{\cE}{\mathcal{E}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}}\\\newcommand{\ld}{\lambda} \newcommand{\Ld}{\boldsymbol{\mathrm{\Lambda}}} \newcommand{\sg}{\sigma} \newcommand{\Sg}{\boldsymbol{\mathrm{\Sigma}}} \newcommand{\th}{\theta} \newcommand{\ve}{\varepsilon}\\\newcommand{\mmu}{\boldsymbol{\mu}} \newcommand{\ppi}{\boldsymbol{\pi}} \newcommand{\CC}{\mathcal{C}} \newcommand{\TT}{\mathcal{T}}\\ \newcommand{\bb}{\begin{bmatrix}} \newcommand{\eb}{\end{bmatrix}} \newcommand{\bp}{\begin{pmatrix}} \newcommand{\ep}{\end{pmatrix}} \newcommand{\bv}{\begin{vmatrix}} \newcommand{\ev}{\end{vmatrix}}\\\newcommand{\im}{^{-1}} \newcommand{\pr}{^{\prime}} \newcommand{\ppr}{^{\prime\prime}}\end{aligned}\end{align} \]

Chapter 24 Logistic Regression

Given a set of predictor attributes or independent variables \(X_1,X_2,\cds,X_d\), and given a categorical response or dependent variable \(Y\), the aim of logistic regression is to predict the probability of the response variable values based on the independent variables. Logistic regression is in fact a classification technique, that given a point \(\x_i\in\R^d\) predicts \(P(c_i|\x_j)\) for each class \(c_i\) in the domain of \(Y\) (the set of possible classes or values for the response variable).

24.1 Binary Logistic Regression

In logistic regression, we are given a set of \(d\) predictor or independent variables \(X_1,X_2,\cds,X_d\), and a binary or Bernoulli response variable \(Y\) that takes on only two values, namely, 0 and 1. Thus, we are given a training dataset \(\D\) comprising \(n\) points \(\x_i\in\R^d\) and the corresponding observed values \(y_i\in\{0,1\}\). We augment the data matrix \(\D\) by adding a new attribute \(X_0\) that is always fixed at the value 1 for each point, so that \(\td{\x_i}=(1,x_1,x_2,\cds,x_d)^T\in\R^{d+1}\) denotes the augmented point, and the multivariate random vector \(\td\X\), comprising all the independent attributes is given as \(\td\X=(X_0,X_1,\cds,X_d)^T\). The augmented training dataset is given as \(\td\D\) comprising the \(n\) augmented points \(\td{\x_i}\) along with the class labels \(y_i\) for \(i=1,2,\cds,n\).

Since there are only two outcomes for the response variable \(Y\), its probability mass function for \(\td\X=\td\x\) is given as:

\[P(Y=1|\td\X=\td\x)=\pi(\td\x)\quad\quad P(Y=0|\td\X=\td\x)=1-\pi(\td\x)\]

where \(\pi(\td\x)\) is the unknown true parameter value, denoting the probability of \(Y=1\) given \(\td\X=\td\x\).

\[ \begin{align}\begin{aligned}E[Y|\td\X=\td\x]&=1\cd P(Y=1|\td\X=\td\x)+0\cd P(Y=0|\td\X=\td\x)\\&=P(Y=1|\td\X=\td\x)=\pi(\td\x)\end{aligned}\end{align} \]

Therefore, in logistic regression, instead of directly predicting the response value, the goal is to learn the probability, \(P(Y=1|\td\X=\td\x)\), which is also the expected value of \(Y\) given \(\td\X=\td\x\).

Since \(P(Y=1|\td\X=\td\x)\) is a probability, it is not appropriate to directly use the linear regression model

\[f(\td\x)=\omega_0\cd x_0+\omega_1\cd x_1+\omega_2\cd x_2+\cds+\omega_d\cd x_d=\td{\bs\omega}^T\td\x\]

where \(\td{\bs\omega}=(\omega_0,\omega_1,\cds,\omega_d)^T\in\R^{d+1}\) is the true augmented weight vector, with \(\omega_0=\beta\) the true unknown bias term, and \(\omega_i\) the true unknown regression coefficient or weight for attribute \(X_i\). The reason we cannot simply use \(P(Y=1|\td\X=\td\x)=f(\td\x)\) is due to the fact that \(f(\td\x)\) can be arbitrarily large or arbitrarily small, whereas for logistic regression, we require that the output represents a probability value, and thus we need a model that results in an output that lies in the interval \([0,1]\). The name “logistic regression” comes from the logstic function (also called the sigmoid function) that meets this requirement.

Note

\(\dp\th(z)=\frac{1}{1+\exp\{-z\}}=\frac{\exp\{z\}}{1+\exp\{z\}}\)

The logstic function “squashes” the output to be between 0 and 1 for any scalar input \(z\).

\[1-\th(z)=1-\frac{\exp\{z\}}{1+\exp\{z\}}=\frac{1+\exp\{z\}-\exp\{z\}}{1+\exp\{z\}}=\frac{1}{1+\exp\{z\}}=\th(-z)\]

Using the logistic function, we define the logistic regression model as follows:

\[P(Y=1|\td\X=\td\x)=\pi(\td\x)=\th(f(\td\x))=\th(\td{\bs\omega}^T\td\x)- \frac{\exp\{\td{\bs\omega}^T\td\x\}}{1+\exp\{\td{\bs\omega}^T\td\x\}}\]

On the other hand, the probability for \(Y=0\) is given as

\[P(Y=0|\td\X=\td\x)=1-P(Y=1|\td\X=\td\x)=\th(-\td{\bs\omega}^T\td\x)=\frac{1}{1+\exp\{\td{\bs\omega}^T\td\x\}}\]

Combining these two cases the full logistic regression model is given as

Note

\(P(Y|\td\X=\td\x)=\th(\td{\bs\omega}^T\td\x)^Y\cd\th(-\td{\bs\omega}^T\td\x)^{1-Y}\)

Log-Odds Ratio

Define the odds ratio for the occurence of \(Y=1\) as follows:

\[ \begin{align}\begin{aligned}\rm{odds}(Y=1|\td\X=\td\x)&=\frac{P(Y=1|\td\X=\td\x)}{P(Y=0|\td\X=\td\x)}= \frac{\th(\td{\bs\omega}^T\td\x)}{\th(-\td{\bs\omega}^T\td\x)}\\&=\frac{\exp\{\td{\bs\omega}^T\td\x\}}{1+\exp\{\td{\bs\omega}^T\td\x\}}\cd(1+\exp\{\td{\bs\omega}^T\td\x\})\\&=\exp\{\td{\bs\omega}^T\td\x\}\end{aligned}\end{align} \]

The logarithm of the odds ratio, called the log-odds ratio, is therefore given as:

\[ \begin{align}\begin{aligned}\ln(\rm{odds}(Y=1|\td\X=\td\x))&=\ln\bigg(\frac{P(Y=1|\td\X=\td\x)} {1-P(Y=1|\td\X=\td\x)}\bigg)=\ln(\exp\{\td{\bs\omega}^T\td\x\})= \td{\bs\omega}^T\td\x\\&=\omega_0\cd x_0+\omega_1\cd x_1+\cds+\omega_d\cd x_d\end{aligned}\end{align} \]

The log-odds ratio function is also called the logit function, defined as

\[\rm{logit}(z)=\ln\bigg(\frac{z}{1-z}\bigg)\]

It is the inverse of the logistic function. We can see that

\[\ln(\rm{odds}(Y=1|\td\X=\td\x))=\rm{logit}(P(Y=1|\td\X=\td\x))\]

The logistic regression model is therefore based on the assumption that the log- odds ratio for \(Y=1\) given \(\td\X=\td\x\) is a linear function (or a weighted sum) of the independent attributes. Let us consider the effect of attribute \(X_i\) by fixing the values for all other attributes; we get

\[ \begin{align}\begin{aligned}&\quad\ \ \ln(\rm{odds}(Y=1|\td\X=\td\x))=\omega_i\cd x_i+C\\&\Rightarrow\rm{odds}(Y=1|\td\X=\td\x)=\exp\{\omega_i\cd x_i+C\}= \exp\{\omega_i\cd x_i\}\cd\exp\{C\}\propto\exp\{\omega_i\cd x_i\}\end{aligned}\end{align} \]

where \(C\) is a constant comprising the fixed attributes. The regression coefficient \(\omega_i\) can therefore be interpreted as the change in the log-odds ratio for \(Y=1\) for a unit change in \(X_i\), or equivalently the odds ratio for \(Y=1\) increases exponentially per unit change in \(X_i\).

24.1.1 Maximum Likelihood Estimation

Let \(\td\D\) be the augmented training dataset comprising the \(n\) augmented points \(\td{\x_i}\) along with their lables \(y_i\). Let \(\td\w=(w_0,w_1,\cds,w_d)^T\) be the augmented weight vector for estimating \(\td\w\). Note that \(w_0=b\) denotes the estimated bias term, and \(w_i\) the estimated weight for attribute \(X_i\). Likelihood is defined as the probability of the obaserved data given the estimated parameters \(\td\w\). We assume that the binary response variables \(y_i\) are all independent. Threfore, the likelihood of the observed responses is given as

\[L(\td\w)=P(Y|\td\w)=\prod_{i=1}^nP(y_i|\td{\x_i})=\prod_{i=1}^n \th(\td\w^T\td{\x_i})^{y_i}\cd\th(-\td\w^T\td{\x_i})^{1-y_i}\]

Instead of trying to maximize the likelihood, we can maximize the logarithm of the likelihood, called log-likelihood, to convert the product into a summation as follows:

Note

\(\dp\ln(L(\td\w))=\sum_{i=1}^ny_i\cd\ln(\th(\td\w^T\td{\x_i}))+(1-y_i)\cd\ln(\th(-\td\w^T\td{\x_i}))\)

The negative of the log-likelihood can also be considered as an error function, the cross-entropy error function, given as follows:

Note

\(\dp E(\td\w)=-\ln(L(\td\w))=\sum_{i=1}^ny_i\cd\ln\bigg(\frac{1}{\th(\td\w^T\td{\x_i})}\bigg)\) \(\dp(1-y_i)\cd\ln\bigg(\frac{1}{1-\th(\td\w^T\td{\x_i})}\bigg)\)

The task of maximizing the log-likelihood is therefore equivalent to minimizing the cross-entropy error.

We use an iterative gradient ascent method to compute the optimal value. It can be obtained by taking its partial derivative with respect to \(\td\w\).

\[\nabla(\td\w)=\frac{\pd}{\pd\td\w}\{\ln(L(\td\w))\}=\frac{\pd}{\pd\td\w} \bigg\{\sum_{i=1}^ny_i\cd\ln(\th(z_i))+(1-y_i)\cd\ln(\th(-z-i))\bigg\}\]

where \(z_i=\td\w^T\td{\x_i}\). We use the chain rule to obtain the derivative of \(\ln(\th(z_i))\) with respect to \(\td\w\).

\[ \begin{align}\begin{aligned}\frac{\pd}{\pd\th(z_i)}\{\ln(\th(z_i))\}&=\frac{1}{\th(z_i)}\\\frac{\pd}{\pd\th(z_i)}\{\ln(\th(-z_i))\}&=\frac{\pd}{\pd\th(z_i)}\{\ln(1-\th(z_i))\}=\frac{-1}{1-\th(z_i)}\\\frac{\pd\th(z_i)}{\pd z_i}&=\th(z_i)\cd(1-\th(z_i))=\th(z_i)\cd\th(-z_i)\\\frac{\pd z_i}{\pd\td\w}&=\frac{\pd\td\w^T\td{\x_i}}{\pd\td\w}=\td{\x_i}\end{aligned}\end{align} \]

As per the chain rule, we have

\[ \begin{align}\begin{aligned}\frac{\ln(\th(z_i))}{\pd\td\w}&=\frac{\pd\ln(\th(z_i))}{\pd\th(z_i)}\cd \frac{\pd\th(z_i)}{\pd(z_i)}\cd\frac{\pd z_i}{\pd\td\w}\\&=\frac{1}{1-\th(z_i)}\cd(\th(z_i)\cd\th(z_i))\cd\td{\x_i}=\th(-z_i)\cd\td{\x_i}\end{aligned}\end{align} \]

\[ \begin{align}\begin{aligned}\frac{\ln(\th(-z_i))}{\pd\td\w}&=\frac{\pd\ln(\th(-z_i))}{\pd\th(z_i)}\cd \frac{\pd\th(z_i)}{\pd(z_i)}\cd\frac{\pd z_i}{\pd\td\w}\\&=\frac{-1}{1-\th(z_i)}\cd(\th(z_i)\cd(1-\th(z_i)))\cd\td{\x_i}=-\th(z_i)\cd\td{\x_i}\end{aligned}\end{align} \]

Substituting the above equations, we get

\[ \begin{align}\begin{aligned}\nabla(\td\w)&=\sum_{i=1}^ny_i\cd\th(-z_i)\cd\td{\x_i}-(1-y_i)\cd\th(z_i)\cd\td{\x_i}\\&=\sum_{i=1}^ny_i\cd(\th(-z_i)+\th(z_i))\cd\td{\x_i}-\th(z_i)\cd\td{\x_i}\\&=\sum_{i=1}^n(y_i-\th(z_i))\cd\td{\x_i}\\&=\sum_{i=1}^n(y_i-\th(\td\w^T\td{\x_i}))\cd\td{\x_i}\end{aligned}\end{align} \]

Given the current estimate \(\td\w^t\), we can obtain the next estimate as follows:

Note

\(\td\w^{t+1}=\td\w^t+\eta\cd\nabla(\td\w^t)\)

Here, \(\eta>0\) is a user-specified parameter called the learning rate. At the optimal value of \(\td\w\), the gradient will be zero, \(\nabla(\td\w)=\0\), as desired.

Stochastic Gradient Ascent

The gradient ascent method computes the gradient by considering all the data points, and it is therefore called batch gradient ascent. For large datasets, it is typically much faster to compute the gradient by considering only one (randomly chosen) point at a time. The weight vector is updated after each such partial gradient step, giving rise to stochastic gradient ascent (SGA) for computing the optimal weight vector \(\td\w\).

Given a randomly chosen point \(\td{\x_i}\), the point-specific gradient is given as

Note

\(\nabla(\td\w,\td{\x_i})=(y_i-\th(\td\w^T\td{\x_i}))\cd\td{\x_i}\)

Once the model has been trained, we can predict the response for any new augmented test point \(\td\z\) as follows:

Note

\(\dp\hat{y}=\left\{\begin{array}{lr}1\quad\rm{if\ }\th(\td\w^T\z)\geq 0.5\\0\quad\rm{if\ }\th(\td\w^T\z)<0.5\end{array}\right.\)

24.2 Multiclass Logistic Regression

We model \(Y\) as a \(K\)-dimensional multivariate Bernoulli random variable. Since \(Y\) can assume only one of the \(K\) values, we use the one-hot encoding approach to map each categorical value \(c_i\) to the \(K\)-dimensional binary vector

\[\e_i=(0,\cds,0,1,0,\cds,0)^T\]

whose \(i\)th element \(e_{ii}=1\), and all other elements \(e_{ij}=0\), so that \(\sum_{j=1}^Ke_{ij}=1\). Henceforth, we assume that the categorical response variable \(Y\) is a multivariate Bernoulli variable \(\Y\in\{\e_1,\e_2,\cds,\e_K\}\), with \(Y_j\) referring to the \(j\)th component of \(\Y\).

The probability mass function for \(\Y\) given \(\td\X=\td\x\) is

\[P(\Y=\e_i|\td\X=\td\x)=\pi_i(\td\x),\ \rm{for}\ i=1,2,\cds,K\]

Thus, there are \(K\) unknown parameters, which must satisfy the following constraint:

\[\sum_{i=1}^K\pi_i(\td\x)=\sum_{i=1}^KP(\Y=\e_i|\td\X=\td\x)=1\]

Given that only one element of \(\Y\) is 1, the probability mass function of \(\Y\) can be written compactly as

Note

\(\dp P(\Y|\td\X=\td\x)=\prod_{j=1}^K(\pi_j(\td\x))^{Y_j}\)

The log-odds ratio of class \(c_i\) with respect to class \(c_K\) is assumed to satisfy

\[ \begin{align}\begin{aligned}\ln(\rm{odds}(\Y=\e_i|\td\X=\td\x))&=\ln\bigg(\frac{P(\Y=\e_i|\td\X=\td\x)} {P(\Y=\e_K|\td\X=\td\x)}\bigg)=\ln\bigg(\frac{\pi_i(\td\x)}{\pi_K(\td\x)} \bigg)=\td{\bs\omega_i}^T\td\x\\&=\omega_{i0}\cd x_0+\omega_{i1}\cd x_1+\cds+\omega_{id}\cd x_d\end{aligned}\end{align} \]

where \(\omega_{i0}=\beta_i\) is the true bias value for class \(c_i\).

\[ \begin{align}\begin{aligned}&\quad\ \frac{\pi_i(\td\x)}{\pi_K(\td\x)}=\exp\{\td{\bs\omega_i}^T\td\x\}\\&\Rightarrow\pi_i(\td\x)=\exp\{\td{\bs\omega_i}^T\td\x\}\cd\pi_K(\td\x)\end{aligned}\end{align} \]

\[ \begin{align}\begin{aligned}&\quad \ \sum_{j=1}^K\pi_j(\td\x)=1\\&\Rightarrow\bigg(\sum_{j\neq K}\exp\{\td{\bs\omega_j}^T\td\x\}\cd\pi_K(\td\x)\bigg)+\pi_K(\td\x)=1\\&\Rightarrow\pi_K(\td\x)=\frac{1}{1+\sum_{j\neq K}\exp\{\td{\bs\omega_j}^T\td\x\}}\end{aligned}\end{align} \]

\[\pi_i(\td\x)=\exp\{\td{\bs\omega_i}^T\td\x\}\cd\pi_K(\td\x)=\frac{\exp\{ \td{\bs\omega_i}^T\td\x\}}{1+\sum_{j\neq K}\exp\{\td{\bs\omega_j}^T\td\x\}}\]

Finally, setting \(\td{\bs\omega_K}=\0\), we have \(\exp\{\td{\bs\omega_K}^T\td\x\}=1\) and thus we can write the full model for multiclass logistic regression as follows:

Note

\(\dp\pi_i(\td\x)=\frac{\exp\{\td{\bs\omega_i}^T\td\x\}}{\sum_{j=1}^K\exp\{\td{\bs\omega_j}^T\td\x\}}\) \(\ \rm{for\ all}\ i=1,2,\cds,K\)

This function is also called the softmax function. When \(K=2\), this formulation yields exactly the same model as in binary logistic regression.

\[ \begin{align}\begin{aligned}\ln\bigg(\frac{\pi_i(\td\x)}{\pi_j(\td\x)}\bigg)&=\ln\bigg(\frac{\pi_i (\td\x)}{\pi_K(\td\x)}\cd\frac{\pi_K(\td\x)}{\pi_j(\td\x)}\bigg)\\&=\ln\bigg(\frac{\pi_i(\td\x)}{\pi_K(\td\x)}\bigg)+ \ln\bigg(\frac{\pi_K(\td\x)}{\pi_j(\td\x)}\bigg)\\&=ln\bigg(\frac{\pi_i(\td\x)}{\pi_K(\td\x)}\bigg)- \ln\bigg(\frac{\pi_j(\td\x)}{\pi_K(\td\x)}\bigg)\\&=\td{\bs\omega_i}^T\td\x-\td{\bs\omega_j}^T\td\x\\&=(\td{\bs\omega_i}-\td{\bs\omega_j})^T\td\x\end{aligned}\end{align} \]

That is, the log-odds ratio between any two classes can be computed from the difference of the corresponding weight vectors.

24.2.1 Maximum Likelihood Estimation

Let \(\td\D\) be the augmented dataset comprising \(n\) points \(\td{\x_i}\) and their labels \(\y_i\). We assume that \(\y_i\) is a one-hot encoded (multivariate Bernoulli) response vector, so that \(y_{ij}\) denotes the \(j\)th element of \(\y_i\). Let \(\td\w_i\in\R^{d+1}\) denote the estimated augmented weight vector for class \(c_i\), with \(w_{i0}=b_i\) denoting the bias term.

To find the \(K\) sets of regression weight vectors \(\td{\w_i}\), for \(i=1,2,\cds,K\), we use the gradient ascent approach to maximize the log-likelihood function. The likelihood of the data is given as

\[L(\td\W)=P(\Y|\td\W)=\prod_{i=1}^nP(\y_i|\td\X=\td{\x_i})=\prod_{i=1}^n\prod_{j=1}^K(\pi_j(\td{\x_j}))^{y_{ij}}\]

where \(\td\W=\{\td{\w_1},\td{\w_2},\cds,\td{\w_K}\}\) is the set of \(K\) weight vectors. The log-likelihood is then given as:

Note

\(\dp\ln(L(\td\W))=\sum_{i=1}^n\sum_{j=1}^Ky_{ij}\cd\ln(\pi_j(\td{\x_i}))=\sum_{i=1}^n\sum_{j=1}^Ky_{ij}\cd\) \(\dp\ln\bigg(\frac{\exp\{\td{\w_j}^T\td{\x_i}\}}{\sum_{a=1}^K\exp\{\td{\w_a}^T\td{\x_i}\}}\bigg)\)

Note that the negative of the log-likelihood function can be regarded as an error function, commonly known as cross-entropy error

\[ \begin{align}\begin{aligned}\frac{\pd}{\pd\pi_j(\td{\x_i})}\ln(\pi_j(\td{\x_i}))=\frac{1}{\pi_j(\td\x_i)}\\\begin{split}\frac{\pd}{\pd\td{\w_a}}\pi_j(\td{\x_i})=\left\{\begin{array}{lr} \pi_a(\td{\x_i})\cd(1-\pi_a(\td{\x_i}))\cd\td{\x_i}\quad\rm{if}\ j=a\\ -\pi_a(\td{\x_i})\cd\pi_j(\td{\x_i})\cd\td{\x_i}\quad\quad\quad\rm{if} \ j\neq a\end{array}\right.\end{split}\end{aligned}\end{align} \]

\[ \begin{align}\begin{aligned}\nabla(\td{\w_a})&=\frac{\pd}{\pd\td{\w_a}}\{\ln(L(\td\W))\}\\&=\sum_{i=1}^n\sum_{j=1}^Ky_{ij}\cd\frac{\pd\ln(\pi_j(\td{\x_i}))} {\pd\pi_j(\td{\x_i})}\cd\frac{\pd\pi_j(\td{\x_i})}{\pd\td{\w_a}}\\&=\sum_{i=1}^n\bigg(y_{ia}\cd\frac{\pi_a(\td{\x_i})}{\pi_a(\td{\x_i})}\cd (1-\pi_a(\td{\x_i}))\cd\td{\x_i}+\sum_{j\neq a}y_{ij}\cd\frac{(-\pi_a (\td{\x_i})\cd\pi_j(\td{\x_i}))}{\pi_j(\td{\x_i})}\cd\td{\x_i}\bigg)\\&=\sum_{i=1}^n\bigg(y_{ia}-y_{ia}\cd\pi_a(\td{\x_i})-\sum_{j\neq a}y_{ij}\cd\pi_a(\td{\x_i})\bigg)\cd\td{\x_i}\\&=\sum_{i=1}^n\bigg(y_{ia}-\sum_{j=1}^Ky_{ij}\cd\pi_a(\td{\x_i})\bigg)\cd\td{\x_i}\\&=\sum_{i=1}^n(y_{ia}-\pi_a(\td{\x_i}))\cd\td{\x_i}\end{aligned}\end{align} \]

For stochastic gradient ascent, we update the weight vectors by considering only one point at a time. The gradient of the log-likelihood function with respect to \(\td{\w_j}\) at a given point \(\td{\x_i}\) is given as

Note

\(\nabla(\td{\w_j},\td{\x_i})=(y_{ij}-\pi_j(\td{\x_i}))\cd\td{\x_i}\)

which results in the following update rule for the \(j\)th weight vector:

Note

\(\td{\w_j}^{t+1}=\td{\w_j}^t+\eta\cd\nabla(\td{\w_j}^t,\td{\x_i})\)

Once the model has been trained, we can predict the class for any new augmented test point \(\td\z\) as follows:

Note

\(\dp\hat{y}=\arg\max_{c_i}\{\pi_i(\td\z)\}=\arg\max_{c_i}\) \(\dp\bigg\{\frac{\exp\{\td{\w_i}^T\td\z\}}{\sum_{j=1}^K\exp\{\td{\w_j}^T\td\z\}}\bigg\}\)