\[ \begin{align}\begin{aligned}\newcommand{\bs}{\boldsymbol} \newcommand{\dp}{\displaystyle} \newcommand{\rm}{\mathrm} \newcommand{\cl}{\mathcal} \newcommand{\pd}{\partial}\\\newcommand{\cd}{\cdot} \newcommand{\cds}{\cdots} \newcommand{\dds}{\ddots} \newcommand{\lag}{\langle} \newcommand{\lv}{\lVert} \newcommand{\ol}{\overline} \newcommand{\od}{\odot} \newcommand{\ra}{\rightarrow} \newcommand{\rag}{\rangle} \newcommand{\rv}{\rVert} \newcommand{\seq}{\subseteq} \newcommand{\td}{\tilde} \newcommand{\vds}{\vdots} \newcommand{\wh}{\widehat}\\\newcommand{\0}{\boldsymbol{0}} \newcommand{\1}{\boldsymbol{1}} \newcommand{\a}{\boldsymbol{\mathrm{a}}} \newcommand{\b}{\boldsymbol{\mathrm{b}}} \newcommand{\c}{\boldsymbol{\mathrm{c}}} \newcommand{\d}{\boldsymbol{\mathrm{d}}} \newcommand{\e}{\boldsymbol{\mathrm{e}}} \newcommand{\f}{\boldsymbol{\mathrm{f}}} \newcommand{\g}{\boldsymbol{\mathrm{g}}} \newcommand{\h}{\boldsymbol{\mathrm{h}}} \newcommand{\i}{\boldsymbol{\mathrm{i}}} \newcommand{\j}{\boldsymbol{j}} \newcommand{\m}{\boldsymbol{\mathrm{m}}} \newcommand{\n}{\boldsymbol{\mathrm{n}}} \newcommand{\o}{\boldsymbol{\mathrm{o}}} \newcommand{\p}{\boldsymbol{\mathrm{p}}} \newcommand{\q}{\boldsymbol{\mathrm{q}}} \newcommand{\r}{\boldsymbol{\mathrm{r}}} \newcommand{\u}{\boldsymbol{\mathrm{u}}} \newcommand{\v}{\boldsymbol{\mathrm{v}}} \newcommand{\w}{\boldsymbol{\mathrm{w}}} \newcommand{\x}{\boldsymbol{\mathrm{x}}} \newcommand{\y}{\boldsymbol{\mathrm{y}}} \newcommand{\z}{\boldsymbol{\mathrm{z}}}\\\newcommand{\A}{\boldsymbol{\mathrm{A}}} \newcommand{\B}{\boldsymbol{\mathrm{B}}} \newcommand{\C}{\boldsymbol{\mathrm{C}}} \newcommand{\D}{\boldsymbol{\mathrm{D}}} \newcommand{\H}{\boldsymbol{\mathrm{H}}} \newcommand{\I}{\boldsymbol{\mathrm{I}}} \newcommand{\K}{\boldsymbol{\mathrm{K}}} \newcommand{\M}{\boldsymbol{\mathrm{M}}} \newcommand{\N}{\boldsymbol{\mathrm{N}}} \newcommand{\P}{\boldsymbol{\mathrm{P}}} \newcommand{\Q}{\boldsymbol{\mathrm{Q}}} \newcommand{\S}{\boldsymbol{\mathrm{S}}} \newcommand{\U}{\boldsymbol{\mathrm{U}}} \newcommand{\W}{\boldsymbol{\mathrm{W}}} \newcommand{\X}{\boldsymbol{\mathrm{X}}} \newcommand{\Y}{\boldsymbol{\mathrm{Y}}} \newcommand{\Z}{\boldsymbol{\mathrm{Z}}}\\\newcommand{\R}{\mathbb{R}}\\\newcommand{\cE}{\mathcal{E}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}}\\\newcommand{\ld}{\lambda} \newcommand{\Ld}{\boldsymbol{\mathrm{\Lambda}}} \newcommand{\sg}{\sigma} \newcommand{\Sg}{\boldsymbol{\mathrm{\Sigma}}} \newcommand{\th}{\theta} \newcommand{\ve}{\varepsilon}\\\newcommand{\mmu}{\boldsymbol{\mu}} \newcommand{\ppi}{\boldsymbol{\pi}} \newcommand{\CC}{\mathcal{C}} \newcommand{\TT}{\mathcal{T}}\\ \newcommand{\bb}{\begin{bmatrix}} \newcommand{\eb}{\end{bmatrix}} \newcommand{\bp}{\begin{pmatrix}} \newcommand{\ep}{\end{pmatrix}} \newcommand{\bv}{\begin{vmatrix}} \newcommand{\ev}{\end{vmatrix}}\\\newcommand{\im}{^{-1}} \newcommand{\pr}{^{\prime}} \newcommand{\ppr}{^{\prime\prime}}\end{aligned}\end{align} \]

Chapter 20 Linear Discriminant Analysis

Given labeled data consisting of \(d\)-dimensional points \(\x_i\) along with their classes \(y_i\), the goal of linear discriminant analysis (LDA) is to find a vector \(\w\) that maximizes the separation between the classes after projection onto \(\w\).

20.1 Optimal Linear Discriminant

Let us assume that the dataset \(\D\) consists of \(n\) points \(\x_i\in\R^d\), with the corresponding class label \(y_i\in\{c_1,c_2,\cds,c_k\}\). Let \(\D_i\) denote the subset of points labeled with class \(c_i\), i.e., \(\D_i=\{\x_j^T|y_j=c_i\}\), and let \(|\D_i|=n_i\) denote the number of points with class \(c_i\). We assume that there are only \(k=2\) classes. Thus, the dataset \(\D\) can be partitioned into \(\D_1\) and \(\D_2\).

Let \(\w\) be a unit vector, that is, \(\w^T\w=1\). The projection of any \(d\)-dimensional point \(\x_i\) onto the vector \(\w\) is given as

\[\x_i\pr=\bigg(\frac{\w^T\x_i}{\w^T\w}\bigg)\w=(\w^T\x_i)\w=a_i\w\]

where \(a_i\) is the offset or scalar projection of \(\x_i\) on the line \(\w\):

\[a_i=\w^T\x_i\]

We also call \(a_i\) a projected point. Thus the set of \(n\) projected points \(\{a_1,a_2,\cds,a_n\}\) represents a mapping from \(\R^d\) to \(\R\), that is, from the original \(d\)-dimensional space to a 1-dimensional space of offsets along \(\w\).

Each projected point \(a_i\) has associated with it the original class label \(y_i\), and thus we can compute, for each of the two classes, the mean of the projected points ,called the projected mean, as follows:

\[ \begin{align}\begin{aligned}m_1&=\frac{1}{n_1}\sum_{\x_i\in\D_1}a_i\\&=\frac{1}{n_1}\sum_{\x_i\in\D_1}\w^T\x_i\\&=\w^T\bigg(\frac{1}{n_1}\sum_{\x_i\in\D_1}\x_i\bigg)\\&=\w^T\mmu_1\end{aligned}\end{align} \]

where \(\mmu_1\) is the mean of all point in \(\D_1\). Likewise, we can obtain

\[m_2=\w^T\mmu_2\]

To maximize the separation between the classes, it seems reasonable to maximize the difference between the projected means, \(|m_1-m_2|\). For good separation, the variance of the projected points for each class should also not be too large. LDA maximizes the separation by ensuring that the scatter \(s_i^2\) for the projected points within each class is small

\[s_i^2=\sum_{\x_j\in\D_i}(a_j-m_i)^2=n_i\sg_i^2\]

We can incorporate the two LDA criteria, namely, maximizing the distance between projected means and minimizing the sum of projected scatter, into a single maximization criterion called the Fisher LDA objective:

Note

\(\dp\max_\w J(\w)\frac{(m_1-m_2)^2}{s_1^2+s_2^2}\)

The vector \(\w\) is also called the optimal linear discriminant (LD).

We can rewrite \((m_1-m_2)^2\) as follows:

\[ \begin{align}\begin{aligned}(m_1-m_2)^2&=(\w^T(\mmu_1-\mmu_2))^2\\&=\w^T((\mmu_1-\mmu_2)(\mmu_1-\mmu_2)^T)\w\\&=\w^T\B\w\end{aligned}\end{align} \]

where \(\B=(\mmu_1-\mmu_2)(\mmu_1-\mmu_2)^T\) is a \(d\times d\) rank-one matrix called the between-class scatter matrix.

As for the projected scatter for class \(c_1\), we can compute it as follows:

\[ \begin{align}\begin{aligned}s_1^2&=\sum_{\x_i\in\D_1}(a_i-m_1)^2\\&=\sum_{\x_i\in\D_1}(\w^T\x_i-\w^T\mmu_1)^2\\&=\sum_{\x_i\in\D_1}(\w^T(\x_i-\mmu_1))^2\\&=\w^T\bigg(\sum_{\x_i\in\D_1}(\x_i-\mmu_1)(\x_i-\mmu_1)^T\bigg)\w\\&=\w^T\S_1\w\end{aligned}\end{align} \]

where \(\S_1\) is the scatter matrix for \(\D_1\). Likewise, we can obtain

\[s_2^2=\w^T\S_2\w\]

Notice again that the scatter matrix is essentially the same as the covariance matrix, but instead of recording the average deviation from the mean, it records the total deviation, that is,

\[\S_i=n_i\Sg_i\]

\[s_1^2+s_2^2=\w^T\S_1\w+\w^T\S_2\w=\w^T(\S_1+\S_2)\w=\w^T\S\w\]

where \(\S=\S_1+\S_2\) denote the within-class scatter matrix for the pooled data.

Note

\(\dp\max_\w J(\w)=\frac{\w^T\B\w}{\w^T\S\w}\)

\[\frac{d}{d\w}J(\w)=\frac{2\B\w(\w^T\S\w)-2\B\w(\w^T\B\w)}{(\w^T\S\w)^2}=\0\]

\[ \begin{align}\begin{aligned}\B\w(\w^T\S\w)&=\S\w(\w^T\B\w)\\\B\w&=\S\w\bigg(\frac{\w^T\B\w}{\w^T\S\w}\bigg)\\\B\w&=J(\w)\S\w\\\B\w&=\ld\S\w\end{aligned}\end{align} \]

where \(\ld=J(\w)\). If \(\S\) is nonsingular, that is, if \(\S\im\) exists

\[\S\im\B\w=\ld\S\im\S\w\]

Note

\((\S\im\B)\w=\ld\w\)

Thus, if \(\S\im\) exists, then \(\ld=J(\w)\) is an eigenvalue, and \(\w\) is an eigenvector of the matrix \(\S\im\B\). To maximize \(J(\w)\) we look for the largest eigenvalue \(\ld\), and the coresponding dominant eigenvector \(\w\) specifies the best linear discriminant vector.

The total time complexity is \(O(d^3+nd^2)\).

For the two class scenario, if \(\S\) is nonsingular, we can directly solve for \(\w\) without computing the eigenvalues and eigenvectors.

\[ \begin{align}\begin{aligned}\B\w&=((\mmu_1-\mmu_2)(\mmu_1-\mmu_2)^T)\w\\&=(\mmu_1-\mmu_2)((\mmu_1-\mmu_2)^T\w)\\&=b(\mmu_1-\mmu_2)\end{aligned}\end{align} \]

where \(n=(\mmu_1-\mmu_2)^T\w\) is just a scalar multiplier.

\[ \begin{align}\begin{aligned}\B\w&=\ld\S\w\\b(\mmu_1-\mmu_2)&=\ld\S\w\\\w&=\frac{b}{\ld}\S\im(\mmu_1-\mmu_2)\end{aligned}\end{align} \]

Because \(\frac{b}{\ld}\) is just a scalar, we can solve for the best linear discriminant as

Note

\(\w=\S\im(\mmu_1-\mmu_2)\)

20.2 Kernel Discriminant Analysis

The goal of kernel LDA is to find the direction vector \(\w\) in feature space that maximizes

\[\max_\w J(\w)=\frac{(m_1-m_2)^2}{s_1^2+s_2^2}\]

Optimal LD: Linear Combination of Feature Points

The mean for class \(c_i\) in feature space is given as

\[\mmu_i^\phi=\frac{1}{n_i}\sum_{\x_j\in\D_i}\phi(\x_j)\]

and the covariance matrix for class \(c_i\) in feature space is

\[\Sg_i^\phi=\frac{1}{n_i}\sum_{\x_j\in\D_i}(\phi(\x_j)-\mmu_i^\phi)(\phi(\x_j)-\mmu_I^\phi)^T\]

The between-class and within-class scatter matrices are defined as

\[ \begin{align}\begin{aligned}\B_\phi=(\mmu_1^\phi-\mmu_2^\phi)(\mmu_1^\phi-\mmu_2^\phi)^T=\d_\phi\d_\phi^T\\\S_\phi=n_1\Sg_1^\phi+n_2\Sg_2^\phi\end{aligned}\end{align} \]

Note

\((\S_\phi\im\B_\phi)\w=\ld\w\)

where we assume that \(\S_\phi\) is non-singular. Let \(\delta_i\) denote the \(i\)th eigenvalue and \(\u_i\) the \(i\)th eigenvector of \(\S_\phi\), for \(i=1,\cds,d\). The eigen-decomposition of \(\S_\phi\) yields \(\S_\phi=\U\Delta\U^T\), with the inverse of \(\S_\phi\) given as \(\S_\phi\im=\U\Delta\im\U^T\). Here \(\U\) is the matrix whose columns are the eigenvectors of \(\S_\phi\) and \(\Delta\) is the diagonal matrix of eigenvalues of \(\S_\phi\). The inverse \(\S_\phi\im\) can thus be expressed as the spectral sum

\[\S_\phi\im=\sum_{r=1}^d\frac{1}{\delta_r}\u_r\u_r^T\]

\[\ld\w=\bigg(\sum_{r=1}^d\frac{1}{\delta_r}\u_r\u_r^T\bigg)\d_\phi\d_\phi^T\w =\sum_{r=1}^d\frac{1}{\delta_r}(\u_r(\u_r^T\d_\phi)(\d_\phi^T\w))= \sum_{r=1}^db_r\u_r\]

where \(b_r=\frac{1}{\delta_r}(\u_r^T\d_\phi)(\d_\phi^T\w)\) is a scalar value.

\[ \begin{align}\begin{aligned}\w&=\frac{1}{\ld}\sum_{r=1}^db_r\bigg(\sum_{j=1}^nc_{rj}\phi(\x_j)\bigg)\\&=\sum_{j=1}^n\phi(\x_j)\bigg(\sum_{r=1}^d\frac{b_rc_{rj}}{\ld}\bigg)\\&=\sum_{j=1}^na_j\phi(\x_j)\end{aligned}\end{align} \]

where \(a_j=\sum_{r=1}^db_rc_{rj}/\ld\) is a scalar value for the feature point \(\phi(\x_j)\). Therefore, the direction vector \(\w\) can be expressed as a linear combination of the points in feature space.

LDA Objective via Kernel Matrix

\[ \begin{align}\begin{aligned}m_i=\w^T\mmu_i^\phi&=\bigg(\sum_{j=1}^na_j\phi(\x_j)\bigg)^T\bigg(\frac{1}{n_i}\sum_{\x_k\in\D_i}\phi(\x_k)\bigg)\\&=\frac{1}{n_i}\sum_{j=1}^n\sum_{\x_k\in\D_i}a_j\phi(\x_j)^T\phi(\x_k)\\&=\frac{1}{n_i}\sum_{j=1}^n\sum_{\x_k\in\D_i}a_jK(\x_j,\x_k)\\&=\a^T\m_i\end{aligned}\end{align} \]

where \(\a=(a_1,a_2,\cds,a_m)^T\) is the weight vector, and

\[\begin{split}\m_i=\frac{1}{n_i}\bp \sum_{\x_k\in\D_i}K(\x_1,\x_k)\\ \sum_{\x_k\in\D_i}K(\x_2,\x_k)\\\vds\\\sum_{\x_k\in\D_i}K(\x_n,\x_k)\ep= \frac{1}{n_i}\K^{c_i}\1_{n_i}\end{split}\]

\[ \begin{align}\begin{aligned}(m_1-m_2)^2&=(\w^T\mmu_1^\phi-\w^T\mmu_2^\phi)^2\\&=(\a^T\m_1-\a^T\m_2)^2\\&=\a^T(\m_1-\m_2)(\m_1-\m_2)^T\a\\&=\a^T\M\a\end{aligned}\end{align} \]

\[ \begin{align}\begin{aligned}s_1^2&=\sum_{\x_i\in\D_1}\lv\w^T\phi(\x_i)-\w^T\mmu_1^\phi\rv^2\\&=\sum_{\x_i\in\D_1}\lv\w^T\phi(\x_i)\rv^2-2\sum_{\x_i\in\D_1} \w^T\phi(\x_i)\cd\w^T\mmu_1^\phi+\sum_{\x_i\in\D_1}\lv\w^T\mmu_1^\phi\rv^2\\&=\bigg(\sum_{\x_i\in\D_1}\lv\sum_{j=1}^na_j\phi(\x_j)^T \phi(\x_i)\rv^2\bigg)-2\cd n_1\cd\lv\w^T\mmu_1^\phi\rv^2+ n_1\cd\lv\w^T\mmu_1^\phi\rv^2\\&=\bigg(\sum_{\x_i\in\D_1}\a^T\K_i\K_i^T\a\bigg)-n_1\cd\a^T\m_1\m_1^T\a\\&=\a^T\bigg(\bigg(\sum_{\x_i\in\D_1}\K_i\K_i^T\bigg)-n_1\m_1\m_1^T\bigg)\a\\&=\a^T\N_1\a\end{aligned}\end{align} \]

\[ \begin{align}\begin{aligned}\N_1&=\bigg(\sum_{\x_i\in\D_1}\K_i\K_i^T\bigg)-n_1\m_1\m_1^T\\&=(\K^{c_1})\bigg(\I_{n_1}-\frac{1}{n_1}\1_{n_1\times n_1}\bigg)(\K^{c_1})^T\end{aligned}\end{align} \]

In a similar manner we get \(s_2^2=\a^T\N_2\a\).

\[s_1^2+s_2^2=\a^T(\N_1+\N_2)\a=\a^T\N\a\]

Note

\(\dp\max_\w J(\w)=\max_\a J(\a)=\frac{\a^T\M\a}{\a^\N\a}\)

The weight vector \(\a\) is the eigenvector corresponding to the largest eigenvalue of the generalized eigenvalue problem:

\[\M\a=\ld_1\N\a\]

If \(\N\) is nonsingular, \(\a\) is the dominant eigenvetor corresponding to the largest eigenvalue for the system

\[(\N\im\M)\a=\ld_1\a\]

As in the case of linear discriminant analysis, when there are only two classes we do not have to solve for the eigenvector because \(\a\) can be obtained directly:

\[\a=\N\im(\m_1-\m_2)\]

we can ensure that \(\w\) is a unit vector if we scale \(\a\) by \(\frac{1}{\sqrt{\a^T\K\a}}\)

Note

\(\dp\w^T\phi(\x)=\sum_{j=1}^na_j\phi(\x_j)^T\phi(\x)=\sum_{j=1}^na_jK(\x_j,\x)\)

The complexity of kernel discriminant analysis is \(O(n^3\).