\[ \begin{align}\begin{aligned}\newcommand{\bs}{\boldsymbol} \newcommand{\dp}{\displaystyle} \newcommand{\rm}{\mathrm} \newcommand{\cl}{\mathcal} \newcommand{\pd}{\partial}\\\newcommand{\cd}{\cdot} \newcommand{\cds}{\cdots} \newcommand{\dds}{\ddots} \newcommand{\lag}{\langle} \newcommand{\lv}{\lVert} \newcommand{\ol}{\overline} \newcommand{\od}{\odot} \newcommand{\ra}{\rightarrow} \newcommand{\rag}{\rangle} \newcommand{\rv}{\rVert} \newcommand{\seq}{\subseteq} \newcommand{\td}{\tilde} \newcommand{\vds}{\vdots} \newcommand{\wh}{\widehat}\\\newcommand{\0}{\boldsymbol{0}} \newcommand{\1}{\boldsymbol{1}} \newcommand{\a}{\boldsymbol{\mathrm{a}}} \newcommand{\b}{\boldsymbol{\mathrm{b}}} \newcommand{\c}{\boldsymbol{\mathrm{c}}} \newcommand{\d}{\boldsymbol{\mathrm{d}}} \newcommand{\e}{\boldsymbol{\mathrm{e}}} \newcommand{\f}{\boldsymbol{\mathrm{f}}} \newcommand{\g}{\boldsymbol{\mathrm{g}}} \newcommand{\h}{\boldsymbol{\mathrm{h}}} \newcommand{\i}{\boldsymbol{\mathrm{i}}} \newcommand{\j}{\boldsymbol{j}} \newcommand{\m}{\boldsymbol{\mathrm{m}}} \newcommand{\n}{\boldsymbol{\mathrm{n}}} \newcommand{\o}{\boldsymbol{\mathrm{o}}} \newcommand{\p}{\boldsymbol{\mathrm{p}}} \newcommand{\q}{\boldsymbol{\mathrm{q}}} \newcommand{\r}{\boldsymbol{\mathrm{r}}} \newcommand{\u}{\boldsymbol{\mathrm{u}}} \newcommand{\v}{\boldsymbol{\mathrm{v}}} \newcommand{\w}{\boldsymbol{\mathrm{w}}} \newcommand{\x}{\boldsymbol{\mathrm{x}}} \newcommand{\y}{\boldsymbol{\mathrm{y}}} \newcommand{\z}{\boldsymbol{\mathrm{z}}}\\\newcommand{\A}{\boldsymbol{\mathrm{A}}} \newcommand{\B}{\boldsymbol{\mathrm{B}}} \newcommand{\C}{\boldsymbol{\mathrm{C}}} \newcommand{\D}{\boldsymbol{\mathrm{D}}} \newcommand{\H}{\boldsymbol{\mathrm{H}}} \newcommand{\I}{\boldsymbol{\mathrm{I}}} \newcommand{\K}{\boldsymbol{\mathrm{K}}} \newcommand{\M}{\boldsymbol{\mathrm{M}}} \newcommand{\N}{\boldsymbol{\mathrm{N}}} \newcommand{\P}{\boldsymbol{\mathrm{P}}} \newcommand{\Q}{\boldsymbol{\mathrm{Q}}} \newcommand{\S}{\boldsymbol{\mathrm{S}}} \newcommand{\U}{\boldsymbol{\mathrm{U}}} \newcommand{\W}{\boldsymbol{\mathrm{W}}} \newcommand{\X}{\boldsymbol{\mathrm{X}}} \newcommand{\Y}{\boldsymbol{\mathrm{Y}}} \newcommand{\Z}{\boldsymbol{\mathrm{Z}}}\\\newcommand{\R}{\mathbb{R}}\\\newcommand{\cE}{\mathcal{E}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}}\\\newcommand{\ld}{\lambda} \newcommand{\Ld}{\boldsymbol{\mathrm{\Lambda}}} \newcommand{\sg}{\sigma} \newcommand{\Sg}{\boldsymbol{\mathrm{\Sigma}}} \newcommand{\th}{\theta} \newcommand{\ve}{\varepsilon}\\\newcommand{\mmu}{\boldsymbol{\mu}} \newcommand{\ppi}{\boldsymbol{\pi}} \newcommand{\CC}{\mathcal{C}} \newcommand{\TT}{\mathcal{T}}\\ \newcommand{\bb}{\begin{bmatrix}} \newcommand{\eb}{\end{bmatrix}} \newcommand{\bp}{\begin{pmatrix}} \newcommand{\ep}{\end{pmatrix}} \newcommand{\bv}{\begin{vmatrix}} \newcommand{\ev}{\end{vmatrix}}\\\newcommand{\im}{^{-1}} \newcommand{\pr}{^{\prime}} \newcommand{\ppr}{^{\prime\prime}}\end{aligned}\end{align} \]

Chapter 7 Dimensionality Reduction

7.1 Background

Let the data \(\D\) consist of \(n\) points over \(d\) attributes, that is, it is an \(n\times d\) matrix, given as

\[\begin{split}\D=\left(\begin{array}{c|cccc}&X_1&X_2&\cds&X_d\\ \hline \x_1&x_{11}&x_{12}&\cds&x_{1d}\\\x_2&x_{21}&x_{22}&\cds&x_{2d}\\ \vds&\vds&\vds&\dds&\vds\\\x_n&x_{n1}&x_{n2}&\cds&x_{nd}\end{array}\right)\end{split}\]

Each point \(\x_i=(x_{i1},x_{i2},\cds,x_{id})^T\) is a vector in the ambient \(d\)-dimensional vector space spanned by the \(d\) standard basis vectors \(\e_1,\e_2,\cds,\e_d\), where \(\e_i\) corresponds to the \(i\).

Given any other set of \(d\) orthonormal vectors \(\u_1,\u_2,\cds,\u_d\), with \(\u_i^T\u_j=0\) and \(\lv\u_i\rv=1\) (or \(\u_i^T\u_i=1\)), we can re-express each point \(x\) as the linear combination

Note

\(\x=a_1\u_1+a_2\u_2+\cds+a_d\u_d\)

where the vector \(\a=(a_1,a_2,\cds,a_d)^T\) represents the coordinates of \(\x\) in the new basis. The above linear combination can also be expressed as a matrix multiplication:

Note

\(\x=\U\a\).

where \(\U\) is an orthonormal matrix whose \(i\)th column comprises the \(i\)th basis vector \(\u_i\).

Because \(\U\) is orthogonal, we have

\[\U\im=\U^T\]

which implies that \(\U^T\U=\I\).

\[\U^T\x=\U^T\U\a\]

Note

\(\a=\U^T\x\)

Becuase there are potentially infinite choices for the set of orthonormal basis vectors, one natural question is whether ther exists an optimal basis, for a suitable notion of optimality. We are interested in finding the optimal \(r\)-dimensional representation of \(\D\) with \(r\ll d\). Projection of \(\x\) onto the first \(r\) basis vectors is given as

\[\x\pr=a_1\u_1+a_2\u_2+\cds+a_r\u_r+\sum_{i=1}^ra_i\u_i\]

which can be written in matrix notaion as follows

\[\begin{split}\x\pr=\bp|&|&&|\\\u_1&\u_2&\cds&\u_r\\|&|&&|\ep\bp a_1\\a_2\\\vds\\a_r \ep=\U_r\a_r\end{split}\]

where \(\U_r\) is the matrix comprising the first \(r\) basis vectors, and \(\a_r\) is a vectgor comprising the first \(r\) coordinates. Because \(\a=\U^T\x\), restricting it to the first \(r\) terms, we get

\[\a_r=\U_r^T\x\]

The projection of \(\x\) onto the first \(r\) basis vectors can be compactly written as

Note

\(\x\pr=\U_r\U_r^T\x=\P_r\x\)

where \(\P_r=\U_r\U_r^T\) is the orthogonal projection matrix for the subspace spanned by the first \(r\) basis vectors. The projection matrix \(\P_r\) can also be written as the decomposition

\[P_r=\U_r\U_r^T=\sum_{i=1}^r\u_i\u_i^T\]

The projection of \(\x\) onto the remaining dimensions comprises the error vector

Note

\(\dp\bs\epsilon=\sum_{i=r+1}^da_i\u_i=\x-\x\pr\)

It is worth noting that \(\x\pr\) and \(\bs\epsilon\) are orthogonal vectors:

\[{\x\pr}^T\bs\epsilon=\sum_{i=1}^r\sum_{j=r+1}^da_ia_j\u_i^T\u_j=0\]

The subspace spanned by the first \(r\) basis vectors and the subspace spanned by the remaining basis vectors are orthogonal subspaces. They are orthogonal complement of each other.

The goal of dimensionality reduction is to seek an \(r\)-dimensional basis that gives the best possible approximation \(\x_i\pr\) over all the points \(\x_i\in\D\). Alternatively, we may seek to minimize the error \(\bs\epsilon_i=\x_i-\x_i\pr\) over all the points.

7.2 Principal Component Analysis

Principal Component Analysis (PCA) is a technique that seeks a \(r\) -dimensional basis that best captures the variance in the data.

7.2.1 Best Line approximation

Assume that \(\u\) is a unit vector, and the data matrix \(\D\) has been centered by subtracting the mean \(\mu\).

\[\bar\D=\D-\1\cd\mmu^T\]

The projection of the centered point \(\bar\x_i\in\bar\D\) on the vector \(\u\) is given as

\[\x_i\pr=\bigg(\frac{\u^T\bar\x_i}{\u^T\u}\bigg)\u=(\u^T\bar\x_i)\u=a_i\u\]

where

Note

\(a_i=\u^T\bar\x_i\)

is the offset or scalar projection of \(\x_i\) on \(\u\). We also call \(a_i\) a projected point. Note that the scalar projection of the mean \(\bar\mmu\) is 0. Therefore, the mean of the projected points \(a_i\) is also zero, since

\[\mu_a=\frac{1}{n}\sum_{i=1}^na_i=\frac{1}{n}\sum_{i=1}^n\u^T(\bar\x_i)=\u^T\bar\mmu=0\]

We have to choose the direction \(\u\) such that the variance of the projected points is maximized. The projected variance along \(\u\) is given as

\[\sg_\u^2=\frac{1}{n}\sum_{i=1}^n(a_i-\mu_a)^2=\frac{1}{n}\sum_{i=1}^n (\u^T\bar\x_i)^2=\frac{1}{n}\sum_{i=1}^n\u^T(\bar\x_i\bar\x_i^T)\u=\u^T \bigg(\frac{1}{n}\sum_{i=1}^n\bar\x_i\bar\x_i^T\bigg)\u\]

Thus, we get

Note

\(\sg_\u^2=\u^T\Sg\u\)

where \(\Sg\) is the sample covariance matrix for the centered data \(\bar\D\).

We have to find the optimal basis vector \(\u\) that maximizes the projected variance \(\sg_\u^2\) subject to the constraint that \(\u^T\u=1\). This can be solved by introducing a Lagrangian multiplier \(\alpha\) for the constraint, to obtain the unconstrained maximization problem

\[\max_\u J(\u)=\u^T\Sg\u-\alpha(\u^T\u-1)\]

Setting the derivative of \(J(\u)\) with respect to \(\u\) to the zero vector, we obtain

\[ \begin{align}\begin{aligned}\frac{\pd}{\pd\u}J(\u)&=\0\\\frac{\pd}{\pd\u}(\u^T\Sg\u-\alpha(\u^T\u-1))&=\0\\2\Sg\u-2\alpha\u&=\0\end{aligned}\end{align} \]

Note

\(\Sg\u=\alpha\u\)

\[\u^T\Sg\u=\u^T\alpha\u=\alpha\u^T\u=\alpha\]

The dominant eigenvector \(\u_1\) specifies the direction of most variance, also called the first principal component, that is, \(\u=\u_1\). Further, the largest eigenvalue \(\ld_1\) specifies the projected variance, that is, \(\sg_\u^2=\alpha=\ld_1\).

Minimum Squared Error Approach

The direction that maximizes the projected variance is also the one that minimizes the average squared error. The mean squared error (MSE) optimization condition is defined as

\[ \begin{align}\begin{aligned}MSE(\u)&=\frac{1}{n}\sum_{i=1}^n\lv\epsilon_i\rv^2= \frac{1}{n}\sum_{i=1}^n\lv\bar\x_i-\x_i\pr\rv^2= \frac{1}{n}\sum_{i=1}^n(\bar\x_i-\x_i\pr)^T(\bar\x_i-\x_i\pr)\\&=\frac{1}{n}\sum_{i=1}^n(\lv\bar\x_i\rv^2-2\bar\x_i^T\x_i\pr+(\x_i\pr)^T\x_i\pr)\\&=\frac{1}{n}\sum_{i=1}^n(\lv\bar\x_i\rv^2-2\bar\x_i^T(\u^T\bar\x_i)\u+ ((\u^T\bar\x_i)\u)^T(\u^T\bar\x_i)\u),\rm{since\ }\x_i\pr=(\u^T\bar\x_i)\u\\&=\frac{1}{n}\sum_{i=1}^n(\lv\bar\x_i\rv^2-2(\u^T\bar\x_i)(\bar\x_i^T\u)+(\u^T\bar\x_i)(\bar\x_i^T\u)\u^T\u)\\&=\frac{1}{n}\sum_{i=1}^n(\lv\bar\rv^2-(\u^T\bar\x_i)\bar\x_i^T\u))\\&=\frac{1}{n}\sum_{i=1}^n\lv\bar\x_i\rv^2-\frac{1}{n}\sum_{i=1}^n\u^T(\bar\x_i\bar\x_i^T)\u\\&=\frac{1}{n}\sum_{i=1}^n\lv\bar\x_i\rv^2-\u^T\bigg(\frac{1}{n}\sum_{i=1}^n\bar\x_i\bar\x_i^T\bigg)\u\end{aligned}\end{align} \]

which implies

Note

\(\dp MSE=\sum_{i=1}^n\frac{\lv\bar\x_i\rv^2}{n}-\u^T\Sg\u\)

Further, we have

Note

\(\dp\rm{var}(\D)=tr(\Sg)=\sum_{i=1}^d\sg_i^2\)

Note

\(\dp MSE(\u)=\rm{var}(\D)-\u^T\Sg\u=\sum_{i=1}^d\sg_i^2-\u^T\Sg\u\)

The principal component \(\u_1\), which is the direction that maximizes the projected variance, is also the direction that minimizes the mean squared error.

\[MSE(\u_1)=\rm{var}(\D)-\u_1^T\Sg\u_1=\rm{var}(\D)=\u_1^T\ld_1\u_1=\rm{var}(\D)-\ld_1\]

7.2.2 Best 2-dimensional Approximation

We are now interested in the best two-dimensional approximation to \(\D\). We now want to find another direction \(\v\), which also maximizes the projected variance, but is orthogonal to \(\u_1\). The projected variance along \(\v\) is given as

\[\sg_\v^2=\v^T\Sg\v\]

We further require that \(\v\) be a unit vector orthogonal to \(\u_1\). The optimization condition then becomes

\[\max_\v J(\v)=\v^T\Sg\v-\alpha(\v^T\v-1)-\beta(\v^T\u_1-0)\]

Taking the derivative of \(J(\v)\) with respect to \(\v\), and setting it to the zero vector, finally gives that \(\v\) is the second largest eigenvector of \(\Sg\).

Total Projected Variance

Let \(\U_2\) be the matrix whose columns correspond to the two principal components. Given the point \(\bar\x_i\in\bar\D\) its coordinates in the two-dimensional subspace spanned by \(\u_1\) and \(\u_2\) can be computed as follows:

\[\a_i=\U_2^T\bar\x_i\]

Assume that each point \(\bar\x_i\in\R^d\) in \(\bar\D\) has been projected to obtain its coordinates \(\a_i\in\R^2\), yielding the new dataset \(\A\). The total variance for \(\A\) is given as

\[ \begin{align}\begin{aligned}\rm{var}(\A)&=\frac{1}{n}\sum_{i=1}^n\lv\a_i-\0\rv^2= \frac{1}{n}\sum_{i=1}^n(\U_2^T\bar\x_i)^T(\U_2^T\bar\x_i)= \frac{1}{n}\sum_{i=1}^n\bar\x_i^T(\U_2\U_2^T)\bar\x_i\\&=\frac{1}{n}\sum_{i=1}^n\bar\x_i^T\P_2\bar\x_i\end{aligned}\end{align} \]

where \(\P_2\) is the orthogonal projection matrix given as

\[\P_2=\U_2\U_2^T=\u_1\u_1^T+\u_2\u_2^T\]

The projected total variance is then given as

\[ \begin{align}\begin{aligned}\rm{var}(\A)&=\frac{1}{n}\sum_{i=1}^n\bar\x_i^T\P_2\bar\x_i\\&=\u_1^T\Sg\u_1+\u_2^T\Sg\u_2=\u_1^T\ld_1\u_1+\u_2^T\ld_2\u_2=\ld_1+\ld_2\end{aligned}\end{align} \]

Mean Squared Error

\[ \begin{align}\begin{aligned}MSE&=\frac{1}{n}\sum_{i=1}^n\lv\bar\x_i-\x_i\pr\rv^2\\&=\frac{1}{n}\sum_{i=1}^n(\lv\bar\x_i\rv^2-2\bar\x_i^T\x_i\pr+(\x_i\pr)^T\x_i\pr)\\&=\rm{var}(\D)+\frac{1}{n}\sum_{i=1}^n(-2\bar\x_i^T\P_2\bar\x_i+(\P_2\bar\x_i)^T\P_2\bar\x_i)\\&=\rm{var}(\D)-\frac{1}{n}\sum_{i=1}^n(\bar\x_i^T\P_2\bar\x_i)\\&=\rm{var}(\D)-\rm{var}(\A)\\&=\rm{var}(\D)-\ld_1-\ld_2\end{aligned}\end{align} \]

7.2.3 Best \(r\)-dimensional Approximation

To find the best \(r\)-dimensional approximation to \(\D\), we compute the eigenvalue of \(\Sg\). Because \(\Sg\) is positive semidefinite, its eigenvalues are non-negative and can be sorted in decreasing order

\[\ld_1\geq\ld_2\geq\cds\ld_r\geq\ld_{r+1}\cds\geq\ld_d\geq 0\]

We then select the \(r\) largest eigenvalues, and their corresponding eigenvectors to form the best \(r\)-dimensional approximation.

Total Projected Variance

\[\rm{var}(\A)=\frac{1}{n}\sum_{i=1}^n\bar\x_i^T\P_r\bar\x_i=\sum_{i=1}^r\u_i^T\Sg\u_i=\sum_{i=1}^r\ld_i\]

Mean Squared Error

\[ \begin{align}\begin{aligned}MSE&=\frac{1}{n}\sum_{i=1}^n\lv\bar\x_i-\x_i\pr\rv^2=\rm{var}(\D)-\rm{var}(\A)\\&=\rm{var}(\D)-\sum_{i=1}^r\u_i^T\Sg\u_i=\rm{var}(\D)-\sum_{i=1}^r\ld_i\end{aligned}\end{align} \]

Total Variance

Note

\(\dp\rm{var}(\D)=\sum_{i=1}^d\sg_i^2=\sum_{i=1}^d\ld_i\)

Choosing the Dimensionality

One criteria for choosing \(r\) is to compute the fraction of the total variance captured by the first \(r\) principal components, computed as

Note

\(\dp f(r)=\frac{\ld_1+\ld_2+\cds+\ld_r}{\ld_1+\ld_2+\cds+\ld_d}=\) \(\dp\frac{\sum_{i=1}^r\ld_i}{\sum_{i=1}^d\ld_i}=\frac{\sum_{i=1}^r\ld_i}{\rm{var}(\D)}\)

Given a certain desired variance threshold, say \(\alpha\), starting from the first principal component, we keep on adding additional components, and stop at the smallest value \(r\) for which \(f(r)\geq\alpha\), given as

Note

\(r=\min\{r\pr|f(r\pr)\geq\alpha\}\)

7.2.4 Geometry of PCA

Geometrically, when \(r=d\), PCA corresponds to a orthogonal change of basis, so that the total variance is captured by the sum of the variances along each of the principal direction \(\u_1,\u_2,\cds,\u_d\), and further, all covariances are zero. This can be seen by looking at the collective action of the full set of principal components, which can be arranged in the \(d\times d\) orthogonal matrix with \(\U\im=\U^T\).

Each principal component \(\u_i\) corresponds to an eigenvector of the covariance matrix \(\Sg\), which can be written compactly as

\[\Sg\U=\U\Ld\]

Multiply above equation on the left by \(\U\im=\U^T\) we obtain

\[\U^T\Sg\U=\U^T\U\Ld=\Ld\]

This means that if we change the basis to \(\U\), we change the covariance matrix \(\Sg\) to a similar matrix \(\Ld\), which in fact is the covariance matrix in the new basis.

It is worth noting that in the new basis, the equation

\[\x^T\Sg\im\x=1\]

defines a \(d\)-dimensional ellipsoid (or hyper-ellipse). The eigenvectors \(\u_i\) of \(\Sg\), that is, the principal components, are the directions for the principal axes of the ellipsoid. The square roots of the eigenvalues, that is, \(\sqrt{\ld_i}\), give the lengths of the semi-axes.

The eigen-decomposition of \(\Sg\) is

Note

\(\dp\Sg=\U\Ld\U^T=\ld_1\u_1\u_1^T+\ld_2\u_2\u_2^T+\cds+\ld_d\u_d\u_d^T=\sum_{i=1}^d\ld_i\u_i\u_i^T\)

Assuming that \(\Sg\) is invertible or nonsingular, we have

\[\Sg\im=(\U\Ld\U^T)\im=(\U\im)^T\Ld\im\U\im=\U\Ld\im\U^T\]

Using the fact that \(\x=\U\a\), we get

\[ \begin{align}\begin{aligned}\x^T\Sg\im\x&=1\\(\a^T\U^T)\U\Ld\im\U^T(\U\a)&=1\\\a^T\Ld\im\a&=1\\\sum_{i=1}^d\frac{a_i^2}{\ld_i}&=1\end{aligned}\end{align} \]

which is precisely the equation for an ellipse centered at \(\0\), with semi-axes lengths \(\sqrt{\ld_i}\). Thus \(\x^T\Sg\im\x=1\), or equivalently \(\a^T\Ld\im\a=1\) in the new principal components basis, defines an ellipsoid in \(d\)-dimensions, where the semi-axes lengths equal the standard deviations along each axis. Likewise, the equation \(\x^T\Sg\im\x=s\), or equivalently \(\a^T\Ld\im\a=s\), for different values of the scalar \(s\), represents concentric ellipsoids.

7.3 Kernel Principal Component Analysis

Principal component analysis can be extended to find nonlinear “directions” in the data using kernel methods. Kernel PCA finds the directions of most variance in the feature space instead of the input space.

In feature space, we can find the first kernel principal component \(\u_1\), by solving for the eigenvector corresponding to the largest eigenvalue of the covariance matrix in feature space:

\[\Sg_\phi\u_1=\ld_1\u_1\]

where \(\Sg_\phi\), the covariance matrix in feature space, is given as

\[\Sg_\phi=\frac{1}{n}\sum_{i=1}^n(\phi(\x_i)-\mmu_\phi)(\phi(\x_i)- \mmu_\phi)^T=\frac{1}{n}\sum_{i=1}^n\bar\phi(\x_i)\bar\phi(\x_i)^T\]

Plugging the expansion of \(\Sg_\phi\), we get

\[ \begin{align}\begin{aligned}\bigg(\frac{1}{n}\sum_{i=1}^n\bar\phi(\x_i)\bar\phi(\x_i)^T\bigg)\u_1&=\ld_1\u_1\\\frac{1}{n}\sum_{i=1}^n\bar\phi(\x_i)(\bar\phi(\x_i)^T\u_1)&=\ld_1\u_1\\\sum_{i=1}^n\bigg(\frac{\bar\phi(\x_i)^T\u_1}{n\ld_1}\bigg)\bar\phi(\x_i)&=\u_1\\\sum_{i=1}^nc_i\bar\phi(\x_i)&=\u_1\end{aligned}\end{align} \]

where \(c_i=\frac{\bar\phi(\x_i)^T\u_1}{n\ld_1}\) is a scalar value.

\[ \begin{align}\begin{aligned}\bigg(\frac{1}{n}\sum_{i=1}^n\bar\phi(\x_i)\bar\phi(\x_i)^T\bigg)\bigg( \sum_{j=1}^nc_j\bar\phi(\x_j)\bigg)&=\ld_1\sum_{i=1}^nc_i\bar\phi(\x_i)\\\frac{1}{n}\sum_{i=1}^n\sum_{j=1}^nc_i\bar\phi(\x_i)\bar\phi(\x_i)^T\bar\phi (\x_j)&=\ld_1\sum_{i=1}^nc_i\bar\phi(\x_i)\\\sum_{i=1}^n\bigg(\bar\phi(\x_i)\sum_{j=1}^nc_j\bar\phi(\x_i)^T\bar\phi (\x_j)\bigg)&=n\ld_1\sum_{i=1}^nc_i\bar\phi(\x_i)\\\sum_{i=1}^n\bigg(\bar\phi(\x_i)\sum_{j=1}^nc_j\bar{K}(\x_i,\x_j)\bigg)&=n\ld_1\sum_{i=1}^nc_i\bar\phi(\x_i)\end{aligned}\end{align} \]

We assume that the kernel matrix \(\K\) has already been centered using

\[\bar\K=\bigg(\I-\frac{1}{n}\1_{n\times n}\bigg)\K\bigg(\I-\frac{1}{n}\1_{n\times n}\bigg)\]

Take any point, say \(\bar\phi(\x_k)\) and multiply by \(\bar\phi(\x_k)^T\) on both sides to obtain

\[ \begin{align}\begin{aligned}\sum_{i=1}^n\bigg(\bar\phi(\x_k)^T\bar\phi(\x_i)\sum_{j=1}^nc_j\bar{K} (\x_i,\x_j)\bigg)&=n\ld_1\sum_{i=1}^nc_i\bar\phi(\x_k)^T\bar\phi(\x_i)\\\sum_{i=1}^n\bigg(\bar{K}(\x_k,\x_i)\sum_{j=1}^nc_j\bar{K} (\x_i,\x_j)\bigg)&=n\ld_1\sum_{i=1}^nc_i\bar{K}(\x_k,\x_i)\end{aligned}\end{align} \]

We can compactly represent it as follows:

\[\bar\K^2\c=n\ld_1\bar\K\c\]

If \(\eta_1\) is the largest eigenvalue of \(\bar\K\) corresponding to the dominant eigenvector \(\c\), we can verify that

\[ \begin{align}\begin{aligned}\bar\K(\bar\K\c)&=n\ld_1\bar\K\c\\\bar\K(\eta_1\cd\c)&=n\ld_1\eta_1\c\\\bar\K\c&=n\ld_1\c\end{aligned}\end{align} \]

which implies

Note

\(\bar\K\c=\eta_1\c\)

where \(\eta_1=n\cd\ld_1\).

If we sort the eigenvalues of \(\K\) in decreasing order \(\eta_1\geq\eta_2\geq\cds\geq\eta_n\geq 0\), we can obtain the \(j\)th principal component as the corresponding eigenvector \(\c_j\), which has to be normalized so that the norm is \(\lv\c_j\rv=\sqrt{\frac{1}{\eta_j}}\), provided \(\eta_j>0\). Also, because \(\eta_j=n\ld_j\), the variance along the \(j\)th principal component is given as \(\ld_j=\frac{\eta_j}{n}\). To obtain a reduced dimensional dataset, say with dimensionality \(r\ll n\), we can compute the scalar projection of \(\bar\phi(\x_i)\) for each point \(\x_i\) onto the principal component \(\u_j\), for \(j=1,2,\cds,r\) , as follows:

\[a_{ij}=\u_j^T\bar\phi(\x_i)=\bar\K_i^T\c_j\]

We can obtain \(\a_i\in\R^r\) as follows:

Note

\(\a_i=\bs{\rm{C}}_r^T\bar\K_i\)

where \(\bs{\rm{C}}_r\) is the weight matrix whose columns comprise the top \(r\) eigenvectors, \(\c_1,\c_2,\cds,\c_r\).

7.4 Singular Value Decomposition

Principal omponents analysis is a special case of a more general matrix decomposition method called Singular Value Decomposition (SVD). PCA yields the following decomposition of the covariance matrix:

\[\Sg=\U\Ld\U^T\]

SVD generalizes the above factorization for any matrix. In particular for an \(n\times d\) data matrix \(\D\) with \(n\) points and \(d\) columns, SVD factorizes \(\D\) as follows:

Note

\(\D=\bs{\rm{L\Delta R}}^T\)

The columns of \(\bs{\rm{L}}\) are called the left singular vectors, and the columns of \(\bs{\rm{R}}\) are called the right singular vectors. The matrix \(\bs{\rm{\Delta}}\) is defined as

\[\begin{split}\bs{\rm{\Delta}}=\left\{\begin{array}{lr}\delta_i\quad\rm{if\ }i=j\\0\quad\rm{if\ }i\neq j\end{array}\right.\end{split}\]

The entries \(\Delta(i,i)=\delta_i\) along the main diagonal of \(\Delta\) are called the singular value of \(\D\).

One can discard those left and right singular vectors that correspond to zero singular values, to obtain the reduced SVD as

Note

\(\D=\bs{\rm{L}}_r\bs{\rm{\Delta}}_r\bs{\rm{R}}_r^T\)

The reduced SVD leads directly to the spectral decomposition of \(\D\), given as

Note

\(\dp\D=\sum_{i=1}^r\delta_i\bs{l}_i\bs{\rm{r}}_i^T\)

By selecting the \(q\) largest singular values \(\delta_1,\delta_2,\cds,\delta_q\) and the corresponding left and right singular vectors, we obtain the best rank \(q\) approximation to the original matrix \(\D\). That is, if \(\D_q\) is the matrix defined as

\[\D_q=\sum_{i=1}^q\delta_i\bs{l}_i\bs{\rm{r}}_i^T\]

then it can be shown that \(\D_q\) is the rank \(q\) matrix that minimizes the expression

\[\lv\D-\D_q\rv_F\]

where \(\lv\A\rv_F\) is called the Frobenius Norm of the \(n\times d\) matrix \(\A\), defined as

\[\lv\A\rv_F=\sqrt{\sum_{i=1}^n\sum_{j=1}^D\A(i,j)^2}\]

7.4.1 Geometry of SVD

SVD is a special factorization of the matrix \(\D\), such that any basis vector \(\bs{\rm{r}}_i\) for the row space is mapped to the corresponding basis vector \(\bs{l}_i\) in the column space, scaled by the singular value \(\delta_i\). We can think of the SVD as a mapping from an orthonormal basis \((\bs{\rm{r}}_1,\bs{\rm{r}}_2,\cds,\bs{\rm{r}}_r)\) in \(\R^d\) (the row space) to an orthonormal basis \((\bs{l}_1,\bs{l}_2,\cds,\bs{l}_r)\) in \(\R^n\) (the column space), with the corresponding axes scaled according to the singular values \(\delta_1,\delta_2,\cds,\delta_r\).

7.4.2 Connection between SVD and PCA

Assume that the matrix \(\D\) has been centered, and assume that the centered matrix \(\bar\D\) has been factorized as \(\bar\D=\bs{\rm{L\Delta R}}^T\). Consider the scatter matrix for \(\bar\D\), given as \(\bar\D^T\bar\D\). We have

\[ \begin{align}\begin{aligned}\bar\D^T\bar\D&=(\bs{\rm{L\Delta R}}^T)^T(\bs{\rm{L\Delta R}}^T)\\&=\bs{\rm{R\Delta}}^T\bs{\rm{L}}^T\bs{\rm{L\Delta R}}^T\\&=\bs{\rm{R}}(\bs{\rm{\Delta}}^T\bs{\rm{\Delta}})\bs{\rm{R}}^T\\&=\bs{\rm{R\Delta}}_d^2\bs{\rm{R}}^T\end{aligned}\end{align} \]

where \(\bs{\rm{R\Delta}}_d^2\) is the \(d\times d\) diagonal matrix defined as \(\bs{\rm{R\Delta}}_d^2(i,i)=\delta_i^2\), for \(i=1,\cds,d\).

Because the covariance matrix of \(\bar\D\) is given as \(\Sg=\frac{1}{n}\bar\D^T\bar\D\), we have

\[ \begin{align}\begin{aligned}\bar\D^T\bar\D&=n\Sg\\&=n\U\Ld\U^T\\&=\U(n\Ld)\U^T\end{aligned}\end{align} \]

The right singular vectors \(\bs{\rm{R}}\) are the same as the eigenvectors of \(\Sg\). The cooresponding singular values of \(\bar\D\) are related to the eigenvalues of \(\Sg\) by the expression

\[ \begin{align}\begin{aligned}n\ld_i=\delta_i^2\\\rm{\or}, \ld_i=\frac{\delta_i^2}{n},\rm{\ for\ }i=1,\cds,d\end{aligned}\end{align} \]

Likewise the left singular vectors in \(\bs{\rm{L}}\) are the eigenvectors of the matrix \(n\times n\) matrix \(\bar\D\bar\D^T\), and the corresponding eigenvalues are given as \(\delta_i^2\).