Chapter 5 Kernel Methods
Given a data instance \(\x\), we need to find a mapping \(\phi\), so that \(\phi(\x)\) is the vector representation of \(\x\). Even when the input data is a numeric data matrix, if we wish to discover nonlinear relationships among the attributes, then a nonlinear mapping \(\phi\) may be used, so that \(\phi(\x)\) represents a vector in the corresponding high-dimensional space comprising nonlinear attributes. We use the input space to reefer to the data space for the input data \(\x\) and feature space to refer to the space of mapped vectors \(\phi(\x)\).
Kernel methods avoid explicitly transforming each point \(\x\) in the input space into the mapped point \(\phi(\x)\) in the feature space. Instead, the input objects are represented via their \(n\times n\) pairwise similarity values. The similarity function, called a kernel, is chosen so that it represents a dot product in some high-dimensional feature space, yet it can be computed without directly constructing \(\phi(\x)\). Let \(\cl{I}\) denote the input space, and let \(\D\subset\cl{I}\) be a dataset comprising \(n\) objects \(\x_i\ (i=1,2,\cds,n)\) in the input space. We can represent the pairwise similarity values between points in \(\D\) via the \(n\times n\) kernel matrix
where \(K:\cl{I}\times\cl{I}\ra\R\) is a kernel function on any two points in input space. For any \(\x_i,\x_j\in\cl{I}\), the kernel function should satisfy the condition
Note
\(K(\x_i,\x_j)=\phi(\x_i)^T\phi(\x_j)\)
Intuitively, this means that we should be able to compute the value of the dot product using the original input representation \(\x\), without having recourse to the mapping \(\phi(\x)\).
many data mining methods can be kernelized, that is, instead of mapping the input points into feature space, the data can be represented via the \(n\times n\) kernel matrix \(\K\), and all relevant analysis can be performed over \(\K\). This is usually done via the so-called kernel trick, that is, show that the analysis task requires only dot products \(\phi(\x_i)^T\phi(\x_j)\) that can be computed efficiently in input space. Once the kernel matrix has been computed, we no longer even need the input points \(\x_i\), as all operations involving only dot products inthe feature space can be performed over the \(n\times n\) kernel matrix \(\K\).
5.1 Kernel Matrix
Let \(\cl{I}\) denote the input space, which can be any arbitrary set of data objects, and let \(\D\subset\cl{I}\) denote a subset of \(n\) objects \(\x_i\) in the input space. Let \(\phi:\cl{I}\ra\cl{F}\) be a mapping from the input space into the feature space \(\cl{F}\), which is endowed with a dot product and norm. Let \(K:\cl{I}\times\cl{I}\ra\R\) be a function that maps pairs of input objects to their dot product value in feature space, that is, \(K(\x_i,\x_j)=\phi(\x_i)^T\phi(\x_j)\), and let \(\K\) be the \(n\times n\) kernel matrix corresponding to the subset \(\D\).
The function \(K\) is called a positive semidefinite kernel if and only if it is symmetric:
and the corresponding kernel matrix \(\K\) for any subset \(\D\subset\cl{I}\) is positive semidefinite, that is,
which implies that
\(K\) is symmetric since the dot product is symmetric, which also implies that \(\K\) is symmetric. \(\K\) is positive semidefinite because
5.1.1 Reproducing Kernel Map
For the reproducing kernel map \(\phi\), we map each point \(\x\in\cl{I}\) into a function in a function space \(\{f:\cl{I}\ra\R\}\) comprising functions that map points in \(\cl{I}\) into \(\R\). Any \(\x\in\R\) in the input space is mapped to the following function:
where the \(\cd\) stands for any argument in \(\cl{I}\). That is, each object \(\x\) in the input space gets mapped to a feature point \(\phi(\x)\), which is in fact a function \(K(\x,\cd)\) that represents its similarity to all other points in the input space \(\cl{I}\).
Let \(\cl{F}\) be the set of all functions or points that can be obtained as a linear combination of any subset of feature points, defined as
Let \(\f,\g\in\cl{F}\) be any two points in feature space:
Define the dot product between two points as
The dot product is bilinear, that is, linear in both arguments, because
The fact that \(K\) is positive semidefinite implies that
Thus, the space \(\cl{F}\) is a pre-Hilbert space,
The space \(\cl{F}\) has the so-called reproducing property, that is, we can evaluate a function \(f(\cd)=\f\) at a point \(\x\in\cl{I}\) by taking the dot product of \(\f\) with \(\phi(\x)\), that is
For this reason, the space \(\cl{F}\) is also called a reproducing kernel Hilbert space.
The reproducing kernel map shows that any positive semidefinite kernel corresponds to a dot product in some feature space. This means we can apply well known algebraic and geometric methods to understand and analyze the data in these spaces.
Empirical Kernel Map
Define the map \(\phi\) as follows:
Define the dot product in feature space as
For \(\phi\) to be a valid map, we require that \(\phi(\x_i)^T\phi(\x_j)=K(\x_i,\x_j)\), which is clearly not satisfied. One solution is to replace \(\K_i^T\K_j\) with :math`K_i^TAK_j` for some positive semidefinite matrix \(\A\) such that
If we can find such an \(\A\), it would imply that over all pairs of mapped points we have
which can be written compactly as
This immediately suggests that we take \(\A=\K\im\), the (pseudo)inverse of the kernel matrix \(K\). The modified map \(\phi\), called the empirical kernel map, is then defined as
so that the dot product yields
Over all pairs of mapped points, we have
It is important to note that this empirical feature representation is valid only for the \(n\) points in \(\D\). If points are added to or removed from \(\D\), the kernel map will have to be updated for all points.
5.1.2 Mercer Kernel Map
Data-specific Kernel Map
Because \(\K\) is a symmetric positive semidefinite matrix, it has real and non-negative eigenvalues, and it can be decomposed as follows:
where \(\U\) is the orthonormal matrix of eigenvectors \(\u_i=(u_{i1},u_{i2},\cds,u_{in})^T\in\R^n\), and \(\Ld\) is the diagonal matrix of eigenvalues, with both arranged in non-increasing order of the eigenvalues \(\ld_1\geq\ld_2\geq\cds\geq\ld_n\geq 0\):
The kernel matrix \(\K\) can therefore be rewritten as the spectral sum
In particular the kernel function between \(\x_i\) and \(\x_j\) is given as
The Mercer map \(\phi\) is defined as follows:
then \(\K(\x_i,\x_j)\) is a dot product in feature space between the mapped points \(\phi(\x_i)\) and \(\phi(\x_j)\) because
We can rewrite the Mercer map \(\phi\) as
The kernel value is simply the dot product between scaled rows of \(\U\):
The Mercer map restricted to the input dataset \(\D\) is called the data-specific Mercer kernel Map.
Mercer Kernel Map
For compact continuous spaces, the kernel value between any two points can be written as the infinite spectral decomposition
where each normalized eigenfunction \(\u_i(\cd)\) is a solution to the integral equation
and \(K\) is a continuous positive semidefinite kernel, that is, for all functions \(a(\cd)\) with a finite square integral \(K\) satisfies the condition
The general Mercer kernel map is given as
with the kernel value being equivalent to the dot product between two mapped points:
5.2 Vector Kernels
Polynomial Kernel
Polynomial kernels are of two types: homogeneous or inhomogeneous. Let \(\x,\y\in\R^d\). The homogeneous polynomial kernel is defined as
Note
\(K_q(\x,\y)=\phi(\x)^T\phi(\y)=(\x^T\y)^q\)
where \(q\) is the degree of the polynomial.
The most typical cases are the linear (with \(q=1\)) and quadratic (with \(q=2\)) kernels, given as
The inhomogeneous polynomial kernel is defined as
Note
\(k_q(\x,\y)=\phi(\x)^T\phi(\y)=(c+\x^T\y)^q\)
where \(q\) is the degree of the polynomial, and \(c\geq 0\) is some constant.
Let \(n_0,n_1,\cds,n_d\) denote non-negative integers, such that \(\sum_{i=0}^dn_i=q\). Further, let \(\n=(n_0,n_1,\cds,n_d)\), and let \(|\n|=\sum_{i=0}^dn_i=q\). Also, let \(\bp q\\\n \ep\) denote the multinomial coefficient
The multinomial expansion of the inhomogeneous kernel is then given as
Using the notation \(\x^\n=\prod_{k=1}^dx_k^{n_k}\), the mapping \(\phi:\R^d\ra\R^m\) is given as the vector
It can be shown that the dimensionality of the feature space is given as
Gaussian Kernel
The Gaussian kernel, also called the Gaussian radial basis function (RBF) kernel, is defined as
Note
\(\dp K(\x,\y)=\exp\bigg\{-\frac{\lv\x-\y\rv^2}{2\sg^2}\bigg\}\)
where \(\sg>0\) is the spread parameter that plays the same role as the standard deviation in a normal density function. Note that \(K(\x,\x)=1\), and further that the kernel value is inversely related to the distance between the two points \(\x\) and \(\y\).
A feature space for the Gaussian kernal has infinite dimensionality.
Further, using \(\gamma=\frac{1}{2\sg^2}\), and noting that \(\lv\x-\y\rv^2=\lv\x\rv^2+\lv\y\rv^2-2\x^T\y\), we can rewrite the Gaussian kernel as follows:
In particular, the last term is given as the infinite expansion
Using the multinomial expansion of \((\x^T\y)^q\), we can write the Gaussian kernel as
where
The mapping into feature space corresponds to the function \(\phi:\R^d\ra\R^\infty\)
5.3 Basic Kernel Operations in Feature Space
Norm of a Point
Note
\(\lv\phi(\x)\rv^2=\phi(\x)^T\phi(\x)=K(\x,\x)\)
Distance between Points
The squared distance between two points \(\phi(\x_i)\) and \(\phi(\x_j)\) can be computed as
Note
\(\lv\phi(\x_i)-\phi(\x_j)\rv^2=\lv\phi(\x_i)\rv^2+\lv\phi(\x_j)\rv^2-2\phi(\x_i)^T\phi(\x_j)\) \(=K(\x_i,\x_i)+K(\x_j,\x_j)-2K(\x_i,\x_j)\)
which implies that the distance is
The kernel value can be considered as a measure of the similarity between two points, as
Thus, the more the distance \(\lv\phi(\x_i)-\phi(\x_j)\rv\) between the two points in feature space, the less the kernel value, that is, the less the similarity.
Mean in Feature Space
Because we do not, in general, have access to \(\phi(\x_i)\), we cannot explicity compute the mean point in feature space
which implies that
Note
\(\dp\lv\mmu_\phi\rv^2=\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^nK(\x_i,\x_j)\)
Total Variance in Feature Space
Note
\(\lv\phi(\x_i)-\mmu_\phi\rv^2=\lv\phi(\x_i)\rv^2-2\phi(\x_i)^T\mmu_\phi+\lv\mmu_\phi\rv^2\)
\(\dp=K(\x_i,\x_i)-\frac{2}{n}\sum_{j=1}^nK(\x_i,\x_j)+\frac{1}{n^2}\sum_{a=1}^n\sum_{b=1}^nK(\x_a,\x_b)\)
That is
Note
\(\dp\sg_\phi^2=\frac{1}{n}\sum_{i=1}^nK(\x_i,\x_i)-\frac{2}{n^2}\sum_{i=1}^n\sum_{j=1}^nK(\x_i,\x_j)\)
Centering in Feature Space
We can center each point in feature space by subtracting the mean from it, as follows:
The centered kernel matrix is given as
where each cell corresponds to the kernel between centered points, that is
The centered kernel matrix can be written campactly as follows:
Note
\(\dp\bar\K=\K-\frac{1}{n}\1_{n\times n}\K-\frac{1}{n}\K\1_{n\times n}\) \(\dp+\frac{1}{n^2}\1_{n\times n}\K\1_{n\times n}=\) \(\dp\bigg(\bs{\rm{I}}-\frac{1}{n}\1_{n\times n}\bigg)\K\bigg(\bs{\rm{I}}-\frac{1}{n}\1_{n\times n}\bigg)\)
Normalizing in Feature Space
The dot product in feature space corresponds to the cosine of the angle between the two mapped points, because
The normalized kernel matrix \(\K_n\) can be computed using only the kernel function \(K\), as
Note
\(\dp\K_n(\x_i,\x_j)=\frac{\phi(\x_i)^T\phi(\x_j)}{\lv\phi(\x_i)\rv\cd\lv\phi(\x_j)\rv}=\) \(\dp\frac{K(\x_i,\x_j)}{\sqrt{K(\x_i,\x_i)\cd K(\x_j,\x_j)}}\)
Let \(\W\) denote the diagonal matrix comprising the diagonal elements of \(\K\):
The normalized kernel matrix can then be expressed compactly as
5.4 Kernels for Complex Objects
5.4.1 Spectrum Kernel for Strings
Consider text or sequence data defined over an alphabet \(\Sg\). The \(l\)-spectrum feature map is the mapping \(\phi:\Sg^*\ra\R^{|\Sg|^l}\) from the set of substrings over \(\Sg\) to the \(|\Sg|^l\)-dimensional space representing the number of occurrences of all possible substrings of length \(l\), defined as
where \(\#(\alpha)\) is the number of occurrences of the \(l\)-length string \(\alpha\) in \(\x\).
The (full) spectrum map is an extension of the \(l\)-spectrum map, obtained by considering all lengths from \(l=0\) to \(l=\infty\), leading to an infinite dimensional feature map \(\phi:\Sg^*\ra\R^\infty\):
where \(\#(\alpha)\) is the number of occurrences of the string \(\alpha\) in \(\x\).
The (\(l\)-)spectrum kernel between two string \(\x_i,\x_j\) is simply the dot product between their (\(l\)-)spectrum maps:
5.4.2 Diffusion Kernels on Graph Nodes
Let :math`S` be some symmetric similarity matrix between nodes of a graph \(G=(V,E)\). Consider the similarity between any two nodes obtained by summing the product of the similarities over walks of length 2:
where
denotes the (column) vector representing the \(i\). Over all pairs of nodes the similarity matrix over walks of length 2, denoted \(\S^{(2)}\), is thus given as the square of the base similarity matrix \(\S\):
In general, if we sum up the product of the base similarities over all \(l\) -length walks between two nodes, we obtain the \(l\)-length similarity matrix \(\S^{(l)}\), which is simply the \(l\)S`, that is,
Power Kernels
The kernel value between any two points is then a dot product in feature space:
For a general walk length \(l\), let \(\K=\S^l\). Consider the eigen-decomposition of \(\S\):
The eigen-decomposition of \(\K\) can be obtained as follows:
Exponential Diffusion Kernel
Instead of fixing the walk length a priori, we can obtain a new kernel between nodes of a graph by considering walks of all possible lengths, but by damping the contribution of longer walks, which leads to the exponential diffusion kernel, defined as
Note
\(\dp\K=\sum_{l=0}^\infty\frac{1}{l!}\beta^l\S^l=\) \(\dp\I+\beta\S+\frac{1}{2!}\beta^2\S^2+\frac{1}{3!}\beta^3\S^3+\cds=\exp\{\beta\S\}\)
where \(\beta\) is a damping factor, and \(\exp\{\beta\S\}\) is the matrix exponential.
Von Neumann Diffusion Kernel
A related kernel based on powers of \(\S\) is the von Neumann diffusion kernel, defined as
Note
\(\dp\K=\sum_{l=0}^\infty\beta^l\S^l\)
Rearranging the terms to obtain a closed form expression for the von Neumann kernel:
For \(\K\) to be a positive semidefinite kernel, all its eigenvalues should be non-negative, which in turn implies that
Further, the inverse matrix \((\I-\beta\Ld)\im\) exists only if