\[ \begin{align}\begin{aligned}\newcommand{\bs}{\boldsymbol} \newcommand{\dp}{\displaystyle} \newcommand{\rm}{\mathrm} \newcommand{\cl}{\mathcal} \newcommand{\pd}{\partial}\\\newcommand{\cd}{\cdot} \newcommand{\cds}{\cdots} \newcommand{\dds}{\ddots} \newcommand{\lag}{\langle} \newcommand{\lv}{\lVert} \newcommand{\ol}{\overline} \newcommand{\od}{\odot} \newcommand{\ra}{\rightarrow} \newcommand{\rag}{\rangle} \newcommand{\rv}{\rVert} \newcommand{\seq}{\subseteq} \newcommand{\td}{\tilde} \newcommand{\vds}{\vdots} \newcommand{\wh}{\widehat}\\\newcommand{\0}{\boldsymbol{0}} \newcommand{\1}{\boldsymbol{1}} \newcommand{\a}{\boldsymbol{\mathrm{a}}} \newcommand{\b}{\boldsymbol{\mathrm{b}}} \newcommand{\c}{\boldsymbol{\mathrm{c}}} \newcommand{\d}{\boldsymbol{\mathrm{d}}} \newcommand{\e}{\boldsymbol{\mathrm{e}}} \newcommand{\f}{\boldsymbol{\mathrm{f}}} \newcommand{\g}{\boldsymbol{\mathrm{g}}} \newcommand{\h}{\boldsymbol{\mathrm{h}}} \newcommand{\i}{\boldsymbol{\mathrm{i}}} \newcommand{\j}{\boldsymbol{j}} \newcommand{\m}{\boldsymbol{\mathrm{m}}} \newcommand{\n}{\boldsymbol{\mathrm{n}}} \newcommand{\o}{\boldsymbol{\mathrm{o}}} \newcommand{\p}{\boldsymbol{\mathrm{p}}} \newcommand{\q}{\boldsymbol{\mathrm{q}}} \newcommand{\r}{\boldsymbol{\mathrm{r}}} \newcommand{\u}{\boldsymbol{\mathrm{u}}} \newcommand{\v}{\boldsymbol{\mathrm{v}}} \newcommand{\w}{\boldsymbol{\mathrm{w}}} \newcommand{\x}{\boldsymbol{\mathrm{x}}} \newcommand{\y}{\boldsymbol{\mathrm{y}}} \newcommand{\z}{\boldsymbol{\mathrm{z}}}\\\newcommand{\A}{\boldsymbol{\mathrm{A}}} \newcommand{\B}{\boldsymbol{\mathrm{B}}} \newcommand{\C}{\boldsymbol{\mathrm{C}}} \newcommand{\D}{\boldsymbol{\mathrm{D}}} \newcommand{\H}{\boldsymbol{\mathrm{H}}} \newcommand{\I}{\boldsymbol{\mathrm{I}}} \newcommand{\K}{\boldsymbol{\mathrm{K}}} \newcommand{\M}{\boldsymbol{\mathrm{M}}} \newcommand{\N}{\boldsymbol{\mathrm{N}}} \newcommand{\P}{\boldsymbol{\mathrm{P}}} \newcommand{\Q}{\boldsymbol{\mathrm{Q}}} \newcommand{\S}{\boldsymbol{\mathrm{S}}} \newcommand{\U}{\boldsymbol{\mathrm{U}}} \newcommand{\W}{\boldsymbol{\mathrm{W}}} \newcommand{\X}{\boldsymbol{\mathrm{X}}} \newcommand{\Y}{\boldsymbol{\mathrm{Y}}} \newcommand{\Z}{\boldsymbol{\mathrm{Z}}}\\\newcommand{\R}{\mathbb{R}}\\\newcommand{\cE}{\mathcal{E}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}}\\\newcommand{\ld}{\lambda} \newcommand{\Ld}{\boldsymbol{\mathrm{\Lambda}}} \newcommand{\sg}{\sigma} \newcommand{\Sg}{\boldsymbol{\mathrm{\Sigma}}} \newcommand{\th}{\theta} \newcommand{\ve}{\varepsilon}\\\newcommand{\mmu}{\boldsymbol{\mu}} \newcommand{\ppi}{\boldsymbol{\pi}} \newcommand{\CC}{\mathcal{C}} \newcommand{\TT}{\mathcal{T}}\\ \newcommand{\bb}{\begin{bmatrix}} \newcommand{\eb}{\end{bmatrix}} \newcommand{\bp}{\begin{pmatrix}} \newcommand{\ep}{\end{pmatrix}} \newcommand{\bv}{\begin{vmatrix}} \newcommand{\ev}{\end{vmatrix}}\\\newcommand{\im}{^{-1}} \newcommand{\pr}{^{\prime}} \newcommand{\ppr}{^{\prime\prime}}\end{aligned}\end{align} \]

Chapter 13 Representative-based Clustering

Given a dataset \(\D\) with \(n\) points \(\x_i\) in a \(d\)-dimensional space, and given the number of desired clusters \(k\), the goal of representative-based clustering is to partition the dataset into \(k\) groups or clusters, which is called a clustering and is denoted as \(\cl{C}=\{C_1,C_2,\cds,C_k\}\). Further, for each cluster \(C_i\) there exists a representative point that summarizes the cluster, a common choice being the mean (also called the centroid) \(\mmu_i\) of all points in the cluster, that is,

\[\mmu_i=\frac{1}{n_i}\sum_{x_j\in C_i}\x_j\]

where \(n_i=|C_i|\) is the number of points in cluster \(C_i\).

The exact number of ways of partitioning \(n\) points into \(k\) nonempty and disjoint parts is given by the Stirling numbers of second kind, given as

\[\begin{split}S(n,k)=\frac{1}{k!}\sum_{t=0}^k(-1)^t\bp k\\t \ep(k-t)^n\end{split}\]

13.1 K-Means Algorithm

Given a clustering \(\cl{C}=\{C_1,C_2,\cds,C_k\}\) we need some scoring function that evaluates its quality or goodness. This sum of squared errors scoring function is defined as

Note

\(SSE(\cl{c})=\sum_{i=1}^k\sum_{\x_j\in C_i}\lv\x_j-\mmu_i\rv^2\)

The goal is to find the clustering that minimizes the SSE scores:

\[\cl{C}^*=\arg\min_{\cl{C}}\{SSE(\cl{C})\}\]

K-means employs a greedy iterative approach to find a clustering that minimizes the SSE objective.

The cluster assignment step take \(O(nkd)\) time because for each of the \(n\) points we have to compute its distance to each of the \(k\) clusters, which takes \(d\) operations in \(d\) dimensions. The centroid re-computation step takes \(O(nd)\) time because we have to add at total of \(n\) \(d\)-dimensional points. Assuming that there are \(t\) iterations, the total time for K-means is given as \(O(tnkd)\). In terms of the I/O cost it requires \(O(t)\) full database scans, because we have to read the entire database in each iteration.

13.2 Kernel K-Means

Assume for the moment that all points \(\x_i\in\D\) have been mapped to their corresponding images \(\phi(\x_i)\) in feature space. Let \(\K=\{K(\x_i,\x_j)\}_{i,j=1,\cds,n}\) denote the \(n\times n\) matrix, where \(K(\x_i,\x_j)=\phi(\x_i)^T\phi(\x_j)\). Let \(\{C_1,\cds,C_k\}\) specify the partitioning of the \(n\) points into \(k\) clusters, and let the corresponding cluster means in feature space be given as \(\{\mmu_1^\phi,\cds,\mmu_k^\phi\}\), where

\[\mmu_i^\phi=\frac{1}{n_i}\sum_{\x_j\in C_i}\phi(\x_j)\]

is the mean of cluster \(C_i\) in feature space, with \(n_i=|C_i|\).

In feature space, the kernel K-means sum of squared errors objective can be written as

\[ \begin{align}\begin{aligned}\min_{\cl{C}}SSE(\cl{C})&=\sum_{i=1}^k\sum_{\x_j\in C_i}\lv\phi(\x_j)-\mmu_i^\phi\rv^2\\&=\sum_{i=1}^k\sum_{\x_j\in C_i}\lv\phi(\x_j)\rv^2-2\phi(\x_j)^T\mmu_i^\phi+\lv\mmu_i\rv^2\\&=\sum_{i=1}^k\bigg(\bigg(\sum_{\x_j\in C_i}\lv\phi(\x_j)\rv^2\bigg)-2n_i \bigg(\frac{1}{n_i}\sum_{\x_j\in C_i}\phi(\x_j)\bigg)^T\mmu_i^\phi+ n_i\lv\mmu_i^\phi\rv^2\bigg)\\&=\bigg(\sum_{i=1}^k\sum_{\x_j\in C_i}\phi(\x_j)^T\phi(\x_j)\bigg)-\bigg(\sum_{i=1}^k n_i\lv\mmu_i^\phi\rv^2\bigg)\\&=\sum_{i=1}^k\sum_{\x_j\in C_i}K(\x_j,\x_j)-\sum_{i=1}^k\frac{1}{n_i} \sum_{\x_a\in C_i}\sum_{\x_b\in C_i}K(\x_a,\x_b)\\&=\sum_{j=1}^nK(\x_j,\x_j)-\sum_{i=1}^k\frac{1}{n_i}\sum_{\x_a\in C_i}\sum_{\x_b\in C_i}K(\x_a,\x_b)\end{aligned}\end{align} \]

Consider the distance of a point \(\phi(\x_j)\) to the mean \(\mmu_i^\phi\) in feature space, which can be computed as

\[ \begin{align}\begin{aligned}\lv\phi(\x_j)-\mmu_i^\phi\rv^2&=\lv\phi(\x_j)\rv^2-2\phi(\x_j)^T\mmu_i^\phi+\lv\mmu_i^\phi\rv^2\\&=\phi(\x_j)^T\phi(\x_j)-\frac{2}{n_i}\sum_{\x_a\in C_i}\phi(\x_j)^T \phi(\x_a)+\frac{1}{n^2}\sum_{\x_a\in C_i}\sum_{\x_b\in C_i} \phi(\x_a)^T\phi(\x_b)\\&=K(\x_j,\x_j)-\frac{2}{n_i}\sum_{\x_a\in C_i}K(\x_a,\x_j)+\frac{1}{n_i^2} \sum_{\x_a\in C_i}\sum_{\x_b\in C_i}K(\x_a,\x_b)\end{aligned}\end{align} \]

In the cluster assignment step of kernel K-means, we assign a point to the closest cluster mean as follows:

\[ \begin{align}\begin{aligned}C^*(\x_j)&=\arg\min_i\{\lv\phi(\x_j)-\mmu_i^\phi\rv^2\}\\&=\arg\min_i\bigg\{K(\x_j,\x_j)-\frac{2}{n_i}\sum_{\x_a\in C_i}K(\x_a,\x_j)+ \frac{1}{n^2}\sum_{\x_a\in C_i}\sum_{\x_b\in C_i}K(\x_a,\x_b)\bigg\}\\&=\arg\min_i\bigg\{\frac{1}{n_i^2}\sum_{\x_a\in C_i}\sum_{\x_b\in C_i} K(\x_a,\x_b)-\frac{2}{n_i}\sum_{\x_a\in C_i}K(\x_a,\x_j)\bigg\}\end{aligned}\end{align} \]

The fraction of points reassigned to a different cluster in the current iteration is given as

\[\frac{n-\sum_{i=1}^k|C_i^T\cap C_i^{t-1}|}{n}=1-\frac{1}{n}\sum_{i=1}^k|C_i^T\cap C_i^{t-1}|\]

Computational Complexity

The total computational complexity of kernel K-means is \(O(tn^2)\), where \(t\) is the number of iterations until convergence. The I/O complexity is \(O(t)\) scans of the kernel matrix \(\K\).

13.3 Expectation-Maximization Clustering

Let \(\D\) consist of \(n\) points \(\x_j\) in \(d\)-dimensional space \(\R^d\). Let \(X_a\) denote the random variable corresponding to the \(a\)th attribute. Let \(\X=(X_1,X_2,\cds,X_d)\) denote the vector random variable across the \(d\)-attributes, with \(\x_j\) being a data sample from \(\X\).

Gaussian Mixture Model

Assume that each cluster \(C_i\) is characterized by a multivariate normal distribution, that is,

Note

\(\dp f_i(\x)=f(\x|\mmu_i,\Sg_i)=\frac{1}{(2\pi)^{\frac{d}{2}}|\Sg_i|^{\frac{1}{2}}}\) \(\dp\exp\bigg\{-\frac{(\x-\mmu_i)^T\Sg_i\im(\x-\mmu_i)}{2}\bigg\}\)

where the cluster mean \(\mmu_i\in\R^d\) and covariance matrix \(\Sg_i\in\R^{d\times d}\) are both unknown parameters. \(f_i(\x)\) is the probability density at \(\x\) attributable to cluster \(C_i\). We assume that the probability density function of \(\X\) is given as a Gaussian mixture model over all the \(k\) cluster normals, defined as

Note

\(\dp f(\x)=\sum_{i=1}^kf_i(\x)P(C_i)=\sum_{i=1}^kf(\x|\mmu_i,\Sg_i)P(C_i)\)

where the prior probabilities \(P(C_i)\) are called the mixture parameters, which must satisfy the condition

\[\sum_{i=1}^kP(C_i)=1\]

We write the set of all the model parameters compactly as

\[\bs\theta=\{\mmu_1,\Sg_1,P(C_1),\cds,\mmu_k,\Sg_k,P(C_k)\}\]

Maximum Likelihood Estimation

Given the dataset \(\D\), we define the likelihood of \(\bs\th\) as the conditional probability of the data \(\D\) given the model parameters \(\bs\th\), denoted as \(P(\D|\bs\th)\).

\[p(\D|\bs\th)=\prod_{j=1}^nf(\x_j)\]

The goal of maximum likelihood estimation (MLE) is to choose the parameters \(\bs\th\) that maximize the likelihood

\[\bs\th^*=\arg\max_{\bs\th}\{P(\D|\bs\th)\}\]

It is typical the maximize the log of the likelihood function

\[\bs\th^*=\arg\max_{\bs\th}\{\ln P(\D|\bs\th)\}\]

where the log-likelihood function is given as

\[\ln P(\D|\bs\th)=\sum_{j=1}^n\ln f(\x_j)=\sum_{j=1}^n\ln\bigg(\sum_{i=1}^kf(\x_j|\mmu_i,\Sg_i)P(C_i)\bigg)\]

We can use the expectation-maximization (EM) approach for finding the maximum likelihood estimates for the parameters \(\bs\th\). EM is a two-step iterative approach that starts from an initial guess for the parameters \(\bs\th\). Given the current estimates for \(\bs\th\), in the expectation step EM computes the cluster posterior probabilities \(P(C_i|\x_j)\) via the Bayes theorem:

\[P(C_i|\x_j)=\frac{P(C_i\rm{\ and\ }\x_j)}{P(\x_j)}=\frac{P(\x_j|C_i)P(C_i)}{\sum_{a=1}^kP(\x_j|C_a)P(C_a)}\]

Because each cluster is modeled as a multivariate normal distribution, the probability of \(\x_j\) given cluster \(C_i\) can be obtained by considering a small interval \(\epsilon>0\) centered at \(\x_j\), as follows:

\[P(\x_j|C_i)\simeq 2\epsilon\cd f(\x_j|\mmu_i,\Sg_i)=2\epsilon\cd f_i(\x_j)\]

The posterior probability of \(C_i\) given \(\x_j\) is thus given as

Note

\(\dp P(C_i|\x_j)=\frac{f_i(\x_j)\cd P(C_i)}{\sum_{a=1}^kf_a(\x_j)\cd P(C_a)}\)

and \(P(C_i|\x_j)\) can be considered as the weight or contribution of the point \(\x_j\) to cluster \(C_i\). Next, in the maximization step, using the weights \(P(C_i|\x_j)\) EM re-estimates \(\bs\th\), for each cluster \(C_i\). The re-estimated mean is given as the weighted average of all the points, the re-estimated covariance matrix is given as the weighted covariance over all pairs of dimensions, and the re0estimated prior probability for each cluster is given as the fraction of weights that contribute to that cluster.

13.3.1 EM in One Dimension

Consider a dataset \(\D\) consisting of a single attribute \(X\), where each point \(x_j\in\R\) (\(j=1,\cds,n\)) is a random sample from \(X\). For the mixture model, we use univariate normals for each cluster:

\[f_i(x)=f(x|\mu_i,\sg_i^2)=\frac{1}{\sqrt{2\pi}\sg_i}\exp\bigg\{-\frac{(x-\mu_i)^2}{2\sg_i^2}\bigg\}\]

with the cluster parameters \(\mu_i,\sg_i^2\), and \(P(C_i)\).

Initialization

For each cluster \(C_i\), with \(i=1,2,\cds,k\), we can randomly initialize the cluster parameters \(\mu,\sg_i^2\), and \(P(C_i)\).

Expectation Step

The posterior probabilities are computed as

\[P(C_i|x_j)=\frac{f(x_j|\mu_i,\sg_i^2)\cd P(C_i)}{\sum_{a=1}^kf(x_j|\mu_a,\sg_a^2)\cd P(C_a)}\]

For convenience, we use the notation \(w_{ij}=P(C_i|x_j)\), and let

\[\w_i=(w_{i1},\cds,w_{in})^T\]

denote the weight vector for cluster \(C_i\) across all the \(n\) points.

Maximization Step

The re-estimated value for the cluster mean, \(\mu_i\), is computed as the weighted mean of all the points:

\[\mu_i=\frac{\sum_{j=1}^nw_{ij}\cd x_j}{\sum_{j=1}^nw_{ij}}\]

In terms of the weight vector \(\w_i\) and the attribute vector \(X=(x_1,x_2,\cds,x_n)^T\), we can write as

\[\mu_i=\frac{\w_i^TX}{\w_i^T\1}\]

The re-estimated value of the cluster variance is computed as the weighted variance across all the points:

\[\sg_i^2=\frac{\sum_{j=1}^nw_{ij}(x_j-\mu_i)^2}{\sum_{j=1}^nw_{ij}}\]

Let \(\bar{X}_i=X-\mu_i\1=(x_1-\mu_i,x_2-\mu_i,\cds,x_n-\mu_i)^T=\) \((\bar{x}_{i1},\bar{x}_{i2},\cds,\bar{x}_{in})^T\) be the centered attribute vector for cluster \(C_i\), and let \(\bar{X}_i^s\) be the squared vector given as \(\bar{X}_i^s=(\bar{x}_{i1}^2,\cds,\bar{x}_{in}^2)^T\). The variance can be expressed compactly as

\[\sg_i^2=\frac{\w_i^T\bar{X}_i^s}{\w_i^T\1}\]

The prior probability of cluster \(C_i\) is re-estimated as the fraction of the total weight belonging to \(C_i\), computed as

\[P(C_i)=\frac{\sum_{j=1}^nw_{ij}}{\sum_{a=1}^k\sum_{j=1}^nw_{aj}}= \frac{\sum_{j=1}^nw_{ij}}{\sum_{j=1}^n1}=\frac{\sum_{j=1}^nw_{ij}}{n}\]

where we made use of the fact that

\[\sum_{i=1}^kw_{ij}=\sum_{i=1}^kP(C_i|x_j)=1\]

In vector notation the prior probability can be written as

\[P(C_i)=\frac{\w_i^T\1}{n}\]

Iteration

Starting from an initial set of values for the cluster parameters \(\mu_i,\sg_i^2\), and \(P(C_i)\) for all \(i=1,\cds,k\), the EM algorithm applies the expectation step to compute the weights \(w_{ij}=P(C_i|x_j)\).

13.3.2 EM in \(d\) Dimensions

For each cluster \(C_i\), we now need to estimate the \(d\)-dimensional mean vector:

\[\mmu_i=(\mu_{i1},\mu_{i2},\cds,\mu_{id})^T\]

and the \(d\times d\) covariance matrix:

\[\begin{split}\Sg_i=\bp (\sg_1^i)^2&\sg_{12}^i&\cds&\sg_{id}^i\\ \sg_{21}^i&(\sg_2^i)^2&\cds&\sg_{2d}^i\\\vds&\vds&\dds&\vds\\ \sg_{d1}^i&\sg_{d2}^i&\cds&(\sg_d^i)^2 \ep\end{split}\]

One simplification is to assume that all dimensions are independent, which leads to a diagonal covariance matrix:

\[\begin{split}\Sg_i=\bp (\sg_1^i)^2&0&\cds&0\\0&(\sg_2^i)^2&\cds&0\\\vds&\vds&\dds&\vds\\0&0&\cds&(\sg_d^i)^2 \ep\end{split}\]

Initialization

For each cluster \(C_i\), with \(i=1,2,\cds,k\), we can randomly initialize the cluster parameters \(\mmu,\Sg_i\), and \(P(C_i)\).

Expectation Step

\[w_{ij}=P(C_i|\x_j)=\frac{f_i(\x_j)\cd P(C_i)}{\sum_{a=1}^kf_a(\x_j)\cd P(C_a)}\]

Maximization Step

The mena \(\mmu_i\) for cluster \(C_i\) can be estimated as

\[\mmu_i=\frac{\sum_{j-1}^nw_{ij}\cd\x_j}{\sum_{j=1}^nw_{ij}}=\frac{\D^T\w_i}{\w_i^T\1}\]

Let \(\bar\D_i=\D-\1\cd\mmu_i^T\) be the centered data matrix for cluster \(C_i\). Let \(\bar\x_{ji}=\x_j-\mmu_i\in\R^d\) denote the \(j\)th centered point in \(\bar\D_i\). We can express \(\Sg_i\) as

\[\Sg_i=\frac{\sum_{j=1}^nw_{ij}\bar\x_{ji}\bar\x_{ji}^T}{\w_i^T\1}\]

The covariance between dimensions \(X_a\) and \(X_b\) is estimated as

\[\sg_{ab}^i=\frac{\sum_{j=1}^nw_{ji}(x_{ja}-\mu_{ia})(x_{jb}-\mu_{ib})}{\sum_{j=1}^nw_{ij}}\]

The prior probability \(P(C_i)\) for each cluster is the same as in the one-dimensional case, given as

\[P(C_i)=\frac{\sum_{j=1}^nw_{ij}}{n}=\frac{\w_i^T\1}{n}\]

EM Clustering Algorithm

Computational Complexity

The computational complexity of the EM method is \(O(t(kd^3+nkd^2))\), where \(t\) is the number of iterations. If we use a diagonal covariance matrix, then the complexity is therefore \(O(tnkd)\). The I/O complexity for the EM algorithm is \(O(t)\) complete databases scans because we read the entire set of points in each iteration.

K-means as Specialization of EM

K-menas can be considered as a special case of the EM algorithm, obtained as follows:

\[\begin{split}P(C_i|\x_j)=\left\{\begin{array}{lr}1\quad\rm{if\ }C_i=\arg\min_{C_a} \{\lv\x_j-\mmu_a\rv^2\}\\0\quad\rm{otherwise}\end{array}\right.\end{split}\]

The posterior probability \(P(C_i|\x_j)\) is given as

\[P(C_i|\x_j)=\frac{P(\x_j|C_i)P(C_i)}{\sum_{a=1}^kP(\x_j|C_a)P(C_a)}\]

Note

\(P(C_i|\x_j)=\left\{\begin{array}{lr}1\quad\rm{if\ }\x_j\in C_i,\rm{\ i.e.,\ if\ }C_i=\arg\min_{C_a}\{\lv\x_j-\mmu_a\rv^2\}\\0\quad\rm{otherwise}\end{array}\right.\)

13.3.3 Maximum Likelihood Estimation

Estimation of Mean

Estimation of Covariance Matrix

Estimating the Prior Probability: Mixture Parameters

13.3.4 EM Approach

Expectation Step

Maximization Step