\[ \begin{align}\begin{aligned}\newcommand{\bs}{\boldsymbol} \newcommand{\dp}{\displaystyle} \newcommand{\rm}{\mathrm} \newcommand{\cl}{\mathcal} \newcommand{\pd}{\partial}\\\newcommand{\cd}{\cdot} \newcommand{\cds}{\cdots} \newcommand{\dds}{\ddots} \newcommand{\lag}{\langle} \newcommand{\lv}{\lVert} \newcommand{\ol}{\overline} \newcommand{\od}{\odot} \newcommand{\ra}{\rightarrow} \newcommand{\rag}{\rangle} \newcommand{\rv}{\rVert} \newcommand{\seq}{\subseteq} \newcommand{\td}{\tilde} \newcommand{\vds}{\vdots} \newcommand{\wh}{\widehat}\\\newcommand{\0}{\boldsymbol{0}} \newcommand{\1}{\boldsymbol{1}} \newcommand{\a}{\boldsymbol{\mathrm{a}}} \newcommand{\b}{\boldsymbol{\mathrm{b}}} \newcommand{\c}{\boldsymbol{\mathrm{c}}} \newcommand{\d}{\boldsymbol{\mathrm{d}}} \newcommand{\e}{\boldsymbol{\mathrm{e}}} \newcommand{\f}{\boldsymbol{\mathrm{f}}} \newcommand{\g}{\boldsymbol{\mathrm{g}}} \newcommand{\h}{\boldsymbol{\mathrm{h}}} \newcommand{\i}{\boldsymbol{\mathrm{i}}} \newcommand{\j}{\boldsymbol{j}} \newcommand{\m}{\boldsymbol{\mathrm{m}}} \newcommand{\n}{\boldsymbol{\mathrm{n}}} \newcommand{\o}{\boldsymbol{\mathrm{o}}} \newcommand{\p}{\boldsymbol{\mathrm{p}}} \newcommand{\q}{\boldsymbol{\mathrm{q}}} \newcommand{\r}{\boldsymbol{\mathrm{r}}} \newcommand{\u}{\boldsymbol{\mathrm{u}}} \newcommand{\v}{\boldsymbol{\mathrm{v}}} \newcommand{\w}{\boldsymbol{\mathrm{w}}} \newcommand{\x}{\boldsymbol{\mathrm{x}}} \newcommand{\y}{\boldsymbol{\mathrm{y}}} \newcommand{\z}{\boldsymbol{\mathrm{z}}}\\\newcommand{\A}{\boldsymbol{\mathrm{A}}} \newcommand{\B}{\boldsymbol{\mathrm{B}}} \newcommand{\C}{\boldsymbol{\mathrm{C}}} \newcommand{\D}{\boldsymbol{\mathrm{D}}} \newcommand{\H}{\boldsymbol{\mathrm{H}}} \newcommand{\I}{\boldsymbol{\mathrm{I}}} \newcommand{\K}{\boldsymbol{\mathrm{K}}} \newcommand{\M}{\boldsymbol{\mathrm{M}}} \newcommand{\N}{\boldsymbol{\mathrm{N}}} \newcommand{\P}{\boldsymbol{\mathrm{P}}} \newcommand{\Q}{\boldsymbol{\mathrm{Q}}} \newcommand{\S}{\boldsymbol{\mathrm{S}}} \newcommand{\U}{\boldsymbol{\mathrm{U}}} \newcommand{\W}{\boldsymbol{\mathrm{W}}} \newcommand{\X}{\boldsymbol{\mathrm{X}}} \newcommand{\Y}{\boldsymbol{\mathrm{Y}}} \newcommand{\Z}{\boldsymbol{\mathrm{Z}}}\\\newcommand{\R}{\mathbb{R}}\\\newcommand{\cE}{\mathcal{E}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}}\\\newcommand{\ld}{\lambda} \newcommand{\Ld}{\boldsymbol{\mathrm{\Lambda}}} \newcommand{\sg}{\sigma} \newcommand{\Sg}{\boldsymbol{\mathrm{\Sigma}}} \newcommand{\th}{\theta} \newcommand{\ve}{\varepsilon}\\\newcommand{\mmu}{\boldsymbol{\mu}} \newcommand{\ppi}{\boldsymbol{\pi}} \newcommand{\CC}{\mathcal{C}} \newcommand{\TT}{\mathcal{T}}\\ \newcommand{\bb}{\begin{bmatrix}} \newcommand{\eb}{\end{bmatrix}} \newcommand{\bp}{\begin{pmatrix}} \newcommand{\ep}{\end{pmatrix}} \newcommand{\bv}{\begin{vmatrix}} \newcommand{\ev}{\end{vmatrix}}\\\newcommand{\im}{^{-1}} \newcommand{\pr}{^{\prime}} \newcommand{\ppr}{^{\prime\prime}}\end{aligned}\end{align} \]

Chapter 16 Spectral and Graph Clustering

16.1 Graphs and Matrices

Given a dataset \(\D\) comprising \(n\) points \(\x_i\in\R^d\ (i=1,2,\cds,n)\), let \(\A\) denote the \(n\times n\) symmetric similarity matrix between the points, given as

\[\begin{split}\A = \left(\begin{array}{cccc}a_{11}&a_{12}&\cds&a_{1n}\\ \a_{21}&a_{22}&\cds&a_{1n}\\\vds&\vds&\dds&\vds\\ \a_{n1}&a_{n2}&\cds&a_{nn}\end{array}\right)\end{split}\]

where \(\A(i,j)=a_{ij}\) denotes the similarity or affinity between points \(\x_i\) and \(\x_j\). We require the similarity to be symmetric and non-negative, that is, \(a_{ij}=a_{ji}\) and \(a_{ji}\geq 0\). The matrix \(\A\) may be considered to be a weighted adjacency matrix of the weighted (undirected) graph \(G=(V,E)\), where each vertex is a point and each edge joins a pair of points, that is,

\[ \begin{align}\begin{aligned}V&=\{\x_i|i=1,\cds,n\}\\E&=\{(\x_i,\x_j)|1\leq i,j\leq n\}\end{aligned}\end{align} \]

Further, the similarity matrix \(\A\) gives the weight on each edge, that is, \(a_{ij}\) denotes the weight of the edge \((\x_i,\x_j)\). If all affinities are 0 or 1, then \(\A\) represents the regular adjacency relationship between the vertices.

For a vertex \(\x_i\), let \(d_j\) denote the degree of the vertex, defined as

\[d_i=\sum_{j=1}^n a_{ij}\]

We define the degree matrix \(\Delta\) of graph \(G\) as the \(n\times n\) diagonal matrix:

Note

\(\Delta=\left(\begin{array}{cccc}d_1&0&\cds&0\\0&d_2&\cds&0\\\vds&\vds&\dds&\vds\\0&0&\cds&d_n\end{array}\right)\) \(=\left(\begin{array}{cccc}\sum_{j=1}^na_{1j}&0&\cds&0\\0&\sum_{j=1}^na_{2j}&\cds&0\\\vds&\vds&\dds&\vds\\0&0&\cds&\sum_{j=1}^na_{nj}\end{array}\right)\)

\(\Delta\) can be compactly written as \(\Delta(i,i)=d_i\) for all \(1\leq i\leq n\).

Normalized Adjacency Matrix

The normalized adjacency matrix is obtained by dividing each row of the adjacency matrix by the degree of the corresponding node.

Note

\(\bs{\rm{M}}=\Delta\im\A=\) \(\left(\begin{array}{cccc}\frac{a_{11}}{d_1}&\frac{a_{12}}{d_1}&\cds&\frac{a_{1n}}{d_1}\\\frac{a_{21}}{d_2}&\frac{a_{22}}{d_2}&\cds&\frac{a_{2n}}{d_2}\\\vds&\vds&\dds&\vds\\\frac{a_{n1}}{d_n}&\frac{a_{n2}}{d_n}&\cds&\frac{a_{nn}}{d_n}\end{array}\right)\)

Each element of \(\bs{\rm{M}}\), namely \(m_{ij}\) is also non-negative, as \(m_{ij}=\frac{a_{ij}}{d_i}\geq 0\). Consider the sum of the \(i\)th row in \(\bs{\rm{M}}\), we have

\[\sum_{j=1}^nm_{ij}=\sum_{j=1}^n\frac{a_{ij}}{d_i}=\frac{d_i}{d_i}=1\]

Thus, each row in \(\bs{\rm{M}}\) sums to 1. This implies that 1 is an eigenvalue of \(\bs{\rm{M}}\). In fact, \(\ld=1\) is the largest eigenvalue of \(\bs{\rm{M}}\), and the other eigenvalues satisfy the property that \(|\ld_i|\leq 1\). Also, if \(G\) is connected then the eigenvector corresponding to \(\ld_1\) is \(\u_1=\frac{1}{\sqrt{n}}=(1,1,\cds,1)^T=\frac{1}{\sqrt{n}}\1\). Because \(\bs{\rm{M}}\) is not symmetric, its eigenvectors are not necessarily orthogonal.

Graph Laplacian Matrices

The Laplacian matrix of a graph is defined as

\[\begin{split}\bs{\rm{L}}=\Delta-\A=\left(\begin{array}{cccc}\sum_{j=1}^na_{1j}&0&\cds&0\\ 0&\sum_{j=1}^na_{2j}&\cds&0\\\vds&\vds&\dds&\vds\\ 0&0&\cds&\sum_{j=1}^na_{nj}\end{array}\right) -\left(\begin{array}{cccc}a_{11}&a_{12}&\cds&a_{1n}\\ \a_{21}&a_{22}&\cds&a_{1n}\\\vds&\vds&\dds&\vds\\ \a_{n1}&a_{n2}&\cds&a_{nn}\end{array}\right)\end{split}\]

Note

\(=\left(\begin{array}{cccc}\sum_{j\ne 1}^na_{1j}&-a_{12}&\cds&-a_{1n}\\-a{21}&\sum_{j\ne 2}^na_{2j}&\cds&-a_{2n}\\\vds&\vds&\dds&\vds\\-a_{n1}&-a_{n2}&\cds&\sum_{j\ne n}^na_{nj}\end{array}\right)\)

\(\bs{\rm{L}}\) is a symmetric, positive semidefinite matrix, as for any \(\c\in\R^n\), we have

\[ \begin{align}\begin{aligned}\c^T\bs{\rm{L}}\c&=\c^T(\Delta-\A)\c=\c^T\Delta\c-\c^T\A\c\\&=\sum_{i=1}^nd_ic_i^2-\sum_{i=1}^n\sum_{j=1}^nc_ic_ja_{ij}\\&=\frac{1}{2}\bigg(\sum_{i=1}^nd_ic_i^2-2\sum_{i=1}^n\sum_{j=1}^nc_ic_ja_{ij}+\sum_{j=1}^nd_jc_j^2\bigg)\\&=\frac{1}{2}\bigg(\sum_{i=1}^n\sum_{j=1}^na_{ij}c_i^2-2\sum_{i=1}^n \sum_{j=1}^nc_ic_ja_{ij}+\sum_{i=j}^n\sum_{i=1}^na_{ij}c_j^2\bigg)\\&=\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^na_{ij}(c_i-c_j)^2\\&\geq 0\end{aligned}\end{align} \]

This means that \(\bs{\rm{L}}\) has \(n\) real, non-negative eigenvalues, which can be arranged in decreasing order as follows: \(\ld_1\geq\ld_2\geq\cds\geq\ld_n\geq 0\). Because \(\bs{\rm{L}}\) is symmetric, its eigenvectors are orthonormal. We can observe that the first column (and the first row) is a linear combination of the remaining columns (rows). This implies that the rank of \(\bs{\rm{L}}\) is at most \(n-1\), and the smallest eigenvalue is \(\ld_n=0\), with the corresponding eigenvector given as \(\u_n=\frac{1}{\sqrt{n}}=(1,1,\cds,1)^T=\frac{1}{\sqrt{n}}\1\), provided the graph is connected. If the graph is disconnected, then the number of eigenvalues equal to zero specifies the number of connected components in the graph.

The normalized symmetric Laplacian matrix of a graph is defined as

\[ \begin{align}\begin{aligned}\bs{\rm{L}}^S&=\Delta^{-1/2}\bs{\rm{L}}\Delta^{-1/2}\\&=\bs{\rm{L}}^{-1/2}(\Delta-\A)\Delta^{-1/2}=\Delta^{-1/2}\Delta\Delta^{-1/2}-\Delta^{-1/2}\A\Delta^{-1/2}\\&=\I-\Delta^{-1/2}\A\Delta^{-1/2}\end{aligned}\end{align} \]

Note

\(\bs{\rm{L}}^S=\Delta^{-1/2}\bs{\rm{L}}\Delta^{-1/2}\) \(=\left(\begin{array}{cccc} \frac{\sum_{j\ne 1}a_{1j}}{\sqrt{d_1d_1}}&-\frac{a_{12}}{\sqrt{d_1d_2}}&\cds&-\frac{a_{1n}}{\sqrt{d_1d_n}}\\-\frac{a_{21}}{\sqrt{d_2d_1}}&\frac{\sum_{j\ne 2}a_{2j}}{\sqrt{d_2d_2}}&\cds&-\frac{a_{2n}}{\sqrt{d_2d_n}}\\\vds&\vds&\dds&\vds\\-\frac{a_{n1}}{\sqrt{d_nd_1}}&-\frac{a_{n2}}{\sqrt{d_nd_2}}&\cds&\frac{\sum_{j\ne n}a_{nj}}{\sqrt{d_nd_n}}\end{array}\right)\)

We can hsow that \(\bs{\rm{L}}^S\) is also positive semidefinite because for any \(\c\in\R^d\), we get

\[\c^T\bs{\rm{L}}^s\c=\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^na_{ij} \bigg(\frac{c_i}{\sqrt{d_i}}-\frac{c_j}{\sqrt{d_j}}\bigg)^2\geq 0\]

The first column is also a linear combination of the other columns, which means that \(\bs{\rm{L}}^S\) has rank at most \(n-1\), with the smallest eigenvalue \(\ld_n=0\), and the corresponding eigenvector \(\frac{1}{\sqrt{\sum_id_i}}(\sqrt{d_1},\sqrt{d_2},\cds,\sqrt{d_n})^T=\frac{1}{\sqrt{\sum_id_i}}\Delta^{1/2}\1\). Combined with the fact that \(\bs{\rm{L}}^S\) is positive semidefinite, we conclude that \(\bs{\rm{L}}^S\) has \(n\) (not necessarily distinct) real, positive eigenvalues \(\ld_1\geq\ld_2\geq\cds\geq\ld_n=0\).

The normalized asymmetric Laplacian matrix is defined as

Note

\(\bs{\rm{L}}^a=\Delta\im\bs{\rm{L}}=\Delta\im(\Delta-\A)=\I-\Delta\im\A\) \(=\left(\begin{array}{cccc}\frac{\sum_{j\ne 1}a_{1j}}{d_1}&-\frac{a_{12}}{d_1}&\cds&-\frac{a_{1n}}{d_1}\\-\frac{a_{21}}{d_2}&\frac{\sum_{j\ne 2}a_{2j}}{d_2}&\cds&-\frac{a_{2n}}{d_2}\\\vds&\vds&\dds&\vds\\-\frac{a_{n1}}{d_n}&-\frac{a_{n2}}{d_n}&\cds&\frac{\sum_{j\ne n}a_{nj}}{d_n}\end{array}\right)\)

Consider the eigenvalue equation for the symmetric Laplacian \(\bs{\rm{L}}^S\):

\[ \begin{align}\begin{aligned}\bs{\rm{L}}^S\u&=\ld\u\\\Delta^{-1/2}\bs{\rm{L}}^S\u&=\ld\Delta^{-1/2}\u\\\Delta^{-1/2}(\Delta^{-1/2}\bs{\rm{L}}\Delta^{-1/2})\u&=\ld\Delta^{-1/2}\u\\\Delta\im\bs{\rm{L}}(\Delta^{-1/2}\u)&=\ld(\Delta^{-1/2}\u)\\\bs{\rm{L}}^a\v=\ld\v\end{aligned}\end{align} \]

where \(\v=\Delta^{-1/2}\u\) is an eigenvector of \(\bs{\rm{L}}^a\), and \(\u\) is an eigenvector of \(\bs{\rm{L}}^S\).

16.2 Clustering as Graph Cuts

A k-way cut in a graph is a partitioning or clustering of the vertex set, given as \(\cl{C}=\{C_1,\cds,C_k\}\), such that \(C_i\ne\emptyset\) for all \(i\), \(C_i\cap C_j=\emptyset\) for all \(i, j\), and \(V=\bigcup_ic_i\). We require \(\cl{C}\) to optimize some objective function that cptures the intuition that nodes within a cluster should have high similarity, and nodes from different clusters should have low similarity.

Given a weighted graph \(G\) defined by its similarity matrix, let \(S, T\subseteq V\) be any two subsets of the vertices. We denote by \(W(S,T)\) the sum of the weights on all edges with one vertex in \(S\) and the other in \(T\), given as

Note

\(\dp W(S,T)=\sum_{v_i\in S}\sum_{v_j\in T}a_{ij}\)

Given \(S\subseteq V\), we denote by \(\bar{S}\) the complementary set of vertices, that is, \(\bar{S}=V-S\). A （vertex) cut in a graph is defined as a partitioning of \(V\) into \(S\subset V\) and \(\bar{S}\). The weight of the cut or cut weight is defined as the sum of all the weights on edges between vertices in \(S\) and \(\bar{S}\), given as \(W(S,\bar{S})\).

Given a clustering \(\cl{C}=\{C_1,\cds,C_k\}\) comprising \(k\) clusters, the size of a cluster \(C_i\) is the number of nodes in the cluster, given as \(|C_i|\). The volume of a cluster \(C_i\) is defined as the sum of all the weights on edges with one end in cluster \(C_i\):

Note

\(\dp vol(C_i)=\sum_{v_j\in C_i}d_j=\sum_{v_j\in C_i}\sum_{v_r\in V}a_{jr}=W(C_i,V)\)

Let \(\c_i=\{0,1\}^n\) be the cluster indicator vector that records the cluster membership for cluster \(C_i\), defined as

\[\begin{split}c_{ij}=\left\{\begin{array}{lr}1\quad\rm{if\ }v_j\in C_i\\0\quad\rm{if\ }v_j\notin C_i\end{array}\right.\end{split}\]

Because a clustering creates pairwise disjoint clusters, we immediately have

\[\c_i^T\c_j=0\]

The cluster size can be written as

Note

\(|C_i|=\c_i^T\c_i=\lv\c_i\rv^2\)

\[vol(C_i)=W(C_i,V)=\sum_{v_r\in C_i}d_r=\sum_{v_r\in C_i}c_{ir}d_rc_{ir}= \sum_{r=1}^n\sum_{s=1}^nc_{ir}\Delta_{rs}c_{is}\]

The volume of the cluster can be written as

Note

\(vol(C_i)=\c_i^T\Delta\c_i\)

\[W(C_i,C_i)=\sum_{v_r\in C_i}\sum_{v_s\in C_i}a_{rs}=\sum_{r=1}^n\sum_{s=1}^nc_{ir}a_{rs}c_{is}\]

The sum of internal weights can be written as

Note

\(W(C_i,C_i)=\c_i^T\A\c_i\)

Note

\(\dp W(C_i,\bar{C_i})=\sum_{v_r\in C_i}\sum_{v_s\in V-C_i}a_{rs}=W(C_i,V)-W(C_i,C_i)\) \(=\c_i(\Delta-\A)\c_i=\c_i^T\bs{\rm{L}}\c_i\)

16.2.1 Clustering Objective Functions: Ratio and Normalized Cut

Ratio Cut

Note

\(\dp\min_{\cl{C}}J_{rc}(\cl{C})=\sum_{i=1}^k\frac{W(C_i,\bar{C_i})}{|C_i|}\) \(=\dp\sum_{i=1}^k\frac{\c_i^T\bs{\rm{L}}\c_i}{\c_i^T\c_i}\) \(=\dp\sum_{i=1}^k\frac{\c_i^T\bs{\rm{L}}\c_i}{\lv\c_i\rv^2}\)

ratio cut tries to minimize the sum of the similarities from a cluster \(C_i\) to other points not in the cluster \(\bar{C_i}\), taking into account the size of each cluster.

For binary cluster indicator vectors \(\c_i\), the ratio cut objective is NP-hard. An obvious relaxation is to allow \(\c_i\) to take on any real value.

\[\min_{\cl{C}}J_{rc}(\cl{C})= \sum_{i=1}^k\frac{\c_i^T\bs{\rm{L}}\c_i}{\lv\c_i\rv^2}= \sum_{i=1}^k\bigg(\frac{\c_i}{\lv\c_i\rv}\bigg)^T\bs{\rm{L}} \bigg(\frac{\c_i}{\lv\c_i\rv}\bigg)=\sum_{i=1}^k\u_i^T\bs{\rm{L}}\u_i\]

where \(\u_i=\frac{\c_i}{\lv\c_i\rv}\) is the unit vector in the direction of \(\c_i\in\R^n\).

To incorporate the constraint the \(\u_i^T\u_i=1\), we introduce the Lagrange multiplier \(\ld_i\) for each cluster \(C_i\). We have

\[ \begin{align}\begin{aligned}\frac{\pd}{\pd\u_i}\bigg(\sum_{i=1}^k\u_i^T\bs{\rm{L}}\u_i+\sum_{i=1}^n\ld_i(1-\u_i^T\u_i)\bigg)&=\0\\2\bs{\rm{L}}\u_i-2\ld_i\u_i&=\0\\\bs{\rm{L}}\u_i&=\ld_i\u_i\\\u_i^T\bs{\rm{L}}\u_i&=\u_i^T\ld_i\u_i=\ld_i\end{aligned}\end{align} \]

which in turn implies that to minimize the ratio cut objective, we should choose the \(k\) smallest eigenvalues, and the corresponding eigenvectors, so that

\[ \begin{align}\begin{aligned}\min_{\cl{C}}J_{rc}(\cl{C})&=\u_n^T\bs{\rm{L}}\u_n+\cds+\u_{n-k+1}^T\bs{\rm{L}}\u_{n-k+1}\\&=\ld_n+\cds+\ld_{n-k+1}\end{aligned}\end{align} \]

Normalized Cut

Normalized cut is similar to ratio cut, except that it divides the cut weight of each cluster by the volume of a cluster instead of its size.

Note

\(\dp\min_{\cl{C}}J_{nc}(\cl{C})=\sum_{i=1}^k\frac{W(C_i,\bar{C_i})}{vol(C_i)}\) \(=\dp\sum_{i=1}^k\frac{\c_i^T\bs{\rm{L}}\c_i}{\c_i^T\Delta\c_i}\)

We assume \(\c_i\) to be an arbitrary real vector, and rewrite the normalized cut objective in terms of the normalized symmetrc Laplacian, as follows:

\[ \begin{align}\begin{aligned}\min_{\cl{C}}J_{nc}(\cl{C}) &=\sum_{i=1}^k\frac{\c_i^T\bs{\rm{L}}\c_i}{\c_i^T\Delta\c_i} =\sum_{i=1}^k\frac{\c_i^T(\Delta^{1/2}\Delta^{-1/2})\bs{\rm{L}} (\Delta^{-1/2}\Delta^{1/2})\c_i}{\c_i^T(\Delta^{1/2}\Delta^{1/2})\c_i}\\&=\sum_{i=1}^k\frac{(\Delta^{1/2}\c_i)^T(\Delta^{-1/2}\bs{\rm{L}} \Delta^{-1/2})(\Delta^{1/2}\c_i)}{(\Delta^{1/2}\c_i)^T(\Delta^{1/2}\c_i)}\\&=\sum_{i=1}^k\bigg(\frac{\Delta^{1/2}\c_i}{\lv\Delta^{1/2}\c_i\rv}\bigg)^T \bs{\rm{L}}^S\bigg(\frac{\Delta^{1/2}\c_i}{\lv\Delta^{1/2}\c_i\rv}\bigg) =\sum_{i=1}^k\u_i^T\bs{\rm{L}}^S\u_i\end{aligned}\end{align} \]

The normalized cut objective can also be expressed in terms of the normalized asymmetric Laplacian:

\[ \begin{align}\begin{aligned}\frac{\pd}{\pd\c_i}\bigg(\sum_{j=1}^k\frac{\c_j^T\bs{\rm{L}}\c_j} {\c_j^T\Delta\c_j}\bigg)=\frac{\pd}{\pd\c_i} \bigg(\frac{\c_i^T\bs{\rm{L}}\c_i}{\c_i^T\Delta\c_i}\bigg)&=\0\\\frac{\bs{\rm{L}}\c_i(\c_i^T\Delta\c_i)-\Delta\c_i(\c_i^T\bs{\rm{L}}\c_i)} {(\c_i^T\Delta\c_i)^2}&=0\\\bs{\rm{L}}\c_i&=\bigg(\frac{\c_i^T\bs{\rm{L}}\c_i}{\c_i^T\Delta\c_i}\bigg)\Delta\c_i\\\Delta\im\bs{\rm{L}}\c_i&=\ld_i\c_i\\\bs{\rm{L}}^a\c_i&=\ld_i\c_i\end{aligned}\end{align} \]

16.2.2 Spectral Clustering Algorithm

Eq.(16.23): \(\dp\y_i=\frac{1}{\sqrt{\sum_{j=1}^ku_{n-j+1,i}^2}}(u_{n,i},u_{n-1,i},\cds,u_{n-k+1,i})^T\)

Computational Complexity

The computational complexity of the spectral clustering algorithm is \(O(n^3)\).

16.2.3 Maximization Objectives: Average Cut and Modularity

Average Weight

The average weight objective is defined as

Note

\(\dp\max_{\cl{C}}J_{aw}(\cl{C})=\sum_{i=1}^k\frac{W(C_i,C_i)}{|C_i|}=\sum_{i=1}^k\frac{\c_i^T\A\c_i}{\c_i^T\c_i}\)

Instead of trying to minimize the weights on edges between clusters as in ratio cut, average weight tries to maximize the within cluster weights. The problem of maximizing \(J_{aw}\) for binary cluster indicator vectors is also NP-hard; we can obtain a solution by relaxing the constraint on \(\c_i\), by assuming that it can take on any real values for its elements.

\[\max_{\cl{C}}J_{aw}(\cl{C})=\sum_{i=1}^k\u_i^T\A\u_i\]

We can maximize the objective by selecting the \(k\) largest eigenvalues of \(\A\), and the corresponding eigenvectors

\[\max_{\cl{C}}J_{aw}(\cl{C})=\u_1^T\A\u_1+\cds+\u_k^T\A\u_k=\ld_1+\cds+\ld_k\]

where \(\ld_1\geq\ld_2\geq\cds\ld_n\).

Average Weight and Kernel K-means

If the weighted adjacency matrix \(\A\) represents the kernel value between a pair of points, so that \(a_{ij}=K(\x_i,\x_j)\), then we may use the sum of squared errors objective of kernel K-means for graph clustering.

\[ \begin{align}\begin{aligned}\min_{\cl{C}}J_{sse}(\cl{C})&=\sum_{j=1}^nK(\x_j,\x_j)-\sum_{i=1}^k \frac{1}{|C_i|}\sum_{\x_r\in C_i}\sum_{\x_s\in C_i}K(\x_r,\x_s)\\&=\sum_{j=1}^na_{jj}-\sum_{i=1}^k\frac{1}{|C_i|}\sum_{v_r\in C_i}\sum_{v_s\in C_i}a_{rs}\\&=\sum_{j=1}^na_{jj}-\sum_{i=1}^k\frac{\c_i^T\A\c_i}{\c_i^T\c_i}\\&=\sum_{j=1}^na_{jj}-J_{aw}(\cl{C})\end{aligned}\end{align} \]

We can observe that because \(\sum_{j=1}^na_{jj}\) is independent of the clustering, minimizing the SSE objective is the same as maximizing the average weight objective. In particular, if \(a_{jj}\) represents the linear kernel \(\x_i^T\x_j\) between the nodes, then maximizing the average weight objective is equivalent to minimizing the regular K-means SSE objective.

Modularity

Informally, modularity is defined as the difference between the observed and expected fraction of edges within a cluster. It measures the extent to which nodes of the same type are linked to each other.

Unweighted Graphs

Let us assume for the moment that the graph \(G\) is unweighted, and that \(\A\) is its binary adjacency matrix. The number of edges within a cluster \(C_i\) is given as

\[\frac{1}{2}\sum_{v_r\in C_i}\sum_{v_s\in C_i}a_{rs}\]

where we divide by \(\frac{1}{2}\) because each edge is counted twice in the summation. Over all the clusters, the observed number of edges within the same cluster is given as

\[\frac{1}{2}\sum_{i=1}^k\sum_{v_r\in C_i}\sum_{v_s\in C_i}a_{rs}\]

The probability that one end of an edge is \(v_r\) and the other \(v_s\) is given as

\[p_{rs}=\frac{d_r}{2m}\cd\frac{d_s}{2m}=\frac{d_rd_s}{4m^2}\]

The number of edges between \(v_r\) and \(v_s\) follows a binomial distribution with success probability \(p_{rs}\) over \(2m\) trails (because we are selecting the two ends of \(m\) edges). The expected number of edges between \(v_r\) and \(v_s\) is given as

\[2m\cd p_{rs}=\frac{d_rd_s}{2m}\]

The expected number of edges within a cluster \(C_i\) is then

\[\frac{1}{2}\sum_{v_r\in C_i}\sum_{v_s\in C_i}\frac{d_rd_s}{2m}\]

and the expected number of edges within the same cluster, summed over all \(k\) clusters, is given as

\[\frac{1}{2}\sum_{i=1}^k\sum_{v_r\in C_i}\sum_{v_s\in C_i}\frac{d_rd_s}{2m}\]

where we divide by 2 because each edge is counted twice. The modularity of the clustering \(\cl{C}\) is defined as the difference between the observed and expected fraction of edges within the same cluster.

\[ \begin{align}\begin{aligned}Q&=\frac{1}{2m}\sum_{i=1}^k\sum_{v_r\in C_i}\sum_{v_s\in C_i}\bigg(a_{rs}-\frac{d_rd_s}{2m}\bigg)\\&=\sum_{i=1}^k\sum_{v_r\in C_i}\sum_{v_s\in C_i}\bigg( \frac{a_{rs}}{\sum_{j=1}^nd_j}-\frac{d_rd_s}{(\sum_{j=1}^nd_j)^2}\bigg)\end{aligned}\end{align} \]

Weighted Graphs

Assume that \(\A\) is the weighted adjacency matrix; we interpret the modularity of a clustering as the difference between the observed and expected fraction of weights on edges within the clusters.

\[ \begin{align}\begin{aligned}\sum_{v_r\in C_i}\sum_{v_s\in C_i}a_{rs}=W(C_i,C_i)\\\sum_{v_r\in C_i}\sum_{v_s\in C_i}d_rd_s=\bigg(\sum_{v_r\in C_i}d_r\bigg)\bigg(\sum_{v_s\in C_i}d_s\bigg)=W(C_i,V)^2\\\sum_{j=1}^nd_j=W(V,V)\end{aligned}\end{align} \]

Note

\(\dp \max_{\cl{C}}J_Q(\cl{C})=\sum_{i=1}^k\bigg(\frac{W(C_i,C_i)}{W(V,V)}\) \(\dp-\bigg(\frac{W(C_i,V)}{W(V,V)}\bigg)^2\bigg)\)

We now express the modularity obejctive in matrix terms. We have

\[ \begin{align}\begin{aligned}W(C_i,C_i)=\c_i^T\A\c_i\\W(C_i,V)=\sum_{v_r\in C_i}d_r=\sum_{v_r\in C_i}d_rc_{ir}=\sum_{j=1}^nd_jc_{ij}=\bs{\rm{d}}^T\c_i\\W(V,V)=\sum_{j=1}^nd_j=tr(\Delta)\end{aligned}\end{align} \]

The clustering objective based on modularity can then be written as

\[ \begin{align}\begin{aligned}\max_{\cl{C}}J_Q(\cl{C})&=\sum_{i=1}^k\bigg(\frac{\c_i^T\A\c_i}{tr(\Delta)}- \frac{(\bs{\rm{d}}^T\c_i)^2}{tr(\Delta)^2}\bigg)\\&=\sum_{i=1}^k\bigg(\c_i^T\bigg(\frac{\A}{tr(\Delta)}\bigg)\c_i- \c_i^T\bigg(\frac{\bs{\rm{d}}\cd\bs{\rm{d}}^T}{tr(\Delta)^2}\bigg)\c_i\bigg)\\&=\sum_{i=1}^k\c_i^T\bs{\rm{Q}}\c_i\end{aligned}\end{align} \]

where \(\bs{\rm{Q}}\) is the modularity matrix:

\[\bs{\rm{Q}}=\frac{1}{tr(\Delta)}\bigg(\A-\frac{\bs{\rm{d}}\cd\bs{\rm{d}}^T}{tr(\Delta)}\bigg)\]

We select the \(k\) largest eigenvalues and the corresponding eigenvectors to obtain

\[\max_{\cl{C}}=J_Q(\cl{C})=\u_1^T\bs{\rm{Q}}\u_1+\cds+\u_k^T\bs{\rm{Q}}\u_k=\ld_1+\cds+\ld_k\]

Modularity as Average Weight

We know that each row of \(\bs{\rm{M}}=\Delta\im\A\) sums to 1, that is

\[\sum_{j=1}^nm_{ij}=d_i=1,\forall i=1,\cds,n\]

The modularity matrix can then be written as

\[\bs{\rm{Q}}=\frac{1}{n}\bs{\rm{M}}-\frac{1}{n^2}\1_{n\times n}\]

For large graphs with many nodes, \(n\) is large and the second term practically vanishes, as \(\frac{1}{n^2}\) will be very small. Thus, the modularity matrix can be reasonably approximated as

\[\bs{\rm{Q}}\simeq\frac{1}{n}\bs{\rm{M}}\]

\[\max_{\cl{C}}J_Q(\cl{C})=\sum_{i=1}^k\c_i^T\bs{\rm{Q}}\c_i=\sum_{i=1}^k\c_i^T\bs{\rm{M}}\c_i\]

where we dropped the \(\frac{1}{n}\) factor because it is a constant for a given graph; it only scales the eigenvalues without effecting the eigenvectors.

In conclusion, if we use the normalized adjacency matrix, maximizing the modularity is equivalent to selecting the \(k\) largest eigenvalues and the corresponding eigenvectors of the normalized adjacency matrix \(\bs{\rm{M}}\).

Normalized Modularity as Normalized Cut

Define the normalized modularity objective as follows:

Note

\(\dp\max_{\cl{C}}J_{nQ}(\cl{C})=\sum_{i=1}^k\frac{1}{W(C_i,V)}\) \(\dp\bigg(\frac{W(C_i,C_i)}{W(V,V)}-\bigg(\frac{W(C_i,V)}{W(V,V)}\bigg)^2\bigg)\)

\[ \begin{align}\begin{aligned}J_{nQ}(\cl{C})&=\frac{1}{W(V,V)}\sum_{i=1}^k\bigg(\frac{W(C_i,C_i)}{W(C_i,V)}-\frac{W(C_i,V)}{W(V,V)}\bigg)\\&=\frac{1}{W(V,V)}\bigg(\sum_{i=1}^k\bigg(\frac{W(C_i,C_i)}{W(C_i,V)}\bigg)- -\sum_{i=1}^k\bigg(\frac{W(C_i,V)}{W(V,V)}\bigg)\bigg)\\&=\frac{1}{W(V,V)}\bigg(\sum_{i=1}^k\bigg(\frac{W(C_i,C_i)}{W(C_i,V)}\bigg)-1\bigg)\end{aligned}\end{align} \]

We have

\[ \begin{align}\begin{aligned}(k-1)-W(V,V)J_{nQ}(\cl{C})&=(k-1)-\bigg(\sum_{i=1}^k\bigg(\frac{W(C_i,C_i)}{W(C_i,V)}\bigg)-1\bigg)\\&=k-\sum_{i=1}^k\frac{W(C_i,C_i)}{W(C_i,V)}\\&=\sum_{i=1}^k1-\frac{W(C_i,C_i)}{W(C_i,V)}\\&=\sum_{i=1}^k\frac{W(C_i,V)-W(C_i,C_i)}{W(C_i,V)}\\&=\sum_{i=1}^k\frac{W(C_i,\bar{C_i})}{W(C_i,V)}\\&=\sum_{i=1}^k\frac{W(C_i,\bar{C_i})}{vol(C_i)}\\&=J_{nc}(\cl{C})\end{aligned}\end{align} \]

In other words the normalized cut objective is related to the normalized modularity objective by the following equation:

\[J_{nc}(\cl{C})=(k-1)-W(V,V)\cd J_{nQ}(\cl{C})\]

Since \(W(V,V)\) is a constant for a given graph, we observe that minimizing normalized cut is equivalent to maximizing normalized modularity.

Spectral Clustering Algorithm

The matrix \(\bs{\rm{B}}\) is chosen to be \(\A\) if we are maximizing average weight or \(\bs{\rm{Q}}\) for the modularity objective. Instead of computing the \(k\) smallest eigenvalues we have to select the \(k\) largest eigenvalues and their corresponding eigenvectors. Because both \(\A\) and \(\bs{\rm{Q}}\) can have negative eigenvalues, we must select only the positive eigenvalues.

16.3 Markov Clustering

If node transitions reflect the weights on the edges, then transitions from one node to another within a cluster are much more likely than transitions between nodes from different clusters.

Given the weighted adhacency matrix \(\A\) for a graph \(G\), the normalized adjacency matrix is given as \(\M=\Delta\im\A\). The matrix \(\M\) can be interpreted as the \(n\times n\) transition matrix where the entry \(m_{ij}=\frac{a_{ij}}{d_i}\) can be interpreted as the probability of transitioning or jumpping from node \(i\) to node \(j\) in the graph \(G\). This is because \(\M\) is a row stochastic or Markov matrix. which satisfies the following conditions: (1) elements of the matrix are non-negative, that is, \(m_{ij}\geq 0\), which follows from the fact that \(\A\) is non-negative, and (2) rows of \(\M\) are probability vectors, that is, row elements add to 1, because

\[\sum_{j=1}^nm_{ij}=\sum_{j=1}^n\frac{\a_{ij}}{d_i}=1\]

The matrix \(\M\) is thus the transition matrix for a Markov chain or a Markov random walk on graph \(G\). The Markov chain makese a transition from one node to another at discrete timesteps \(t=1,2,\cds\), with the probability of making a transition from node \(i\) to node \(j\) given as \(m_{ij}\). Let the random variable \(X_t\) denote the state at time \(t\). The Markov property means that the probability distribution of \(X_t\) over the states at time \(t\) depends only on the probability distribution of \(X_{t=1}\), that is,

\[P(X_t=i|X_0,X_1,\cds,X_{t-1})=P(X_t=i|X_{t-1})\]

Further, we assume that the Markov chain is homogeneous, that is, the transition probability

\[P(X_t=j|X_{t-1}=i)=m_{ij}\]

is independent of the time step \(t\).

The transition probability matrix for \(t\) time steps is given as

\[\M^{t-1}\cd\M=\M^t\]

A random walk on \(G\) thus corresponds to taking successive powers of the transition matrix \(\M\). Let \(\ppi_0\) specify the initial state probability vector at time \(t=0\), that is, \(\ppi_{0i}=P(X_0=i)\) is the probability of starting at node \(i\), for all \(i=1,\cds,n\).

\[ \begin{align}\begin{aligned}\ppi_t^T&=\ppi_{t-1}^T\M\\&=(\ppi_{t-2}^T\M)\cd\M=\ppi_{t-2}^T\M^2\\&=(\ppi_{t-3}^T\M^2)\cd\M=\ppi_{t-3}^T\M^3\\&=\vds\\&=\ppi_0^T\M^t\end{aligned}\end{align} \]

Equivalently, taking transpose on both sides, we get

\[\ppi_t=(\M^t)^T\ppi_0=(\M^T)^t\ppi_0\]

The state probability vector thus converges to the dominant eigenvector of \(\M^T\), reflecting the steady-state probability of reaching any node in the graph, regardless of the staring node.

Transition Probability Inflation

We now consider a variation of the random walk, where the probability of transitioning from node \(i\) to \(j\) is inflated by taking each element \(m_{ij}\) to the power \(r\geq 1\). Given a transition matrix \(\M\), define the inflation operator \(\Upsilon\) as follows:

Note

\(\dp\Upsilon(\M,r)=\bigg\{\frac{(m_{ij})^r}{\sum_{a=1}^n(m_{ia})^r}\bigg\}_{i,j=1}^n\)

The net effect of the inflation operator is to increase the higher probability transitions and decrease the lower probability transitions.

16.3.1 Markov Clustering Algorithm

The Markov clustering algorithm (MCL) is an iterative method that interleaves matrix expansion and inflation steps.

The matrix difference is given in terms of the Frobenius norm:

\[\lv\M_t-\M_{t-1}\rv_F=\sqrt{\sum_{i=1}^n\sum_{j=1}^n(\M_t(i,j)-\M_{t-1}(i,j))^2}\]

MCL Graph

The final clusters are found by enumerating the weakly connected components in the directed graph induced by the converged transition matrix \(\M_t\). The directed graph induced by \(\M_t\) is denoted as \(G_t=(V_t,E_t)\). The vertex set is the same as the set of nodes in the original graph, that is \(V_t=V\), and the edge set is given as

\[E_t=\{(i,j)|\M_t(i,j)>0\}\]

In other words, a directed edge \((i,j)\) exists only if node \(i\) can transition to node \(j\) within \(t\) steps of the expansion and inflation process. A node \(j\) is called an attractor if \(\M_t(j,j)>0\), and we say that node \(i\) is attracted to attractor \(j\) if \(\M_t(i,j)>0\). The MCL process yields a set of attractor nodes, \(V_a\subseteq V\), such that other nodes are attracted to at least one attractor in \(V_a\).

Computational Complexity

The computational complexity of the MCL algorithm is \(O(tn^3)\), where \(t\) is the number of iterations until convergence.