\[ \begin{align}\begin{aligned}\newcommand{\bs}{\boldsymbol} \newcommand{\dp}{\displaystyle} \newcommand{\rm}{\mathrm} \newcommand{\cl}{\mathcal} \newcommand{\pd}{\partial}\\\newcommand{\cd}{\cdot} \newcommand{\cds}{\cdots} \newcommand{\dds}{\ddots} \newcommand{\lag}{\langle} \newcommand{\lv}{\lVert} \newcommand{\ol}{\overline} \newcommand{\od}{\odot} \newcommand{\ra}{\rightarrow} \newcommand{\rag}{\rangle} \newcommand{\rv}{\rVert} \newcommand{\seq}{\subseteq} \newcommand{\td}{\tilde} \newcommand{\vds}{\vdots} \newcommand{\wh}{\widehat}\\\newcommand{\0}{\boldsymbol{0}} \newcommand{\1}{\boldsymbol{1}} \newcommand{\a}{\boldsymbol{\mathrm{a}}} \newcommand{\b}{\boldsymbol{\mathrm{b}}} \newcommand{\c}{\boldsymbol{\mathrm{c}}} \newcommand{\d}{\boldsymbol{\mathrm{d}}} \newcommand{\e}{\boldsymbol{\mathrm{e}}} \newcommand{\f}{\boldsymbol{\mathrm{f}}} \newcommand{\g}{\boldsymbol{\mathrm{g}}} \newcommand{\h}{\boldsymbol{\mathrm{h}}} \newcommand{\i}{\boldsymbol{\mathrm{i}}} \newcommand{\j}{\boldsymbol{j}} \newcommand{\m}{\boldsymbol{\mathrm{m}}} \newcommand{\n}{\boldsymbol{\mathrm{n}}} \newcommand{\o}{\boldsymbol{\mathrm{o}}} \newcommand{\p}{\boldsymbol{\mathrm{p}}} \newcommand{\q}{\boldsymbol{\mathrm{q}}} \newcommand{\r}{\boldsymbol{\mathrm{r}}} \newcommand{\u}{\boldsymbol{\mathrm{u}}} \newcommand{\v}{\boldsymbol{\mathrm{v}}} \newcommand{\w}{\boldsymbol{\mathrm{w}}} \newcommand{\x}{\boldsymbol{\mathrm{x}}} \newcommand{\y}{\boldsymbol{\mathrm{y}}} \newcommand{\z}{\boldsymbol{\mathrm{z}}}\\\newcommand{\A}{\boldsymbol{\mathrm{A}}} \newcommand{\B}{\boldsymbol{\mathrm{B}}} \newcommand{\C}{\boldsymbol{\mathrm{C}}} \newcommand{\D}{\boldsymbol{\mathrm{D}}} \newcommand{\H}{\boldsymbol{\mathrm{H}}} \newcommand{\I}{\boldsymbol{\mathrm{I}}} \newcommand{\K}{\boldsymbol{\mathrm{K}}} \newcommand{\M}{\boldsymbol{\mathrm{M}}} \newcommand{\N}{\boldsymbol{\mathrm{N}}} \newcommand{\P}{\boldsymbol{\mathrm{P}}} \newcommand{\Q}{\boldsymbol{\mathrm{Q}}} \newcommand{\S}{\boldsymbol{\mathrm{S}}} \newcommand{\U}{\boldsymbol{\mathrm{U}}} \newcommand{\W}{\boldsymbol{\mathrm{W}}} \newcommand{\X}{\boldsymbol{\mathrm{X}}} \newcommand{\Y}{\boldsymbol{\mathrm{Y}}} \newcommand{\Z}{\boldsymbol{\mathrm{Z}}}\\\newcommand{\R}{\mathbb{R}}\\\newcommand{\cE}{\mathcal{E}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}}\\\newcommand{\ld}{\lambda} \newcommand{\Ld}{\boldsymbol{\mathrm{\Lambda}}} \newcommand{\sg}{\sigma} \newcommand{\Sg}{\boldsymbol{\mathrm{\Sigma}}} \newcommand{\th}{\theta} \newcommand{\ve}{\varepsilon}\\\newcommand{\mmu}{\boldsymbol{\mu}} \newcommand{\ppi}{\boldsymbol{\pi}} \newcommand{\CC}{\mathcal{C}} \newcommand{\TT}{\mathcal{T}}\\ \newcommand{\bb}{\begin{bmatrix}} \newcommand{\eb}{\end{bmatrix}} \newcommand{\bp}{\begin{pmatrix}} \newcommand{\ep}{\end{pmatrix}} \newcommand{\bv}{\begin{vmatrix}} \newcommand{\ev}{\end{vmatrix}}\\\newcommand{\im}{^{-1}} \newcommand{\pr}{^{\prime}} \newcommand{\ppr}{^{\prime\prime}}\end{aligned}\end{align} \]

Chapter 3 Categorical Attributes

3.1 Univariate Analysis

Consider a single categorical attribute, \(X\), with domain \(dom(X)=\{a_1,a_2,\cds,a_m\}\) comprising \(m\) symbolic values. The data \(\D\) is an \(n\times 1\) symbolic data matrix given as

\[\begin{split}\D=\bp X\\\hline x_1\\x_2\\\vds\\x_n \ep\end{split}\]

where each point \(x_i\in dom(X)\).

3.1.1 Bernoulli Variable

When the categorical attribute \(X\) has domain \(\{a_1,a_2\}\), with \(m=2\). We can model \(X\) as a Bernoulli random variables, which takes on two distinct values, 1 and 0, according to the mapping

\[\begin{split}X(v)=\left\{\begin{array}{lr}1\quad\rm{if\ }v=a_1\\0\quad\rm{if\ }v=a_2\end{array}\right.\end{split}\]

The probability mass funciton (PMF) of \(X\) is given as

\[\begin{split}P(X=x)=f(x)=\left\{\begin{array}{lr}p_1\quad\rm{if\ }x=1\\p_0\quad\rm{if\ }x=0\end{array}\right.\end{split}\]

where \(p_1\) and \(p_0\) are the parameters of the distribution, which must satisfy the condition

\[p_1+p_0=1\]

Denote \(p_1=p\), from which it follows that \(p_0=1-p\). The PMF of Bernoulli random variable \(X\) can then be written compactly as

Note

\(P(X=x)=f(x)=p^x(1-p)^{1-x}\)

Mean and Variance

The expected value of \(X\) is given as

Note

\(\mu=E[X]=1\cd p+0\cd(1-p)=0\)

and the variance of \(X\) is given as

\[\sg^2=\rm{var}(X)=E[X^2]-(E[X])^2=(1^2\cd p+0^2\cd (1-p))-p^2=p-p^2\]

which implies

Note

\(\sg^2=p(1-p)\)

Sample Mean and Variance

The sample mean is given as

\[\hat\mu=\frac{1}{n}\sum_{i=1}^nx_i=\frac{n_1}{n}=\hat{p}\]

Let \(n_0=n-n_1\) denote the number of points with \(x_i=0\) in the random sample. The sample variance is given as

\[ \begin{align}\begin{aligned}\hat\sg^2&=\frac{1}{n}\sum_{i=1}^n(x_i-\hat\mu)^2\\&=\frac{n_1}{n}(1-\hat p)^2+\frac{n-n_1}{n}(0-\hat p)^2=\hat p(1-\hat p)^2+(1-\hat p)\hat p^2\\&= \hat p(1-\hat p)(1-\hat p+\hat p)=\hat p(1-\hat p)\end{aligned}\end{align} \]

Bionomial Distribution: Number of Occurrences

Given the Bernoulli variable \(X\), let \(\{x_1,x_2,\cds,x_n\}\) denote a random sample of size \(n\) drawn from \(X\). Let \(N\) be the random variable denoting the number of occurrences of the symbol \(a_1\) (value \(X=1\)) in the sample. \(N\) has a binomial distribution, given as

Note

\(f(N=n_1|n,p)=\bp n\\n_1 \ep p^{n_1}(1-p)^{n-n_1}\)

\(N\) is the sum of the \(n\) independent Bernoulli random variables \(x_i\) IID with \(X\), that is, \(N=\sum_{i=1}^nx_i\). The mean or expected number of occurrences of symbol \(a_1\) is given as

Note

\(\dp\mu_N=E[N]=E\bigg[\sum_{i=1}^nx_i\bigg]=\sum_{i=1}^nE[x_i]=\sum_{i=1}^np=np\)

Because \(x_i\) are all independent, the variance of \(N\) is given as

Note

\(\dp\sg_N^2=\rm{var}(N)=\sum_{i=1}^n\rm{var}(x_i)=\sum_{i=1}^np(1-p)=np(1-p)\)

3.1.2 Multivariate Bernoulli Variable

For the general case when \(dom(X)=\{a_1,a_2,\cds,a_m\}\), we model \(X\) as an \(m\)-dimensional Bernoulli random variable \(\X=(A_1,A_2,\cds,A_m)^T\), where each \(A_i\) is a Bernoulli variable with parameter \(p_i\) denoting the probability of observing symbol \(a_i\). However, \(X\) can assume only one of the symbolic values at any one time. Thus,

\[\X(b)=\e_i\rm{\ if\ }v=a_i\]

where \(\e_i\) is the \(i\)-th standard basis vector in \(m\) dimensions. The range of \(\X\) consists of \(m\) distinct vector values \(\{\e_1,\e_2,\cds,\e_m\}\).

The PMF of \(\X\) is given as

\[p(\X=\e_i)=f(\e_i)=p_i\]

where \(p_i\) is the probability of observing value \(a_i\). These parameters must satisfy the condition

\[\sum_{i=1}^m p_i=1\]

The PMF can be written compactly as follows:

Note

\(\dp P(\X=\e_i)=f(\e_i)=\prod_{j=1}^m p_i^{e_{ij}}\)

Because \(e_{ii}=1\) and \(e_{ij}=0\) for \(j\neq i\), we can see that, as expected, we have

\[f(\e_i)=\prod_{j=1}^mp_j^{e_{ij}}=p_1^{e_{i0}}\times\cds p_i^{e_{ii}}\cds \times p_m^{e_{im}}=p_1^0\times\cds p_i^1\cds\times p_m^0=p_i\]

Mean

The mean or expected value of \(\X\) can be obtained as

\[\begin{split}\mmu=E[\X]=\sum_{i=1}^m\e_if(\e_i)=\sum_{i=1}^m\e_ip_i=\bp 1\\0\\\vds\\0\ep p_1+\cds+\bp 0\\0\\\vds\\1\ep p_m=\bp p_1\\p_2\\\vds\\p_m\ep=\p\end{split}\]

Sample Mean

\[\begin{split}\hat\mmu=\frac{1}{n}\sum_{i=1}^n\x_i=\sum_{i=1}^m\frac{n_i}{n}\e_i= \bp n_1/n\\n_2/n\\\vds\\n_m/n\ep=\bp\hat{p}_1\\\hat{p}_2\\\vds\\\hat{p}_m\ep =\hat\p\end{split}\]

where \(n_i\) is the number of occurrences of the vector value \(\e_i\) in the sample, which is equivalent to the number of occurrences of the symnbol \(a_i\). Furthermore, we have \(\sum_{i=1}^mn_i=n\).

Covariance Matrix

Note

\(\sg_i^2=\rm{var}(A_i)=p_i(1-p_i)\)

Note

\(\sg_{ij}=E[A_iA_j]-E[A_i]\cd E[A_j]=0-p_ip_j=-p_ip_j\)

which follows from the fact taht \(E[A_iA_j]=0\), as \(A_i\) and \(A_j\) cannot both be 1 at the same time.

The \(m\times m\) covariance matrix for \(\X\) is given as

\[\begin{split}\Sg=\bp\sg_1^2&\sg_{12}&\cds&\sg_{1m}\\\sg_{21}&\sg_{2}^2&\cds&\sg_{2m}\\ \vds&\vds&\dds&\vds\\\sg_{1m}&\sg_{2m}&\cds&\sg_m^2\ep= \bp p_1(1-p_1)&-p_1p_2&\cds&-p_1p_m\\-p_1p_2&p_2(1-p_2)&\cds&-p_2p_m\\ \vds&\vds&\dds&\vds\\-p_1p_m&-p_2p_m&\cds&p_m(1-p_m) \ep\end{split}\]

Define \(\\) diagonal matrix:

\[\begin{split}\P=\rm{diag}(\p)=\rm{diag}(p_1,p_2,\cds,p_m)= \bp p_1&0&\cds&0\\0&p_2&\cds&0\\\vds&\vds&\dds&\vds\\0&0&\cds&p_m \ep\end{split}\]

We can compactly write the covariance matrix of \(\X\) as

\[\Sg=\P-\p\cd\p^T\]

Sample Covariance Matrix

\[\hat\Sg=\hat\P-\hat\p\cd\hat\p^T\]

where \(\hat\P=\rm{diag}(\hat\p)\), and \(\hat\p=\hat\mmu=(\hat{p}_1,\hat{p}_2,\cds,\hat{p}_m)^T\) denotes the empirical probability mass function for \(\X\).

Multinomial Distribution: Number of Occurrences

Let \(\{\x_1,\x_2,\cds,\x_n\}\) drawn from \(\X\). Let \(N_i\) be the random variable corresponding to the number of occurrences of symbol \(a_i\) in the sample, and let \(\N=(N_1,N_2,\cds,N_m)^T\) denote the vector random variable corresponding to the joint distribution of the number of occurrences over all the symbols. Then \(\N\) has a multinomial distribution, given as

Note

\(\dp f(\N=(n_1,n_2,\cds,n_m)|\p)=\bp n\\n_1n_2\cds n_m \ep\prod_{i=1}^mp_i^{n_i}\)

The term

\[\begin{split}\bp n\\n_1n_2\cds n_m \ep=\frac{n!}{n_1!n_2!\cds n_m!}\end{split}\]

denotes the number of ways of choosing \(n_i\) occurrences of each symbol \(a_i\) from a sample of size \(n\), with \(\sum_{i=1}^mn_i=n\).

The mean of \(\N\) is given as

\[\begin{split}\mmu_\N=E[\N]=nE[\X]=n\cd\mmu=n\cd\p=\bp np_1\\\vds\\np_m \ep\end{split}\]

and its covariance matrix is given as

\[\begin{split}\Sg_\N=n\cd(\P=\p\p^T)= \bp np_1(1-p_1)&-np_1p_2&\cds&-np_1p_m\\-np_1p_2&np_2(1-p_2)&\cds&-np_2p_m\\ \vds&\vds&\dds&\vds\\-np_1p_m&-np_2p_m&\cds&np_m(1-p_m) \ep\end{split}\]

The sample mean and covariance matrix for \(\N\) are given as

\[\hat\mmu_\N=n\hat\p\quad\hat\Sg_\N=n(\hat\P-\hat\p\hat\p^T)\]

3.2 Bivariate Analysis

Assume that the data comprises two categorical attributes, \(X_1\) and \(X_2\), with

\[ \begin{align}\begin{aligned}dom(X_1)=\{a_{11},a_{12},\cds,a_{1m_1}\}\\dom(X_2)=\{a_{21},a_{22},\cds,a_{2m_2}\}\end{aligned}\end{align} \]

The dataset is now an \(n\times 2\) symbolic data matrix:

\[\begin{split}\D=\bp X_1&X_2\\\hline x_{11}&x_{12}\\x_{21}&x_{22}\\\vds&\vds\\x_{n1}&x_{n2} \ep\end{split}\]

We model \(X_1\) and \(X_2\) as multivariate Bernoulli variables \(\X_1\) and \(\X_2\) with dimensions \(m_1\) and \(m_2\). The probability mass funcitons for \(\X_1\) and \(\X_2\) are given as

\[ \begin{align}\begin{aligned}P(\X_1=\e_{1i})=f_1(\e_{1i})=p_i^1=\prod_{k=1}^{m1}(p_i^1)^{e_{ik}^1}\\P(\X_2=\e_{2j})=f_2(\e_{2j})=p_j^2=\prod_{k=1}^{m2}(p_j^2)^{e_{jk}^2}\end{aligned}\end{align} \]

where

\[\sum_{i=1}^{m1}p_i^1=1\quad\rm{and}\quad\sum_{j=1}^{m2}p_j^2=1\]

The joint distribution of \(\X_1\) and \(\X_2\) is modeled as the \(d\pr=m_1+m_2\) dimensional vector variable \(\X=\bp \X_1,\X_2 \ep\)

\[\begin{split}\X((v_1,v_2)^T)=\bp \X_1(v_1)\\\X_2(v_2) \ep=\bp \e_{1i}\\\e_{2j} \ep\end{split}\]

provided that \(v_1=a_{1i}\) and \(v_2=a_{2j}\). The joint PMF of \(\X\) is given as

\[P(\X=(\e_{1i},\e_{2j})^T)=f(\e_{1i},\e_{2j})=p_{ij}=\prod_{r=1}^{m1}\prod_{s=1}^{m2}p_{ij}^{e_{ir}^1\cd e_{is}^2}\]

where \(p_{ij}\) the probability of observing the symbol pair \((a_{1i},a_{2j})\). The probability paramemters must satisfy \(\sum_{i=1}^{m1}\sum_{j=1}^{m2}p_{ij}=1\). The joint PMF for \(\X\) can be expressed as the \(m_1\times m_2\) matrix

\[\begin{split}\P_{12}=\bp p_{11}&p_{12}&\cds&p_{1m_2}\\p_{21}&p_{22}&\cds& p_{2m_2}\\\vds&\vds&\dds&\vds\\p_{m_11}&p_{m_12}&\cds&p_{m_1m_2} \ep\end{split}\]

Mean

\[\begin{split}\mmu=E[\X]=E\bigg[\bp\X_1\\\X_2\ep\bigg]=\bp E[\X_1]\\E[\X_2] \ep=\bp\mmu_1\\\mmu_2\ep=\bp\p_1\\\p_2\ep\end{split}\]

Sample Mean

\[\begin{split}\hat\mmu=\frac{1}{n}\sum_{i=1}^n\x_i= \frac{1}{n}\bp\sum_{i=1}^{m_1}n_i^1\e_{1i}\\\sum_{j=1}^{m_2}n_j^2\e_{2j}\ep =\frac{1}{n}\bp n_1^1\\\vds\\n_{m_1}^1\\n_1^2\\\vds\\n_{m_2}^2 \ep= \bp\hat{p}_1^1\\\vds\\\hat{p}_{m_1}^1\\\hat{p}_1^2\\\vds\\\hat{p}_{m_2}^2\ep =\bp\hat\p_1\\\hat\p_2\ep=\bp\hat\mmu_1\\\hat\mmu_2\ep\end{split}\]

Covariance Matrix

The covariance matrix for \(\X\) is the \(d\pr\times d\pr=(m_1+m_2)\times(m_1+m_2)\) matrix given as

\[ \begin{align}\begin{aligned}\begin{split}\Sg=\bp \Sg_{11}&\Sg_{12}\\\Sg_{12}^T&\Sg_{22} \ep\end{split}\\\Sg_{11}=\P_1-\p_1\p_1^T\\\Sg_{22}=\P_2-\p_2\p_2^T\end{aligned}\end{align} \]
\[ \begin{align}\begin{aligned}\Sg_{12}&=E[(\X_1-\mmu_1)(\X_2-\mmu_2)^T]\\&=E[\X_1\X_2^T]-E[\X_1]E[\X_2]^T\\&=\P_{12}-\mmu_1\mmu_2^T\\&=\P_{12}-\p_1\p_2^T\end{aligned}\end{align} \]
\[\begin{split}=\bp p_{11}-p_1^1p_1^2&p_{12}-p_1^1p_2^2&\cds&p_{1m_2}-p_1^1p_{m_2}^2\\ p_{21}-p_1^2p_1^2&p_{22}-p_2^1p_2^2&\cds&p_{2m_2}-p_2^1p_{m_2}^2\\ \vds&\vds&\dds&\vds\\p_{m_11}-p_{m_1}^1p_1^2&p_{m_12}-p_{m_1}^1p_2^2&\cds &p_{m_1m_2}-p_{m_1}^1p_{m_2}^2 \ep\end{split}\]

Each row and each column of \(\Sg_{12}\) sums to zero. Consider row \(i\) and column \(j\):

\[ \begin{align}\begin{aligned}\sum_{k=1}^{m_2}(p_{ik}-p_i^1p_k^2)=\bigg(\sum_{k=1}^{m_2}p_{ik}\bigg)-p_i^1=p_i^1-p_i^1=0\\\sum_{k=1}^{m_1}(p_{kj}-p_k^1p_j^2)=\bigg(\sum_{k=1}^{m_1}p_{kj}\bigg)-p_k^2=p_j^2-p_j^2=0\end{aligned}\end{align} \]

Sample Covariance Matrix

The sample covariance matrix is given as

\[\begin{split}\hat\Sg=\bp \hat\Sg_{11}&\hat\Sg_{12}\\\hat\Sg_{12}^T&\hat\Sg_{22} \ep\end{split}\]

where

\[ \begin{align}\begin{aligned}\hat\Sg_{11}&=\hat\P_1-\hat\p_1\hat\p_1^T\\\hat\Sg_{22}&=\hat\P_2-\hat\p_2\hat\p_2^T\\\hat\Sg_{12}&=\hat\P_{12}-\hat\p_1\hat\p_2^T\end{aligned}\end{align} \]

\(\hat\P_{12}\) specifies the empirical joint PMF for \(\X_1\) and \(\X_2\), given as

\[\hat\P_{12}(i,j)=\hat{f}(\e_{1i},\e_{2j})= \frac{1}{n}\sum_{k=1}^nI_{ij}(\x_k)=\frac{n_{ij}}{n}=\hat{p}_{ij}\]

where \(I_{ij}\) is the indicator variable

\[\begin{split}I_{ij}(\x_k)=\left\{\begin{array}{lr}1\quad\rm{if\ }x_{k1}=\e_{1i} \rm{\ and\ }\x_{k2}=\e_{2j}\\0\quad\rm{otherwise}\end{array}\right.\end{split}\]

3.2.1 Attribute Dependence: Contingency Analysis

Testing for the independence of the two categorical random variables \(X_1\) and \(X_2\) can be down via contingency table analysis.

Contingency Table

A contingency table for \(\X_1\) and \(\X_2\) is the \(m_1\times m_2\) matrix of observed counts \(n_{ij}\) for all pairs of values \((\e_{1i},\e_{2j})\) in the given sample of size \(n\), defined as

\[\begin{split}\N_{12}=n\cd\hat\P_{12}=\bp n_{11}&n_{12}&\cds&n_{1m_2}\\ n_{21}&n_{22}&\cds&n_{2m_2}\\\vds&\vds&\dds&\vds\\ n_{m_11}&n_{m_12}&\cds&n_{m_1m_2} \ep\end{split}\]

where \(\hat\P_{12}\) is the empirical joint PMF for \(\X_1\) and \(\X_2\). The contingency table is then augmented with row and column marginal counts, as follows:

\[\begin{split}\N_1=n\cd\hat\p_1=\bp n_1^1\\\vds\\n_{m_1}^1\ep\quad\N_2=n\cd\hat\p_2=\bp n_1^2\\\vds\\n_{m_2}^2 \ep\end{split}\]

Note that the marginal row and column entries and the sample size satisfy the following constraints:

\[n_i^1=\sum_{j=1}^{m_2}n_{ij}\quad n_j^2=\sum_{i=1}^{m_1}n_{ij}\quad n= \sum_{j=1}^{m_1}n_i^1=\sum_{j=1}^{m_2}n_j^2= \sum_{i=1}^{m_1}\sum_{j=1}^{m_2}n_{ij}\]

It is worth noting that both \(\N_1\) and \(\N_2\) have a multinomial distribution with parameters \(\p_1=(p_1^1,\cds,p_{m_1}^1)\) and \(\p_2=(p_1^2,\cds,p_{m_2}^2)\), respectively. Further, \(\N_{12}\) also has a multinomial distribution with parameters \(\P_{12}=\{p_{ij}\}\), for \(1\leq i\leq m_i\) and \(1\leq j\leq m_2\).

\(\bs{\chi^2}\) Statistic and Hypothesis Testing

Under the null hypothesis \(\X_1\) and \(\X_2\) are assumed to be independent, which means that their joint probability mass function is given as

\[\hat{p}_{ij}=\hat{p}_i^1\cd\hat{p}_j^2\]

Under this independence assumption, the expected frequency for each pair of values is given as

\[e_{ij}=n\cd\hat{p}_{ij}=n\cd\hat{p}_i^1\cd\hat{p}_j^2=n\cd\frac{n_i^1}{n}\cd\frac{n_j^2}{n}=\frac{n_i^1n_j^2}{n}\]

The \(\chi^2\) statistic quantifies the difference between observed and expected counts for each pair of values; it is defined as follows:

Note

\(\dp\chi^2=\sum_{i=1}^{m_1}\sum_{j=1}^{m_2}\frac{(n_{ij}-e{ij})^2}{e_{ij}}\)

For the \(\chi^2\) statistic it is known that its sampling distribution follows the chi-squared density function with \(q\) degrees of freedom:

\[f(x|q)=\frac{1}{2^{q/2}\Gamma(q/2)}x^{\frac{q}{2}-1}e^{-\frac{x}{2}}\]

where the gamma function \(\Gamma\) is defined as

\[\Gamma(k>0)=\int_0^\infty x^{k-1}e^{-x}dx\]

The total degrees of freedom is

\[ \begin{align}\begin{aligned}q&=|dom(X_1)|\times|dom(X_2)|-(|dom(X_1)|+|dom(X_2)|)+1\\&=m_1m_2-m_1-m_2+1\\&=(m_1-1)(m_2-1)\end{aligned}\end{align} \]

p-value

The p-value of a statistic is defined as the probability of obtaining a value at least as extreme as the observed value under the null hypothesis. For the \(\chi^2\) statistic computed above, its p-value is defined as follows

Note

p-value\((\chi^2)=P(x\geq\chi^2)=1-F_1(\chi^2)\)

where \(F_q\) is the cumulative \(\chi^2\) probability distribution with \(q\) degrees of freedom.

The null hypothesis is rejected if the p-value is below some significance level, \(\alpha\). The value \(1-\alpha\) is also called the confidence level.

For a given significance level \(\alpha\) (or equivalently, confidence level \(1-\alpha\)), define the corresponding critical value, \(v_\alpha\), of the test statistic as follows:

\[P(x\geq v_\alpha)=1-F_q(v_\alpha)=\alpha,\rm{\ or\ equivalently\ }F_q(v_\alpha)=1-\alpha\]

For the given significance value \(\alpha\), we can find the critical value from the quantile funtion \(F_q\im\):

\[v_\alpha=F_q\im(1-\alpha)\]

An alternative test for rejection of the null hypothesis is to check if \(\chi^2\geq v_\alpha\), as in that case \(P(x\geq\chi^2)\leq P(x\geq v_\alpha)\), and therefore, the p-value of the observed \(\chi^2\) value is bounded above by \(\alpha\), that is, p-value\((\chi^2)\leq\) p-value\((v_\alpha)=\alpha\).

3.3 Multivariate Analysis

For an \(n\times d\) symbolic matrix

\[\begin{split}\D=\bp X_1&X_2&\cds&X_d\\\hline x_{11}&x_{12}&\cds&x_{1d}\\ x_{21}&x_{22}&\cds&x_{2d}\\\vds&\vds&\dds&\vds\\x_{n1}&x_{n2}&\cds&x_{nd}\ep\end{split}\]

The joint distribution is modeled as a \(d\pr=\sum_{j=1}^dm_j\) dimensional vector random variable

\[\begin{split}\X=\bp \X_1\\\vds\\\X_d \ep\end{split}\]

Each categorical data point \(\v=(v_1,v_2,\cds,v_d)^T\) is represented as a \(d\pr\)-dimensional binary vector

\[\begin{split}\X(\v)=\bp\X_1(v_1)\\\vds\\\X_d(v_d)\ep=\bp\e_{1k_1}\\\vds\\\e_{dk_d}\ep\end{split}\]

provided \(v_i=a_{ik_i}\), the \(k_i\)th symbol of \(X_i\).

Mean

The mean and sample mean for \(\X\) are given as

\[\begin{split}\mmu=E[\X]=\bp\mmu_1\\\vds\\\mmu_d\ep=\bp\p_1\\\vds\\\p_d\ep\quad \hat\mmu=\bp\hat\mmu_1\\\vds\\\hat\mmu_d\ep=\bp\hat\p_1\\\vds\\\hat\p_d\ep\end{split}\]

The covariance matrix for \(\X\), and its estimate from the sample, are given as the \(d\pr\times d\pr\) matrices:

\[\begin{split}\Sg=\bp\Sg_{11}&\Sg_{12}&\cds&\Sg_{1d}\\\Sg_{12}^T&\Sg_{12}&\cds&\Sg_{2d}\\ \vds&\vds&\dds&\vds\\\Sg_{1d}^T&\Sg_{2d}^T&\cds&\Sg_{dd}\ep\quad \hat\Sg=\bp\hat\Sg_{11}&\hat\Sg_{12}&\cds&\hat\Sg_{1d}\\ \hat\Sg_{12}^T&\hat\Sg_{12}&\cds&\hat\Sg_{2d}\\\vds&\vds&\dds&\vds\\ \hat\Sg_{1d}^T&\hat\Sg_{2d}^T&\cds&\hat\Sg_{dd}\ep\end{split}\]

where

\[\Sg_{ij}=\P_{ij}-\p_i\p_j^T\quad\hat\Sg_{ij}=\hat\P_{ij}-\hat\p_i\hat\p_j^T\]

3.3.1 Multiway Contingency Analysis

The empirical joint probability mass function for \(\X\) is

\[\hat{f}(\e_{1i_1},\e_{2i_2},\cds,\e_{di_d})= \frac{1}{n}\sum_{k=1}^nI_{i_1i_2\cds i_d}(\x_k)= \frac{n_{i_1i_2\cds i_d}}{n}=\hat{p}_{i_1i_2\cds i_d}\]

where \(I_{i_1i_2\cds i_d}\) is the indicator variable

\[\begin{split}I_{i_1i_2\cds i_d}(\x_k)=\left\{\begin{array}{lr}1\quad\rm{if\ } x_{k1}=\e_{1i_1},x_{k2}=\e_{2i_2},\cds,x_{kd}=\e_{di_d}\\ 0\quad\rm{otherwise}\end{array}\right.\end{split}\]

Using the notation \(\i=(i_1,i_2,\cds,i_d)\) to denote the index tuple, we can write the joint empirical PMF as the \(d\)-dimensional matrix \(\hat\P\) of size \(m_1\times m_2\times\cds\times m_d=\prod_{i=1}^dm_i\), given as

\[\hat\P(\i)=\{\hat{p}_\i\}\rm{\ for\ all\ index\ tuples\ }\i,\rm{\ with\ }1\leq i_1\leq m_1,\cds,1\leq i_d\leq m_d\]

where \(\hat{p}_\i=\hat{p}_{i_1i_2\cds i_d}\). The \(d\)-dimensional contingency table is then given as

\[\N=n\times\hat\P=\{n_\i\}\rm{\ for\ all\ index\ tuples\ }\i,\rm{\ with\ }1\leq i_1\leq m_1,\cds,1\leq i_d\leq m_d\]

where \(n_\i=n_{i_1i_2\cds i_d}\). The contingency table is augmented with the marginal count vectors \(\N_i\) for all \(d\) attributes \(\X_i\):

\[\begin{split}\N_i=n\hat\p_i=\bp n_1^i\\\vds\\n_{m_i}^i \ep\end{split}\]

\(\bs{\chi^2}\)-Test

The null hypothesis \(H_0\) is that the attributes are \(d\)-way independent. The alternative hypothesis \(H_1\) is that they are dependent in some way.

The expected number of occurrences of the symbol tuple \((a_{1i_1},a_{2i_2},\cds,a_{di_d})\) is given as

\[e_\i=n\cd\hat\p_\i=n\cd\prod_{j=1}^d\hat{p}_{i_j}^j=\frac{n_{i_1}^1n_{i_2}^2\cds n_{i_d}^d}{n^{d-1}}\]

The chi-squared statistic measures the difference between the observed counts \(n_\i\) and the expected count \(e_\i\):

Note

\(\dp\chi^2=\sum_\i\frac{(n_\i-e_\i)^2}{e_\i}=\sum_{i_1=1}^{m_1}\sum_{i_2=1}^{m_2}\cds\sum_{i_d=1}^{m_d}\) \(\dp\frac{(n_{i_1,i_2,\cds,i_d}-e_{i_1,i_2,\cds,i_d})^2}{e_{i_1,i_2,\cds,i_d}}\)

The total number of degrees of freedom is given as

\[ \begin{align}\begin{aligned}q&=\prod_{i=1}^d|dom(X_i)|-\sum_{i=1}^d|dom(X_i)|+(d-1)\\&=\bigg(\prod_{i=1}^dm_i\bigg)-\bigg(\sum_{i=1}^dm_i\bigg)+d-1\end{aligned}\end{align} \]

3.4 Distance and Angle

With the modeling of categorical attributes as multivariate Bernoulli variables, it is possible to compute the distance or the angle between any two points \(\x_i\) and \(\x_j\):

\[\begin{split}\x_i=\bp\e_{1i_1}\\\vds\\\e_{di_d}\ep\quad\x_j=\bp\e_{1j_1}\\\vds\\\e_{dj_d}\ep\end{split}\]

The number of matching values \(s\)

\[s=\x_i^T\x_j=\sum_{k=1}^d(\e_{ki_k)^T\e_{kj_k)\]

The norm of each point

\[\lv\x_i\rv^2=\x_i^T\x_i=d\]

Euclidean Distance

\[\lv\x_i-\x_j\rv=\sqrt{\x_i^T\x_i-2\x_i\x_j+\x_j^T\x_j}=\sqrt{2(d-s)}\]

Hamming Distance

\[\delta_H(\x_i,\x_j)=d-s=\frac{1}{2}\lv\x_i-\x_j\rv^2\]

Cosine Similarity

\[\cos\th=\frac{\x_i^T\x_j}{\lv\x_i\rv\cd\lv\x_j\rv}=\frac{s}{d}\]

Jaccard Coefficient

The Jaccard Coefficient is defined as the ratio of the number of matching values to the number of distinct values that appear in both \(\x_i\) and \(\x_j\), across the \(d\) attributes:

\[J(\x_i,\x_j)=\frac{s}{2(d-s)+s}=\frac{s}{2d-s}\]

3.5 Discretization

Discretization, also called binning, converts numeric attributes into categorical ones. Formally, given a numeric attribute \(X\), and a random sample \(\{x_i\}_{i=1}^n\) of size \(n\) drawn from \(X\), the discretization task is to divide the value range of \(X\) into \(k\) consecutive intervals, also called bins, by finding \(k-1\) boundary values \(v_1,v_2,\cds,v_{k-1}\) that yield the \(k\) intervals:

\[[x_\min,v_1],(v_1,v_2],\cds,(v_{k-1},x_\max]\]

where the extremes of the range of \(X\) are given as

\[x_\min=\min_i\{x_i\}\quad x_\max=\max_i\{x_i\}\]

Equal-Width Intervals

\[w=\frac{x_\max-x_\min}{k}\]

The \(i\)th interval boundary is given as

\[v_i=x_\min+iw,\rm{\ for\ }i=1,\cds,k-1\]

Equal-Frequency Intervals

\[v_i=\hat{F}\im(i/k)\rm{\ for\ }i=1,\cds,k-1\]