Chapter 3 Categorical Attributes
3.1 Univariate Analysis
Consider a single categorical attribute, \(X\), with domain \(dom(X)=\{a_1,a_2,\cds,a_m\}\) comprising \(m\) symbolic values. The data \(\D\) is an \(n\times 1\) symbolic data matrix given as
where each point \(x_i\in dom(X)\).
3.1.1 Bernoulli Variable
When the categorical attribute \(X\) has domain \(\{a_1,a_2\}\), with \(m=2\). We can model \(X\) as a Bernoulli random variables, which takes on two distinct values, 1 and 0, according to the mapping
The probability mass funciton (PMF) of \(X\) is given as
where \(p_1\) and \(p_0\) are the parameters of the distribution, which must satisfy the condition
Denote \(p_1=p\), from which it follows that \(p_0=1-p\). The PMF of Bernoulli random variable \(X\) can then be written compactly as
Note
\(P(X=x)=f(x)=p^x(1-p)^{1-x}\)
Mean and Variance
The expected value of \(X\) is given as
Note
\(\mu=E[X]=1\cd p+0\cd(1-p)=0\)
and the variance of \(X\) is given as
which implies
Note
\(\sg^2=p(1-p)\)
Sample Mean and Variance
The sample mean is given as
Let \(n_0=n-n_1\) denote the number of points with \(x_i=0\) in the random sample. The sample variance is given as
Bionomial Distribution: Number of Occurrences
Given the Bernoulli variable \(X\), let \(\{x_1,x_2,\cds,x_n\}\) denote a random sample of size \(n\) drawn from \(X\). Let \(N\) be the random variable denoting the number of occurrences of the symbol \(a_1\) (value \(X=1\)) in the sample. \(N\) has a binomial distribution, given as
Note
\(f(N=n_1|n,p)=\bp n\\n_1 \ep p^{n_1}(1-p)^{n-n_1}\)
\(N\) is the sum of the \(n\) independent Bernoulli random variables \(x_i\) IID with \(X\), that is, \(N=\sum_{i=1}^nx_i\). The mean or expected number of occurrences of symbol \(a_1\) is given as
Note
\(\dp\mu_N=E[N]=E\bigg[\sum_{i=1}^nx_i\bigg]=\sum_{i=1}^nE[x_i]=\sum_{i=1}^np=np\)
Because \(x_i\) are all independent, the variance of \(N\) is given as
Note
\(\dp\sg_N^2=\rm{var}(N)=\sum_{i=1}^n\rm{var}(x_i)=\sum_{i=1}^np(1-p)=np(1-p)\)
3.1.2 Multivariate Bernoulli Variable
For the general case when \(dom(X)=\{a_1,a_2,\cds,a_m\}\), we model \(X\) as an \(m\)-dimensional Bernoulli random variable \(\X=(A_1,A_2,\cds,A_m)^T\), where each \(A_i\) is a Bernoulli variable with parameter \(p_i\) denoting the probability of observing symbol \(a_i\). However, \(X\) can assume only one of the symbolic values at any one time. Thus,
where \(\e_i\) is the \(i\)-th standard basis vector in \(m\) dimensions. The range of \(\X\) consists of \(m\) distinct vector values \(\{\e_1,\e_2,\cds,\e_m\}\).
The PMF of \(\X\) is given as
where \(p_i\) is the probability of observing value \(a_i\). These parameters must satisfy the condition
The PMF can be written compactly as follows:
Note
\(\dp P(\X=\e_i)=f(\e_i)=\prod_{j=1}^m p_i^{e_{ij}}\)
Because \(e_{ii}=1\) and \(e_{ij}=0\) for \(j\neq i\), we can see that, as expected, we have
Mean
The mean or expected value of \(\X\) can be obtained as
Sample Mean
where \(n_i\) is the number of occurrences of the vector value \(\e_i\) in the sample, which is equivalent to the number of occurrences of the symnbol \(a_i\). Furthermore, we have \(\sum_{i=1}^mn_i=n\).
Covariance Matrix
Note
\(\sg_i^2=\rm{var}(A_i)=p_i(1-p_i)\)
Note
\(\sg_{ij}=E[A_iA_j]-E[A_i]\cd E[A_j]=0-p_ip_j=-p_ip_j\)
which follows from the fact taht \(E[A_iA_j]=0\), as \(A_i\) and \(A_j\) cannot both be 1 at the same time.
The \(m\times m\) covariance matrix for \(\X\) is given as
Define \(\\) diagonal matrix:
We can compactly write the covariance matrix of \(\X\) as
Sample Covariance Matrix
where \(\hat\P=\rm{diag}(\hat\p)\), and \(\hat\p=\hat\mmu=(\hat{p}_1,\hat{p}_2,\cds,\hat{p}_m)^T\) denotes the empirical probability mass function for \(\X\).
Multinomial Distribution: Number of Occurrences
Let \(\{\x_1,\x_2,\cds,\x_n\}\) drawn from \(\X\). Let \(N_i\) be the random variable corresponding to the number of occurrences of symbol \(a_i\) in the sample, and let \(\N=(N_1,N_2,\cds,N_m)^T\) denote the vector random variable corresponding to the joint distribution of the number of occurrences over all the symbols. Then \(\N\) has a multinomial distribution, given as
Note
\(\dp f(\N=(n_1,n_2,\cds,n_m)|\p)=\bp n\\n_1n_2\cds n_m \ep\prod_{i=1}^mp_i^{n_i}\)
The term
denotes the number of ways of choosing \(n_i\) occurrences of each symbol \(a_i\) from a sample of size \(n\), with \(\sum_{i=1}^mn_i=n\).
The mean of \(\N\) is given as
and its covariance matrix is given as
The sample mean and covariance matrix for \(\N\) are given as
3.2 Bivariate Analysis
Assume that the data comprises two categorical attributes, \(X_1\) and \(X_2\), with
The dataset is now an \(n\times 2\) symbolic data matrix:
We model \(X_1\) and \(X_2\) as multivariate Bernoulli variables \(\X_1\) and \(\X_2\) with dimensions \(m_1\) and \(m_2\). The probability mass funcitons for \(\X_1\) and \(\X_2\) are given as
where
The joint distribution of \(\X_1\) and \(\X_2\) is modeled as the \(d\pr=m_1+m_2\) dimensional vector variable \(\X=\bp \X_1,\X_2 \ep\)
provided that \(v_1=a_{1i}\) and \(v_2=a_{2j}\). The joint PMF of \(\X\) is given as
where \(p_{ij}\) the probability of observing the symbol pair \((a_{1i},a_{2j})\). The probability paramemters must satisfy \(\sum_{i=1}^{m1}\sum_{j=1}^{m2}p_{ij}=1\). The joint PMF for \(\X\) can be expressed as the \(m_1\times m_2\) matrix
Mean
Sample Mean
Covariance Matrix
The covariance matrix for \(\X\) is the \(d\pr\times d\pr=(m_1+m_2)\times(m_1+m_2)\) matrix given as
Each row and each column of \(\Sg_{12}\) sums to zero. Consider row \(i\) and column \(j\):
Sample Covariance Matrix
The sample covariance matrix is given as
where
\(\hat\P_{12}\) specifies the empirical joint PMF for \(\X_1\) and \(\X_2\), given as
where \(I_{ij}\) is the indicator variable
3.2.1 Attribute Dependence: Contingency Analysis
Testing for the independence of the two categorical random variables \(X_1\) and \(X_2\) can be down via contingency table analysis.
Contingency Table
A contingency table for \(\X_1\) and \(\X_2\) is the \(m_1\times m_2\) matrix of observed counts \(n_{ij}\) for all pairs of values \((\e_{1i},\e_{2j})\) in the given sample of size \(n\), defined as
where \(\hat\P_{12}\) is the empirical joint PMF for \(\X_1\) and \(\X_2\). The contingency table is then augmented with row and column marginal counts, as follows:
Note that the marginal row and column entries and the sample size satisfy the following constraints:
It is worth noting that both \(\N_1\) and \(\N_2\) have a multinomial distribution with parameters \(\p_1=(p_1^1,\cds,p_{m_1}^1)\) and \(\p_2=(p_1^2,\cds,p_{m_2}^2)\), respectively. Further, \(\N_{12}\) also has a multinomial distribution with parameters \(\P_{12}=\{p_{ij}\}\), for \(1\leq i\leq m_i\) and \(1\leq j\leq m_2\).
\(\bs{\chi^2}\) Statistic and Hypothesis Testing
Under the null hypothesis \(\X_1\) and \(\X_2\) are assumed to be independent, which means that their joint probability mass function is given as
Under this independence assumption, the expected frequency for each pair of values is given as
The \(\chi^2\) statistic quantifies the difference between observed and expected counts for each pair of values; it is defined as follows:
Note
\(\dp\chi^2=\sum_{i=1}^{m_1}\sum_{j=1}^{m_2}\frac{(n_{ij}-e{ij})^2}{e_{ij}}\)
For the \(\chi^2\) statistic it is known that its sampling distribution follows the chi-squared density function with \(q\) degrees of freedom:
where the gamma function \(\Gamma\) is defined as
The total degrees of freedom is
p-value
The p-value of a statistic is defined as the probability of obtaining a value at least as extreme as the observed value under the null hypothesis. For the \(\chi^2\) statistic computed above, its p-value is defined as follows
Note
p-value\((\chi^2)=P(x\geq\chi^2)=1-F_1(\chi^2)\)
where \(F_q\) is the cumulative \(\chi^2\) probability distribution with \(q\) degrees of freedom.
The null hypothesis is rejected if the p-value is below some significance level, \(\alpha\). The value \(1-\alpha\) is also called the confidence level.
For a given significance level \(\alpha\) (or equivalently, confidence level \(1-\alpha\)), define the corresponding critical value, \(v_\alpha\), of the test statistic as follows:
For the given significance value \(\alpha\), we can find the critical value from the quantile funtion \(F_q\im\):
An alternative test for rejection of the null hypothesis is to check if \(\chi^2\geq v_\alpha\), as in that case \(P(x\geq\chi^2)\leq P(x\geq v_\alpha)\), and therefore, the p-value of the observed \(\chi^2\) value is bounded above by \(\alpha\), that is, p-value\((\chi^2)\leq\) p-value\((v_\alpha)=\alpha\).
3.3 Multivariate Analysis
For an \(n\times d\) symbolic matrix
The joint distribution is modeled as a \(d\pr=\sum_{j=1}^dm_j\) dimensional vector random variable
Each categorical data point \(\v=(v_1,v_2,\cds,v_d)^T\) is represented as a \(d\pr\)-dimensional binary vector
provided \(v_i=a_{ik_i}\), the \(k_i\)th symbol of \(X_i\).
Mean
The mean and sample mean for \(\X\) are given as
The covariance matrix for \(\X\), and its estimate from the sample, are given as the \(d\pr\times d\pr\) matrices:
where
3.3.1 Multiway Contingency Analysis
The empirical joint probability mass function for \(\X\) is
where \(I_{i_1i_2\cds i_d}\) is the indicator variable
Using the notation \(\i=(i_1,i_2,\cds,i_d)\) to denote the index tuple, we can write the joint empirical PMF as the \(d\)-dimensional matrix \(\hat\P\) of size \(m_1\times m_2\times\cds\times m_d=\prod_{i=1}^dm_i\), given as
where \(\hat{p}_\i=\hat{p}_{i_1i_2\cds i_d}\). The \(d\)-dimensional contingency table is then given as
where \(n_\i=n_{i_1i_2\cds i_d}\). The contingency table is augmented with the marginal count vectors \(\N_i\) for all \(d\) attributes \(\X_i\):
\(\bs{\chi^2}\)-Test
The null hypothesis \(H_0\) is that the attributes are \(d\)-way independent. The alternative hypothesis \(H_1\) is that they are dependent in some way.
The expected number of occurrences of the symbol tuple \((a_{1i_1},a_{2i_2},\cds,a_{di_d})\) is given as
The chi-squared statistic measures the difference between the observed counts \(n_\i\) and the expected count \(e_\i\):
Note
\(\dp\chi^2=\sum_\i\frac{(n_\i-e_\i)^2}{e_\i}=\sum_{i_1=1}^{m_1}\sum_{i_2=1}^{m_2}\cds\sum_{i_d=1}^{m_d}\) \(\dp\frac{(n_{i_1,i_2,\cds,i_d}-e_{i_1,i_2,\cds,i_d})^2}{e_{i_1,i_2,\cds,i_d}}\)
The total number of degrees of freedom is given as
3.4 Distance and Angle
With the modeling of categorical attributes as multivariate Bernoulli variables, it is possible to compute the distance or the angle between any two points \(\x_i\) and \(\x_j\):
The number of matching values \(s\)
The norm of each point
Euclidean Distance
Hamming Distance
Cosine Similarity
Jaccard Coefficient
The Jaccard Coefficient is defined as the ratio of the number of matching values to the number of distinct values that appear in both \(\x_i\) and \(\x_j\), across the \(d\) attributes:
3.5 Discretization
Discretization, also called binning, converts numeric attributes into categorical ones. Formally, given a numeric attribute \(X\), and a random sample \(\{x_i\}_{i=1}^n\) of size \(n\) drawn from \(X\), the discretization task is to divide the value range of \(X\) into \(k\) consecutive intervals, also called bins, by finding \(k-1\) boundary values \(v_1,v_2,\cds,v_{k-1}\) that yield the \(k\) intervals:
where the extremes of the range of \(X\) are given as
Equal-Width Intervals
The \(i\)th interval boundary is given as
Equal-Frequency Intervals