Chapter 2 Numeric Attributes
2.1 Univariate Analysis
Univariate analysis focuses on a single attribute at a time; thus the data matrix \(\D\) can be thought of as an \(n\times 1\) matrix, or simply a column vector, given as
where \(X\) is the numeric attribute of interest, with \(x+i\in\R\). \(X\) is assumed to be a random variable, with each point \(x_i(1\leq i\leq b)\) itself treated as an identity random variable.
Empirical Cumulative Distribution Function
The empirical cumulative distribution function (CDF) of \(X\) is given as
where
is a binary indicator variable that indicates whether the given condition is satisfied or not. Note that we use the notation \(\hat{F}\) to denote the fact that the empirical CDF is an estimate for the unknown population CDF \(F\).
Inverse Cumulative Distribution Function
Define the inverse cumulative distribution function or quantile function for a random variable \(X\) as follows:
Empirical Probability Mass Function
The empirical probability mass function (PMF) of \(X\) is given as
where
2.1.1 Measures of Central Tendency
Mean The mean, also called the expected value, of a random variable \(X\) is the arithmetic average of the values of \(X\). It provides a one-number summary of the location or central tendency for the distribution of \(X\).
The mean or expected value of a discrete random variable \(X\) is definede as
Note
\(\dp\mu=E[X]=\sum_xx\cd f(x)\)
where \(f(x)\) is the probability mass function of \(X\).
The expected vvalue of a continuous random variable \(X\) is defined as
Note
\(\dp\mu=E[x]=\int_{-\infty}^\infty x\cd f(x)dx\)
where \(f(x)\) is the probability density function of \(X\).
Sample Mean
The sample mean is a statistic, that is, a function \(\hat\mu:\{x_1,x_2,\cds,x_n\}\ra\R\), defined as the average value of \(x_i\)‘s:
Note
\(\dp\hat\mu=\frac{1}{n}\sum_{i=1}^nx_i\)
It serves as an estimator for the unknown mean value \(\mu\) of \(X\).
Sample Mean Is Unbiased
An estimator \(\hat\th\) is called an unbiased estimator for parameter \(\th\) if \(E[\hat\th]=\th\) forr every possible value of \(\th\). The sample mean \(\hat\mu\) is an unbiased estimator for the population mean \(\mu\), as
Robustness
We say that a statistic is robust if it is not affected by extreme values (such as outliers) in the data. The sample mean is not robust because a single large value (an outlier) can skew the average. A more robust measure is the trimmed mean obtained after discarding a small fraction of extreme values on one or both ends.
Geometric Interpretation of Sample Mean
Consider the projection of \(X\) onto the vector \(\bs{1}\), we have
Thus, the sample mean is simply the offset or the scalar projection of \(X\) on the vector \(\bs{1}\):
Note
\(\dp\hat\mu=\rm{proj}_{\bs{1}}(X)=\bigg(\frac{X^T\bs{1}}{\bs{1}^T\bs{1}}\bigg)\)
The sample mean can be used to center the attribute \(X\). Define the centered attribute vector, \(\bar{X}\), as follows:
We can see that \(\bs{1}\) and \(\bar{X}\) are orthogonal to each other, since
If fact, the subspace containing \(\bar{X}\) is an orthogonal complement of the space spanned by \(bs{1}\).
Median
The median of a random variable is defined as the value \(m\) such that
In terms of the (inverse) cumulative distribution function, the median is therefore the value \(m\) for which
The sample median can be obtained from the empirical CDF or the empirical inverse CDF by computing
Median is robust, as it is not affected very much by extreme values.
Mode
The mode of a random variable \(X\) is the value at which the probability mass function or the probability density function attains its maximum value, depending on whether \(X\) is discrete or continuous, respectively.
2.1.2 Measures of Dispersion
Range
The value range or simply range of a random variable \(X\) is the difference between the maximum and minimum values of \(X\), given as
The sample range is a statistic, given as
By definition, range is sensitive to extreme values, and thus is not robust.
Interquartile Range
Quartiles are special values of the quantile function that divide the data into four equal parts. A more robust measure of the dispersion of \(X\) is the interquartile range (IQR), defined as
The sample IQR can be obtained by plugging in the empirical inverse CDF:
Variance and Standard Deviation
The variance of a random variable \(X\) provides a measure of how much the values \(X\) deviate from the mean or expected value of \(X\)
Note
\(\dp\sg^2=\rm{var}(X)=E[(X-\mu)^2]=\) \(\dp\left\{\begin{array}{lr}\dp\sum_x(x-\mu)^2f(x)\quad\rm{if\ }X\rm{\ is\ discrete}\\\dp\int_{-\infty}^\infty(x-\mu)^2f(x)dx\quad\rm{if\ }X\rm{\ is\ continuous}\end{array}\right.\)
The standard deviation, \(\sg\), is defined as the positive square root of the variance, \(\sg^2\).
It is worth noting that variance is in fact the second moment about the mean, corresponding to \(r=2\), which is a special case of the \(r\)th moment about the mean for a random variable \(X\), defined as \(E[(X-\mu)^r]\).
Sample Variance
The sample variance is defined as
Note
\(\dp\sg^2=\frac{1}{n}\sum_{i=1}^n(x_i-\mu)^2\)
The sample standard deviation is given as the positive square root of the sample variance:
The standard score, also called the \(z\)-score, of a sample value \(x_i\) is the number of standard deviations the value is away from the mean:
Note
\(\dp z_i=\frac{x_i-\hat\mu}{\hat\sg}\)
Variance of the Sample Mean
The expected value of the sample mean is simply \(\mu\).
Further, note that
The variance of the sample mean \(\hat\mu\) can be computed as
Bias of Sample Variance
The sample variance is a biased estimator for the true population variance, \(\sg^2\), that is, \(E[\hat\sg^2]\neq\sg^2\).
The sample variance \(\hat\sg^2\) is a biased estimator of \(\sg^2\), as its expected value differs from the population variance by a factor of \(\frac{n-1}{n}\). However, it is asymptotically unbiased, that is, the bias vanishes as \(n\ra\infty\) because
Put differently, as the sample size increases, we have
If we eant an unbiased estimate of the sample variance, denoted \(\hat\sg_u^2\), we must divide by \(n-1\) instead of \(n\):
Geometric Interpretation of Sample Variance
Let \(\bar{X}\) denote the centered attribute vector
Note
\(\dp\hat\sg^2=\frac{1}{n}\lv\bar{X}\rv^2=\frac{1}{n}\bar{X}^T\bar{X}=\frac{1}{n}\sum_{i=1}^n(x_i-\bar\mu)^2\)
Define the degress of freedom (dof) of a statistical vector as the dimensionality of the subspace that contains the vector. Notice that the centered attribute vector \(\bar{X}=X-\hat\mu\cd\bs{1}\) lies in a \(n-1\) dimensional subspace that is an orthogonal complement of the 1 dimensional subspace spanned by the ones vector \(\bs{1}\). Thus, the vector \(\bar{X}\) has only \(n-1\) degrees of freedom, and the unbiased sample variance is simply the mean or expected squared length of \(\bar{X}\) per dimension
2.2 Bivariate Analysis
In bivariate analysis, we consider two attributes at the same time.
It can be viewed as \(n\) points or vectors in 2-dimensional space over the attributes \(X_1\) and \(X_2\), that is, \(\x_i=(x_{i1},x_{i2})^T\in\R^2\). Alternatively, it can be viewed as two points or vectors in an \(n\)-dimensional space comprising the points, that is, each column is a vector in \(\R\), as follows:
In the probabilistic view, the column vector \(\X=(X_1,X_2)^T\) is considered a bivariate vector random variable, and the points \(\x_i (1\leq i\leq n)\) are treated as a random sample drawn from \(\X\), that is, \(\x_i\)’s are considered independent and identically distributed as \(\X\).
Empirical Joint Probability Mass Function
The empirical joint probability mass function for \(\X\) is given as
where
2.2.1 Measures of Location and Dispersion
Mean
The bivariate mean is defined as the expected value of the vector random variable \(\X\), defined as follows:
The sample mean vector can be computed from the joint empirical PMF
Note
\(\dp\hat\mmu=\sum_\x\x\hat{f}(\x)=\sum_\x\x\bigg(\frac{1}{n}\sum_{i=1}^nI(\x_i=\x)\bigg)=\frac{1}{n}\sum_{i=1}^n\x_i\)
Variance
The total variance is given as
The sample total variance is simply
2.2.2 Measures of Association
Covariance
The covariance between two attributes \(X_1\) and \(X_2\) provides a measure of the association or linear dependence between them, and is defined as
Note
\(\sg_{12}=E[(X_1-\mu_1)(X_2-\mu_2)]\)
By linearity of expectation, we have
which implies
Note
\(\sg_{12}=E[X_1X_2]-E[X_1]E[X_2]\)
If \(X_1\) and \(X_2\) are independent random variables, then we conclude that their covariance is zero. This is because if \(X_1\) and \(X_2\) are independent, then we have
which in turn implies that
The converse is not true.
The sample covariance between \(X_1\) and \(X_2\) is given as
Note
\(\dp\hat\sg_{12}=\frac{1}{n}\sum_{i=1}^n(x_{i1}-\hat\mu_1)(x_{i2}-\hat\mu_2)\)
Correlation
The correlation between variables \(X_1\) and \(X_2\) is the standardized covariance, obatained by normalizing the covariance with the standard deviation of each variable, given as
The sample correlation for attributes \(X_1\) and \(X_2\) is given as
Note
\(\dp\hat\rho_{12}=\frac{\hat\sg_{12}}{\hat\sg_1\sg_2}=\) \(\dp\frac{\sum_{i=1}^n(x_{i1}-\hat\mu_1)(x_{i2}-\hat\mu_2)}{\sqrt{\sum_{i=1}^n(x_{i1}-\hat\mu_1)^2}\sqrt{\sum_{i=1}^n(x_{i2}-\hat\mu_2)^2}}\)
Geometric Interpretation of Sample Covariance and Correlation
Let \(\bar{X}_1\) and \(\bar{X}_2\) denote the centered attribute vectors in \(\R^n\), given as follows:
The sample covariance can then be written as
Note
\(\dp\hat\sg_{12}=\frac{\hat{X}_1^T\hat{X}_2}{n}\)
The sample correlation can be written as
Note
\(\dp\hat\rho_{12}=\frac{\bar{X}_1^T\bar{X}_2}{\sqrt{\bar{X}_1^T\bar{X}_1}\sqrt{\bar{X}_2^T\bar{X}_2}}=\) \(\dp\frac{\bar{X}_1^T\bar{X}_2}{\lv\bar{X}_1\rv\lv\bar{X}_2\rv}=\) \(\dp\left(\frac{\bar{X}_1}{\lv\bar{X}_1\rv}\right)^T\left(\frac{\bar{X}_2}{\lv\bar{X}_2\rv}\right)=\cos\th\)
Covariance Matrix
The variance-covariance information for the two attributes \(X_1\) and \(X_2\) can be summarized in the square \(2\times 2\) covariance matrix, given as
Because \(\sg_{12}=\sg_{21}\), \(\Sg\) is a symmetric matrix.
The total variance of the two attributes is given as the sum of the diagonal elements of \(\Sg\), which is also called the trace of \(\Sg\), given as
We immediately have \(tr(\Sg)\leq 0\).
The generalized covariance is non-negative, because
Note that \(|\rho_{12}|\leq 1\) implies that \(\rho_{12}^2\leq 1\), which in turn implies that \(\det(\Sg)\geq 0\).
The sample covariance matrix is given as
Note
\(\dp\hat\Sg=\bp\hat\sg_1^2&\hat\sg_{12}\\\hat\sg_{12}&\hat\sg_2^2\ep\)
Note
\(\dp\rm{var}(\D)=tr(\hat\Sg)=\hat\sg_1^2+\hat\sg_2^2\)
2.3 Multivariate Analysis
In multivariate analysis, we consider all the \(d\) numeric attributes \(X_1,X_2,\cds,X_d\). The full data is an \(n\times d\) matrix, given as
In the row view, the data can be considered as a set of \(n\) points or vectors in the \(d\)-dimensional attribute space
In the column view, the data can be considered as a set of \(d\) points or vectors in the \(n\)-dimensional space spanned by the data points
In the probabilistic view, the \(d\) attributes are modeled as a vector random variable, \(\X=(X_1,X_2,\cds,X_d)^T\), and the points \(\x_i\) are considered to be a random sample drawn from \(\X\), that is, they are independent and identically distributed as \(\X\).
Mean
The multivariate mean vector is obtained by taking the mean of each attribute, given as
The sample mean is given as
Note
\(\dp\hat\mmu=\frac{1}{n}\sum_{i=1}^n\x_i\)
Covariance Matrix
The multivariate covariance information is captured by thbp sg_1^2&sg_{12}&cds&sg_{1d}\ sg_{21}&sg_{2}^2&cds&sg_{2d}\cds&cds&cds&cds\ sg_{d1}&sg_{d2}&cds&sg_d^2 epe \(d\times d\) symmetric covariance matrix
Covariance Matrix Is Positive Semidefinite
\(\Sg\) is a positive semidefinite matrix, that is,
Too see this, observe that
where \(Y\) is the random variable \(Y=\a^t(\X-\mmu)=\sum_{i=1}^da_i(X_i-\mu_i)\).
The \(d\) eigenvalues of \(\Sg\) can be arranged from the largest to the smallest as follows: \(\ld_1\geq\ld_2\geq\cds\geq\ld_d\geq 0\).
Total and Generalized Variance
The total variacne is given as the trace of the covariance matrix:
Note
\(tr(\Sg)=\sg_1^2+\sg_2^2+\cds+\sg_d^2\)
The generalized variacne is defined as the determinant of the covariance matrix, \(\det(\Sg)\), also denoted as \(|\Sg|\); it gives a single value for the overall multivariate scatter:
Note
\(\dp\det(\Sg)=|\Sg|=\prod_{i=1}^d\ld_i\)
Since all the eigenvalues of \(\Sg\) are non-negative (\(\ld_i\geq 0\)), it follows that \(\det(\Sg)\geq 0\).
Sample Covariance Matrix
The sample covariance matrix is given as
Note
\(\dp\hat\Sg=E[(\X-\hat\mmu)(\X-\hat\mmu)^T]=\) \(\dp\bp\hat\sg_1^2&\hat\sg_{12}&\cds&\hat\sg_{1d}\\\hat\sg_{21}&\hat\sg_{2}^2&\cds&\hat\sg_{2d}\\\cds&\cds&\cds&\cds\\\hat\sg_{d1}&\hat\sg_{d2}&\cds&\hat\sg_d^2\ep\)
Let \(\bar{D}\) represent the centered data matrix, given as the matrix of centered attribute vectors \(\bar{X}_i-X_i-\hat\mu_i\cd\bs{1}\), where \(\bs{1}\in\R^n\):
In matrix notation, the sample covariance matrix can be written as
Note
\(\dp\hat\Sg=\frac{1}{n}(\bar\D^T\bar\D)=\frac{1}{n}\) \(\dp\bp\bar{X}_1^T\bar{X}_1&\bar{X}_1^T\bar{X}_2&\cds&\bar{X}_1^T\bar{X}_d\\\bar{X}_2^T\bar{X}_1&\bar{X}_2^T\bar{X}_2&\cds&\bar{X}_2^T\bar{X}_d\\\vds&\vds&\dds&\vds\\\bar{X}_d^T\bar{X}_1&\bar{X}_d^T\bar{X}_2&\cds&\bar{X}_d^T\bar{X}_d\ep\)
The sample covariance matrix can also be written as a sum of rank-one matrices obtained as the outer product of each centered point:
Note
\(\dp\hat\Sg=\frac{1}{n}\sum_{i=1}^n\bar\x_i\cd\bar\x_i^T\)
Also the sample total variance is given as
Sample Scatter Matrix
The sample scatter matrix is the \(d\times d\) positive semi-denifite matrix defined as
It is simply the un-normalized sample covariance matrix, since \(\bs{\rm{S}}=n\cd\hat\Sg\).
2.4 Data Normalization
Range Normalization
Let \(X\) be an attribute and let \(x_1,x_2,\cds,x_n\) be a random sample drawn from \(X\). In range normalization each value is caled by the sample range \(\hat{r}\) of \(X\):
After transformation the new attribute takes on values in the range [0, 1].
Standard Score Normalization
In standard score normalization, also called \(z\)-normalization, each value is replaced by its \(z\)-score:
2.5 Normal Distribution
2.5.1 Univariate Normal Distribution
Note
\(\dp f(x|\mu,\sg^2)=\frac{1}{\sqrt{2\pi \sg^2}}\exp\bigg\{-\frac{(x-\mu)^2}{2\sg^2}\bigg\}\)
Probability Mass
Given an interval \([a,b]\) the probability mass of the normal distribution within that interval is given as
The probability mass concentrated with \(k\) standard deviations from the mean, that is, for the interval \([\mu-k\sg,\mu+k\sg]\), can be computed as
Via a change of variable \(z=\frac{x-\mu}{\sg}\), we get
Via another change of variable \(t=\frac{z}{\sqrt{2}}\), we get
where erf is the Gauss error function, defined as
2.5.2 Multivariate Normal Distribution
Note
\(\dp f(\x|\mmu,\Sg)=\frac{1}{(\sqrt{2\pi})^d\sqrt{|\Sg|}}\) \(\dp\exp\bigg\{-\frac{(\x-\mmu)^T\Sg\im(\x-\mmu)}{2}\bigg\}\)
As in the univariate case, the term
measures the distance, called the Mahalanobis distance, of the point \(\x\) from the mean \(\mmu\) of the distribution, taking into account all of the variance-covariance information between the attributes.
The standard multivariate normal distribution has parameters \(\mu=\0\) and \(\Sg=\bs{\rm{I}}\).
Geometry of the Multivariate Normal
Compared to the standard normal distribution, we can expect the density contours to be shifted, scaled, and rotated. The shape or geometry of the normal distribution becomes clear by considering the eigen-decomposition of the covariance matrix. The eigenvector equation for \(\Sg\) is given as
The diagonal matrix \(\Ld\) is used to record the eigenvalues:
The eigenvectors are orthonormal, and can be put together into an orthogonal matrix \(\bs{\rm{U}}\):
The eigen-decomposition of \(\Sg\) can then be expressed compactly as follows:
This equation can be interpreted geometrically as a change in basis vectors.
Total and Generalized Variance
In other words \(\sg_1^2+\cds+\sg_d^2=\ld_1+\cds+\ld_d\).