Chapter 1 Data Matrix
1.1 Data Matrix
The \(n\times d\) data matrix is given as
where \(\x_i\) donotes the \(i\)th row, which is a \(d\)-tuple given as
and \(X_j\) denotes the \(j\)th column, which is an \(n\)-tuple given as
The number of instances \(n\) is referred to as the size of the data, whereas the number of attributes \(d\) is called the dimensionality of the data.
1.2 Attributes
Numeric Attributes
A numeric attribute is one that has a real-valued or integer-valued domain. Numeric attributes that take on a finite or countably infinite set of values are called discrete, whereas those that can take on any real value are called continuous. If an attribute has as its domain the set \(\{0, 1\}\), it is called a binary attribute.
Interval-scaled: For these kinds of attributes only differences (addtion or subtraction) make sense.
Ratio-scaled: Here one can compute both differences as well as ratios between values.
Categorical Attributes
A categorical attribute is one that has a set-valued domain composed of a set of symbols.
Nominal: The attribute values in the domain are unordered, and thus only equality comparisons are meaningful.
Ordinal: The attribute values are ordered, and thus both equality comparisons and inequality comparisons are allowed, though it may not be possible to quantify the difference between values.
1.3 Data: Algebraic and Geometric View
If the \(d\) attributes or dimensions in the data matrix \(\D\) are all numeric, then each row can be considered as a \(d\)-dimensional point:
or equivalently, each row may be considered as a \(d\)-dimensional column vector:
The \(j\)th standard basis vector \(\e_j\) of Cartesian coordinate space is the \(d\)-dimensional unit vector whose \(j\)th component is 1 and the rest of thecomponents are 0:
Any other vector in \(\R^d\) can be written as a linear combination of the standard basis vectors:
where the scalar value \(x_{ij}\) is the coordinate value along the \(j\)th axis or attribute.
Each numeric column or attribute can also be treated as a vector in an \(n\)-dimensional space \(\R^n\):
If all attributes are numeric, then the data matrix \(\D\) is in the fact an \(n\times d\) matrix, also written as \(\D\in\R^{n\times d}\), given as
1.3.1 Distance and Angle
Let \(\a,\b\in\R^m\) be two \(m\)-dimensional vectors given as
Dot Product
Note
\(\dp\a^T\b=\bp a_1&a_2&\cds&a_m\ep\times\bp b_1\\b_2\\\vds\\b_m\ep=a_1b_1+a_2b_2+\cds+a_mb_m=\sum_{i=1}^ma_ib_i\)
Length
The Euclidean norm or length of a vector \(\a\in\R^m\) is defined as
Note
\(\dp\lv\a\rv=\sqrt{\a^T\a}=\sqrt{a_1^2+a_2^2+\cds+a_m^2}=\sqrt{\sum_{i=1}^ma_i^2}\)
The unit vector in the direction of \(\a\) is given as
By definition \(\u\) has length \(\lv\u\rv=1\), and it is also called a normalized vector.
The Euclidean norm is a special case of a general class of norms, known as \(L_p\)-norm, defined as
Note
\(\dp\lv\a\rv_p=(|a_1|^p+|a_2|^p+\cds+|a_m|^p)^{\frac{1}{\p}}=\bigg(\sum_{i=1}^m|a_i|^p\bigg)^{\frac{1}{p}}\)
for any \(p\neq 0\).
Distance
The Eclidean distance between \(\a\) and \(\b\), as follows
Note
\(\dp\lv\a-\b\rv=\sqrt{(\a-\b)^T(\a-\b)}=\sqrt{\sum_{i=1}^m(a_i-b_i)^2}\)
The general \(L_p\)-distance function is geven as follows
Angle
The cosine of the smallest angle between vectors \(\a\) and \(\b\), also called the cosine similarity is given as
Note
\(\dp\cos\th=\frac{\a^T\b}{\lv\a\rv\lv\b\rv}=\bigg(\frac{\a}{\lv\a\rv}\bigg)^T\bigg(\frac{\b}{\lv\b\rv}\bigg)\)
The Cauchy-Schwartz inequality states that for any vectors \(\a\) and \(\b\) in \(\R^m\)
It follows immediately from the Cauchy-Schwartz inequality that
Orthogonality
Two vectors \(\a\) and \(\b\) are said to be orthogonal if and only if \(\a^T\b=0\), which in turn implies that \(\cos\th=0\). In this case, we say that they have no similarity.
1.3.2 Mean and Total Variance
Mean
Total Variance
Simplifying the equation we obtain
Centered Data Matrix
The mean of the centered data matrix \(\ol\D\) is \(\0\in\R^d\), because we have subtracted the mean \(\mmu\) from all the points \(\x_i\).
1.3.3 Orthogonal Projection
Let \(\a,\b\in\R^m\) be two \(m\)-dimensional vectors. An orthogonal decomposition of the vector \(\b\) in the direction of another vector \(\a\) is given as
where \(\p=\b_\parallel\) is parallel to \(\a\), and \(\r=\b_\perp\) is perpendicular or orthogonal to \(\a\). The vector \(\p\) is called the orthogonal projection or simply projection of \(\b\) on the vector \(\a\). The magnitude of the vector \(\r=\b-\p\) gives the perpendicular distance between \(\b\) and \(\a\), which is often interpreted as the residual or error between the points \(\b\) and \(\p\). The vector \(\r\) is also called the error vector.
We can derive an expression for \(\p\) by noting that \(\p=c\a\) for some scalar \(c\), as \(p\) is parallel to \(\a\). Thus \(\r=\b-\p=\b-c\a\). Because \(\p\) and \(\r\) are orthogonal, we have
which implies that
Therefore, the projection of \(\b\) on \(\a\) is given as
Note
\(\dp\p=c\a=\bigg(\frac{\a^T\b}{\a^T\a}\bigg)\a\)
The scalar offset \(c\) along \(\a\) is also called the scalar projection of \(\b\) on \(\a\), denoted as
Note
\(\dp\rm{proj}_\a(\b)=\bigg(\frac{\b^T\a}{\a^T\a}\bigg)\)
Therefore, the projection of \(\b\) on \(\a\) can also be written as
1.3.4 Linear Independence and Dimensionality
Given the data matrix
we are often interested in the linear combinations of the rows (points) or the columns (attributes).
Given any set of vectors \(\v_1,\v_2,\cds,\v_k\) in an \(m\)-dimensional vector space \(\R^m\), their linear combination is given as
where \(c_i\in\R\) are scalar values. The set of all possible linear combinations of the \(k\) vectors is called the span, denoted as \(span(\v_1,\cds,\v_k)\), which is itself a vector space being a subspace of \(\R^m\). If \(span(\v_1,\cds,\v_k)=\R^m\), then we say that \(\v_1,\cds,\v_k\) is a spanning set for \(\R^m\).
Row and Column space
The column space of \(\D\), denoted \(col(\D)\), is the set of all linear combinations of the \(d\) attributes \(X_j\in\R^n\)
By definition \(col(\D)\) is a subsapce of \(\R^n\). The row space of \(\D\), denoted \(row(\D)\), is the setof all linear combinations of the \(n\) points \(\x_i\in\R^d\)
By definition \(row(\D)\) is a subspace of \(\R^d\).
Linear Independence
The \(k\) vectors are linearly dependent if there are scalars \(c_1,c_2,\cds,c_k\), at least one of which is not zero, such that
On the other hand, \(\v_1,\cds,\v_k\) are linearly independent if and only if
Dimension and Rank
Let \(S\) be a subspace of \(\R^m\). A basis for \(S\) is a set of vectors in \(S\), say \(\v_1,\cds,\v_k\), that are linearly independent and they span \(S\), that is, \(span(\v_1,\cds,\v_k)=S\). A basis is a minimal spanning set. If the vectors in the basis are pairwise orthogonal, they are said to form an orthogonal basis for \(S\). If they are also normalized to be unit vectors, then they make up an orthonormal basis for \(S\). The standard basis for \(\R^m\) is an orthonormal basis consisting of the vectors
The number of vectors in a basis for \(S\) is called the dimension of \(S\), denoted as \(dim(S)\). Because \(S\) is a subspace of \(\R^m\), we must have \(dim(S)\leq m\).
For any matrix, the dimension of its row and column space is the same, and this dimension is also called the rank of the matrix. For the data matrix \(\D\in\R^{n\times d}\), we have \(rank(\D)\leq\min(n,d)\), which follows from the fact that the column space can have dimension at most \(d\), the row space can have dimension at most \(n\). With dimensionality reduction methods it is often possible to approximate \(\D\in\R^{n\times d}\) with a derived data matrix \(\D\pr\in\R^{n\times k}\), which has much lower dimensionality, that is, \(k\ll d\).
1.4 Data: Probabilistic View
A random variable \(X\) is called a discrete random variable if it takes on only a finite or countably infinite number of values in its range, wehreas \(X\) is called a continuous random variable if it can take on any value in its range.
Probability mass Function
If \(X\) is discrete, the probability mass function of \(X\) is defined as
Note
\(f(x)=P(X=x)\) for all \(x\in\R\).
\(f\) must obey the basi rules of probability, that is, \(f\) must be non-negative:
and the sum of all probabilities should add to 1:
Probability Density Function
We define the probability density function, which specifies the probability that the variable \(X\) takes on values in any interval \([a,b]\subset\R\):
Note
\(\dp P(X\in[a,b])=\int_a^b f(x)dx\)
The density function \(f\) must satisfy the basic laws of probability:
and
We can get an intuitive understanding of the density function \(f\) by considering the probability density over a small interval of width \(2\epsilon >0\), centered at \(x\), namely \([x-\epsilon,x+\epsilon]\):
It is important to note that \(P(X=x)\neq f(x)\).
We can use the PDF to obtain the relative probability of one value \(x_1\) over another \(x_2\) because for a given \(\epsilon>0\), we have
If \(f(x_1)\) is larger than \(f(x_2)\), then values of \(X\) close to \(x_1\) are more probable than values cloes to \(x_2\) and vice versa.
Cumulative Distribution Function
The cumulative distribution function (CDF) \(F:\R\rightarrow[0,1]\) which gives the probability of observing a value at most some given value \(x\):
Note
\(F(x)=P(X\leq x)\) for all \(-\infty<x<\infty\)
When \(X\) is discrete, \(F\) is given as
and when \(X\) is continuous, \(F\) is given as
1.4.1 Bivariate Random Variables
We can also perform pair-wise analysis by considering a pair of attributes, \(X_1\) and \(X_2\), as a bivariate random variable:
Joint Probability Mass Function
If \(X_1\) and \(X_2\) are both discrete random variables then \(\X\) has a joint probability mass function given as follows:
\(f\) must satisfy the following two conditions:
Joint Probability Density Function
If \(X_1\) and \(X_2\) are both continuous random variables then \(\X\) has a joint probability density function \(f\) given as follows:
where \(W\subset\R^2\) is some subset of the 2-dimensional space of reals. \(f\) must also satisfy the following two conditions:
The probability density at \(\x\) can be approximated using a 2-dimensional window of width \(2\epsilon\) centered at \(\x=(x_1,x_2)^T\) as
which implies that
The relative probability of one value \((a_1,a_2)\) versus another \((b_1,b_2)\) can therefore be computed via the probability density function:
Joint Cumulative Distribution Function
The joint cumulative distribution function for two random variables \(X_1\) and \(X_2\) is defined as the function \(F\), such that for all values \(x_1,x_2\in(-\infty,\infty)\),
Statistical Independence
Two random variables \(X_1\) and \(X_2\) are said to be (statistically) independent if, for every \(W_1\subset\R\) and \(W_2\subset\R\),
Furthermore, if \(X_1\) and \(X_2\) are independent, then the following two conditions are also satisfied:
1.4.2 Multivariate Random Variable
A \(d\)-dimensional multivariate random variable \(\X=(X_1,X_2,\cds,X_d)^T\), also called a vector random variable, is defined as a function that assigns a vector of real numbers to each outcome in the sample space, that is \(\X:\cl{O}\ra\R^d\).
If all \(X_j\) are discrete, then \(\X\) is jointly discrete and its joint probability mass function \(f\) is given as
If all \(X_j\) are continuous, then \(\X\) is jointly continuous and its joint probability density function is given as
for any \(d\)-dimensional region \(W\seq\R^d\).
We say that \(X_1,X_2,\cds,X_d\) are independent random variables if any only if, for every region \(W_i\in\R\):
If \(X_1,X_2,\cds,X_d\) are independent then the following conditions are also satisfied
1.4.3 Random Sample and Statistics
In statistics, the word population is used to refer to the set or universe of all entieis under study. We try to make inferences about the population parameters by drawing a random sample from the population, and by computing appropriate statistics from the sample that give estimates of the corresponding population parameters of interest.
Univariate Sample
Given a random variable \(X\), a random sample of size \(n\) from \(X\) is defined as a set of \(n\) independent and identically distributed (IID) random variables \(S_1,S_2,\cds,S_n\), that is, all of the \(S_i\)’s are statistically independent of each other, and follow the same probability mass or density function as \(X\).
Their joint probability function is given as
Note
\(\dp f(x_1,\cds,x_n)=\prod_{i=1}^nf_X(x_i)\)
where \(f_X\) is the probability mass or density function for \(X\).
Multivariate Sample
\(\x_i\) are assumed to be independent and identically distributed, and thus their joint distirbution is given as
Note
\(\dp f(\x_1,\x_2,\cds,\x_n)=\prod_{i=1}^n f_{\X}(\x_i)\)
where \(f_{\X}\) is the probability mass or density function for \(\X\).
Under the attribute independence assumption the above equation can be rewritten as
Statistics
Let \(\{ \bs{\rm{S}}_i\}_{i=1}^m\) denote a random sample of size \(m\) drawn from a (multivariate) random variable \(\X\). A statistic \(\hat{\th}\) is some function over the random sample, given as
If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called an estimator of the parameter.