\[ \begin{align}\begin{aligned}\newcommand{\bs}{\boldsymbol} \newcommand{\dp}{\displaystyle} \newcommand{\rm}{\mathrm} \newcommand{\cl}{\mathcal} \newcommand{\pd}{\partial}\\\newcommand{\cd}{\cdot} \newcommand{\cds}{\cdots} \newcommand{\dds}{\ddots} \newcommand{\lag}{\langle} \newcommand{\lv}{\lVert} \newcommand{\ol}{\overline} \newcommand{\od}{\odot} \newcommand{\ra}{\rightarrow} \newcommand{\rag}{\rangle} \newcommand{\rv}{\rVert} \newcommand{\seq}{\subseteq} \newcommand{\td}{\tilde} \newcommand{\vds}{\vdots} \newcommand{\wh}{\widehat}\\\newcommand{\0}{\boldsymbol{0}} \newcommand{\1}{\boldsymbol{1}} \newcommand{\a}{\boldsymbol{\mathrm{a}}} \newcommand{\b}{\boldsymbol{\mathrm{b}}} \newcommand{\c}{\boldsymbol{\mathrm{c}}} \newcommand{\d}{\boldsymbol{\mathrm{d}}} \newcommand{\e}{\boldsymbol{\mathrm{e}}} \newcommand{\f}{\boldsymbol{\mathrm{f}}} \newcommand{\g}{\boldsymbol{\mathrm{g}}} \newcommand{\h}{\boldsymbol{\mathrm{h}}} \newcommand{\i}{\boldsymbol{\mathrm{i}}} \newcommand{\j}{\boldsymbol{j}} \newcommand{\m}{\boldsymbol{\mathrm{m}}} \newcommand{\n}{\boldsymbol{\mathrm{n}}} \newcommand{\o}{\boldsymbol{\mathrm{o}}} \newcommand{\p}{\boldsymbol{\mathrm{p}}} \newcommand{\q}{\boldsymbol{\mathrm{q}}} \newcommand{\r}{\boldsymbol{\mathrm{r}}} \newcommand{\u}{\boldsymbol{\mathrm{u}}} \newcommand{\v}{\boldsymbol{\mathrm{v}}} \newcommand{\w}{\boldsymbol{\mathrm{w}}} \newcommand{\x}{\boldsymbol{\mathrm{x}}} \newcommand{\y}{\boldsymbol{\mathrm{y}}} \newcommand{\z}{\boldsymbol{\mathrm{z}}}\\\newcommand{\A}{\boldsymbol{\mathrm{A}}} \newcommand{\B}{\boldsymbol{\mathrm{B}}} \newcommand{\C}{\boldsymbol{\mathrm{C}}} \newcommand{\D}{\boldsymbol{\mathrm{D}}} \newcommand{\H}{\boldsymbol{\mathrm{H}}} \newcommand{\I}{\boldsymbol{\mathrm{I}}} \newcommand{\K}{\boldsymbol{\mathrm{K}}} \newcommand{\M}{\boldsymbol{\mathrm{M}}} \newcommand{\N}{\boldsymbol{\mathrm{N}}} \newcommand{\P}{\boldsymbol{\mathrm{P}}} \newcommand{\Q}{\boldsymbol{\mathrm{Q}}} \newcommand{\S}{\boldsymbol{\mathrm{S}}} \newcommand{\U}{\boldsymbol{\mathrm{U}}} \newcommand{\W}{\boldsymbol{\mathrm{W}}} \newcommand{\X}{\boldsymbol{\mathrm{X}}} \newcommand{\Y}{\boldsymbol{\mathrm{Y}}} \newcommand{\Z}{\boldsymbol{\mathrm{Z}}}\\\newcommand{\R}{\mathbb{R}}\\\newcommand{\cE}{\mathcal{E}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}}\\\newcommand{\ld}{\lambda} \newcommand{\Ld}{\boldsymbol{\mathrm{\Lambda}}} \newcommand{\sg}{\sigma} \newcommand{\Sg}{\boldsymbol{\mathrm{\Sigma}}} \newcommand{\th}{\theta} \newcommand{\ve}{\varepsilon}\\\newcommand{\mmu}{\boldsymbol{\mu}} \newcommand{\ppi}{\boldsymbol{\pi}} \newcommand{\CC}{\mathcal{C}} \newcommand{\TT}{\mathcal{T}}\\ \newcommand{\bb}{\begin{bmatrix}} \newcommand{\eb}{\end{bmatrix}} \newcommand{\bp}{\begin{pmatrix}} \newcommand{\ep}{\end{pmatrix}} \newcommand{\bv}{\begin{vmatrix}} \newcommand{\ev}{\end{vmatrix}}\\\newcommand{\im}{^{-1}} \newcommand{\pr}{^{\prime}} \newcommand{\ppr}{^{\prime\prime}}\end{aligned}\end{align} \]

Chapter 1 Data Matrix

1.1 Data Matrix

The \(n\times d\) data matrix is given as

\[\begin{split}\left(\begin{array}{c|cccc}&X_1&X_2&\cds&X_d\\ \hline \x_1&x_{11}&x_{12}&\cds&x_{1d}\\\x_2&x_{21}&x_{22}&\cds&x_{2d}\\ \vds&\vds&\vds&\dds&\vds\\\x_n&x_{n1}&x_{n2}&\cds&x_{nd}\end{array}\right)\end{split}\]

where \(\x_i\) donotes the \(i\)th row, which is a \(d\)-tuple given as

\[\x_i=(x_{i1},x_{i2},\cds,x_{id})\]

and \(X_j\) denotes the \(j\)th column, which is an \(n\)-tuple given as

\[X_j=(x_{1j},x_{2j},\cds,x_{nj})\]

The number of instances \(n\) is referred to as the size of the data, whereas the number of attributes \(d\) is called the dimensionality of the data.

1.2 Attributes

Numeric Attributes

A numeric attribute is one that has a real-valued or integer-valued domain. Numeric attributes that take on a finite or countably infinite set of values are called discrete, whereas those that can take on any real value are called continuous. If an attribute has as its domain the set \(\{0, 1\}\), it is called a binary attribute.

  • Interval-scaled: For these kinds of attributes only differences (addtion or subtraction) make sense.

  • Ratio-scaled: Here one can compute both differences as well as ratios between values.

Categorical Attributes

A categorical attribute is one that has a set-valued domain composed of a set of symbols.

  • Nominal: The attribute values in the domain are unordered, and thus only equality comparisons are meaningful.

  • Ordinal: The attribute values are ordered, and thus both equality comparisons and inequality comparisons are allowed, though it may not be possible to quantify the difference between values.

1.3 Data: Algebraic and Geometric View

If the \(d\) attributes or dimensions in the data matrix \(\D\) are all numeric, then each row can be considered as a \(d\)-dimensional point:

\[\x_i=(x_{i1},x_{i2},\cds,x_{id})\in\R^d\]

or equivalently, each row may be considered as a \(d\)-dimensional column vector:

\[\begin{split}\x_i=\bp x_{i1}\\x_{i2}\\\vds\\x_{id} \ep=\bp x_{i1}&x_{i2}&\cds&x_{id} \ep^T\in\R^d\end{split}\]

The \(j\)th standard basis vector \(\e_j\) of Cartesian coordinate space is the \(d\)-dimensional unit vector whose \(j\)th component is 1 and the rest of thecomponents are 0:

\[\e_j=(0,\cds,1_j,\cds,0)^T\]

Any other vector in \(\R^d\) can be written as a linear combination of the standard basis vectors:

\[\x_i=x_{i1}\e_1+\x_{i2}\e_2+\cds+x_{id}\e_d=\sum_{j=1}^dx_{ij}\e_j\]

where the scalar value \(x_{ij}\) is the coordinate value along the \(j\)th axis or attribute.

Each numeric column or attribute can also be treated as a vector in an \(n\)-dimensional space \(\R^n\):

\[\begin{split}X_j=\bp x_{1j}\\x_{2j}\\\vds\\x_{nj} \ep\end{split}\]

If all attributes are numeric, then the data matrix \(\D\) is in the fact an \(n\times d\) matrix, also written as \(\D\in\R^{n\times d}\), given as

\[\begin{split}\D=\bp x_{11}&x_{12}&\cds&x_{1d}\\x_{21}&x_{22}&\cds&x_{2d}\\ \vds&\vds&\dds&\vds\\x_{n1}&x_{n2}&\cds&x_{nd} \ep= \bp -&\x_1^T&-\\-&\x_2^T&-\\&\vds\\-&\x_n^T&- \ep= \bp |&|&&|\\X_1&X_2&\cds&X_d\\|&|&&| \ep\end{split}\]

1.3.1 Distance and Angle

Let \(\a,\b\in\R^m\) be two \(m\)-dimensional vectors given as

\[\begin{split}\a=\bp a_1\\a_2\\\vds\\a_m \ep\quad\b=\bp b_1\\b_2\\\vds\\b_m \ep\end{split}\]

Dot Product

Note

\(\dp\a^T\b=\bp a_1&a_2&\cds&a_m\ep\times\bp b_1\\b_2\\\vds\\b_m\ep=a_1b_1+a_2b_2+\cds+a_mb_m=\sum_{i=1}^ma_ib_i\)

Length

The Euclidean norm or length of a vector \(\a\in\R^m\) is defined as

Note

\(\dp\lv\a\rv=\sqrt{\a^T\a}=\sqrt{a_1^2+a_2^2+\cds+a_m^2}=\sqrt{\sum_{i=1}^ma_i^2}\)

The unit vector in the direction of \(\a\) is given as

\[\u=\frac{\a}{\lv\a\rv}=\bigg(\frac{1}{\lv\a\rv}\bigg)\a\]

By definition \(\u\) has length \(\lv\u\rv=1\), and it is also called a normalized vector.

The Euclidean norm is a special case of a general class of norms, known as \(L_p\)-norm, defined as

Note

\(\dp\lv\a\rv_p=(|a_1|^p+|a_2|^p+\cds+|a_m|^p)^{\frac{1}{\p}}=\bigg(\sum_{i=1}^m|a_i|^p\bigg)^{\frac{1}{p}}\)

for any \(p\neq 0\).

Distance

The Eclidean distance between \(\a\) and \(\b\), as follows

Note

\(\dp\lv\a-\b\rv=\sqrt{(\a-\b)^T(\a-\b)}=\sqrt{\sum_{i=1}^m(a_i-b_i)^2}\)

The general \(L_p\)-distance function is geven as follows

\[\lv\a-\b\rv_p=\bigg(\sum_{i=1}^m|a_i-b_i|^p\bigg)^{\frac{1}{p}}\]

Angle

The cosine of the smallest angle between vectors \(\a\) and \(\b\), also called the cosine similarity is given as

Note

\(\dp\cos\th=\frac{\a^T\b}{\lv\a\rv\lv\b\rv}=\bigg(\frac{\a}{\lv\a\rv}\bigg)^T\bigg(\frac{\b}{\lv\b\rv}\bigg)\)

The Cauchy-Schwartz inequality states that for any vectors \(\a\) and \(\b\) in \(\R^m\)

\[|\a^T\b|\leq\lv\a\rv\cd\lv\b\rv\]

It follows immediately from the Cauchy-Schwartz inequality that

\[-1\leq\cos\th\leq 1\]

Orthogonality

Two vectors \(\a\) and \(\b\) are said to be orthogonal if and only if \(\a^T\b=0\), which in turn implies that \(\cos\th=0\). In this case, we say that they have no similarity.

1.3.2 Mean and Total Variance

Mean

\[mean(\D)=\mmu=\frac{1}{n}\sum_{i=1}^n\x_i\]

Total Variance

\[\rm{var}(\D)=\frac{1}{n}\sum_{i=1}^n\lv\x_i-\mmu\rv^2\]

Simplifying the equation we obtain

\[ \begin{align}\begin{aligned}\rm{var}(\D)&=\frac{1}{n}\sum_{i=1}^n(\lv\x_i\rv^2-2\x_i^T\mmu+\lv\mmu\rv^2)\\&=\frac{1}{n}\bigg(\sum_{i=1}^n\lv\x_i\rv^2-2n\mmu^T\bigg(\frac{1}{n}\sum_{i=1}^n\x_i\bigg)+n\lv\mmu\rv^2\bigg)\\&=\frac{1}{n}\bigg(\sum_{i=1}^n\lv\x_i\rv^2-2n\mmu^T\mmu+n\lv\mmu\rv^2\bigg)\\&=\frac{1}{n}\bigg(\sum_{i=1}^n\lv\x_i\rv^2\bigg)-\lv\mmu\rv^2\end{aligned}\end{align} \]

Centered Data Matrix

\[\begin{split}\ol\D=\D-\bs{1}\cd\mmu^T=\bp\x_1^T\\\x_2^T\\\vds\\\x_n^T\ep- \bp\mmu^T\\\mmu^T\\\vds\\\mmu^T\ep=\bp\x_1^T- \mmu^T\\\x_2^T-\mmu^T\\\vds\\\x_n^T-\mmu^T\ep= \bp\ol\x_1^T\\\ol\x_2^T\\\vds\\\ol\x_n^T\ep\end{split}\]

The mean of the centered data matrix \(\ol\D\) is \(\0\in\R^d\), because we have subtracted the mean \(\mmu\) from all the points \(\x_i\).

1.3.3 Orthogonal Projection

Let \(\a,\b\in\R^m\) be two \(m\)-dimensional vectors. An orthogonal decomposition of the vector \(\b\) in the direction of another vector \(\a\) is given as

\[\b=\b_\parallel+\b_\perp=\p+\r\]

where \(\p=\b_\parallel\) is parallel to \(\a\), and \(\r=\b_\perp\) is perpendicular or orthogonal to \(\a\). The vector \(\p\) is called the orthogonal projection or simply projection of \(\b\) on the vector \(\a\). The magnitude of the vector \(\r=\b-\p\) gives the perpendicular distance between \(\b\) and \(\a\), which is often interpreted as the residual or error between the points \(\b\) and \(\p\). The vector \(\r\) is also called the error vector.

We can derive an expression for \(\p\) by noting that \(\p=c\a\) for some scalar \(c\), as \(p\) is parallel to \(\a\). Thus \(\r=\b-\p=\b-c\a\). Because \(\p\) and \(\r\) are orthogonal, we have

\[\p^T\r=(c\a)^T(\b-c\a)=c\a^T\b-c^2\a^T\a=0\]

which implies that

\[c=\frac{\a^T\b}{\a^T\a}\]

Therefore, the projection of \(\b\) on \(\a\) is given as

Note

\(\dp\p=c\a=\bigg(\frac{\a^T\b}{\a^T\a}\bigg)\a\)

The scalar offset \(c\) along \(\a\) is also called the scalar projection of \(\b\) on \(\a\), denoted as

Note

\(\dp\rm{proj}_\a(\b)=\bigg(\frac{\b^T\a}{\a^T\a}\bigg)\)

Therefore, the projection of \(\b\) on \(\a\) can also be written as

\[\p=\rm{proj}_\a(\b)\cd\a\]

1.3.4 Linear Independence and Dimensionality

Given the data matrix

\[\D=\bp \x_1&\x_2&\cds&\x_n \ep^T=\bp X_1&X_2&\cds&X_d \ep\]

we are often interested in the linear combinations of the rows (points) or the columns (attributes).

Given any set of vectors \(\v_1,\v_2,\cds,\v_k\) in an \(m\)-dimensional vector space \(\R^m\), their linear combination is given as

\[c_1\v_1+c_2\v_2+\cds+c_k\v_k\]

where \(c_i\in\R\) are scalar values. The set of all possible linear combinations of the \(k\) vectors is called the span, denoted as \(span(\v_1,\cds,\v_k)\), which is itself a vector space being a subspace of \(\R^m\). If \(span(\v_1,\cds,\v_k)=\R^m\), then we say that \(\v_1,\cds,\v_k\) is a spanning set for \(\R^m\).

Row and Column space

The column space of \(\D\), denoted \(col(\D)\), is the set of all linear combinations of the \(d\) attributes \(X_j\in\R^n\)

\[col(\D)=span(X_1,X_2,\cds,X_d)\]

By definition \(col(\D)\) is a subsapce of \(\R^n\). The row space of \(\D\), denoted \(row(\D)\), is the setof all linear combinations of the \(n\) points \(\x_i\in\R^d\)

\[row(\D)=span(\x_1,\x_2,\cds,\x_n)\]

By definition \(row(\D)\) is a subspace of \(\R^d\).

\[row(\D)=col(\D^T)\]

Linear Independence

The \(k\) vectors are linearly dependent if there are scalars \(c_1,c_2,\cds,c_k\), at least one of which is not zero, such that

\[c_1\v_1+c_2\v_2+\cds+c_k\v_k=\0\]

On the other hand, \(\v_1,\cds,\v_k\) are linearly independent if and only if

\[c_1\v_1+c_2\v_2+\cds+c_k\v_k=\0\rm{\ implies\ }c_1=c_2=\cds=c_k=0\]

Dimension and Rank

Let \(S\) be a subspace of \(\R^m\). A basis for \(S\) is a set of vectors in \(S\), say \(\v_1,\cds,\v_k\), that are linearly independent and they span \(S\), that is, \(span(\v_1,\cds,\v_k)=S\). A basis is a minimal spanning set. If the vectors in the basis are pairwise orthogonal, they are said to form an orthogonal basis for \(S\). If they are also normalized to be unit vectors, then they make up an orthonormal basis for \(S\). The standard basis for \(\R^m\) is an orthonormal basis consisting of the vectors

\[\begin{split}\e_1=\bp 1\\0\\\vds\\0 \ep\quad\e_2=\bp 0\\1\\\vds\\0 \ep\quad\cds\quad\e_m=\bp 0\\0\\\vds\\1 \ep\end{split}\]

The number of vectors in a basis for \(S\) is called the dimension of \(S\), denoted as \(dim(S)\). Because \(S\) is a subspace of \(\R^m\), we must have \(dim(S)\leq m\).

For any matrix, the dimension of its row and column space is the same, and this dimension is also called the rank of the matrix. For the data matrix \(\D\in\R^{n\times d}\), we have \(rank(\D)\leq\min(n,d)\), which follows from the fact that the column space can have dimension at most \(d\), the row space can have dimension at most \(n\). With dimensionality reduction methods it is often possible to approximate \(\D\in\R^{n\times d}\) with a derived data matrix \(\D\pr\in\R^{n\times k}\), which has much lower dimensionality, that is, \(k\ll d\).

1.4 Data: Probabilistic View

A random variable \(X\) is called a discrete random variable if it takes on only a finite or countably infinite number of values in its range, wehreas \(X\) is called a continuous random variable if it can take on any value in its range.

Probability mass Function

If \(X\) is discrete, the probability mass function of \(X\) is defined as

Note

\(f(x)=P(X=x)\) for all \(x\in\R\).

\(f\) must obey the basi rules of probability, that is, \(f\) must be non-negative:

\[f(x)\geq 0\]

and the sum of all probabilities should add to 1:

\[\sum_xf(x)=1\]

Probability Density Function

We define the probability density function, which specifies the probability that the variable \(X\) takes on values in any interval \([a,b]\subset\R\):

Note

\(\dp P(X\in[a,b])=\int_a^b f(x)dx\)

The density function \(f\) must satisfy the basic laws of probability:

\[f(x) \geq 0,\quad\rm{for\ all\ }x\in\R\]

and

\[\int_{-\infty}^\infty f(x)dx=1\]

We can get an intuitive understanding of the density function \(f\) by considering the probability density over a small interval of width \(2\epsilon >0\), centered at \(x\), namely \([x-\epsilon,x+\epsilon]\):

\[ \begin{align}\begin{aligned}P(X\in[x-\epsilon,x+\epsilon])=\int_{x-\epsilon}^{x+\epsilon}f(x)dx\simeq 2\epsilon\cd f(x)\\f(x)\simeq\frac{P(X\in[x-\epsilon,x+\epsilon])}{2\epsilon}\end{aligned}\end{align} \]

It is important to note that \(P(X=x)\neq f(x)\).

We can use the PDF to obtain the relative probability of one value \(x_1\) over another \(x_2\) because for a given \(\epsilon>0\), we have

\[\frac{P(X\in[x_1-\epsilon,x_1+\epsilon])}{P(X\in[x_2-\epsilon,x_2+\epsilon])} \simeq\frac{2\epsilon\cd f(x_1)}{2\epsilon\cd f(x_2)}=\frac{f(x_1)}{f(x_2)}\]

If \(f(x_1)\) is larger than \(f(x_2)\), then values of \(X\) close to \(x_1\) are more probable than values cloes to \(x_2\) and vice versa.

Cumulative Distribution Function

The cumulative distribution function (CDF) \(F:\R\rightarrow[0,1]\) which gives the probability of observing a value at most some given value \(x\):

Note

\(F(x)=P(X\leq x)\) for all \(-\infty<x<\infty\)

When \(X\) is discrete, \(F\) is given as

\[F(x)=P(X\leq x)=\sum_{u\leq x}f(u)\]

and when \(X\) is continuous, \(F\) is given as

\[F(x)=P(X\leq x)=\int_{-\infty}^xf(u)du\]

1.4.1 Bivariate Random Variables

We can also perform pair-wise analysis by considering a pair of attributes, \(X_1\) and \(X_2\), as a bivariate random variable:

\[\begin{split}\X=\bp X_1\\X_2 \ep\end{split}\]

Joint Probability Mass Function

If \(X_1\) and \(X_2\) are both discrete random variables then \(\X\) has a joint probability mass function given as follows:

\[f(\x)=f(x_1,x_2)=P(X_1=x_1,X_2=x_2)=P(\X=\x)\]

\(f\) must satisfy the following two conditions:

\[ \begin{align}\begin{aligned}f(\x)=f(x_1,x_2)\geq 0\quad\rm{for\ all\ }-\infty<x_1,x_2<\infty\\\sum_\x f(\X)=\sum_{x_1}\sum_{x_2}f(x_1,x_2)=1\end{aligned}\end{align} \]

Joint Probability Density Function

If \(X_1\) and \(X_2\) are both continuous random variables then \(\X\) has a joint probability density function \(f\) given as follows:

\[P(\X\in W)=\iint_{\x\in W}f(\X)d\X=\iint_{(x_1,x_2)^T\in W}f(x_1,x_2)dx_1dx_2\]

where \(W\subset\R^2\) is some subset of the 2-dimensional space of reals. \(f\) must also satisfy the following two conditions:

\[ \begin{align}\begin{aligned}f(\x)=f(x_1,x_2)\geq 0\quad\rm{for\ all\ }-\infty<x_1,x_2<\infty\\\int_{\R^2}f(\x)d\x=\int_{-\infty}^\infty\int_{-\infty}^\infty f(x_1,x_2)dx_1dx_2=1\end{aligned}\end{align} \]

The probability density at \(\x\) can be approximated using a 2-dimensional window of width \(2\epsilon\) centered at \(\x=(x_1,x_2)^T\) as

\[ \begin{align}\begin{aligned}P(\X\in W)&=P(\X\in([x_1-\epsilon,x_1+\epsilon],[x_2-\epsilon,x_2+\epsilon]))\\&=\int_{x_1-\epsilon}^{x_1+\epsilon}\int_{x_2-\epsilon}^{x_2+\epsilon} f(x_1,x_2)dx_1dx_2\simeq 2\epsilon\cd 2\epsilon\cd f(x_1,x_2)\end{aligned}\end{align} \]

which implies that

\[f(x_1,x_2)=\frac{P(\X\in W)}{(2\epsilon)^2}\]

The relative probability of one value \((a_1,a_2)\) versus another \((b_1,b_2)\) can therefore be computed via the probability density function:

\[\frac{P(\X\in([a_1-\epsilon,a_1+\epsilon],[a_2-\epsilon,a_2+\epsilon]))} {P(\X\in([b_1-\epsilon,b_1+\epsilon],[b_2-\epsilon,b_2+\epsilon]))}\simeq \frac{(2\epsilon)^2\cd f(a_1,a_2)}{(2\epsilon)^2\cd f(b_1,b_2)}= \frac{f(a_1,a_2)}{f(b_1,b_2)}\]

Joint Cumulative Distribution Function

The joint cumulative distribution function for two random variables \(X_1\) and \(X_2\) is defined as the function \(F\), such that for all values \(x_1,x_2\in(-\infty,\infty)\),

\[F(\x)=F(x_1,x_2)=P(X_1\leq x_1\rm{\ and\ }X_2\leq x_2)=P(\X\leq\x)\]

Statistical Independence

Two random variables \(X_1\) and \(X_2\) are said to be (statistically) independent if, for every \(W_1\subset\R\) and \(W_2\subset\R\),

\[P(X_1\in W_1\rm{\ and\ }X_2\in W_2)=P(X_1\in W_1)\cd P(X_2\in W_2)\]

Furthermore, if \(X_1\) and \(X_2\) are independent, then the following two conditions are also satisfied:

\[ \begin{align}\begin{aligned}F(\x)=F(x_1,x_2)=F_1(x_1)\cd F_2(x_2)\\f(\x)=f(x_1,x_2)=f_1(x_1)\cd f_2(x_2)\end{aligned}\end{align} \]

1.4.2 Multivariate Random Variable

A \(d\)-dimensional multivariate random variable \(\X=(X_1,X_2,\cds,X_d)^T\), also called a vector random variable, is defined as a function that assigns a vector of real numbers to each outcome in the sample space, that is \(\X:\cl{O}\ra\R^d\).

If all \(X_j\) are discrete, then \(\X\) is jointly discrete and its joint probability mass function \(f\) is given as

\[ \begin{align}\begin{aligned}f(\x)&=P(\X=\x)\\f(x_1,x_2,\cds,x_d)&=P(X_1=x_1,X_2=x_2,\cds,X_d=x_d)\end{aligned}\end{align} \]

If all \(X_j\) are continuous, then \(\X\) is jointly continuous and its joint probability density function is given as

\[ \begin{align}\begin{aligned}P(\X\in W)&=\underset{\x\in W}{\int\cds\int}f(\x)d\x\\P((X_1,X_2,\cds,X_d)^T\in W)&=\underset{(x_1,x_2,\cds,x_d)^T\in W}{\int\cds\int}f(x_1,x_2,\cds,x_d)dx_1dx_2\cds dx_d\end{aligned}\end{align} \]

for any \(d\)-dimensional region \(W\seq\R^d\).

We say that \(X_1,X_2,\cds,X_d\) are independent random variables if any only if, for every region \(W_i\in\R\):

\[ \begin{align}\begin{aligned}P(X_1\in W_1\rm{\ and\ }X_2\in W_2\cds\rm{\ and\ }X_d\in W_d)\\=P(X_1\in W_1)\cd P(X_2\in W_2)\cds P(X_d\in W_d)\end{aligned}\end{align} \]

If \(X_1,X_2,\cds,X_d\) are independent then the following conditions are also satisfied

\[ \begin{align}\begin{aligned}F(\x)=F(x_1,\cds,x_d)=F_1(x_1)\cd F_2(x_2)\cd \cds \cd F_d(x_d)\\f(\x)=f(x_1,\cds,x_d)=f_1(x_1)\cd f_2(x_2)\cd \cds \cd f_d(x_d)\end{aligned}\end{align} \]

1.4.3 Random Sample and Statistics

In statistics, the word population is used to refer to the set or universe of all entieis under study. We try to make inferences about the population parameters by drawing a random sample from the population, and by computing appropriate statistics from the sample that give estimates of the corresponding population parameters of interest.

Univariate Sample

Given a random variable \(X\), a random sample of size \(n\) from \(X\) is defined as a set of \(n\) independent and identically distributed (IID) random variables \(S_1,S_2,\cds,S_n\), that is, all of the \(S_i\)’s are statistically independent of each other, and follow the same probability mass or density function as \(X\).

Their joint probability function is given as

Note

\(\dp f(x_1,\cds,x_n)=\prod_{i=1}^nf_X(x_i)\)

where \(f_X\) is the probability mass or density function for \(X\).

Multivariate Sample

\(\x_i\) are assumed to be independent and identically distributed, and thus their joint distirbution is given as

Note

\(\dp f(\x_1,\x_2,\cds,\x_n)=\prod_{i=1}^n f_{\X}(\x_i)\)

where \(f_{\X}\) is the probability mass or density function for \(\X\).

Under the attribute independence assumption the above equation can be rewritten as

\[f(\x_1,\x_2,\cds,\x_n)=\prod_{i=1}^n f(\x_i)=\prod_{i=1}^n\prod_{j=1}^df_{X_j}(x_{ij})\]

Statistics

Let \(\{ \bs{\rm{S}}_i\}_{i=1}^m\) denote a random sample of size \(m\) drawn from a (multivariate) random variable \(\X\). A statistic \(\hat{\th}\) is some function over the random sample, given as

\[\hat{\th}:(\bs{\rm{S}}_1,\bs{\rm{S}}_2,\cds,\bs{\rm{S}}_m)\ra\R\]

If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called an estimator of the parameter.