\[ \begin{align}\begin{aligned}\newcommand{\bs}{\boldsymbol} \newcommand{\dp}{\displaystyle} \newcommand{\rm}{\mathrm} \newcommand{\cl}{\mathcal} \newcommand{\pd}{\partial}\\\newcommand{\cd}{\cdot} \newcommand{\cds}{\cdots} \newcommand{\dds}{\ddots} \newcommand{\lag}{\langle} \newcommand{\lv}{\lVert} \newcommand{\ol}{\overline} \newcommand{\od}{\odot} \newcommand{\ra}{\rightarrow} \newcommand{\rag}{\rangle} \newcommand{\rv}{\rVert} \newcommand{\seq}{\subseteq} \newcommand{\td}{\tilde} \newcommand{\vds}{\vdots} \newcommand{\wh}{\widehat}\\\newcommand{\0}{\boldsymbol{0}} \newcommand{\1}{\boldsymbol{1}} \newcommand{\a}{\boldsymbol{\mathrm{a}}} \newcommand{\b}{\boldsymbol{\mathrm{b}}} \newcommand{\c}{\boldsymbol{\mathrm{c}}} \newcommand{\d}{\boldsymbol{\mathrm{d}}} \newcommand{\e}{\boldsymbol{\mathrm{e}}} \newcommand{\f}{\boldsymbol{\mathrm{f}}} \newcommand{\g}{\boldsymbol{\mathrm{g}}} \newcommand{\h}{\boldsymbol{\mathrm{h}}} \newcommand{\i}{\boldsymbol{\mathrm{i}}} \newcommand{\j}{\boldsymbol{j}} \newcommand{\m}{\boldsymbol{\mathrm{m}}} \newcommand{\n}{\boldsymbol{\mathrm{n}}} \newcommand{\o}{\boldsymbol{\mathrm{o}}} \newcommand{\p}{\boldsymbol{\mathrm{p}}} \newcommand{\q}{\boldsymbol{\mathrm{q}}} \newcommand{\r}{\boldsymbol{\mathrm{r}}} \newcommand{\u}{\boldsymbol{\mathrm{u}}} \newcommand{\v}{\boldsymbol{\mathrm{v}}} \newcommand{\w}{\boldsymbol{\mathrm{w}}} \newcommand{\x}{\boldsymbol{\mathrm{x}}} \newcommand{\y}{\boldsymbol{\mathrm{y}}} \newcommand{\z}{\boldsymbol{\mathrm{z}}}\\\newcommand{\A}{\boldsymbol{\mathrm{A}}} \newcommand{\B}{\boldsymbol{\mathrm{B}}} \newcommand{\C}{\boldsymbol{\mathrm{C}}} \newcommand{\D}{\boldsymbol{\mathrm{D}}} \newcommand{\H}{\boldsymbol{\mathrm{H}}} \newcommand{\I}{\boldsymbol{\mathrm{I}}} \newcommand{\K}{\boldsymbol{\mathrm{K}}} \newcommand{\M}{\boldsymbol{\mathrm{M}}} \newcommand{\N}{\boldsymbol{\mathrm{N}}} \newcommand{\P}{\boldsymbol{\mathrm{P}}} \newcommand{\Q}{\boldsymbol{\mathrm{Q}}} \newcommand{\S}{\boldsymbol{\mathrm{S}}} \newcommand{\U}{\boldsymbol{\mathrm{U}}} \newcommand{\W}{\boldsymbol{\mathrm{W}}} \newcommand{\X}{\boldsymbol{\mathrm{X}}} \newcommand{\Y}{\boldsymbol{\mathrm{Y}}} \newcommand{\Z}{\boldsymbol{\mathrm{Z}}}\\\newcommand{\R}{\mathbb{R}}\\\newcommand{\cE}{\mathcal{E}} \newcommand{\cX}{\mathcal{X}} \newcommand{\cY}{\mathcal{Y}}\\\newcommand{\ld}{\lambda} \newcommand{\Ld}{\boldsymbol{\mathrm{\Lambda}}} \newcommand{\sg}{\sigma} \newcommand{\Sg}{\boldsymbol{\mathrm{\Sigma}}} \newcommand{\th}{\theta} \newcommand{\ve}{\varepsilon}\\\newcommand{\mmu}{\boldsymbol{\mu}} \newcommand{\ppi}{\boldsymbol{\pi}} \newcommand{\CC}{\mathcal{C}} \newcommand{\TT}{\mathcal{T}}\\ \newcommand{\bb}{\begin{bmatrix}} \newcommand{\eb}{\end{bmatrix}} \newcommand{\bp}{\begin{pmatrix}} \newcommand{\ep}{\end{pmatrix}} \newcommand{\bv}{\begin{vmatrix}} \newcommand{\ev}{\end{vmatrix}}\\\newcommand{\im}{^{-1}} \newcommand{\pr}{^{\prime}} \newcommand{\ppr}{^{\prime\prime}}\end{aligned}\end{align} \]

Chapter 27 Regression Evaluation

Given a set of predictor attributes or independent variables \(X_1,X_2,\cds,X_d\), and given the response attribute \(Y\), the goal of regression is to learn a \(f\), such that

\[Y=f(X_1,X_2,\cds,X_d)+\ve=f(\X)+\ve\]

where \(\X=(X_1,X_2,\cds,X_d)^T\) is the \(d\)-dimensional multivariate random variable comprised of the predictor variables. Here, the random variable \(\ve\) denotes the inferent error in the response that is not explained by the linear model.

When estimating the regression function \(f\), we make assumptions about the form of \(f\). Once we have estimated the bias and coefficients, we need to formulate a probabilistic model of regression to evaluate the learned model in terms of goodness of fit, confidence intervals for the parameters, and to test for the regression effects, namely whether \(\X\) really helps in predicting \(Y\). In particular, we assume that even if the value of \(\X\) has been fixed, there can still be uncertainty in the response \(Y\). Further, we will assume that the error \(\ve\) is independent of \(\X\) and follows a normal (or Guassian) distribution with mean \(\mu=0\) and variance \(\sg^2\), that is, we assume that the errors are independent and identically distributed with zero mean and fixed variance.

The probabilistic regression model comprises two components-the deterministic component comprising the observed predictor attributes, and the random error component comprising the error term, which is assumed to be independent of the predictor attributes.

27.1 Univariate Regression

We assume that the true relationship can be modeled as a linear function

\[Y=f(X)+\ve=\beta+\omega\cd X+\ve\]

where \(\omega\) is the slope of the best fitting line and \(\beta\) is its intercept, and \(\ve\) is the random error variable that follows a normal distribution with mean \(\mu=0\) and variance \(\sg^2\).

Mean and Variance of Response Variable

Consider a fixed value \(x\) for the independent variable \(X\). The expected value of the response variable \(Y\) given \(x\) is

\[E[Y|X=x]=E[\beta+\omega\cd x+\ve]=\beta+\omega\cd x+E[\ve]=\beta+\omega\cd x\]

The last step follows from our assumption that \(E[\ve]=\mu=0\). Also, since \(x\) is assumed to be fixed, and \(\beta\) and \(\omega\) are constants, the expected value \(E[\beta+\omega\cd x]=\beta+\omega\cd x\). Next, consider the variance of \(Y\) given \(X=x\), we have

\[\rm{var}(Y|X=x)=\rm{var}(\beta+\omega\cd x+\ve)=\rm{var}(\beta+\omega\cd x)+\rm{var}(\ve)=0+\sg^2=\sg^2\]

Here \(\rm{var}(\beta+\omega\cd x)=0\), since \(\beta,\omega,x\) are all constants. Thus, given \(X=x\), the response variable \(Y\) follows a normal distribution with mean \(E[Y|X=x]=\beta+\omega\cd x\), and variance \(\rm{var}(Y|X=x)=\sg^2\)

Estimated Parameters

The true parameters \(\beta,\omega,\sg^2\) are all unknown, and have to be estimated from the training data \(\D\) comprising \(n\) points \(x_i\) and corresponding response values \(y_i\), for \(i=1,2,\cds,n\). Let \(b\) and \(w\) denote the estimated bias and weight terms; we can then make predictions for any given value \(x_i\) as follows:

\[\hat{y_i}=b+w\cd x_i\]

The estimated bias \(b\) and weight \(w\) are obtained by minimizing the sum of squared errors, given as

\[SSE=\sum_{i=1}^n(y_i-\hat{y_i})^2=\sum_{i=1}^n(y_i-b-w\cd x_i)^2\]

with the least squares estimates given as

\[w=\frac{\sg_{XY}}{\sg_X^2}\quad\quad b=\mu_Y-w\cd\mu_X\]

27.1.1 Estimating Variance (\(\sg^2\))

According to our model, the variance in prediction is entirely due to the random error term \(\ve\). We can estimate this variance by considering the predicted value \(\hat{y_i}\) and its deviation from the true response \(y_i\), that is, by looking at the residual error

\[\epsilon_i=y_i-\hat{y_i}\]

One of the properties of the estimated values \(b\) and \(w\) is that the sum of residual errors is zero, since

\[ \begin{align}\begin{aligned}\sum_{i=1}^n\epsilon_i&=\sum_{i=1}^n(y_i-b-w\cd x_i)\\&=\sum_{i=1}^n(y_i-\mu_Y+w\cd\mu_X-w\cd x_i)\\&=\bigg(\sum_{i=1}^ny_i\bigg)-n\cd\mu_Y+w\cd\bigg(n\mu_X-\sum_{i=1}^n x_i\bigg)\\&=n\cd\mu_Y-n\cd\mu_Y+w\cd(n\cd\mu_X-n\cd\mu_X)=0\end{aligned}\end{align} \]

Thus, the expected value of \(\epsilon_i\) is zero, since \(E[\epsilon_i]=\frac{1}{n}\sum_{i=1}^n\epsilon_i=0\).

The estimated variance \(\hat\sg^2\) is given as

\[\hat\sg^2=\rm{var}(\epsilon_i)=\frac{1}{n-2}\cd\sum_{i=1}^n(\epsilon_i- E[\epsilon_i])^2=\frac{1}{n-2}\cd\sum_{i=1}^n\epsilon_i^2=\frac{1}{n-2}\cd \sum_{i=1}^n(y_i-\hat{y_i})^2\]

Thus, the estimated variance is

Note

\(\dp\hat\sg^2=\frac{SSE}{n-2}\)

We divide by \(n-2\) to get an unbiased estimate, since \(n-2\) is the number of degrees of freedom for estimating SSE.

The squared root of the variance is called the standard error of regression

Note

\(\dp\hat\sg=\sqrt{\frac{SSE}{n-2}}\)

27.1.2 Goodness of Fit

The total scatter, also called total sum of squares, for the dependent variable \(Y\), is defined as

\[TSS=\sum_{i=1}^n(y_i-\mu_Y)^2\]

The total scatter can be decomposed into two components by adding and subtracting \(\hat{y_i}\) as follows

\[ \begin{align}\begin{aligned}TSS&=\sum_{i=1}^n(y_i-\mu_Y)^2=\sum_{i=1}^n(y_i-\hat{y_i}+\hat{y_i}-\mu_Y)^2\\&=\sum_{i=1}^n(y_i-\hat{y_i})^2+\sum_{i=1}^n(\hat{y_i}-\mu_Y)^2+2\sum_{i=1}^n(y_i-\hat{y_i})\cd(\hat{y_i}-\mu_Y)\\&=\sum_{i=1}^n(y_i-\hat{y_i})^2+\sum_{i=1}^n(\hat{y_1}-\mu_Y)^2=SSE+RSS\end{aligned}\end{align} \]

where we use the fact that \(\sum_{i=1}^n(y_i-\hat{y_i})\cd(\hat{y_i}-\mu_Y)=0\), and

\[RSS=\sum_{i=1}^n(\hat{y_i}-\mu_Y)^2\]

is a new term called regression sum of squares that measures the squared deviation of the predictions from the true mean. TSS can thus be decomposed into two parts: SSE, which is the amount of variation not explained by the model, and RSS, which is the amount of variance explained by the model. Therefore, the fraction of the variation left unexplained by the model is given by the ration \(\frac{SSE}{TSS}\). Conversely, the fraction of the variation that is explained by the model called the coefficient of determination or simply the \(R^2\) statistic, is given as

Note

\(\dp R^2=\frac{TSS-SSE}{TSS}=1-\frac{SSE}{TSS}=\frac{RSS}{TSS}\)

The higher the \(R^2\) statistic the better the estimated model, with \(R^2\in[0,1]\).

Geometry of Goodness of Fit

Recall that \(Y\) can be decomposed into two orthogonal parts

\[Y=\hat{Y}+\bs\epsilon\]

where \(\hat{Y}\) is the projection of \(Y\) onto the subspace spanned by \(\{\1,X\}\). Using the fact that this subspace is the same as that spanned by the orthogonal vectors \(\{\1,\bar{X}\}\), with \(\bar{X}=X-\mu_X\cd\1\), we can further decompose \(\hat{Y}\) as follows

\[\hat{Y}=\rm{proj}_\1(Y)\cd\1+\rm{proj}_{\bar{X}}(Y)\cd\bar{X}=\mu_Y\cd\1+ \frac{Y^T\bar{X}}{\bar{X}^T\bar{X}}\cd\bar{X}=\mu_Y\cd\1+w\cd\bar{X}\]

Likewise, the vector \(Y\) and \(\hat{Y}\) can be centered by subtracting their projections along the vector \(\1\)

\[\bar{Y}=Y-\mu_Y\cd\1\quad\quad\hat{\bar{Y}}=\hat{Y}-\mu_Y\cd\1=w\cd\bar{X}\]

The centered vectors \(\bar{Y},\hat{\bar{Y}},\bar{X}\) all lie in the \(n-1\) dimensional subspace orthogonal to the vector \(\1\).

In this subspace, the centered vectors \(\bar{Y}\) and \(\hat{\bar{Y}}\), and the error vector \(\bs\epsilon\) form a right triangle, since \(\hat{\bar{Y}}\) is the orthogonal projection of \(\bar{Y}\) onto the vector \(\bar{X}\). Noting that \(\bs\epsilon=Y-\hat{Y}=\bar{Y}-\hat{\bar{Y}}\), by the Pythagoras theorem, we have

\[\lv\bar{Y}\rv^2=\lv\hat{\bar{Y}}\rv^2+\lv\bs\epsilon\rv^2=\lv\hat{\bar{Y}}\rv^2+\lv Y-\hat{Y}\rv^2\]

This equation is equivalent to the decomposition of the total scatter, TSS, into sum of squared erros, SSE, and residual sum of squares, RSS.

\[ \begin{align}\begin{aligned}TSS&=\sum_{i=1}^n(y_i-\mu_Y)^2=\lv T-\mu_Y\cd\1\rv^2=\lv\bar{Y}\rv^2\\RSS&=\sum_{i=1}^n(\hat{y_i}-\mu_Y)^2=\lv\hat{Y}-\mu_Y\cd\1\rv^2=\lv\hat{\bar{Y}}\rv^2\\SSE&=\lv\bs\epsilon\rv^2=\lv Y-\hat{Y}\rv^2\end{aligned}\end{align} \]
\[ \begin{align}\begin{aligned}\lv\bar{Y}\rv^2&=\lv\hat{\bar{Y}}\rv^2+\lv Y-\hat{Y}\rv^2\\\lv Y-\mu_Y\cd\1\rv^2&=\lv\hat{Y}-\mu_Y\cd\1\rv^2+\lv Y-\hat{Y}\rv^2\\TSS&=RSS+SSE\end{aligned}\end{align} \]

Notice further that since \(\bar{Y},\hat{\bar{Y}},\bs\epsilon\) form a right triangle, the cosine of the angle between \(\bar{Y}\) and \(\hat{\bar{Y}}\) is given as the ratio of the base to the hypotenuse. On the other hand, the cosine of the angle is also the correlation between \(Y\) and \(\hat{Y}\) denoted \(\rho_{Y\hat{Y}}\). Thus, we have:

\[\rho_{Y\hat{Y}}=\cos\th=\frac{\lv\hat{\bar{Y}}\rv}{\lv\bar{Y}\rv}\]

We can observe that

\[\lv\hat{\bar{Y}}\rv=\rho_{Y\hat{Y}}\cd\lv\bar{Y}\rv\]

Note that, whereas \(|\rho_{Y\hat{Y}}|\leq 1\), due to the projection operation, the angle between \(Y\) and \(\hat{Y}\) is always less than or equal to \(90^\circ\), which means that \(\rho_{Y\hat{Y}}\in[0,1]\) for univariate regression. Thus, the predicted response vector \(\hat{\bar{Y}}\) is smaller than the true response vector \(\bar{Y}\) by an amount equal to the correlation between them. Furthermore, the coefficient of determination is the same as the squared correlation between \(Y\) and \(\hat{Y}\)

\[R^2=\frac{RSS}{TSS}=\frac{\lv\hat{\bar{Y}}\rv^2}{\lv\bar{Y}\rv^2}=\rho^2_{Y\hat{Y}}\]

27.1.3 Inference about Regression Coefficient and Bias Term

The estimated values of the bias and regression coefficient, \(b\) and \(w\), are only point estimates for the true parameters \(\beta\) and \(\omega\). To obtain confidence intervals for these parameters, we treat each \(y_i\) as a random variable for the response given the corresponding fixed value \(x_i\). These random variables are all independent and identically distributed as \(Y\), with expected value \(\beta+\omega\cd x_i\) and variance \(\sg^2\). On the other hand, the \(x_i\) values are fixed a priori and therefore \(\mu_X\) and \(\sg_X^2\) are also fixed values.

We can now treat \(b\) and \(w\) as random variables, with

\[ \begin{align}\begin{aligned}b&=\mu_Y-w\cd\mu_X\\w&=\frac{\sum_{i=1}^n(x_i-\mu_X)(y_i-\mu_Y)}{\sum_{i=1}^n(x_i-\mu_X)^2}= \frac{1}{s_X}\sum_{i=1}^n(x_i-\mu_X)\cd y_i=\sum_{i=1}^nc_i\cd y_i\end{aligned}\end{align} \]

where \(c_i\) is a constant, given as

\[c_i=\frac{x_i-\mu_X}{s_X}\]

and \(s_X=\sum_{i=1}^n(x_i-\mu_X)^2\) is the total scatter for \(X\), defined as the sum of squared deviations of \(x_i\) from its mean \(\mu_X\). We also use the fact that

\[\sum_{i=1}^n(x_i-\mu_X)\cd\mu_Y=\mu_Y\cd\sum_{i=1}^n(x_i-\mu_X)=0\]

Note that

\[\sum_{i=1}^nc_i=\frac{1}{s_X}\sum_{i=1}^n(x_i-\mu_X)=0\]

Mean and Variance of Regression Coefficient

The expected value of \(w\) is given as

\[ \begin{align}\begin{aligned}E[w]&=E\bigg[\sum_{i=1}^nc_iy_i\bigg]=\sum_{i=1}^nc_i\cd E[y_i]=\sum_{i=1}^nc_i(\beta+\omega\cd x_i)\\&=\beta\sum_{i=1}^nc_i+\omega\cd\sum_{i=1}^nc_i\cd x_i=\frac{\omega}{s_X}\cd \sum_{i=1}^n(x_i-\mu_X)\cd x_i=\frac{\omega}{s_X}\cd s_X=\omega\end{aligned}\end{align} \]

which follows from the observation that \(\sum_{i=1}^nc_i=0\), and further

\[s_X=\sum_{i=1}^n(x_i-\mu_X)^2=\bigg(\sum_{i=1}^nx_i^2\bigg)-n\cd\mu_X^2=\sum_{i=1}^n(x_i-\mu_X)\cd x_i\]

Thus, \(w\) is an unbiased estimator for the true parameter \(\omega\). Using the fact that the variables \(y_i\) are independent and identically distributed as \(Y\), we can compute the variance of \(w\) as follows

\[\rm{var}(w)=\rm{var}\bigg(\sum_{i=1}^nc_i\cd y_i\bigg)=\sum_{i=1}^nc_i^2\cd \rm{var}(y_i)=\sg^2\cd\sum_{i=1}^nc_i^2=\frac{\sg^2}{s_X}\]

since \(c_i\) is a constant, \(\rm{var}(y_i)=\sg^2\), and further

\[\sum_{i=1}^nc_i^2=\frac{1}{s^2_X}\cd\sum_{i=1}^n(x_i-\mu_X)^2=\frac{s_X}{s^2_X}=\frac{1}{s_X}\]

The standard deviation of \(w\), also called the standard error of \(w\), is given as

Note

\(\dp\rm{se}(w)=\sqrt{\rm{var}(w)}=\frac{\sg}{\sqrt{s_X}}\)

Mean and Variance of Bias Term

The expected value of \(b\) is given as

\[ \begin{align}\begin{aligned}E[b]&=E[\mu_Y-w\cd\mu_X]=E\bigg[\frac{1}{n}\sum_{i=1}^ny_i-w\cd\mu_x\bigg]\\&=\bigg(\frac{1}{n}\cd\sum_{i=1}^nE[y_i]\bigg)-\mu_X\cd E[w]=\bigg( \frac{1}{n}\sum_{i=1}^n(\beta+\omega\cd x_i)\bigg)-\omega\cd\mu_X\\&=\beta+\omega\cd\mu_X-\omega\cd\mu_X=\beta\end{aligned}\end{align} \]

Thus, \(b\) is an unbiased estimator for the true parameter \(beta\).

Using the observation that all \(y_i\) are independent, the variance of the bias term can be computed as follows

\[ \begin{align}\begin{aligned}\rm{var}(b)&=\rm{var}(\mu_Y-w\cd\mu_X)\\&=\rm{var}\bigg(\frac{1}{n}\sum_{i=1}^ny_i\bigg)+\rm{var}(\mu_X\cd w)\\&=\frac{1}{n^2}\cd n\sg^2+\mu_X^2\cd\rm{var}(w)=\frac{1}{n}\cd\sg^2+\mu_X^2\cd\frac{\sg^2}{s_X}\\&=\bigg(\frac{1}{n}+\frac{\mu_X^2}{s_X}\bigg)\cd\sg^2\end{aligned}\end{align} \]

The standard deviation of \(b\), also called the standard error of \(b\), is given as

Note

\(\dp\rm{se}(b)=\sqrt{\rm{var}(b)}=\sg\cd\sqrt{\frac{1}{n}+\frac{\mu_X^2}{s_X}}\)

Covariance of Regression Coefficient and Bias

\[ \begin{align}\begin{aligned}\rm{cov}(w,b)&=E[w\cd b]-E[w]\cd E[b]=E[(\mu_Y-w\cd\mu_X)\cd w]-\omega\cd\beta\\&=\mu_Y\cd E[w]-\mu_X\cd E[w^2]-\omega\cd\beta=\mu_Y\cd\omega-\mu_X\cd(\rm{Var}(w)+E[w]^2)-\omega\cd\beta\\&=\mu_Y\cd\omega-\mu_X\cd\bigg(\frac{\sg^2}{s_X}-\omega^2\bigg)-\omega\cd \beta=\omega\cd(\mu_Y-\omega\cd\mu_X)-\frac{\mu_X\cd\sg^2}{s_X}-\omega\cd \beta\\&=-\frac{\mu_X\cd\sg^2}{s_X}\end{aligned}\end{align} \]

Confidence Intervals

Since the \(y_i\) variables are all normally distributed, their linear combination also follows a normal distribution. Thus \(w\) follows a normal distribution with mean \(\omega\) and variance \(\sg^2/s_X\). Like wise, \(b\) follows a normal distribution with mean \(\beta\) and variance \((1/n+\mu_X^2/s_X)\cd\sg^2\).

Since the true variance \(\sg^2\) is unknown, we use the estimated variance \(\hat\sg^2\), to define the standardized variables \(Z_w\) amd \(Z_b\) as follows

Note

\(\dp Z_w=\frac{w-E[w]}{\rm{se}(w)}=\frac{w-\omega}{\frac{\hat\sg}{\sqrt{s_X}}}\quad\quad\) \(\dp Z_b=\frac{b-E[b]}{\rm{se}(b)}=\frac{b-\beta}{\hat\sg\sqrt{(1/n+\mu_X^2/s_X)}}\)

These variables follow the Student’s \(t\) distribution with \(n-2\) degrees of freedom. Let \(T_{n-2}\) denote the cumulative \(t\) distribution with \(n-2\) degrees of freedom, and let \(t_{\alpha/2}\) denote the critical value of \(T_{n-2}\) that encompasses \(\alpha/2\) of the probability mass in the right tail.

\[ \begin{align}\begin{aligned}P(Z\geq t_{\alpha/2})=\frac{\alpha}{2}\rm{\ or\ equivalently\ }T_{n-2}(t_{\alpha/2})=1-\frac{\alpha}{2}\\P(Z\geq-t_{\alpha/2})=1-\frac{\alpha}{2}\rm{\ or\ equivalently\ }T_{n-2}(-t_{\alpha/2})=\frac{\alpha}{2}\end{aligned}\end{align} \]

Given confidence level \(1-\alpha\), i.e., significance level \(\alpha\in(0,1)\), the \(100(1-\alpha)\%\) confidence interval for the true values, \(\omega\) and \(\beta\), are therefore as follows

\[ \begin{align}\begin{aligned}P(w-t_{\alpha/2}\cd\rm{se}(w)\leq\omega\leq w+t_{\alpha/2}\cd\rm{se}(w))&=1-\alpha\\P(b-t_{\alpha/2}\cd\rm{se}(b)\leq\beta\leq b+t_{\alpha/2}\cd\rm{se}(b))&=1-\alpha\end{aligned}\end{align} \]

27.1.4 Hypothesis Testing for Regression Effects

In the regression model, \(Y\) depends on \(X\) through the parameter \(\omega\), therefore, we can check for the regression effect by assuming the null hypothesis \(H_0\) that \(\omega=0\), with the alternative hypothesis \(H_a\) being \(\omega\ne 0\):

\[H_0:\omega=0\quad\quad H_a:\omega\ne 0\]

When \(\omega=0\), the response \(Y\) depends only on the bias \(\beta\) and the random error \(\ve\).

Under the null hypothesis we have \(E[w]=\omega=0\). Thus,

Note

\(\dp Z_w=\frac{w-E[w]}{\rm{se}(w)}=\frac{w}{\hat\sg/\sqrt{s_X}}\)

Given significance level \(\alpha\), we reject the null hypothesis if the p-value is below \(\alpha\). In this case, we accept the alternative hypothesis that the estimated value of the slope parameter is significantly different from zero.

We can also define the \(f\)-statistic, which is the ratio of the regression sum of squares, RSS, to the estimated variance, given as

Note

\(\dp f=\frac{RSS}{\hat\sg^2}=\frac{\sum_{i=1}^n(\hat{y_i}-\mu_Y)^2]{\sum_{i=1}^n(y_i-\hat{y_i}^2/n-2)}\)

Under the null hypothesis, one can show that

\[E[RSS]=\sg^2\]

Further, it is also true that

\[E[\hat\sg^2]=\sg^2\]

Thus, under the null hypothesis the \(f\)-statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then \(E[RSS]\geq\sg^2\), resulting in a larger \(f\) value. In fact, the \(f\)-statistic follows a \(F\)-distribution with 1, \((n-2)\) degrees of freedom; therefore, we can reject the null hypothesis that \(w=0\) if the p-value of \(f\) is less than the significance level \(\alpha\).

Interestingly the \(f\)-test is equivalent to the \(t\)-test since \(Z_w^2=f\).

\[ \begin{align}\begin{aligned}f&=\frac{1}{\hat\sg^2}\cd\sum_{i=1}^n(\hat{y_i}-\mu_Y)^2=\frac{1}{\hat\sg^2}\cd\sum_{i=1}^n(b+w\cd x_i-\mu_Y)^2\\&=\frac{1}{\hat\sg^2}\cd\sum_{i=1}^n(\mu_Y-w\cd\mu_X+w\cd x_i-\mu_Y)^2= \frac{1}{\hat\sg^2}\cd\sum_{i=1}^n(w\cd(x_i-\mu_X))^2\\&=\frac{1}{\hat\sg^2}\cd w^2\cd\sum_{i=1}^n(x_i-\mu_X)^2=\frac{w^2\cd s_X}{\hat\sg^2}\\&=\frac{w^2}{\hat\sg^2/s_X}=Z_w^2\end{aligned}\end{align} \]

Test for Bias Term

Note that we can also test if the bias value is statistically significant or not by setting up the null hypothesis, \(H_0:\beta=0\), versus the alternative hypothesis \(H_a:\beta\neq 0\). We then evaluate the \(Z_b\) statistic under the null hypothesis:

Note

\(\dp Z_b=\frac{b-E[b]}{\rm{se}(b)}=\frac{b}{\hat\sg\cd\sqrt{(1/n)+\mu_X^2/s_X)}}\)

since, under the null hypothesis \(E[b]=\beta=0\). Using a two-tailed \(t\)-test with \(n-2\) degrees of freedom, we can compute the p-value of \(Z_b\). We reject the null hypothesis if this value is smaller than the significance level \(\alpha\).

27.1.5 Standardized Residuals

Our assumption about the true errors \(\ve_i\) is that they are normally distributed with mean \(\mu=0\) and fixed variance \(\sg^2\).

The mean of \(\epsilon_i\) is given as

\[ \begin{align}\begin{aligned}E[\epsilon_i]&=E[y_i-\hat{y_i}]=E[y_i]-E[\hat{y_i}]\\&=\beta+\omega\cd x_i-E[b+w\cd x_i]=\beta+\omega\cd x_i-(\beta+\omega\cd x_i)=0\end{aligned}\end{align} \]

To compute the variance of \(\epsilon_i\), we will express it as a linear combination of the \(y_i\) variables, by noting that

\[ \begin{align}\begin{aligned}w&=\frac{1}{s_X}\bigg(\sum_{j=1}^nx_iy_i-n\cd\mu_X\cd\mu_Y\bigg)= \frac{1}{s_X}\bigg(\sum_{j=1}^nx_jy_j-\sum_{j=1}^n\mu_X\cd y_j\bigg)= \sum_{j=1}^n\frac{(x_j-\mu_X)}{s_X}\cd y_j\\b&=\mu_Y-w\cd\mu_X=\bigg(\sum_{j=1}^n\frac{1}{n}\cd y_i\bigg)-w\cd\mu_X\end{aligned}\end{align} \]
\[ \begin{align}\begin{aligned}\epsilon_i=y_i-\hat{y_i}&=y_i-b-w\cd x_i=y_i-\sum_{j=1}^n\frac{1}{n}y_i+w\cd\mu_X-w\cd x_i\\&=y_i-\sum_{j=1}^n\frac{1}{n}y_j-(x_i-\mu_X)\cd w\\&=y_i-\sum_{j=1}^n\frac{1}{n}y_j-\sum_{j=1}^n\frac{(x_i-\mu_X)\cd(x_j-\mu_X)}{s_X}\cd y_j\\&=\bigg(1-\frac{1}{n}-\frac{(x_i-\mu_X)^2}{s_X}\bigg)\cd y-i-\sum_{j\neq i} \bigg(\frac{1}{n}+\frac{(x_i-\mu_X)\cd(x_j-\mu_X)}{s_X}\bigg)\cd y_j\end{aligned}\end{align} \]

Define \(a_j\) as follows:

\[a_j=\bigg(\frac{1}{n}+\frac{(x_i-\mu_x)\cd(x_j-\mu_X)}{s_X}\bigg)\]
\[ \begin{align}\begin{aligned}\rm{var}(\epsilon_i)&=\rm{var}\bigg((1-a_i)\cd y_i-\sum_{j\neq i}a_j\cd y_j\bigg)\\&=(1-a_i)^2\cd\rm{var}(y_i)+\sum_{j\neq i}a_j^2\cd\rm{var}(y_j)\\&=\sg^2\cd(1-2a_i+a_i^2+\sum_{j\neq i}a_j^2)\\&=\sg^2\cd(1-2a_i+\sum_{j=1}^na_j^2)\end{aligned}\end{align} \]

Consider the term \(\sum_{j=1}^na_j^2\), we have

\[ \begin{align}\begin{aligned}\sum_{j=1}^na_j^2&=\sum_{j=1}^n\bigg(\frac{1}{n}+\frac{(x_i-\mu_X)\cd(x_j-\mu_X)}{s_X}\bigg)^2\\&=\sum_{j=1}^n\bigg(\frac{1}{n^2}-\frac{2\cd(x_i-\mu_X)\cd(x_j-\mu_X)}{n\cd s_X}+\frac{(x_i-\mu_X)^2\cd(x_j-\mu_x)^2}{s_X^2}\bigg)\\&=\frac{1}{n}-\frac{2\cd(x_i-\mu_x)}{n\cd s_X}\sum_{j=1}^n(x_j-\mu_X)+ \frac{(x_i-\mu_X)^2}{s_X^2}\sum_{j=1}^n(x_j-\mu_X)^2\\&=\frac{1}{n}+\frac{x_i-\mu_X)^2}{s_X}\end{aligned}\end{align} \]
\[ \begin{align}\begin{aligned}\rm{var}(\epsilon_i)&=\sg^2\cd\bigg(1-\frac{2}{n}-\frac{2\cd(x_i-\mu_x)^2} {s_X}+\frac{1}{n}+\frac{(x_i-\mu_X)^2}{s_X}\bigg)\\&=\sg^2\cd\bigg(1-\frac{1}{n}-\frac{(x_i-\mu_x)^2}{s_x}\bigg)\end{aligned}\end{align} \]

We can now define the standardized residual \(\epsilon_i^*\) by dividing \(\epsilon_i\) by its standard deviation after replacing \(\sg^2\) by its estimated value \(\hat\sg^2\).

Note

\(\dp\epsilon_i^*=\frac{\epsilon_i}{\sqrt{\rm{var}(\epsilon_i)}}\) \(\dp=\frac{\epsilon_i}{\hat\sg\cd\sqrt{1-\frac{1}{n}-\frac{(x_i-\mu_x)^2}{s_X}}}\)

These standardized residuals should follow a standard normal distribution. We can thus plot the standardized residuals against the quantiles of a standard normal distribution, and check if the normality assumption holds. Significant deviations would indicate that our model assumptions may not be correct.

27.2 Multiple Regression

In multiple regression there are multiple independent attributes \(X_1,X_2,\cds,X_d\) and a single dependent or response attribute \(Y\), and we assume that the true relationship can be modeled as a linear function

\[Y=\beta+\omega_1\cd X_1+\omega_2\cd X_2+\cds+\omega_d X_d+\ve\]

where \(\beta\) is the intercept or bias term and \(\omega_i\) is the regression coefficient for attribute \(X_i\). We assume that \(\ve\) is a random variable that is normally distributed with mean \(\mu=0\) and variance \(\sg^2\).

Mean and Variance of Response Variable

Let \(\X=(X_1,X_2,\cds,X_d)^T\in\R^d\) denote the multivariate random variable comprising the independent attributes. Let \(\x=(x_1,x_2,\cds,x_d)^T\) be some fixed value of \(\X\), and let \(\bs\omega=(\omega_1,\omega_2,\cds,\omega_d)^T\). The expected response value is then given as

\[ \begin{align}\begin{aligned}E[Y|\X=\x]&=E[\beta+\omega_1\cd x_1+\cds+\omega_d\cd x_d+\ve]=E\bigg[\beta+\sum_{i=1}^d\omega_i\cd x_i]+E[\ve]\\&=\beta+\omega_1\cd x_1+\cds+\omega_d\cd x_d=\beta+\bs\omega^T\bs\x\end{aligned}\end{align} \]

which follows from the assumption that \(E[\ve]=0\). The variance of the response variable is given as

\[\rm{var}(Y|\X=\x)=\rm{var}\bigg(\beta+\sum_{i=1}^d\omega_i\cd x_i+\ve\bigg)= \rm{var}\bigg(\beta+\sum_{i=1}^d\omega_i\cd x_i\bigg)+\rm{var}(\ve)=0+\sg^2= \sg^2\]

which follows from the assumption that all \(x_i\) are fixed a priori. Thus, we conclude that \(Y\) also follows a normal distribution with mean \(E[Y|\x]=\beta+\sum_{i=1}^d\omega_i\cd x_i=\beta+\bs\omega^T\x\) and variance \(\rm{var}(Y|\x)=\sg^2\).

Estimated Parameters

We augment the data matrix by adding a new column \(X_0\) with all values fixed at 1, that is, \(X_0=\1\). Thus, the augmented data \(\td\D\in\R^{n\times(d+1)}\) comprises the \((d+1)\) attributes \(X_0,X_1,X_2,\cds,X_d\), and each augmented point is given as \(\td\x_i=(1,x_{i1},x_{i2},\cds,x_{id})^T\).

Let \(b=w_0\) denote the estimated bias term, and let \(w_i\) denote the estimated regression weights. The augmented vector of estimated weights, including the bias term, is

\[\td\w=(w_0,w_1,\cds,w_d)^T\]

We then make predictions for any given point \(\x_i\) as follows:

\[\hat{y_i}=b\cd 1+w_1\cd x_{i1}+\cds+w_d\cd x_{id}=\td\w^T\td{\x_i}\]

Recall that these estimates are obtained by minimizing the sum of squared errors (SSE), given as

\[SSE=\sum_{i=1}^n(y_i-\hat{y_i})^2=\sum_{i=1}^n\bigg(y_i-b-\sum_{j=1}^dw_\cd x_{ij}\bigg)^2\]

with the least squares estimate given as

\[\td\w=(\td\D^T\td\D)\im\td\D^TY\]

The estimated variance \(\hat\sg^2\) is then given as

Note

\(\dp\hat\sg^2=\frac{SSE}{n-(d+1)}=\frac{1}{n-d-1}\cd\sum_{i=1}^n(y_i-\hat{y_i})^2\)

We divide by \(n-(d+1)\) to get an unbiased estimate, since \(n-(d+1)\) is the number of degrees of freedom for estimating SSE.

Estimated Variance is Unbiased

Recall that

\[\hat{Y}=\td\D\td\w=\td\D(\td\D^T\td\D)\im\td\D^TY=\H Y\]

where \(\H\) is the \(n\times n\) hat matrix (assuming that \((\td\D^T\td\D)\im\) exists). Note that \(\H\) is an orthogonal projection matrix, since it is symmetric (\(\H^T=\H\)) and idempotent (\(\H^2=\H\)).

\[ \begin{align}\begin{aligned}\H^T&=(\td\D(\td\D^T\td\D)\im\td\D^T)^T=(\td\D^T)^T((\td\D^T\td\D)^T)\im\td\D^T=\H\\\H^2&=\td\D(\td\D^T\td\D)\im\td\D^T\td\D(\td\D^T\td\D)\im\td\D^T=\td\D(\td\D^T\td\D)\im\td\D^T=\H\end{aligned}\end{align} \]

Furthermore, the trace of the hat matrix is given as

\[\rm{tr}(\H)=\rm{tr}(\td\D(\td\D^T\td\D)\im\td\D^T)=\rm{tr}(\td\D^T\td\D(\td\D^T\td\D)\im)=\rm{tr}(\I_{(d+1)})=d+1\]

Finally, note that the matrix \(\I-\H\) is also symmetric and idempotent, since

\[ \begin{align}\begin{aligned}(\I-\H)^T&=\I^T-\H^T=\I-\H\\(\I-\H)^2&=(\I-\H)(\I-\H)=\I-\H-\H+\H^2=\I-\H\end{aligned}\end{align} \]

Now consider the squared error; we have

\[ \begin{align}\begin{aligned}SSE&=\lv Y-\hat{Y}\rv^2=\lv Y-\H Y\rv^2=\lv(\I-\H)Y\rv^2\\&=Y^T(\I-\H)(\I-\H)Y=Y^T(\I-\H)Y\end{aligned}\end{align} \]

However, note that the response vector \(Y\) is given as

\[Y=\td\D\td{\bs\omega}+\bs\ve\]

where \(\td{\bs\omega}=(\omega_0,\omega_1,\cds,\omega_d)^T\) is the true (augmented) vector of parameters of the model, and \(\bs\ve=(\ve_1,\ve_2,\cds,\ve_n)^T\) is the true error vector, which is assumed to be normally distributed with mean \(E[\bs\ve]=\0\) and with fixed variance \(\ve_i\sg^2\) for each point, so that \(\rm{cov}(\bs\ve)=\sg^2\I\).

\[ \begin{align}\begin{aligned}SSE&=Y^T(\I-\H)Y=(\td\D\td{\bs\omega}+\bs\ve)^T(\I-\H)(\td\D\td{\bs\omega}+\bs\ve)\\&=(\td\D\td{\bs\omega}+\bs\ve)^T((\I-\H)\td\D\td{\bs\omega}+(\I-\H)\bs\ve)\\&=((\I-\H)\bs\ve)^T(\td\D\td{\bs\ve}+\bs\ve)=\bs\ve^T(\I-\H)(\td\D\td{\bs\omega}+\bs\ve)\\&=\bs\ve^T(\I-\H)\td\D\td{\bs\omega}+\bs\ve^T(\I-\H)\bs\ve=\bs\ve^T(\I-\H)\bs\ve\end{aligned}\end{align} \]

where we use the observation that

\[(\I-\H)\td\D\td{\bs\omega}=\td\D\td{\bs\omega}-\H\td\D\td{\bs\omega}= \td\D\td{\bs\omega}-(\td\D(\td\D^T\td\D)\im\td\D^T)\td\D\td{\bs\omega}= \td\D\td{\bs\omega}-\td\D\td{\bs\omega}=\0\]
\[ \begin{align}\begin{aligned}E[SSE]&=E[\bs\ve^T(\I-\H)\bs\ve]\\&=E\bigg[\sum_{i=1}^n\ve_i^2-\sum_{i=1}^n\sum_{j=1}^nh_{ij}\ve_i\ve_j\bigg]= \sum_{i=1}^nE[\ve_i^2]-\sum_{i=1}^n\sum_{j=1}^nh_{ij}E[\ve_i\ve_j]\\&=\sum_{i=1}^n(1-h_{ii})E[\ve_i^2]\\&=\bigg(n-\sum_{i=1}^nh_{ii}\bigg)\sg^2=(n-\rm{tr}(\H))\sg^2=(n-d-1)\cd\sg^2\end{aligned}\end{align} \]

It follows that

\[\hat\sg^2=E\bigg[\frac{SSE}{(n-d-1)}\bigg]=\frac{1}{(n-d-1)}E[SSE]=\frac{1}{(n-d-1)}\cd(n-d-1)\cd\sg^2=\sg^2\]

27.2.1 Goodness of Fit

The decomposition of the total sum of squares, TSS, into the sum of squared errors, SSE, and the residual sum of squares, RSS, holds true for multiple regression as well:

\[ \begin{align}\begin{aligned}TSS&=SSE+RSS\\\sum_{i=1}^n(y_i-\mu_Y)^2&=\sum_{i=1}^n(y_i\hat{y_i})^2+\sum_{i=1}^n(\hat{y_i}-\mu_Y)^2\end{aligned}\end{align} \]

The coefficient of multiple determinations, \(R^2\), gives the goodness of fit, measured as the fraction of the variation explained by the linear model:

Note

\(\dp R^2=1-\frac{SSE}{TSS}=\frac{TSS-SSE}{TSS}=\frac{RSS}{TSS}\)

One of the potential problems with the \(R^2\) measure is that it is susceptible to increase as the number of attributes increase, even though the additional attributes may be uninformative. To counter this, we can consider the adjusted coefficient of determination, which takes into account the degrees of freedom in both TSS and SSE

Note

\(\dp R_a^2=1-\frac{SSE/(n-d-1)}{TSS/(n-1)}=1-\frac{(n-1)\cd SSE}{(n-d-1)\cd TSS}\)

We can observe that the adjusted \(R_a^2\) measure is always less than \(R^2\), since the ratio \(\frac{n-1}{n-d-1}>1\). If there is too much of a difference between \(R^2\) and \(R_a^2\), it might indicate that there are potentially many, possibly irrelevant, attributes being used to fit the model.

Geometry of Goodness of Fit

\[\bar{X_i}=X_i-\mu_{X_i}\cd\1\quad\quad\bar{Y}=Y-\mu_Y\cd\1\quad\quad\hat{\bar{Y}}=\hat{Y}-\mu_Y\cd\1\]

The centered vectors \(\bar{Y}\) and \(\hat{\bar{Y}}\), and the error vector \(\bs\epsilon\) form a right triangle, and thus, by the Pythagoras theorem, we have

\[ \begin{align}\begin{aligned}\lv\bar{Y}\rv^2&=\lv\hat{\bar{Y}}\rv^2=\lv\bs\epsilon\rv^2=\lv\hat{\bar{Y}}\rv^2+\lv Y-\hat{Y}\rv^2\\YSS&=RSS+SSE\end{aligned}\end{align} \]

The correlation between \(Y\) and \(\hat{Y}\) is the cosine of the angle between \(\bar{Y}\) and \(\hat{\bar{Y}}\), which is also given as the ratio of the base to the hypotenuse

\[\rho_{Y\hat{Y}}=\cos\th=\frac{\lv\hat{\bar{Y}}\rv}{\lv\bar{Y}\rv}\]

The coefficient of multiple determination is given as

\[R^2=\frac{RSS}{TSS}=\frac{\lv\hat{\bar{Y}}\rv^2}{\lv\bar{Y}\rv^2}=\rho_{Y\hat{Y}}^2\]

27.2.2 Inference about Regression Coefficients

Let \(Y\) be the response vector over all observations. Let \(\td\w=(w_0,w_1,w_2,\cds,w_d)^T\) be the estimated vector of regression coefficients, computed as

\[\td\w=(\td\D^T\td\D)\im\td\D^TY\]

The expected value of \(\td\w\) is given as follows:

\[ \begin{align}\begin{aligned}E[\td\w]&=E[(\td\D^T\td\D)\im\td\D^TY]=(\td\D^T\td\D)\im\td\D^T\cd E[Y]\\&=(\td\D^T\td\D)\im\td\D^T\cd E[\td\D\td{\bs\omega}+\bs\ve]=(\td\D^T\td\D) \im(\td\D^T\td\D)\td{\bs\omega}=\td{\bs\omega}\end{aligned}\end{align} \]

Thus, \(\td\w\) is an unbiased estimator for the true regressions coefficients vector \(\td{\bs\omega}\).

\[ \begin{align}\begin{aligned}\rm{cov}(\td\w)&=\rm{cov}((\td\D^T\td\D)\im\td\D^TY)\\&=\rm{cov}(\A Y)=\A\rm{cov}(Y)\A^T\\&=\A\cd(\sg^2\cd\I)\cd\A^T\\&=(\td\D^T\td\D)\im\td\D^T(\sg^2\cd\I)\td\D(\td\D^T\td\D)\im\\&=\sg^2\cd(\td\D^T\td\D)\im(\td\D^T\td\D)(\td\D^T\td\D)\im\\&=\sg^2(\td\D^T\td\D)\im\end{aligned}\end{align} \]

Here, we made use of the fact that \(\A=(\td\D^T\td\D)\im\td\D^T\) is a matrix of fixed values, and therefore \(\rm{cov}(\A Y)=\A\rm{cov}(Y)\A^T\). Also, we have \(\rm{cov}(Y)=\sg^2\cd\I\), which follows from the fact that the observed response \(y_i\)’s are all independent and have the same variance \(\sg^2\).

Note that \(\td\D^T\td\D\in\R^{(d+1)\times(d+1)}\) is the uncentered scatter matrix for the augmented data. Let \(\C\) denote the inverse of \(\td\D^T\td\D\).

\[(\td\D^T\td\D)\im=\C\]

Therefore, the covariance matrix for \(\td\w\) can be written as

\[\rm{cov}(\td\w)=\sg^2\C\]

In particular, the diagonal entries \(\sg^2\cd c_{ii}\) give the variance for each of the regression coefficient estimates, and their squared roots specify the standard erros.

\[\rm{var}(w_i)=\sg^2\cd c_{ii}\quad\quad\rm{se}(w_i)=\sqrt{\rm{var}(w_i)}=\sg\cd\sqrt{c_{ii}}\]

We can now define the standardized variable \(Z_{w_i}\) that can be used to derive the confidence intervals for \(w_i\) as follows

Note

\(\dp Z_{w_i}=\frac{w_i-E[w_i]}{\rm{se}(w_i)}=\frac{w_i-\omega_i}{\hat\sg\sqrt{c_{ii}}}\)

Each of the variables \(Z_{w_i}\) follows a \(t\)-distribution with \(n-d-1\) degrees of freedom, from which we can obtain the \(100(1-\alpha)\%\) confidence interval of the true value \(\omega_i\) as follows:

\[P(w_i-t_{\alpha/2}\cd\rm{se}(w_i)\leq\omega_i\leq w_i+t_{\alpha/2}\cd\rm{se}(w_i))=1-\alpha\]

Here, \(t_{\alpha/2}\) is the critical value of the \(t\) distribution, with \(n-d-1\) degrees of freedom, that encompasses \(\alpha/2\) fraction of the probability mass in the right tail, given as

\[P(Z\geq t_{\alpha/2})=\frac{\alpha}{2}\rm{\ or\ equivalently\ }T_{n-d-1}(t_{\alpha/2})=1-\frac{\alpha}{2}\]

27.2.3 Hypothesis Testing

We set up the null hypothesis that all the true weights are zero, except for the bias term (\(\beta=\omega_0\)). We contrast the nul hypothesis with the alternative hypothesis that at least one of the weights is not zero

\[ \begin{align}\begin{aligned}H_0&:\omega_1=0,\omega_2=0,\cds,\omega_d=0\\H_a&:\exists i,\rm{\ such\ that\ }\omega_i\neq 0\end{aligned}\end{align} \]

The null hypothesis can also be written as \(H_0:\bs\omega=\0\).

We use the \(F\)-test that compares the ratio of the adjusted RSS value to the estimated variance \(\hat\sg^2\), defined via the \(f\)-statistic

Note

\(\dp f=\frac{RSS/d}{\hat\sg^2}=\frac{RSS/d}{SEE/(n-d-1)}\)

Under the null hypothesis, we have

\[E[RSS/d]=\sg^2\]

To see this, consider

\[ \begin{align}\begin{aligned}\hat{Y}&=b\cd\1+w_1\cd X_1+\cds+w_d\cd X_d\\\hat{Y}&=(\mu_Y-w_1\mu_{X_1}-\cds-w_d\mu_{X_d})\cd\1+w_1\cd X_1+\cds+w_d\cd X_d\\\hat{Y}-\mu_Y\cd\1&=w_1(X_1-\mu_{X_1}\cd\1)+\cds+w_d(X_d-\mu_{X_d}\cd\1)\\\hat{\bar{Y}}&=w_1\bar{X_1}+w_2\bar{X_2}+\cds+w_d\bar{X_d}=\sum_{i=1}^dw_i\bar{X_i}\end{aligned}\end{align} \]

Consider the RSS value; we have

\[ \begin{align}\begin{aligned}RSS&=\lv\hat{Y}-\mu_Y\cd\1\rv^2=\lv\hat{\bar{Y}}\rv^2=\hat{\bar{Y}}^T\hat{\bar{Y}}\\&=\bigg(\sum_{i=1}^dw_i\bar{X_i}\bigg)^T\bigg(\sum_{j=1}^dw_j\bar{X_j}\bigg) =\sum_{i=1}^d\sum_{j=1}^dw_iw_j\bar{X_i}^T\bar{X_j}=\w^T(\bar\D^T\bar\D)\w\end{aligned}\end{align} \]

The expected value of RSS is thus given as

\[ \begin{align}\begin{aligned}E[RSS]&=E[\w^T(\bar\D^T\bar\D)\w]\\&=\rm{tr}(E[\w^T\bar\D^T\bar\D)\w])\\&=E[\rm{tr}(\w^T(\bar\D^T\bar\D)\w)]\\&=E[\rm{tr}((\bar\D^T\bar\D)\w\w^T)]\\&=\rm{tr}((\bar\D^T\bar\D)\cd E[\w\w^T])\\&=\rm{tr}((\bar\D^T\bar\D)\cd(\rm{cov}(\w)+E[\w]\cd E[\w]^T))\\&=\rm{tr}((\bar\D^T\bar\D)\cd\rm{cov}(\w))\\&=\rm{tr}((\bar\D^T\bar\D)\cd\sg^2(\bar\D^T\bar\D)\im)\\&=\sg^2\rm{tr}(\I_d)=d\cd\sg^2\end{aligned}\end{align} \]
\[ \begin{align}\begin{aligned}E\bigg[\frac{RSS}{d}\bigg]&=\frac{1}{d}E[RSS]=\frac{1}{d}\cd d\cd\sg^2=\sg^2\\E[\hat\sg^2]&=\sg^2\end{aligned}\end{align} \]

Thus, under the null hypothesis the \(f\)-statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then \(E[RSS/d]\geq\sg^2\), resulting in a larger \(f\) value.

The ratio \(f\) follows a \(F\)-distribution with \(d\), \((n-d-1)\) degrees of freedom for the numerator and denominator, respectively. Therefore, we can reject the null hypothesis if the p-value is less than the chosen significance level.

Notice that, since \(R^2=1-\frac{SSE}{TSS}=\frac{RSS}{TSS}\), we have

\[SSE=(1-R^2)\cd TSS\quad\quad RSS=R^2\cd TSS\]

Therefore, we can rewrite the \(f\) ratio as follows

Note

\(\dp f=\frac{RSS/d}{SSE/(n-d-1)}=\frac{n-d-1}{d}\cd\frac{R^2}{1-R^2}\)

In other words, the \(F\)-test compares the adjusted fraction of explained variation to the unexplained variation. If \(R^2\) is high, it means the model can fit the data well, and that is more evidence to reject the null hypothesis.

Hypothesis Testing for Individual Parameters

For attribute \(X_i\), we set up the null hypothesis \(H_0:\omega_i=0\) and contrast it with the alternative hypothesis \(H_a:\omega_i\neq 0\). The standardized variable \(Z_{w_i}\) under the null hypothesis is given as

Note

\(\dp Z_{w_i}=\frac{w_i-E[w_i]}{\rm{se}(w_i)}=\frac{w_i}{\rm{se}(w_i)}=\frac{w_i}{\hat\sg\sqrt{c_{ii}}}\)

Next, using a two-tailed \(t\)-test with \(n-d-1\) degrees of freedom, we compute p-value (\(Z_{w_i}\)). If this probability is smaller than the significance level \(\alpha\), we can reject the null hypothesis. Otherwise, we accept the null hypothesis, which would imply that \(X_i\) does not add significant value in predicting the response in light of other attributes already used to fit the model. The \(t\)-test can also be used to test whether the bias term is significantly different from 0 or not.

27.2.4 Geometric Approach to Statistical Testing

Let \(\bar{X_i}=X_i-\mu_{X_i}\cd\1\) denote the centered attribute vector, and let \(\bar\X=(\bar{X_1},\bar{X_2},\cds,\bar{X_d})^T\) denote the multivariate centered vector of predictor variables. The \(n\)-dimensional space over the points is divided into three mutually orthogonal subspaces, namely the 1-dimensioal mean space \(\cl{S}_\mu=span(\1)\), the \(d\) dimensional centered variable space \(\cl{S}_{\bar{X}}=span(\bar\X)\), and the \(n-d-1\) dimensional error space \(\cl{S}_\epsilon\), which contains the error vector \(\bs\epsilon\). The response vector \(Y\) can thus be decomposed into three components

\[Y=\mu_Y\cd\1+\hat{\bar{Y}}+\bs\epsilon\]

Recall that the degrees of freedom of a random vector is defined as the dimensionality of its enclosing subspace. Since the original dimensionality of the point space is \(n\), we have a total of \(n\) degrees of freedom. The mean space has dimensionality \(dim(\cl{S}_\mu)=1\), the centered variable space has \(dim(\cl{S}_{\bar{X}})=d\), and the error space has \(dim(\cl{S}_\epsilon)=n-d-1\), so that we have

\[dim(\cl{S}_\mu)+\dim(\cl{S}_{\bar{X}})+dim(\cl{S}_\epsilon)=1+d+(n-d-1)=n\]

Population Regression Model

For a fixed value \(\x=(x_{i1},x_{i2},\cds,x_{id})^T\), the true response \(y_i\) is given as

\[y_i=\beta+\omega_1\cd x_{i1}+\cds+\omega_d\cd x_{id}+\ve_i\]

where the systematic part of the model \(\beta+\sum_{j-1}^d\omega_j\cd x_{ij}\) is fixed, and the error term \(\ve_i\) varies randomly, with the assumption that \(\ve_i\) follows a normal distribution with mean \(\mu=0\) and variance \(\sg^2\). We also assume that the \(\ve_i\) values are all independent of each other.

\[ \begin{align}\begin{aligned}y_i&=\mu_Y+\omega_1\cd(x_{i1}-\mu_{X_1})+\cds+\omega_d\cd(x_{id}-\mu_{X_d})+\ve_i\\&=\mu_Y+\omega_1\cd\bar{x_{i1}}+\cds+\omega_d\cd\bar{x_{id}}+\ve_i\end{aligned}\end{align} \]

Across all the points, we can rewrite the above equation in vector form

\[Y=\mu_Y\cd\1+\omega_1\bar{X_1}+\cds+\omega_d\cd\bar{X_d}+\bs\ve\]

We can also center the vector \(Y\), so that we obtain a regression model over the centered response and predictor variables

\[\bar{Y}=Y-\mu_Y\cd\1=\omega_1\cd\bar{X_1}+\omega_2\cd\bar{X_2}+\cds+\omega_d \cd\bar{X_d}+\bs\ve=E[\bar{Y}|\bar{\X}]+\bs\ve\]

In this equation, \(\sum_{i=1}^d\omega_i\cd\bar{X_i}\) is a fixed vector that denotes the expected value \(E[\bar{Y}|\bar\X]\) and \(\bs\ve\) is an \(n\)-dimensional random vector that is distributed according to a \(n\)-dimensional multivariate normal vector with mean \(\mmu=\0\), and a fixed variance \(\sg^2\) along all dimensions, so that its covariance matrix is \(\bs\Sg=\sg^2\cd\I\). The distribution of \(\bs\ve\) is therefore given as

\[f(\bs\ve)=\frac{1}{(\sqrt{2\pi})^n\cd\sqrt{|\bs\Sg|}}\cd\exp\bigg\{-\frac{ \bs\ve^T\bs\Sg\im\bs\ve}{2}\bigg\}=\frac{1}{(\sqrt{2\pi})^n\cd\sg^n}\cd\exp \bigg\{-\frac{\lv\bs\ve\rv^2}{2\cd\sg^2}\bigg\}\]

which follows from the fact that \(|\bs\Sg|=\det(\bs\Sg)=\det(\sg^2\I)=(\sg^2)^n\) and \(\bs\Sg\im=\frac{1}{\sg^2}\I\).

The density of \(\bs\ve\) is thus a function of its squared length \(\lv\bs\ve\rv^2\), independent of its angle. In other words, the vector \(\bs\ve\) is distributed uniformly over all angles and is equally likely to point in any direction.

Hypothesis Testing

Consider the population regression model

\[Y=\mu_Y\cd\1+\omega_1\cd\bar{X_1}+\cds+\omega_d\cd\bar{X_d}+\bs\ve=\mu_Y\cd\1+E[\bar{Y}|\bs\X]+\bs\ve\]

The null hypothesis is

\[H_0:\omega_1=0,\omega_2=0,\cds,\omega_d=0\]

In this case, we have

\[Y=\mu_Y\cd\1+\bs\ve\Rightarrow Y-\mu_Y\cd\1=\bs\ve\Rightarrow\bar{Y}=\bs\ve\]

Since \(\bs\ve\) is normally distributed with mean \(\0\) and covariance matrix \(\sg^2\cd\I\), under the null hypothesis, the variation in \(\bar{Y}\) for a given value of \(\x\) will therefore be centered around the origin \(\0\).

On the other hand, under the alternative hypothesis \(H_a\) that at least one of the \(\omega_i\) is non-zero, we have

\[\bar{Y}=E[\bar{Y}|\bar\X]+\bs\ve\]

Thus, the variation in \(\bar{Y}\) is shifted away from the origin \(\0\) in the direction \(E[\bar{Y}|\bar\X]\).

We estimate its true value by projecting the centered observation vector \(\bar{Y}\) onto the subspace \(\cl{S}_{\bar{X}}\) and \(\cl{S}_\epsilon\), as follows

\[\bar{Y}=w_1\cd\bar{X_1}+\w_2\cd\bar{X_2}+\cds+w_d\cd\bar{X_d}+\bs\epsilon=\hat{\bar{Y}}+\bs\epsilon\]

Under the null hypothesis, the true centered response vector is \(\bar{Y}=\bs\ve\), and therefore, \(\hat{\bar{Y}}\) and \(\bs\epsilon\) are simply the projections of the random error vector \(\bs\ve\) onto the subspaces \(\cl{S}_{\bar{X}}\) and \(\cl{S}_\epsilon\). In this case, we also expect the length of \(\bs\epsilon\) and \(\hat{\bar{Y}}\) to be roughly comparable. On the other hand, under the alternative hypothesis, we have \(\bar{Y}=E[\bar{Y}|\bar\X]+\bs\ve\), and so \(\hat{\bar{Y}}\) will be relatively much longer compared to \(\bs\epsilon\).

Define the mean squared length of per dimension for the two vectors \(\hat{\bar{Y}}\) and \(\bs\epsilon\), as follows

\[ \begin{align}\begin{aligned}M(\hat{\bar{Y}})&=\frac{\lv\hat{\bar{Y}}\rv^2}{dim(\cl{S}_{\bar{X}})}=\frac{\lv\hat{\bar{Y}}\rv^2}{d}\\M(\bs\epsilon)&=\frac{\lv\bs\epsilon\rv^2}{dim(\cl{S}_\epsilon)}=\frac{\lv\bs\epsilon\rv^2}{n-d-1}\end{aligned}\end{align} \]

The geometric ratio test is identical to the F-test since

\[\frac{M(\hat{\bar{Y}})}{M(\bs\epsilon)}=\frac{\lv\hat{\bar{Y}}\rv^2/d} {\lv\bs\epsilon\rv^2/(n-d-1)}=\frac{RSS/d}{SSE/(n-d-1)}=f\]

The geometric approach makes it clear that if \(f\simeq 1\) then the null hypothesis holds, and we conclude that \(Y\) does not depend on the predictor variables \(X_1,X_2,\cds,X_d\). On the other hand, if \(f\) is large, with a p-value less than the significance level, then we can reject the null hypothesis and accept the alternative hypothesis that \(Y\) depends on at least one predictor variable \(X_i\).