Chapter 27 Regression Evaluation
Given a set of predictor attributes or independent variables \(X_1,X_2,\cds,X_d\), and given the response attribute \(Y\), the goal of regression is to learn a \(f\), such that
where \(\X=(X_1,X_2,\cds,X_d)^T\) is the \(d\)-dimensional multivariate random variable comprised of the predictor variables. Here, the random variable \(\ve\) denotes the inferent error in the response that is not explained by the linear model.
When estimating the regression function \(f\), we make assumptions about the form of \(f\). Once we have estimated the bias and coefficients, we need to formulate a probabilistic model of regression to evaluate the learned model in terms of goodness of fit, confidence intervals for the parameters, and to test for the regression effects, namely whether \(\X\) really helps in predicting \(Y\). In particular, we assume that even if the value of \(\X\) has been fixed, there can still be uncertainty in the response \(Y\). Further, we will assume that the error \(\ve\) is independent of \(\X\) and follows a normal (or Guassian) distribution with mean \(\mu=0\) and variance \(\sg^2\), that is, we assume that the errors are independent and identically distributed with zero mean and fixed variance.
The probabilistic regression model comprises two components-the deterministic component comprising the observed predictor attributes, and the random error component comprising the error term, which is assumed to be independent of the predictor attributes.
27.1 Univariate Regression
We assume that the true relationship can be modeled as a linear function
where \(\omega\) is the slope of the best fitting line and \(\beta\) is its intercept, and \(\ve\) is the random error variable that follows a normal distribution with mean \(\mu=0\) and variance \(\sg^2\).
Mean and Variance of Response Variable
Consider a fixed value \(x\) for the independent variable \(X\). The expected value of the response variable \(Y\) given \(x\) is
The last step follows from our assumption that \(E[\ve]=\mu=0\). Also, since \(x\) is assumed to be fixed, and \(\beta\) and \(\omega\) are constants, the expected value \(E[\beta+\omega\cd x]=\beta+\omega\cd x\). Next, consider the variance of \(Y\) given \(X=x\), we have
Here \(\rm{var}(\beta+\omega\cd x)=0\), since \(\beta,\omega,x\) are all constants. Thus, given \(X=x\), the response variable \(Y\) follows a normal distribution with mean \(E[Y|X=x]=\beta+\omega\cd x\), and variance \(\rm{var}(Y|X=x)=\sg^2\)
Estimated Parameters
The true parameters \(\beta,\omega,\sg^2\) are all unknown, and have to be estimated from the training data \(\D\) comprising \(n\) points \(x_i\) and corresponding response values \(y_i\), for \(i=1,2,\cds,n\). Let \(b\) and \(w\) denote the estimated bias and weight terms; we can then make predictions for any given value \(x_i\) as follows:
The estimated bias \(b\) and weight \(w\) are obtained by minimizing the sum of squared errors, given as
with the least squares estimates given as
27.1.1 Estimating Variance (\(\sg^2\))
According to our model, the variance in prediction is entirely due to the random error term \(\ve\). We can estimate this variance by considering the predicted value \(\hat{y_i}\) and its deviation from the true response \(y_i\), that is, by looking at the residual error
One of the properties of the estimated values \(b\) and \(w\) is that the sum of residual errors is zero, since
Thus, the expected value of \(\epsilon_i\) is zero, since \(E[\epsilon_i]=\frac{1}{n}\sum_{i=1}^n\epsilon_i=0\).
The estimated variance \(\hat\sg^2\) is given as
Thus, the estimated variance is
Note
\(\dp\hat\sg^2=\frac{SSE}{n-2}\)
We divide by \(n-2\) to get an unbiased estimate, since \(n-2\) is the number of degrees of freedom for estimating SSE.
The squared root of the variance is called the standard error of regression
Note
\(\dp\hat\sg=\sqrt{\frac{SSE}{n-2}}\)
27.1.2 Goodness of Fit
The total scatter, also called total sum of squares, for the dependent variable \(Y\), is defined as
The total scatter can be decomposed into two components by adding and subtracting \(\hat{y_i}\) as follows
where we use the fact that \(\sum_{i=1}^n(y_i-\hat{y_i})\cd(\hat{y_i}-\mu_Y)=0\), and
is a new term called regression sum of squares that measures the squared deviation of the predictions from the true mean. TSS can thus be decomposed into two parts: SSE, which is the amount of variation not explained by the model, and RSS, which is the amount of variance explained by the model. Therefore, the fraction of the variation left unexplained by the model is given by the ration \(\frac{SSE}{TSS}\). Conversely, the fraction of the variation that is explained by the model called the coefficient of determination or simply the \(R^2\) statistic, is given as
Note
\(\dp R^2=\frac{TSS-SSE}{TSS}=1-\frac{SSE}{TSS}=\frac{RSS}{TSS}\)
The higher the \(R^2\) statistic the better the estimated model, with \(R^2\in[0,1]\).
Geometry of Goodness of Fit
Recall that \(Y\) can be decomposed into two orthogonal parts
where \(\hat{Y}\) is the projection of \(Y\) onto the subspace spanned by \(\{\1,X\}\). Using the fact that this subspace is the same as that spanned by the orthogonal vectors \(\{\1,\bar{X}\}\), with \(\bar{X}=X-\mu_X\cd\1\), we can further decompose \(\hat{Y}\) as follows
Likewise, the vector \(Y\) and \(\hat{Y}\) can be centered by subtracting their projections along the vector \(\1\)
The centered vectors \(\bar{Y},\hat{\bar{Y}},\bar{X}\) all lie in the \(n-1\) dimensional subspace orthogonal to the vector \(\1\).
In this subspace, the centered vectors \(\bar{Y}\) and \(\hat{\bar{Y}}\), and the error vector \(\bs\epsilon\) form a right triangle, since \(\hat{\bar{Y}}\) is the orthogonal projection of \(\bar{Y}\) onto the vector \(\bar{X}\). Noting that \(\bs\epsilon=Y-\hat{Y}=\bar{Y}-\hat{\bar{Y}}\), by the Pythagoras theorem, we have
This equation is equivalent to the decomposition of the total scatter, TSS, into sum of squared erros, SSE, and residual sum of squares, RSS.
Notice further that since \(\bar{Y},\hat{\bar{Y}},\bs\epsilon\) form a right triangle, the cosine of the angle between \(\bar{Y}\) and \(\hat{\bar{Y}}\) is given as the ratio of the base to the hypotenuse. On the other hand, the cosine of the angle is also the correlation between \(Y\) and \(\hat{Y}\) denoted \(\rho_{Y\hat{Y}}\). Thus, we have:
We can observe that
Note that, whereas \(|\rho_{Y\hat{Y}}|\leq 1\), due to the projection operation, the angle between \(Y\) and \(\hat{Y}\) is always less than or equal to \(90^\circ\), which means that \(\rho_{Y\hat{Y}}\in[0,1]\) for univariate regression. Thus, the predicted response vector \(\hat{\bar{Y}}\) is smaller than the true response vector \(\bar{Y}\) by an amount equal to the correlation between them. Furthermore, the coefficient of determination is the same as the squared correlation between \(Y\) and \(\hat{Y}\)
27.1.3 Inference about Regression Coefficient and Bias Term
The estimated values of the bias and regression coefficient, \(b\) and \(w\), are only point estimates for the true parameters \(\beta\) and \(\omega\). To obtain confidence intervals for these parameters, we treat each \(y_i\) as a random variable for the response given the corresponding fixed value \(x_i\). These random variables are all independent and identically distributed as \(Y\), with expected value \(\beta+\omega\cd x_i\) and variance \(\sg^2\). On the other hand, the \(x_i\) values are fixed a priori and therefore \(\mu_X\) and \(\sg_X^2\) are also fixed values.
We can now treat \(b\) and \(w\) as random variables, with
where \(c_i\) is a constant, given as
and \(s_X=\sum_{i=1}^n(x_i-\mu_X)^2\) is the total scatter for \(X\), defined as the sum of squared deviations of \(x_i\) from its mean \(\mu_X\). We also use the fact that
Note that
Mean and Variance of Regression Coefficient
The expected value of \(w\) is given as
which follows from the observation that \(\sum_{i=1}^nc_i=0\), and further
Thus, \(w\) is an unbiased estimator for the true parameter \(\omega\). Using the fact that the variables \(y_i\) are independent and identically distributed as \(Y\), we can compute the variance of \(w\) as follows
since \(c_i\) is a constant, \(\rm{var}(y_i)=\sg^2\), and further
The standard deviation of \(w\), also called the standard error of \(w\), is given as
Note
\(\dp\rm{se}(w)=\sqrt{\rm{var}(w)}=\frac{\sg}{\sqrt{s_X}}\)
Mean and Variance of Bias Term
The expected value of \(b\) is given as
Thus, \(b\) is an unbiased estimator for the true parameter \(beta\).
Using the observation that all \(y_i\) are independent, the variance of the bias term can be computed as follows
The standard deviation of \(b\), also called the standard error of \(b\), is given as
Note
\(\dp\rm{se}(b)=\sqrt{\rm{var}(b)}=\sg\cd\sqrt{\frac{1}{n}+\frac{\mu_X^2}{s_X}}\)
Covariance of Regression Coefficient and Bias
Confidence Intervals
Since the \(y_i\) variables are all normally distributed, their linear combination also follows a normal distribution. Thus \(w\) follows a normal distribution with mean \(\omega\) and variance \(\sg^2/s_X\). Like wise, \(b\) follows a normal distribution with mean \(\beta\) and variance \((1/n+\mu_X^2/s_X)\cd\sg^2\).
Since the true variance \(\sg^2\) is unknown, we use the estimated variance \(\hat\sg^2\), to define the standardized variables \(Z_w\) amd \(Z_b\) as follows
Note
\(\dp Z_w=\frac{w-E[w]}{\rm{se}(w)}=\frac{w-\omega}{\frac{\hat\sg}{\sqrt{s_X}}}\quad\quad\) \(\dp Z_b=\frac{b-E[b]}{\rm{se}(b)}=\frac{b-\beta}{\hat\sg\sqrt{(1/n+\mu_X^2/s_X)}}\)
These variables follow the Student’s \(t\) distribution with \(n-2\) degrees of freedom. Let \(T_{n-2}\) denote the cumulative \(t\) distribution with \(n-2\) degrees of freedom, and let \(t_{\alpha/2}\) denote the critical value of \(T_{n-2}\) that encompasses \(\alpha/2\) of the probability mass in the right tail.
Given confidence level \(1-\alpha\), i.e., significance level \(\alpha\in(0,1)\), the \(100(1-\alpha)\%\) confidence interval for the true values, \(\omega\) and \(\beta\), are therefore as follows
27.1.4 Hypothesis Testing for Regression Effects
In the regression model, \(Y\) depends on \(X\) through the parameter \(\omega\), therefore, we can check for the regression effect by assuming the null hypothesis \(H_0\) that \(\omega=0\), with the alternative hypothesis \(H_a\) being \(\omega\ne 0\):
When \(\omega=0\), the response \(Y\) depends only on the bias \(\beta\) and the random error \(\ve\).
Under the null hypothesis we have \(E[w]=\omega=0\). Thus,
Note
\(\dp Z_w=\frac{w-E[w]}{\rm{se}(w)}=\frac{w}{\hat\sg/\sqrt{s_X}}\)
Given significance level \(\alpha\), we reject the null hypothesis if the p-value is below \(\alpha\). In this case, we accept the alternative hypothesis that the estimated value of the slope parameter is significantly different from zero.
We can also define the \(f\)-statistic, which is the ratio of the regression sum of squares, RSS, to the estimated variance, given as
Note
\(\dp f=\frac{RSS}{\hat\sg^2}=\frac{\sum_{i=1}^n(\hat{y_i}-\mu_Y)^2]{\sum_{i=1}^n(y_i-\hat{y_i}^2/n-2)}\)
Under the null hypothesis, one can show that
Further, it is also true that
Thus, under the null hypothesis the \(f\)-statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then \(E[RSS]\geq\sg^2\), resulting in a larger \(f\) value. In fact, the \(f\)-statistic follows a \(F\)-distribution with 1, \((n-2)\) degrees of freedom; therefore, we can reject the null hypothesis that \(w=0\) if the p-value of \(f\) is less than the significance level \(\alpha\).
Interestingly the \(f\)-test is equivalent to the \(t\)-test since \(Z_w^2=f\).
Test for Bias Term
Note that we can also test if the bias value is statistically significant or not by setting up the null hypothesis, \(H_0:\beta=0\), versus the alternative hypothesis \(H_a:\beta\neq 0\). We then evaluate the \(Z_b\) statistic under the null hypothesis:
Note
\(\dp Z_b=\frac{b-E[b]}{\rm{se}(b)}=\frac{b}{\hat\sg\cd\sqrt{(1/n)+\mu_X^2/s_X)}}\)
since, under the null hypothesis \(E[b]=\beta=0\). Using a two-tailed \(t\)-test with \(n-2\) degrees of freedom, we can compute the p-value of \(Z_b\). We reject the null hypothesis if this value is smaller than the significance level \(\alpha\).
27.1.5 Standardized Residuals
Our assumption about the true errors \(\ve_i\) is that they are normally distributed with mean \(\mu=0\) and fixed variance \(\sg^2\).
The mean of \(\epsilon_i\) is given as
To compute the variance of \(\epsilon_i\), we will express it as a linear combination of the \(y_i\) variables, by noting that
Define \(a_j\) as follows:
Consider the term \(\sum_{j=1}^na_j^2\), we have
We can now define the standardized residual \(\epsilon_i^*\) by dividing \(\epsilon_i\) by its standard deviation after replacing \(\sg^2\) by its estimated value \(\hat\sg^2\).
Note
\(\dp\epsilon_i^*=\frac{\epsilon_i}{\sqrt{\rm{var}(\epsilon_i)}}\) \(\dp=\frac{\epsilon_i}{\hat\sg\cd\sqrt{1-\frac{1}{n}-\frac{(x_i-\mu_x)^2}{s_X}}}\)
These standardized residuals should follow a standard normal distribution. We can thus plot the standardized residuals against the quantiles of a standard normal distribution, and check if the normality assumption holds. Significant deviations would indicate that our model assumptions may not be correct.
27.2 Multiple Regression
In multiple regression there are multiple independent attributes \(X_1,X_2,\cds,X_d\) and a single dependent or response attribute \(Y\), and we assume that the true relationship can be modeled as a linear function
where \(\beta\) is the intercept or bias term and \(\omega_i\) is the regression coefficient for attribute \(X_i\). We assume that \(\ve\) is a random variable that is normally distributed with mean \(\mu=0\) and variance \(\sg^2\).
Mean and Variance of Response Variable
Let \(\X=(X_1,X_2,\cds,X_d)^T\in\R^d\) denote the multivariate random variable comprising the independent attributes. Let \(\x=(x_1,x_2,\cds,x_d)^T\) be some fixed value of \(\X\), and let \(\bs\omega=(\omega_1,\omega_2,\cds,\omega_d)^T\). The expected response value is then given as
which follows from the assumption that \(E[\ve]=0\). The variance of the response variable is given as
which follows from the assumption that all \(x_i\) are fixed a priori. Thus, we conclude that \(Y\) also follows a normal distribution with mean \(E[Y|\x]=\beta+\sum_{i=1}^d\omega_i\cd x_i=\beta+\bs\omega^T\x\) and variance \(\rm{var}(Y|\x)=\sg^2\).
Estimated Parameters
We augment the data matrix by adding a new column \(X_0\) with all values fixed at 1, that is, \(X_0=\1\). Thus, the augmented data \(\td\D\in\R^{n\times(d+1)}\) comprises the \((d+1)\) attributes \(X_0,X_1,X_2,\cds,X_d\), and each augmented point is given as \(\td\x_i=(1,x_{i1},x_{i2},\cds,x_{id})^T\).
Let \(b=w_0\) denote the estimated bias term, and let \(w_i\) denote the estimated regression weights. The augmented vector of estimated weights, including the bias term, is
We then make predictions for any given point \(\x_i\) as follows:
Recall that these estimates are obtained by minimizing the sum of squared errors (SSE), given as
with the least squares estimate given as
The estimated variance \(\hat\sg^2\) is then given as
Note
\(\dp\hat\sg^2=\frac{SSE}{n-(d+1)}=\frac{1}{n-d-1}\cd\sum_{i=1}^n(y_i-\hat{y_i})^2\)
We divide by \(n-(d+1)\) to get an unbiased estimate, since \(n-(d+1)\) is the number of degrees of freedom for estimating SSE.
Estimated Variance is Unbiased
Recall that
where \(\H\) is the \(n\times n\) hat matrix (assuming that \((\td\D^T\td\D)\im\) exists). Note that \(\H\) is an orthogonal projection matrix, since it is symmetric (\(\H^T=\H\)) and idempotent (\(\H^2=\H\)).
Furthermore, the trace of the hat matrix is given as
Finally, note that the matrix \(\I-\H\) is also symmetric and idempotent, since
Now consider the squared error; we have
However, note that the response vector \(Y\) is given as
where \(\td{\bs\omega}=(\omega_0,\omega_1,\cds,\omega_d)^T\) is the true (augmented) vector of parameters of the model, and \(\bs\ve=(\ve_1,\ve_2,\cds,\ve_n)^T\) is the true error vector, which is assumed to be normally distributed with mean \(E[\bs\ve]=\0\) and with fixed variance \(\ve_i\sg^2\) for each point, so that \(\rm{cov}(\bs\ve)=\sg^2\I\).
where we use the observation that
It follows that
27.2.1 Goodness of Fit
The decomposition of the total sum of squares, TSS, into the sum of squared errors, SSE, and the residual sum of squares, RSS, holds true for multiple regression as well:
The coefficient of multiple determinations, \(R^2\), gives the goodness of fit, measured as the fraction of the variation explained by the linear model:
Note
\(\dp R^2=1-\frac{SSE}{TSS}=\frac{TSS-SSE}{TSS}=\frac{RSS}{TSS}\)
One of the potential problems with the \(R^2\) measure is that it is susceptible to increase as the number of attributes increase, even though the additional attributes may be uninformative. To counter this, we can consider the adjusted coefficient of determination, which takes into account the degrees of freedom in both TSS and SSE
Note
\(\dp R_a^2=1-\frac{SSE/(n-d-1)}{TSS/(n-1)}=1-\frac{(n-1)\cd SSE}{(n-d-1)\cd TSS}\)
We can observe that the adjusted \(R_a^2\) measure is always less than \(R^2\), since the ratio \(\frac{n-1}{n-d-1}>1\). If there is too much of a difference between \(R^2\) and \(R_a^2\), it might indicate that there are potentially many, possibly irrelevant, attributes being used to fit the model.
Geometry of Goodness of Fit
The centered vectors \(\bar{Y}\) and \(\hat{\bar{Y}}\), and the error vector \(\bs\epsilon\) form a right triangle, and thus, by the Pythagoras theorem, we have
The correlation between \(Y\) and \(\hat{Y}\) is the cosine of the angle between \(\bar{Y}\) and \(\hat{\bar{Y}}\), which is also given as the ratio of the base to the hypotenuse
The coefficient of multiple determination is given as
27.2.2 Inference about Regression Coefficients
Let \(Y\) be the response vector over all observations. Let \(\td\w=(w_0,w_1,w_2,\cds,w_d)^T\) be the estimated vector of regression coefficients, computed as
The expected value of \(\td\w\) is given as follows:
Thus, \(\td\w\) is an unbiased estimator for the true regressions coefficients vector \(\td{\bs\omega}\).
Here, we made use of the fact that \(\A=(\td\D^T\td\D)\im\td\D^T\) is a matrix of fixed values, and therefore \(\rm{cov}(\A Y)=\A\rm{cov}(Y)\A^T\). Also, we have \(\rm{cov}(Y)=\sg^2\cd\I\), which follows from the fact that the observed response \(y_i\)’s are all independent and have the same variance \(\sg^2\).
Note that \(\td\D^T\td\D\in\R^{(d+1)\times(d+1)}\) is the uncentered scatter matrix for the augmented data. Let \(\C\) denote the inverse of \(\td\D^T\td\D\).
Therefore, the covariance matrix for \(\td\w\) can be written as
In particular, the diagonal entries \(\sg^2\cd c_{ii}\) give the variance for each of the regression coefficient estimates, and their squared roots specify the standard erros.
We can now define the standardized variable \(Z_{w_i}\) that can be used to derive the confidence intervals for \(w_i\) as follows
Note
\(\dp Z_{w_i}=\frac{w_i-E[w_i]}{\rm{se}(w_i)}=\frac{w_i-\omega_i}{\hat\sg\sqrt{c_{ii}}}\)
Each of the variables \(Z_{w_i}\) follows a \(t\)-distribution with \(n-d-1\) degrees of freedom, from which we can obtain the \(100(1-\alpha)\%\) confidence interval of the true value \(\omega_i\) as follows:
Here, \(t_{\alpha/2}\) is the critical value of the \(t\) distribution, with \(n-d-1\) degrees of freedom, that encompasses \(\alpha/2\) fraction of the probability mass in the right tail, given as
27.2.3 Hypothesis Testing
We set up the null hypothesis that all the true weights are zero, except for the bias term (\(\beta=\omega_0\)). We contrast the nul hypothesis with the alternative hypothesis that at least one of the weights is not zero
The null hypothesis can also be written as \(H_0:\bs\omega=\0\).
We use the \(F\)-test that compares the ratio of the adjusted RSS value to the estimated variance \(\hat\sg^2\), defined via the \(f\)-statistic
Note
\(\dp f=\frac{RSS/d}{\hat\sg^2}=\frac{RSS/d}{SEE/(n-d-1)}\)
Under the null hypothesis, we have
To see this, consider
Consider the RSS value; we have
The expected value of RSS is thus given as
Thus, under the null hypothesis the \(f\)-statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then \(E[RSS/d]\geq\sg^2\), resulting in a larger \(f\) value.
The ratio \(f\) follows a \(F\)-distribution with \(d\), \((n-d-1)\) degrees of freedom for the numerator and denominator, respectively. Therefore, we can reject the null hypothesis if the p-value is less than the chosen significance level.
Notice that, since \(R^2=1-\frac{SSE}{TSS}=\frac{RSS}{TSS}\), we have
Therefore, we can rewrite the \(f\) ratio as follows
Note
\(\dp f=\frac{RSS/d}{SSE/(n-d-1)}=\frac{n-d-1}{d}\cd\frac{R^2}{1-R^2}\)
In other words, the \(F\)-test compares the adjusted fraction of explained variation to the unexplained variation. If \(R^2\) is high, it means the model can fit the data well, and that is more evidence to reject the null hypothesis.
Hypothesis Testing for Individual Parameters
For attribute \(X_i\), we set up the null hypothesis \(H_0:\omega_i=0\) and contrast it with the alternative hypothesis \(H_a:\omega_i\neq 0\). The standardized variable \(Z_{w_i}\) under the null hypothesis is given as
Note
\(\dp Z_{w_i}=\frac{w_i-E[w_i]}{\rm{se}(w_i)}=\frac{w_i}{\rm{se}(w_i)}=\frac{w_i}{\hat\sg\sqrt{c_{ii}}}\)
Next, using a two-tailed \(t\)-test with \(n-d-1\) degrees of freedom, we compute p-value (\(Z_{w_i}\)). If this probability is smaller than the significance level \(\alpha\), we can reject the null hypothesis. Otherwise, we accept the null hypothesis, which would imply that \(X_i\) does not add significant value in predicting the response in light of other attributes already used to fit the model. The \(t\)-test can also be used to test whether the bias term is significantly different from 0 or not.
27.2.4 Geometric Approach to Statistical Testing
Let \(\bar{X_i}=X_i-\mu_{X_i}\cd\1\) denote the centered attribute vector, and let \(\bar\X=(\bar{X_1},\bar{X_2},\cds,\bar{X_d})^T\) denote the multivariate centered vector of predictor variables. The \(n\)-dimensional space over the points is divided into three mutually orthogonal subspaces, namely the 1-dimensioal mean space \(\cl{S}_\mu=span(\1)\), the \(d\) dimensional centered variable space \(\cl{S}_{\bar{X}}=span(\bar\X)\), and the \(n-d-1\) dimensional error space \(\cl{S}_\epsilon\), which contains the error vector \(\bs\epsilon\). The response vector \(Y\) can thus be decomposed into three components
Recall that the degrees of freedom of a random vector is defined as the dimensionality of its enclosing subspace. Since the original dimensionality of the point space is \(n\), we have a total of \(n\) degrees of freedom. The mean space has dimensionality \(dim(\cl{S}_\mu)=1\), the centered variable space has \(dim(\cl{S}_{\bar{X}})=d\), and the error space has \(dim(\cl{S}_\epsilon)=n-d-1\), so that we have
Population Regression Model
For a fixed value \(\x=(x_{i1},x_{i2},\cds,x_{id})^T\), the true response \(y_i\) is given as
where the systematic part of the model \(\beta+\sum_{j-1}^d\omega_j\cd x_{ij}\) is fixed, and the error term \(\ve_i\) varies randomly, with the assumption that \(\ve_i\) follows a normal distribution with mean \(\mu=0\) and variance \(\sg^2\). We also assume that the \(\ve_i\) values are all independent of each other.
Across all the points, we can rewrite the above equation in vector form
We can also center the vector \(Y\), so that we obtain a regression model over the centered response and predictor variables
In this equation, \(\sum_{i=1}^d\omega_i\cd\bar{X_i}\) is a fixed vector that denotes the expected value \(E[\bar{Y}|\bar\X]\) and \(\bs\ve\) is an \(n\)-dimensional random vector that is distributed according to a \(n\)-dimensional multivariate normal vector with mean \(\mmu=\0\), and a fixed variance \(\sg^2\) along all dimensions, so that its covariance matrix is \(\bs\Sg=\sg^2\cd\I\). The distribution of \(\bs\ve\) is therefore given as
which follows from the fact that \(|\bs\Sg|=\det(\bs\Sg)=\det(\sg^2\I)=(\sg^2)^n\) and \(\bs\Sg\im=\frac{1}{\sg^2}\I\).
The density of \(\bs\ve\) is thus a function of its squared length \(\lv\bs\ve\rv^2\), independent of its angle. In other words, the vector \(\bs\ve\) is distributed uniformly over all angles and is equally likely to point in any direction.
Hypothesis Testing
Consider the population regression model
The null hypothesis is
In this case, we have
Since \(\bs\ve\) is normally distributed with mean \(\0\) and covariance matrix \(\sg^2\cd\I\), under the null hypothesis, the variation in \(\bar{Y}\) for a given value of \(\x\) will therefore be centered around the origin \(\0\).
On the other hand, under the alternative hypothesis \(H_a\) that at least one of the \(\omega_i\) is non-zero, we have
Thus, the variation in \(\bar{Y}\) is shifted away from the origin \(\0\) in the direction \(E[\bar{Y}|\bar\X]\).
We estimate its true value by projecting the centered observation vector \(\bar{Y}\) onto the subspace \(\cl{S}_{\bar{X}}\) and \(\cl{S}_\epsilon\), as follows
Under the null hypothesis, the true centered response vector is \(\bar{Y}=\bs\ve\), and therefore, \(\hat{\bar{Y}}\) and \(\bs\epsilon\) are simply the projections of the random error vector \(\bs\ve\) onto the subspaces \(\cl{S}_{\bar{X}}\) and \(\cl{S}_\epsilon\). In this case, we also expect the length of \(\bs\epsilon\) and \(\hat{\bar{Y}}\) to be roughly comparable. On the other hand, under the alternative hypothesis, we have \(\bar{Y}=E[\bar{Y}|\bar\X]+\bs\ve\), and so \(\hat{\bar{Y}}\) will be relatively much longer compared to \(\bs\epsilon\).
Define the mean squared length of per dimension for the two vectors \(\hat{\bar{Y}}\) and \(\bs\epsilon\), as follows
The geometric ratio test is identical to the F-test since
The geometric approach makes it clear that if \(f\simeq 1\) then the null hypothesis holds, and we conclude that \(Y\) does not depend on the predictor variables \(X_1,X_2,\cds,X_d\). On the other hand, if \(f\) is large, with a p-value less than the significance level, then we can reject the null hypothesis and accept the alternative hypothesis that \(Y\) depends on at least one predictor variable \(X_i\).