Chapter 25 Neural Networks
Artificial neural networks or simply neural networks are inspired by biological neuronal networks. Artifical neural networks are comprised of abstract neurons that try to mimic real neurons at a very high level. They can be described via a weighte directed graph \(G=(V,E)\), with each node \(v_i\in V\) representing a neuron, and each directed edge \((v_i,v_j)\in E\) representing a synaptic to dendritic connection from \(v_i\) to \(v_j\). The weight of the edge \(w_{ij}\) denotes the synaptic strength. Nerual networks are characterized by the type of activation function used to generate an output, and the architecture of the network in terms of how the nodes are interconnected.
25.1 Artificial Neuron: Activation Functions
A neuro \(z_k\) has incoming edges from neurons \(x_1,\cds,x_d\). Let \(x_i\) denotes neuron \(i\), and also the value of that neuron. The net input at \(z_k\), denoted \(net_k\), is given as the weighted sum
Note
\(net_k=b_k+\sum_{i=1}^dw_{ik}\cd x_i=b_k+\w^T\x\)
where \(\w_k=(w_{1k},w_{2k},\cds,w_{dk})^T\in\R^d\) and \(\x=(x_1,x_2,\cds,x_d)^T\in\R^d\) is an input point. Notice that \(x_0\) is a special bias neuron whose value is always fixed at 1, and the weight from \(x_0\) to \(z_k\) is \(b_k\), which specifies the bias term for the neuron. Finally, the output value of \(z_k\) is given as some activation function, \(f(\cd)\), applied to the net input at \(z_k\)
The value \(z_k\) is then passed along the outgoing edges from \(z_k\) to other neurons.
Linear/Identity Function:
\[f(net_k)=net_k\]
Step Function:
\[\begin{split}f(net_k)=\left\{\begin{array}{lr}1\quad\rm{if\ }net_k\leq 0\\0\quad\rm{if\ }net_k>0\end{array}\right.\end{split}\]
Rectified Linear Unit (ReLU):
\[\begin{split}f(net_k)=\left\{\begin{array}{lr}1\quad\rm{if\ }net_k\leq 0\\net_k\quad\rm{if\ }net_k>0\end{array}\right.\end{split}\]An alternative expression for the ReLU activation is given as \(f(net_k)=\max\{0,net_k\}\).
Sigmoid:
\[f(net_k)=\frac{1}{1+\exp\{-net_k\}}\]
Hyperbolic Tangent (tanh):
\[f(net_k)=\frac{\exp\{net_k\}-\exp\{-net_k\}}{\exp\{net_k\}+\exp\{-net_k\}}= \frac{\exp\{2\cd net_k\}-1}{\exp\{2\cd net_k\}+1}\]
Softmax:
Softmax is mainly used at the output layer in a neural network, and unlike the other functions it depends not only on the net input at neuron \(k\), but it depends on the net signal at all other neurons in the output layer. Thus, given the net input vector, \(\bs{\rm{net}}=(net_1,net_2,\cds,net_p)^T\), for all the \(p\) output neurons, the output of the softmax function for the \(k\)th neuron is given as
\[f(net_k|\bs{\rm{net}})=\frac{\exp\{net_k\}}{\sum_{i=1}^p\exp\{net_i\}}\]
25.1.1 Derivatives for Activation Functions
Linear/Identity Function:
\[\frac{\pd f(net_j)}{\pd net_j}=1\]
Step Function:
The step function has a derivative of 0 everywhere except for the discontinuity at 0, where the derivative is \(\infty\).
Rectified Linear Unit (ReLU):
\[\begin{split}\frac{\pd f(net_k)}{\pd net_j}=\left\{\begin{array}{lr}0\quad\rm{if\ }net_j\leq 0\\1\quad\rm{if\ }net_k>0\end{array}\right.\end{split}\]At 0, we can set the derivative to be any value in the range \([0,1]\).
Sigmoid:
\[\frac{\pd f(net_j)}{\pd net_j}=f(net_j)\cd (1-f(net_j))\]
Hyperbolic Tangent (tanh):
\[\frac{\pd f(net_j)}{\pd net_j}=1-f(net_j)^2\]
Softmax:
Since softmax is used at the output layer, if we denote the \(i\)th output neuron as \(o_i\), then \(f(net_i)=o_i\).
\[\begin{split}\frac{\pd f(net_j|\bs{\rm{net}})}{\pd net_k}=\frac{\pd o_j}{\pd net_k}= \left\{\begin{array}{lr}o+j\cd(1-o_j)\quad\rm{if\ }k=j\\ -o_k\cd o_j\quad\quad\quad\quad\rm{if\ }k\ne j\end{array}\right.\end{split}\]
25.2 Neural Networks: Regression and Classification
25.2.1 Regression
Consider the multiple (linear) regression problem, where given an input \(\x_i\in\R^d\), the goal is to predict the response as follows:
Given a training data \(\D\) comprising \(n\) points \(\x_i\) in a \(d\)-dimensional space, along with their corresponding true response value \(y_i\), the bias and weights for linear regression are chosen so as to minimize the sum of squared errors between the true and predicted response over all data points
A neural network with \(d+1\) input neurons \(x_0,x_1,\cds,x_d\), including the bias neuron \(x_0=1\), and a single output neuron \(o\), all with identity activation functions and with \(\hat{y}=o\), represents the exact same model as multiple linear regression. Wereas the multiple regression problem has a closed form solution, neural networks learn the bias and weights via a gradient descent approach that minimizes the squared error.
Neural networks can just as easily model the multivariate (linear) regression task, where we have a \(p\)-dimensional response vector \(\y_i\in\R^p\) instead of a single value \(y_i\). That is, the training data \(\D\) comprises \(n\) points \(\x_i\in\R^d\) and their true response vectors \(\y_i\in\R^p\). Multivariate regression can be modeled by a neural network with \(d+1\) input neurons, and \(p\) output neurons \(o_1,o_2,\cds,o_p\), with all input and output neurons using the identity activation function. A neural network learns the weights by comparing its predicted output \(\hat\y=\o=(o_1,o_2,\cds,o_p)^T\) with the true response vector \(\y=(y_1,y_2,\cds,y_p)^T\). That is, training happens by first computing the error function or loss function between \(\o\) and \(\y\). When the prediction matches the true output the loss should be zero. The most common loss function for regression is the squared error function
where \(\cl{E}_\x\) denotes the error on input \(\x\). Across all the points in a dataset, the total sum of squared error is
25.2.2 Classification
Consider the binary classification problem, where \(y=1\) dentoes that the point belongs to the positive class, and \(y=0\) means that it belongs to the negative class. In logistic regression, we model the probability of the positive class viar the logistic (or sigmoid) function:
Given input \(\x\), true response \(y\), and predicted response \(o\), recall that the cross-entropy error is defined as
With sigmoid activation, the output of the neural network is given as
which is the same as the logstic regression model.
Multiclass Logistic Regression
For the general classification problem with \(K\) classes \(\{c_1,c_2,\cds,c_K\}\), the true response \(y\) is encoded as a one-hot vector. Thus, class \(c_i\) is encoded as \(\e_1=(1,0,\cds,0)^T\), and so on, with \(\e_1\in\{0,1\}^K\) for \(i=1,2,\cds,K\). Thus, we encode \(y\) as a multivariate vector \(\y\in\{\e_1,\e_2,\cds,\e_K\}\). Recall that in multiclass logistic regression the task is to estimate the per class bias \(b_i\) and weight vector \(\w_i\in\R^d\), with the last class \(c_K\) used as the base class with fixed bias \(b_K=0\) and fixed weight vector \(\w_K=(0,0,\cds,0)^T\in\R^d\). The probability vector across all \(K\) classes is modeled via the softmax function:
Therefore, the neural network can solve the multiclass logistic regression task, provided we use a softmax activation at the outputs, and use the \(K\)-way cross-entropy error, defined as
where \(\x\) is an input vector, \(\o=(o_1,o_2,\cds,o_K)^T\) is the predicted response vector, and \(\y=(y_1,y_2,\cds,y_k)^T\) is the true response vector. Note that only one element of \(\y\) is 1, and the rest are 0, due to the one-hot encoding.
With softmax activation, the output of the neural network is given as
The only restriction we have to impose on the neural network is that weights on edges into the last output neuron should be zero to model the base class weights \(\w_K\). However, in practice, we relax this restriction, and just learn a regular weight vector for class \(c_K\).
25.2.3 Error Functions
Squared Error:
Given an input vector \(\x\in\R^d\), the squared error loss function measures the squared deviation between the predicted output vector \(\o\in\R^p\) and the true response \(\y\in\R^p\), defined as follows:
Note
\(\dp\cl{E}_\x=\frac{1}{2}\lv\y-\o\rv^2=\frac{1}{2}\sum_{j=1}^p(y_j-o_j)^2\)
where \(\cl{E}_\x\) denotes the error on input \(\x\).
\[ \begin{align}\begin{aligned}\frac{\pd\cl{E}_\x}{\pd o_j}=\frac{1}{2}\cd 2\cd(y_j-o_j)\cd -1=o_j-y_j\\\frac{\pd\cl{E}_\x}{\pd\o}=\o-\y\end{aligned}\end{align} \]
Cross-Entropy Error:
For classification tasks, with \(K\) classes \(\{c_1,c_2,\cds,c_K\}\), we usually set the number of output neurons \(p=K\), with one output neuron per class. Furthermore, each of the classes is coded as a one-hot vector, with class \(c_i\) encoded as the \(i\)th standard basis vector \(\e_i=(e_{i1},e_{i2},\cds,e_{iK})^T\in\{0,1\}^K\), with \(e_{ii}=1\) and \(e_{ij}=0\) for all \(j\ne i\). Thus, given input \(\x\in\R^d\), with the true response \(\y=(y_1,y_2,\cds,y_K)^T\), where \(\y\in\{\e_1,\e_2,\cds,\e_K\}\), the cross-entropy loss is defined as
Note
\(\dp\cl{E}_\x=-\sum_{i=1}^Ky_i\cd\ln(o_i)=-(y_1\cd\ln(o_1)+\cds+y_K\cd\ln(o_K))\)
\[ \begin{align}\begin{aligned}\frac{\pd\cl{E}_\x}{\dp o_j}=-\frac{y_j}{o_j}\\\frac{\pd\cl{E}_\x}{\pd\o}=\bigg(\frac{\pd\cl{E}_\x}{\dp o_1}, \frac{\pd\cl{E}_\x}{\dp o_2},\cds,\frac{\pd\cl{E}_\x}{\dp o_K}\bigg)^T= \bigg(-\frac{y_1}{o_1},-\frac{y_2}{o_2},\cds,-\frac{y_K}{o_K}\bigg)^T\end{aligned}\end{align} \]
Binary Cross-Entropy Error:
For classification tasks with binary classes, it is typical to encode the positive class as 1 and the negative class as 0, as opposed to using a one-hot encoding as in the general \(K\)-class case. Given an input \(\x\in\R^d\), with true response \(y\in\{0,1\}\), there is only one output neuron \(o\). Therefore, the binary cross-entropy error is defined as
Note
\(\cl{E}_\x=-(y\cd\ln(o)+(1-y)\cd\ln(1-o))\)
\[ \begin{align}\begin{aligned}\frac{\pd}{\cl{E}_\x}&=\frac{\pd}{\pd o}\{y\cd\ln(o)-(1-y)\cd\ln(1-o)\}\\&=-\bigg(\frac{y}{o}+\frac{1-y}{1-o}\cd-1\bigg)=\frac{-y\cd(1-o)+(1-y)\cd o}{o\cd(1-o)}\\&=\frac{o-y}{o\cd(1-o)}\end{aligned}\end{align} \]
25.4 Deep Multilayer Perceptrons
Consider an MLP with \(h\) hidden layers. We assume that the input to the MLP comprises \(n\) points \(\x_i\in\R^d\) with the corresponding true response vector \(\y_i\in\R^p\). We denote the input neurons as layer \(l=0\), the first hidden layer as \(l=1\), the last hidden layer as \(l=h\), and the output layer as layer \(l=h+1\). We use \(n_l\) to denote the number of neurons in layer \(l\). We have \(n_0=d\), and \(n_{h+1}=p\). The vector of neuron values for layer \(l\) (for \(l=0,\cds,h+1\)) is denoted as
Each layer except the output layer has one extra bias neuron, which is the neuron at index 0. Thus, the bias neuron for layer \(l\) is denoted \(z_0^l\) and its value is fixed at \(z_0^l=1\).
The vector of input neuron values is also written as
and the vector of output neuron values is also denoted as
The weight matrix between layer \(l\) and layer \(l+1\) neurons is denoted \(\W_l\in\R^{n_l\times n_{l+1}}\), and the vector of bias terms from the bias neuron \(z_0^l\) to neurons in layer \(l+1\) is denoted \(\b_l\in\R^{n_{l+1}}\), for \(l=0,1,\cds,h\).
Define \(\delta_i^l\) as the net gradient, i.e., the partial derivative of the error function with respect to the net value at \(z_i^l\)
and let \(\bs\delta^l\) denote the net gradient vector at layer \(l\), for \(l=1,2,\cds,h+1\)
Let \(f^l\) denote the activation function for layer \(l\), for \(l=0,1,\cds,h+1\), and futher let \(\pd\f^l\) denote the vector of the derivatives of the activation function with respect to \(net_i\) for all neurons \(z_i^l\) in layer \(l\):
Finally, let \(\pd\cl{E}_\x\) denote the vector of partial derivatives of the error function with respect to the values \(o_i\) for all output neurons:
25.4.1 Feed-forward Phase
We assume a fixed activation function \(f^l\) for all neurons in a given layer. For a given input pair \((\x,\y)\in\D\), the deep MLP computes the output vector via the feed-forward process:
Note that each \(f^l\) distributes over its argument. That is
25.4.2 Backpropagation Phase
Let \(z_i^l\) be a neuron in layer \(l\), and \(z_j^{l+1}\) a neuron in the next layer \(l+1\). Let \(w_{ij}^l\) be the weight between \(z_i^l\) and \(z_j^{l+1}\), and let \(b_j^l\) denote the bias term between \(z_0^l\) and \(z_j^{l+1}\). The weight and bias are updated using the gradient descent approach
Given the vector of neuron values at layer \(l\), namely \(\z^l=(z_1^l,\cds,z_{n_l}^l)^T\), we can compute the entire weight gradient matrix via an outer product operation
Note
\(\nabla_{w_l}=\z^l\cd(\bs\delta^{l+1})^T\)
and the bias gradient vector as:
Note
\(\nabla_{\b_l}=\bs\delta^{l+1}\)
with \(l=0,1,\cds,h\).
Note
\(\W_l=\W_l-\eta\cd\nabla_{\w_l}\)
\(\b_l=\b_l-\eta\cd\nabla_{\b_l}\)
We observe that to compute the weight and bias gradients for layer \(l\) we need to compute the net gradients \(\bs\delta^{l+1}\) at layer \(l+1\).
25.4.3 Net Gradients at Output Layer
If all of the output neurons are independent, the net gradient is obtained by differentiating the error function with respect to the net signal at the output neuron.
Note
\(\bs\delta^{h+1}=\pd\f^{h+1}\od\pd\cl{\bs{E}}_\x\)
If the output neurons are not independent, then we have to modify the computation of thg net gradient at each output neuron as follows:
Note
\(\bs\delta^{h+1}=\pd\bs{\rm{F}}^{h+1}\cd\pd\cl{\bs{E}}_\x\)
where \(\pd\bs{\rm{F}}^{h+1}\) is the matrix of derivatives of \(o_i=f^{h+1}(net_i)\) with respect to \(net_j\) for all \(i,j=1,2,\cds,p\), given as
Squared Error:
The error gradient is given as
\[\pd\cl{\bs{E}}_\x=\frac{\pd\cl{\bs{E}}_\x}{\pd\o}=\o-\y\]The net gradient at the output layer is given as
\[\bs\delta^{h+1}=\pd\f^{h+1}\od\pd\cl{\bs{E}}_\x\]Typically, for regression tasks, we use a linear activation at the output neurons. In that case, we have \(\pd\f^{h+1}=\1\).
Cross-Entropy Error (binary output, sigmoid activation):
The binary cross-entropy error is given as
\[\cl{E}_\x=-(y\cd\ln(o)+(1-y)\cd\ln(1-o))\]\[\pd\cl{\bs{E}}_\x=\frac{\pd\cl{E}_\x}{\pd o}=\frac{o-y}{o\cd(1-o)}\]For sigmoid activaton, we have
\[\pd\f^{h+1}=\frac{\pd f(net_o)}{\pd net_o}=o\cd(1-o)\]Therefore, the net gradient at the output neuron is
\[\delta^{h+1}=\pd\cl{\bs{E}}_\x\cd\pd\f^{h+1}=\frac{o-y}{o\cd(1-o)}\cd o(1-o)=o-y\]
Cross-Entropy Error (\(K\) outputs, softmax activation):
The cross-entropy error function is given as
\[\cl{E}_\x=-\sum_{i=1}^K y_i\cd\ln(o_i)=-(y_1\cd\ln(o_1)+\cds+y_K\cd\ln(o_K))\]\[\pd\cl{\bs{E}}_\x=\bigg(\frac{\pd\cl{E}_\x}{\pd o_1},\frac{\pd\cl{E}_\x} {\pd o_2},\cds,\frac{\pd\cl{E}_x}{\pd o_K}\bigg)^T=\bigg( -\frac{y_1}{o_1},-\frac{y_2}{o_2},\cds,-\frac{y_K}{o_K}\bigg)^t\]Cross-entropy error is typically used with the softmax activation so that we get a (normalized) probability value for each class. That is,
\[o_j=\rm{softmax}(net_j)=\frac{\exp\{net_j\}}{\sum_{i=1}^K\exp\{net_i\}}\]so that the output neuron values sum to one, \(\sum_{j=1}^Ko_j=1\).
\[ \begin{align}\begin{aligned}\begin{split}\pd\bs{\rm{F}}^{h+1}=\bp\frac{\pd o_1}{\pd net_1}&\frac{\pd o_1} {\pd net_2}&\cds&\frac{\pd o_1}{\pd net_K}\\\frac{\pd o_2}{\pd net_1}& \frac{\pd o_2}{\pd net_2}&\cds&\frac{\pd o_2}{\pd net_K}\\\vds&\vds&\dds &\vds\\\frac{\pd o_K}{\pd net_1}&\frac{\pd o_K}{\pd net_2}&\cds& \frac{\pd o_K}{\pd net_K}\ep\end{split}\\\begin{split}=\bp o_1\cd(1-o_1)&-o_1\cd o_2&\cds&-o_1\cd o_K\\-o_1\cd o_2&o_2 \cd(1-o_2)&\cds&-o_2\cd o_K\\\vds&\vds&\dds&\vds\\-o_1\cd o_K&-o_2 \cd o_K&\cds&o_K\cd(1-o_K)\ep\end{split}\end{aligned}\end{align} \]Therefore, the net gradient vecgtor at the output layer is
\[\bs\delta^{h+1}=\pd\bs{\rm{F}}^{h+1}\cd\pd\cl{\bs{E}}_\x=\o-\y\]
25.4.5 Training Deep MLPs
In practice, it is commonto update the gradients by considering a fixed sized subset of the training points called a minibatch instead of using single points. That is, the training data is divided into minibatches using an additional parameter called batch size, and a gradient descent step is performed after computing the bias and weight gradient from each minibatch. This helps better estimate the gradients, and also allows vectorized matrix operations over the minibatch of points, which can lead to faster convergence and substantial speedups in the learning.
One caveat while training very deep MLPs is the problem of vanishing and exploding gradients. In the vanishing gradient problem, the norm of the net gradient can decay exponentially with the distance from the output layer, that is, as we backpropagate the gradients from the output layer to the input layer. In this case the network will learn extremely slowly, if at all, since the gradient descent method will make minuscule changes to the weights and biases. On the other hand, in the exploding gradient problem, the norm of the net gradient can grow exponentially with the distance from the output layer. In this case, the weights and biases will become exponentially large, resulting in a failure to learn. The gradient explosion problem can be mitigated to some extent by gradient thresholding, that is, by resetting the value if itexceeds an upper bound. The vanishing gradients problem is more difficult to address. Typically sigmoid activations are more susceptible to this problem, and one solution is to use a lternative activation functions such as ReLU. In general, recurrent neural networks, which are deep neural networks with feedback connections, are more prone to vanishing and exploding gradients.