Deep Learning - Part 2
- Input layer > hidden layer(s) > output layer
Activation function (1,0): linear vs non-linear. A linear composition of a bunch of linear functions is still just a linear function, so most neural networks use non-linear activation functions like logistic, sigmoid, tanh, binary or rectifier.
Training Perceptrons
The most common deep learning algorithm for supervised training of the multilayer perceptrons is known as backpropagation. The basic procedure:
- A training sample is presented and propagated forward through the network.
- The output error is calculated, typically the mean squared error:
Where t is the target value and y is the actual network output. Other error calculations are also acceptable, but the MSE is a good choice. Network error is minimized using a method called stochastic gradient descent.
Thankfully, backpropagation provides a method for updating each weight between two neurons with respect to the output error. The goal is to adjust the weight to the direction to minimize the output error. Using the approach of gradient descent, we may end up at the local minimum.
The knowledge is in the hidden layer
The hidden layer is where the network stores it’s internal abstract representation of the training data, similar to the way that a human brain (greatly simplified analogy) has an internal representation of the real world.
A neural network can have more than one hidden layer: in that case, the higher layers are “building” new abstractions on top of previous layers. And as we mentioned before, you can often learn better in-practice with larger networks.
However, increasing the number of hidden layers leads to two known issues:
- Vanishing gradients: as we add more and more hidden layers, backpropagation becomes less and less useful in passing information to the lower layers. In effect, as information is passed back, the gradients begin to vanish and become small relative to the weights of the networks.
- Overfitting: perhaps the central problem in machine learning. Briefly, overfitting describes the phenomenon of fitting the training data too closely, maybe with hypotheses that are too complex. In such a case, your learner ends up fitting the training data really well, but will perform much, much more poorly on real examples.