Summary of The Multi-layer Perceptron

2025-02-11

Machine Learning

Train a MLP system, need two step.
Forward to calcualte the output for given inputs and current weights.
Backward to feedback the error between predict and target to weights, and update weights with the gradient of error function.
Traditional MLP systems use threshold function as activation function
Mordern MLP system has replaced activations with Sigmoid, Soft-max, tanh, ReLU ect.
These activations has 2 main advantage: non-linearity and continuous(differentiable)
Non-linearity enable the system to learn the non-linearity features
Continuous(differentiable) enables the system to do back-propogation learning, with gradient descent algorithm
With the gradient descent algorithm, we can find the lowest error, with iterately adjust the weights. There are several components in gradient descent algorition: inputs, activation function, weights.
In one layer, weights represents the direction of gradients, we need to find the lower gradients in every direction.
Which means we need to update each weights in each iteration.
The forward calculation is very straightforward, just sum the products of all activations or inputs and weights from previous layer, pass this number into the activation function of current layer, the output is the activation of this neuron.
The back-propogation algorithm is based on the relation between error with weights of every layers, this relation is built with the gradients of error function, which is the partial derivative of error function with respect to the weights in each layers.
To calculate the weights of output layer, we just need to calculate the partial derivative of error function with respect to each weight.

Weights should be initialised to small random numbers, both positive and negative. If the weights was initialised too large, the activation function will easy to saturated, which means reached its maximum or minimum value. If the weights are too small(close to zero), the activation will behave like a linear function.
Two popular ways of initializing weights: glorot_uniform (Xavier Uniform) and he_normal.
glorot_uniform(Xavier Uniform) initialise weights suit for activation functions like sigmoid, tanh, soft-max
he_normal initialise weights suit for ReLU.
Both weights initialiser pursue to uniform the input to the neuron to 1. don’t too small to reach to the linear area of activation function 2. don’t too large to saturate the activation function
When we consider how to set the input layer, and how to feed input into network, there are mainly two directions: sequential and batch training.
MLP is designed to be a batch algorithm. All of the training examples are presented to the neural network. It perfoms a more accurate estimate of the error gradient, and will thus converge to the local minimum more quickly. And more possible to reach to a local minimum which is not what we want.
We also can feed input one by one, which is sequentilly, weights will be update according to each input, It will take more time to converge ,and sometime could avoid local minima, potentially reaching better solutions. If choice sequentil input, we need to random the input order.
Gradient Descent algorithm potiently drive us to a local minima, instead of we expected global minima.
Several technique could help us overcome the local minima, like try different starting (initialising weights), Pick up momentum for the Gradient Descent algorithm, Weight decay which reduce the learning rate as learning progress.
Minibatches can be employed to avoid the local minimun the same time to speed up the learning progress.
Stochastic Gradient Descent is the extreme version of minibatch.
Include information about the second derivatives of the error with respect to the weights could sometimes results in much larger performance gains.
MLP could be applied to find solutions to four different types of problem: regression, classification, time-series prediction, and data compression.
Training data should be 10 times of neuron number. Neuron number is not as much as good, because large neuron number is more possible to overfit, and need more data and time to train.
For MLP system, one hidden layer + one output layer could approximate almost all smooth function according to the Universal Approximation Theorem.
Early stopping is what we pursued to finish the learning to avoid overfitting. Use the validation set to track the validation error, once it goes up, it’s just the time of early stopping.
Except the Error function by calculating the sum of squares, we can use the cross-entropy error function, it emply the natural logarithm, it is nice to handle the soft-max activation.