标签:remove may use careful tun relevant xpl ase oid
from
https://towardsdatascience.com/optimizers-for-training-neural-network-59450d71caf6
Many people may be using optimizers while training the neural network without knowing that the method is known as optimization. Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.
Optimizers help to get results faster
How you should change your weights or learning rates of your neural network to reduce the losses is defined by the optimizers you use. Optimization algorithms or strategies are responsible for reducing the losses and to provide the most accurate results possible.
We’ll learn about different types of optimizers and their advantages:
Gradient Descent is the most basic but most used optimization algorithm. It’s used heavily in linear regression and classification algorithms. Backpropagation in neural networks also uses a gradient descent algorithm.
Gradient descent is a first-order optimization algorithm which is dependent on the first order derivative of a loss function. It calculates that which way the weights should be altered so that the function can reach a minima. Through backpropagation, the loss is transferred from one layer to another and the model’s parameters also known as weights are modified depending on the losses so that the loss can be minimized.
algorithm: θ=θ?α??J(θ)
Advantages:
Disadvantages:
It’s a variant of Gradient Descent. It tries to update the model’s parameters more frequently. In this, the model parameters are altered after computation of loss on each training example. So, if the dataset contains 1000 rows SGD will update the model parameters 1000 times in one cycle of dataset instead of one time as in Gradient Descent.
θ=θ?α??J(θ;x(i);y(i)) , where {x(i) ,y(i)} are the training examples.
As the model parameters are frequently updated parameters have high variance and fluctuations in loss functions at different intensities.
Advantages:
Disadvantages:
It’s best among all the variations of gradient descent algorithms. It is an improvement on both SGD and standard gradient descent. It updates the model parameters after every batch. So, the dataset is divided into various batches and after every batch, the parameters are updated.
θ=θ?α??J(θ; B(i)), where {B(i)} are the batches of training examples.
Advantages:
All types of Gradient Descent have some challenges:
Momentum was invented for reducing high variance in SGD and softens the convergence. It accelerates the convergence towards the relevant direction and reduces the fluctuation to the irrelevant direction. One more hyperparameter is used in this method known as momentum symbolized by ‘γ’.
V(t)=γV(t?1)+α.?J(θ)
Now, the weights are updated by θ=θ?V(t).
The momentum term γ is usually set to 0.9 or a similar value.
Advantages:
Disadvantages:
Momentum may be a good method but if the momentum is too high the algorithm may miss the local minima and may continue to rise up. So, to resolve this issue the NAG algorithm was developed. It is a look ahead method. We know we’ll be using γV(t?1) for modifying the weights so, θ?γV(t?1) approximately tells us the future location. Now, we’ll calculate the cost based on this future parameter rather than the current one.
V(t)=γV(t?1)+α. ?J( θ?γV(t?1) ) and then update the parameters using θ=θ?V(t).
NAG vs momentum at local minima
Advantages:
Disadvantages:
One of the disadvantages of all the optimizers explained is that the learning rate is constant for all parameters and for each cycle. This optimizer changes the learning rate. It changes the learning rate ‘η’ for each parameter and at every time step ‘t’. It’s a type second order optimization algorithm. It works on the derivative of an error function.
A derivative of loss function for given parameters at a given time t.
Update parameters for given input i and at time/iteration t
η is a learning rate which is modified for given parameter θ(i) at a given time based on previous gradients calculated for given parameter θ(i).
We store the sum of the squares of the gradients w.r.t. θ(i) up to time step t, while ? is a smoothing term that avoids division by zero (usually on the order of 1e?8). Interestingly, without the square root operation, the algorithm performs much worse.
It makes big updates for less frequent parameters and a small step for frequent parameters.
Advantages:
Disadvantages:
It is an extension of AdaGrad which tends to remove the decaying learning Rate problem of it. Instead of accumulating all previously squared gradients, Adadelta limits the window of accumulated past gradients to some fixed size w. In this exponentially moving average is used rather than the sum of all the gradients.
E[g2](t)=γ.E[g2](t?1)+(1?γ).g2(t)
We set γ to a similar value as the momentum term, around 0.9.
Update the parameters
Advantages:
Disadvantages:
Adam (Adaptive Moment Estimation) works with momentums of first and second order. The intuition behind the Adam is that we don’t want to roll so fast just because we can jump over the minimum, we want to decrease the velocity a little bit for a careful search. In addition to storing an exponentially decaying average of past squared gradients like AdaDelta, Adamalso keeps an exponentially decaying average of past gradients M(t).
M(t) and V(t) are values of the first moment which is the Mean and the second moment which is the uncentered variance of the gradientsrespectively.
First and second order of momentum
Here, we are taking mean of M(t) and V(t) so that E[m(t)] can be equal to E[g(t)] where, E[f(x)] is an expected value of f(x).
To update the parameter:
Update the parameters
The values for β1 is 0.9 , 0.999 for β2, and (10 x exp(-8)) for ‘?’.
Advantages:
Disadvantages:
Computationally costly.
Comparison 1
comparison 2
Adam is the best optimizers. If one wants to train the neural network in less time and more efficiently than Adam is the optimizer.
For sparse data use the optimizers with dynamic learning rate.
If, want to use gradient descent algorithm than min-batch gradient descent is the best option.
I hope you guys liked the article and were able to give you a good intuition towards the different behaviors of different Optimization Algorithms.
Various Optimization Algorithms For Training Neural Network[转]
标签:remove may use careful tun relevant xpl ase oid
原文地址:https://www.cnblogs.com/lightsong/p/14643083.html