调参之learning rate

2023-05-16

The learning rate is perhaps the most important hyperparameter. If you have time to tune only one hyperparameter, tune the learning rate. \hspace{20em} – Page 429, Deep Learning, 2016

a large learning rate allows the model to learn faster, at the cost of arriving on a sub-optimal final set of weights
a smaller learning rate may allow the model to learn a more optimal or even globally optimal set of weights but may take significantly longer to train
a learning rate that is too large will result in weight updates that will be too large and the performance of the model will oscillate over training epochs. Oscillating performance is said to be caused by weights that diverge \textcolor{green}{\text{\small diverge}} diverge (are divergent \textcolor{green}{\text{\small divergent}} divergent)
a learning rate that is too small may never converge or may get stuck on a suboptimal solution
in the worst case, as large weights induce large gradients which then induce a large update to the weights, these updates consistently increase the size of the weights. In other words, weight updates that are too large may cause the weights to explode, i.e. resulting in a numerical overflow \textcolor{green}{\text{\small numerical overflow}} numerical overflow

Learning Rate三部曲

Diagnose if the current training method encounters the issue of poor learning rate
Run a sensitivity analysis to build intuition on how and how much the learning rate interacts with the current model and data
Adjust the learning rate for better performance: fixed learning rate → learning rate decay → momentum → adaptive learning rate

Step 1: Diagnostic plots

An example of diagnostic plot is to create a line plot of loss over training epochs during training. It can be used to investigate how the learning rate impacts the rate of learning and learning dynamics of the model. Properties that can be identified from the line plot includes:

the rate of learning over training epochs, such as fast or slow
whether model has learned too quickly (sharp rise and plateau) or is learning too slowly (little or no change)
whether the learning rate might be too large, which can be identified via oscillations in loss

Detailed plots and discussions are shown in the sequel of the article.

Step 2: Sensitivity analysis

Sensitivity analysis of the learning rate, also called a grid search, helps to both highlight an order of magnitude where good learning rates may reside and describe the relationship between learning rate and performance

a typical grid search involves picking values approximately on a logarithmic scale, e.g. within the set { 0.1 , 0.01 , … , 1 0 − 5 } \{0.1, 0.01, \ldots, 10^{−5}\} {0.1,0.01,…,10−5}
when plotted, the results of a sensitivity analysis often show a “U” shape

Step 3: Learning rate configuration

fixed learning rate:

grid search the optimal learning rate
typical values for a neural network with standardized inputs (or inputs mapped to the (0,1) interval) are less than 1 and greater than 1 0 − 6 10^{−6} 10−6.
a traditional default value for the learning rate is 0.1 or 0.01
smaller batch sizes give more noisy estimate of the gradient and thus are better suited to smaller learning rates

learning rate decay / learning rate schedule

Effect: help to converge the optimization process

Reason for its use: the SGD gradient estimator embeds a source of noise from batch sampling that does not vanish even when we arrive at a minimum

decay linearly until reaching a fixed minimum learning rate
decay exponentially
decrease the learning rate by a factor of two or an order of magnitude (2-10) each time validation error plateaus
initial_lrate * (1 / (1 + decay * iteration)) (often known as time-based learning schedule \textcolor{green}{\text{\small time-based learning schedule}} time-based learning schedule)
the learning rate can be increased again if performance does not improve for a fixed number of training epochs

momentum

Effect: accelerate training

[方法介绍及数学形式见《梯度下降、随机梯度下降法、及其改进》]

momentum does not make it easier to configure the learning rate, as the step size is independent of the momentum. Instead, momentum can speed up the optimization process in concert with the step size, improving the likelihood that a better set of weights is discovered in fewer training epochs
the method of momentum is designed to accelerate learning, especially in the face of steep curvature, small but consistent gradients (flat regions), or noisy gradients
common values of the momentum factor used in practice include 0.5, 0.9, and 0.99

adaptive learning rate

Effect: accelerate training, alleviate some of the pressure of choosing a learning rate and learning rate schedule

[方法介绍及数学形式见《梯度下降、随机梯度下降法、及其改进》]

AdaGrad, RMSProp, and Adam are three adaptive learning rate methods that have proven to be robust over many types of neural network architectures and problem types. Perhaps the most popular is Adam, as it builds upon RMSProp and adds momentum.
a robust strategy may be to first evaluate the performance of a model with a modern version of stochastic gradient descent with adaptive learning rates, such as Adam, and use the result as a baseline. Then, if time permits, explore whether improvements can be achieved with a carefully selected learning rate or simpler learning rate schedule.

Dynamics of learning rate on model behaviour [5]

In the following plots, x-axis denotes the epoch number and y-axis denotes accuracy. Training accuracy is plotted in blue and test accuracy is plotted in orange.

fixed learning rate

Look out for α = 0.001 \alpha=0.001 α=0.001. Although its behaviour looks very similar to α = 0.01 , 0.1 \alpha=0.01,0.1 α=0.01,0.1, it may be subject to local optimum.

momentum

Using momemtum not only speeds up but also stablise training.
Caution: \textcolor{red}{\text{\small Caution:}} Caution: momentum may create its own oscillations [6]

time-based decay decrease learning rate on plateau

The training accuracy converges at the end, which does not appear when using time-based decay or adaptive learning rate.

adaptive learning rate

Terminology

The discovering of a sub-optimal solution or local optima is referred to as premature convergence \textcolor{green}{\text{\small premature convergence}} premature convergence.

Practical issues

Internally, Keras applies the following learning rate schedule to adjust the learning rate after every batch update – it is a misconception that Keras updates the standard decay after every epoch. Keep this in mind when using the default learning rate scheduler supplied with Keras [4].
As a result of above, when using time-based learning schedule, the change to the learning rate is dependent on the batch size. For example, a default batch size of 32 across 500 examples results in 16 updates per epoch and a batch size of 64 results in 8 updates per epoch.

Acknowledgements: This article heavily builds on references [1,5].

参考文献：

How to Configure the Learning Rate When Training Deep Learning Neural Networks
https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/
Deep Learning, 2016
Practical recommendations for gradient-based training of deep architectures, 2012
Keras learning rate schedules and decay
https://www.pyimagesearch.com/2019/07/22/keras-learning-rate-schedules-and-decay/
Understand the Impact of Learning Rate on Neural Network Performance
https://machinelearningmastery.com/understand-the-dynamics-of-learning-rate-on-deep-learning-neural-networks/
Why Momentum Really Works
https://distill.pub/2017/momentum/

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

Learning

rate