逐层贪婪预训练（解决梯度消失的第一个成功方案，但现在除了NLP领域外很少使用）

2023-05-16

起因/背景：梯度消失 vanishing gradient problem

DNN的训练中，由于梯度消失，即输出层的错误在反向传播的过程中会显著地越来越小，所以靠近输入层的层的梯度就接近0，所以这些层的参数就会得不到更新。而只有靠近输出层的那几层的参数得到了很好的更新。这是比较深的多层NN训练的一个难点。

梯度消失：随着NN的隐层数目增加，网络加深，从输出层传回的错误信息的量就会显著减少。于是那些靠近输出层的隐层能够被正常更新参数，但是靠近输入层的隐层却被更新地很慢或者完全更新不了。

Training deep neural networks was traditionally challenging as the vanishing gradient meant that weights in layers close to the input layer were not updated in response to errors calculated on the training dataset.
As the number of hidden layers is increased, the amount of error information propagated back to earlier layers is dramatically reduced. This means that weights in hidden layers close to the output layer are updated normally, whereas weights in hidden layers close to the input layer are updated minimally or not at all.

针对这个难点，一个创新的方法诞生了：逐层贪婪预训练（简称：预训练），这个方法可以让DNN被很好地训练。这是Hinton2006年提出的方法（论文名：A Fast Learning Algorithm for Deep Belief Nets ），正是这个方法让深度学习再次复兴。

An important milestone in the resurgence of neural networking that initially allowed the development of deeper neural network models was the technique of greedy layer-wise pretraining, often simply referred to as “pretraining.”

逐层

一层一层的训练，一次只训练一层，所以你完全可以在本来已经预训练好的模型后面再加一层，因为这新加的一层的训练只需要它前面的最后一层的数据，前面其他层的参数都是fixed固定的。

贪婪

贪婪就是因为逐层，因为贪婪也是一个抽象概念，即只处理好当前的任务，让当下的子任务达到最优，即使这样做并不会使得全局最优。一次只把一层训练好，训练到完美优秀。

The technique is referred to as “greedy” because the piecewise or layer-wise approach to solving the harder problem of training a deep network. As an optimization process, dividing the training process into a succession of layer-wise training processes is seen as a greedy shortcut that likely leads to an aggregate of locally optimal solutions, a shortcut to a good enough global solution.

在这里插入图片描述

预训练基于什么假设

基于浅层模型总是比深层模型好训练的一个假设。所以才把整个网络的训练分解为多个子任务，一次训练一层。
在这里插入图片描述

预训练的好处

作为一种权重初始化方案
在这里插入图片描述

预训练的主要方法

有监督逐层贪婪预训练

不断向一个有监督学习任务的模型中添加隐层（中间层）, 被用来迭代地加深一个有监督模型。
Broadly, supervised pretraining involves successively adding hidden layers to a model trained on a supervised learning task.

Pretraining is used to iteratively deepen a supervised model.

无监督逐层贪婪预训练

使用逐层贪婪预训练流程去构建一个无监督的自编码器模型，在最后一层加一个有监督的输出层，比如最常见的softmax分类层，或者也可以加SVM,KNN，逻辑回归等任何机器学习模型，因为这个无监督的自编码器实际上就是一个特征提取器，除输出层外的最后一层就是最终无监督学到的特征。
Unsupervised pretraining involves using the greedy layer-wise process to build up an unsupervised autoencoder model, to which a supervised output layer is later added

这里，预训练被用于迭代地加深一个无监督模型，这个无监督模型最后会被当做有监督模型用。
Pretraining is used to iteratively deepen an unsupervised model that can be repurposed as a supervised model。

还可以对整个网络进行有监督的微调！！！这是很常见的做法：在无监督逐层贪婪预训练结束后，先添加softmax层，然后对整个网络的所有参数进行有监督的微调。相当于无监督逐层贪婪预训练只是为DNN做一个初始化和正则化，得到的网络仍然是一个DNN。
Although the weights in prior layers are held constant, it is common to fine tune all weights in the network at the end after the addition of the final layer. As such, this allows pretraining to be considered a type of weight initialization method.
在这里插入图片描述

当你有大量无标签数据和少量有标签数据时，再用少量有标签数据进行有监督微调之前，用无监督预训练去做一个权重初始化也许会有用。

Unsupervised pretraining may be appropriate when you have a significantly larger number of unlabeled examples that can be used to initialize a model prior to using a much smaller number of examples to fine tune the model weights for a supervised task.

这种方法在某些问题上真的有用，比如文本问题，先用无监督预训练文本数据，为词语以及他们的关系提供一个更详细的分布表示。

The approach can be useful on some problems; for example, it is best practice to use unsupervised pretraining for text data in order to provide a richer distributed representation of words and their interrelationships via word2vec.

除了自然语言处理领域之外，很多领域都抛弃了摈弃了无监督预训练，因为无监督预训练成功的前提是必须用大量的无标签数据去预训练才可以学到好的表示。然后用这些表示进行微调。
在这里插入图片描述

已经不再被需要

虽然无监督预训练是帮助深度学习解决梯度消失造成的训练问题的第一个成功的方案，但是现在已经有了更好的方法：权重初始化，其他激活函数，梯度下降的变体，正则化等方法去保证DNN的成功训练并能达到更好的效果。所以这种无监督预训练已经不被需要了。
在这里插入图片描述

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)