无监督预训练 & 有监督预训练



无监督预训练(unsupervised pre-training)

无监督预训练策略,主要应用于“复杂任务+少量标记数据集”,即没有足够的训练集为我们提供模型训练支持。这是 Hinton 团队在2006年提出的技术:A Fast Learning Algorithm for Deep Belief Nets。

In SGD optimization, one typically initiates model weights at random and tries to go towards minimum cost by following the opposite of gradient of objective function. For deep nets, this has not shown much of success and it is believed to be result of extremely non-convex (and high-dimensional) nature of their objective function.
What Y. Bengio and others(原文链接) found out was that, instead of starting weights at random and hoping that SGD will take you to minimum point of such a rugged landscape, you can pre-train each layer like an autoencoder. Here is how it works: you build an autoencoder with first layer as encoding layer and the transpose of that as decoder. And you train it unsupervised, that is you train it to reconstruct the input (refer to Autoencoder, they are great for unsupervised feature extraction tasks). Once trained, you fix weights of that layer to those you just found. Then, you move to next layers and repeat the same until you pre-train all layers of deep net (greedy approach). At this point, you go back to the original problem that you wanted to solve with deep net (classification/regression) and you optimize it with SGD but starting from weights you just learned during pre-training.
They found that this gives much better results. I think no one knows why exactly this works, but the idea is that by pre-training you start from more favorable regions of feature space.



有监督预训练(supervised pre-training)


Using these tools, we start with the architecture of (Krizhevsky et al., 2012) and explore different architectures, discovering ones that outperform their resultson ImageNet. We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by (Hinton et al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008). The generalization ability of convnet features is also explored in concurrent work by (Donahue et al., 2013).




