A Simple Framework for Contrastive Learning of Visual Representations[论文学习] SimCLR


We simplify recently proposed contrastive self-supervised learning algorithms without requiring
specialized architectures or a memory bank.

  1. composition of data augmentations plays a critical role in defining effective predictive tasks,【数据增量的构成在定义有效的预测任务中起着关键作用】
  2. introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations【在表征和对比性损失之间引入可学习的非线性转换,极大地提高了学习表征的质量】
  3. contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Most mainstream approaches fall into one of two classes: generative or discriminative.

Generative approaches learn to generate or otherwise model pixels in the input space on representations learned with different self-supervised methods (pretrained on ImageNet).

Discriminative approaches learn representations using objective functions similar to those used for supervised learning, but train networks to perform pretext tasks where both the inputs and labels are derived from an unlabeled dataset.

Many such approaches have relied on heuristics to design pretext tasks , which could limit the generality of the learned representations.【许多这样的方法依靠启发式方法来设计借口任务,这可能会限制所学表征的通用性。】

Discriminative approaches based on contrastive learning in the latent space have recently shown great promise, achieving state-of-the-art results。

Composition of multiple data augmentation operations
is crucial in defining the contrastive prediction tasks that
yield effective representations. In addition, unsupervised
contrastive learning benefits from stronger data augmentation than supervised learning。【多种数据增强操作的组成对于定义产生有效表征的对比性预测任务至关重要。

Introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations.

Representation learning with contrastive cross entropy loss benefits from normalized embeddings and an appropriately adjusted temperature parameter【对比性交叉熵损失的表征学习得益于规范化的嵌入和适当调整的温度参数】

Contrastive learning benefits from larger batch sizes and longer training compared to its supervised counterpart.Like supervised learning, contrastive learning benefits from deeper and wider networks.对比学习对比学习和监督学习相比,得益于更大的batch sizes和更长的训练。像监督学习一样,对比学习得益于更深和更广的网络。

Inspired by recent contrastive learning algorithms (see Sec-tion 7 for an overview), SimCLR learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space. 【SimCLR通过潜空间中的对比损失,使同一数据实例的不同增强视图之间的一致性最大化,从而学习表征。】

  • 一个随机的数据增强模块可以转换任何给出的数据例子,导致相同例子的两个相关的视角
    Xi,Xj,我们认为是一个positive对。我们按顺序应用三种简单的增强方法:random cropping followed by
    resize back to the original size, random color distortions, and
    random Gaussian

  • 一个能够从增强的数据样例中获取表示向量的,基于编码器的神经网络。
    我们的框架允许在没有任何限制的情况下对网络结构进行各种选择。我们选择简单的方法,采用常用的ResNet,得到hi = f( ~xi) =
    ResNet( ~xi) 其中hi∈Rd是平均池化层后的输出。

  • 一个小的神经网络投影头g(-),将表征映射到应用了对比性损失的空间中。

    
  
    定义了 (在由minibatch派生的增强样例对 上的) 对比预测任务,从而得到2N个数据点。
    我们没有明确地给出negative examples。相反我们给出了一对positive pair,把剩下的2N-1作为negative examples。样例(i,j)的positive pair损失函数定义如下:
    τ denotes a temperature parameter.
    最后的损失是在所有positive pairs中计算出来的。(归一化的温度标度交叉熵损失)。
    Training with large batch size may be unstable when using standard SGD/Momentum with linear learning rate scaling.【在使用线性学习率缩放的标准SGD/Momentum时,大批量的训练可能是不稳定的。】 To stabilize the training, we use the LARS optimizer (You et al.,2017) for all batch sizes.



We use ResNet-50 as the base encoder network, and a 2-layer MLP projection head to project the representation to a 128-dimensional latent space.【我们使用ResNet-50作为基础编码器网络,和一个2层的MLP投影头,将表征投射到128维的潜空间。】 As the loss, we use NT-Xent, optimized using LARS with learning
rate of 4.8 (= 0.3 × BatchSize/256) and weight decay of 10−6. We train at batch size 4096 for 100 epochs。we use linear warmup for the first 10 epochs,
and decay the learning rate with the cosine decay schedule without restarts。【我们在前10个历时中使用线性预热。





pretext tasks是个啥

pretext tasks 通常被翻译作“前置任务”或“代理任务”, 有时也用“surrogate task”代替。是为了学习一个好的任务表示。

  • 这种训练不是我们本身的训练任务,并不是本身这次训练需要做的事情。
  • 虽然不是这次训练需要做的事情,但是他可以促进我们的训练,达到更好的效 果。

The strength of this simple framework suggests that, despite a recent surge in interest, self-supervised learning remains undervalued.【这个简单框架的优势表明,尽管最近兴趣大增,但自我监督学习的价值仍然被低估了】


