基于caffe的量化模型训练与部署——训练篇

2023-05-16

为什么需要量化？

我们知道，cnn网络的前向计算瓶颈主要集中在卷积层，而卷积层的实质是大量的浮点数相乘、相加等运算操作，大量的浮点数计算限制了模型在低处理器或移动端等设备中的部署。如果能将浮点运算转换为整形运算，则cnn模型的前向处理速度将达到质的提升。

为什么量化有用？

关于深度神经网络的一个非常有意思的发现是：即使输入中包含大量的噪声，神经网络依然能处理的很好。深度网络的一个神奇特质是它们倾向于很好地应对输入中的高噪声。如果您考虑识别您刚拍摄的照片中的物体，网络必须忽略所有CCD噪音，光照变化以及它与之前看到的训练样例之间的其他非本质差异，并关注重要事项相似之处。这种能力意味着它们似乎将低精度计算视为另一种噪声源，并且即使使用包含较少信息的数字格式仍能产生准确的结果。

如何量化？

由于网络中主要的计算集中在卷积层，这里主要以卷积层的量化为例讲解。
量化需要完成的是将浮点型的输入值和网络参数值分别映射到一定范围内的整型数。假设我们要将网络参数和输入值量化为kbit，即对应的整型数值表述范围为 2 k 2^k 2k个。一种最常用的量化方法就是线性量化，即将一定范围内的浮点型数均匀量化成 2 k 2^k 2k个区间，每个区间对应一个整形数值，如下表为将[-1, 1]范围内的浮点数线性量化为2比特4个区间：

------------------------------
input         |        output
------------------------------
[-1.0,-0.5)   |          -2
[-0.5,0)      |          -1
[0,0.5)       |          0
[0.5,1.0]      |          1
------------------------------

均匀量化需要明确被量化的浮点型数值的范围，通过对卷积层的输入和参数特性分析发现，这一数值范围其实是可以大致确定的：

卷积层输入范围
对于卷积层，其输入为上一层的输出，通常包含两种情况，一种是输入图像，另一种是经过激活函数的前一层卷积结果，所以卷积层的输入值的取值范围通常为非负数，故对应的量化到[0, 2 k 2^k 2k-1]中。
I ∈ [ 0 , + ∞ ] I\in[0,+∞] I∈[0,+∞] ——> I q ∈ { 0 , 1 , 2 , . . . 2 k − 1 } Iq\in\{0,1,2,...2^k-1\} Iq∈{0,1,2,...2k−1}
卷积层参数范围
而通过对常见网络的卷积层参数的可视化发现，卷积层参数基本呈现以0为中心轴，对称分布的状态，且最大值和最小值的绝对值基本接近，如下图所示，因此可以将参数量化到[-( 2 k / 2 2^k/2 2k/2), 2 k / 2 − 1 2^k/2-1 2k/2−1]中。
W ∈ [ − m i n , + m a x ] W\in[-min,+max] W∈[−min,+max] ——> W q ∈ { − ( 2 k / 2 ) , − ( 2 k / 2 ) + 1 , . . . 2 k / 2 − 1 } Wq\in\{-(2^k/2),-(2^k/2)+1,...2^k/2-1\} Wq∈{−(2k/2),−(2k/2)+1,...2k/2−1}

通过统计大量的网络参数和输入的数值分布，在量化时我们可以确定输入和参数的大致正负边界（如网络参数基本落在[-1,1]区间，输入值基本落在[0,5]之间，对于大多数网络基本满足这一情况），对于落在边界外的少数数值，直接进行截断即可，这带来的噪声值是相对微弱的。

量化公式

我们知道，卷积层的前向计算公式可以表示为输入和卷积层参数的卷积运算，简化为：
O = I ∗ W O=I*W O=I∗W对于量化计算，我们希望找到一组对应的量化值 I q I^q Iq和 W q W^q Wq，使其满足：
O = I ∗ W ≈ α I q ∗ β W q = α β ( I q ∗ W q ) O=I*W≈\alpha I^q*\beta W^q=\alpha \beta (I^q*W^q) O=I∗W≈αIq∗βWq=αβ(Iq∗Wq)上式中， α \alpha α和 β \beta β是浮点型标量系数。
因此浮点型的卷积操作可以退化成对应量化值 I q I^q Iq和 W q W^q Wq的整型卷积操作，大大提升计算的效率。

卷积层输入量化
前面已经说过，我们可以采用最简单的线性均匀量化来确定上面所说的量化值，输入 I I I被量化到[0, 2 k 2^k 2k-1]区间，由均匀量化公式可得重建公式为：
I ≈ I ~ = [ I ‾ − I m i n Q s I ] ∗ Q s I + I m i n (1) I≈ \widetilde{I}=[\cfrac{\overline{I}-I_{min}}{Qs_I}]*Qs_I+I_{min} \tag 1 I≈I =[QsII−Imin]∗QsI+Imin(1) 式中， Q s I Qs_I QsI为量化区间步长,[.]为取整公式 Q s I = I m a x − I m i n 2 k − 1 Qs_I = \cfrac{I_{max}-I_{min}}{2^k-1} QsI=2k−1Imax−Imin I ‾ \overline{I} I为截断的输入值
I m i n i f I < I m i n I ‾ = { I o t h e r s I m a x i f I > I m a x \qquad \qquad \qquad I_{min} \qquad if \, I<I_{min} \\ \overline{I}=\{ I \qquad others \\ \qquad \qquad \qquad I_{max} \qquad if \, I>I_{max} IminifI<IminI={IothersImaxifI>Imax
特别的，上面提到，通常卷积层的输入为非负数，因此有 I m i n = 0 I_{min}=0 Imin=0，上面的重建公式简化为
I ≈ I ~ = [ I ‾ Q s I ] ∗ Q s I = I q ∗ Q s I I≈ \widetilde{I}=[\cfrac{\overline{I}}{Qs_I}]*Qs_I=I^q*Qs_I I≈I =[QsII]∗QsI=Iq∗QsI
则有，输入 I I I对应的量化值为
I q = [ I ‾ Q s I ] ， α = Q s I (2) I^q=[\cfrac{\overline{I}}{Qs_I}]，\alpha=Qs_I \tag 2 Iq=[QsII]，α=QsI(2)
卷积层参数量化
同样的，对于每个 C ∗ K ∗ K C*K*K C∗K∗K维的卷积核参数，被均匀量化到[-( 2 k / 2 2^k/2 2k/2), 2 k / 2 − 1 2^k/2-1 2k/2−1]区间中，其量化后到重建公式为：
W n ≈ W n ~ = [ W n ‾ Q s W n ] ∗ Q s W n (3) W_n≈\widetilde{W_n}=[\cfrac{\overline{W_n}}{Qs_Wn}]*Qs_Wn \tag 3 Wn≈Wn =[QsWnWn]∗QsWn(3) 式中， n ∈ { 1 , 2 , . . . N } n \in\{1,2,...N \} n∈{1,2,...N}，N为输出通道数。 Q s W n Qs_Wn QsWn为量化区间步长,[.]为取整公式 Q s W n = 2 ∗ W n m a x 2 k − 2 Qs_Wn = \cfrac{2*Wn_{max}}{2^k-2} QsWn=2k−22∗Wnmax W n m a x = ∣ W n ∣ m a x Wn_{max}=|W_n|_{max} Wnmax=∣Wn∣max , W n ‾ \overline{W_n} Wn为截断的参数值
− W n m a x i f W n < − W n m a x W n ‾ = { W n o t h e r s W n m a x i f W n > W n m a x \qquad \qquad \qquad \qquad \qquad-Wn_{max} \qquad if \, W_n<-Wn_{max} \\ \overline{W_n}=\{ W_n \qquad others \\ \qquad \qquad \qquad \qquad \qquad Wn_{max} \qquad if \, W_n>Wn_{max} −WnmaxifWn<−WnmaxWn={WnothersWnmaxifWn>Wnmax
参数 W n Wn Wn对应的量化值为
W n q = [ W n ‾ Q s W n ] ， β = Q s W n (4) W_n^q=[\cfrac{\overline{W_n}}{Qs_Wn}]，\beta=Qs_Wn \tag 4 Wnq=[QsWnWn]，β=QsWn(4)

训练算法

量化训练的卷积层前向和后向计算基本改动不大，唯一的差别是用量化重建后的输入值和参数值代替原始值做计算，从而在训练的过程中最小化量化引起的误差，

前向计算

：1.根据量化重建公式(1)将输入 I I I表示为量化后的重建值 I ~ \widetilde{I} I

：2. for each n in N:
根据量化重建公式(3)将卷积层参数 W n W_n Wn表示为量化后的重建值 W n ~ \widetilde{W_n} Wn

：3.用重建后的输入值和参数进行值卷积计算 O = I ~ ∗ W n ~ O=\widetilde{I}*\widetilde{W_n} O=I ∗Wn

反向传播

： 1.计算 ∂ L ∂ b i a s \cfrac{\partial L}{\partial bias} ∂bias∂L
： 2.通过 I ~ \widetilde{I} I 计算 ∂ L ∂ W ~ \cfrac{\partial L}{\partial\widetilde{W}} ∂W ∂L，同时将被截断的 W ~ \widetilde{W} W 对应的梯度置零，避免其梯度传播
： 3.通过 W ~ \widetilde{W} W 计算 ∂ L ∂ I ~ \cfrac{\partial L}{\partial\widetilde{I}} ∂I ∂L，同时将被截断的 I ~ \widetilde{I} I 对应的梯度置零，避免其梯度传播

在caffe中实现量化

新增加conv_quantized_layer.cu/cpp/hpp
主要函数：
量化输入值

template <typename Dtype>
__global__ void binarize_kernel(const Dtype *x, int n, float maxcutoff, float mincutoff, float quantStep, Dtype *binary)
{
    int i = (blockIdx.x + blockIdx.y*gridDim.x) * blockDim.x + threadIdx.x;
    if (i >= n) return;

      binary[i] = (x[i] >= maxcutoff) ? maxcutoff: ((x[i] < mincutoff) ? mincutoff : x[i]);
      Dtype temp = binary[i] - mincutoff;
      binary[i] = (round(temp / quantStep)) * quantStep + mincutoff;
}

template <typename Dtype>
void binarize_input_gpu(const Dtype *input, int n, int size, float maxcutoff, float mincutoff, int bit, Dtype *binary)
{
    //float maxcutoff = MAXCUTOFF_A;
    //float mincutoff = MINCUTOFF_A;
    float bitNum = pow(2,bit);
    float quantStep =  (maxcutoff - mincutoff) / (bitNum - 1);

      binarize_kernel<<<cuda_gridsize(n * size), BLOCK>>>(input, n * size, maxcutoff, mincutoff, quantStep, binary);
    
    check_error(cudaPeekAtLastError());
}

量化参数值

template <typename Dtype>
__global__ void binarize_weights_kernel(const Dtype *weights, int n, int size, float maxcutoff, float bitNum, Dtype *binary)
{
    int f = (blockIdx.x + blockIdx.y*gridDim.x) * blockDim.x + threadIdx.x;
    if (f >= n) return;
    int i = 0;

    Dtype weight_temp;
    Dtype weight_max = -1.0f;
    Dtype quantStep = 0.0f;
    //float cutoffNew = 0.0f;
    for(i = 0; i < size; ++i)
    {
        weight_temp = fabs(weights[f*size + i]);
        weight_max = (weight_temp > weight_max) ? weight_temp : weight_max;
    }
    weight_max = (weight_max > maxcutoff) ? maxcutoff : weight_max;
    quantStep = 2 * weight_max / (bitNum - 2);
    //cutoffNew = weight_max - quantStep;

    for(i = 0; i < size; ++i)
    {
       weight_temp = weights[f*size + i];
       weight_temp = (weight_temp >= weight_max) ? (weight_max): ((weight_temp < -weight_max) ? -weight_max : weight_temp);
       binary[f*size + i] = round(weight_temp / quantStep) * quantStep;
    }
}

template <typename Dtype>
void binarize_weights_gpu(const Dtype *weights, int n, int size, float maxcutoff, float mincutoff, int bit, Dtype *binary)//n:kernel number size:c*kh*kw
{
    //printf("BLOCK = %d\n",BLOCK);
    //float maxcutoff = MAXCUTOFF_W;
    //float mincutoff = MINCUTOFF_W;
    float bitNum = pow(2,bit);

     binarize_weights_kernel<<<cuda_gridsize(n), BLOCK>>>(weights, n, size, maxcutoff, bitNum, binary);
    
    check_error(cudaPeekAtLastError());
}

前向计算

template <typename Dtype>
void ConvQuantizedLayer<Dtype>::Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top)
{


    //quantized weight
    const Dtype* weight = this->blobs_[0]->gpu_data();
    Dtype* xnor_weight = this->xnor_blobs_[0]->mutable_gpu_data();
    binarize_weights_gpu(weight, conv_out_channels_, kernel_dim_, MAXCUTOFF_W, MINCUTOFF_W, WBIT, xnor_weight);

    for (int i = 0; i < bottom.size(); ++i) {
        const Dtype* bottom_data = bottom[i]->gpu_data();

        //quantizing input
        Dtype* xnor_bottom_data = this->xnor_bottom_[i]->mutable_gpu_data();
        binarize_input_gpu(bottom_data, num_, bottom_dim_, MAXCUTOFF_A, MINCUTOFF_A, ABIT, xnor_bottom_data);

        Dtype* top_data = top[i]->mutable_gpu_data();
        for (int n = 0; n < this->num_; ++n) {
            this->forward_gpu_gemm(xnor_bottom_data + n * this->bottom_dim_,
                    xnor_weight, top_data + n * this->top_dim_);
            if (this->bias_term_) {
                const Dtype* bias = this->blobs_[1]->gpu_data();
                this->forward_gpu_bias(top_data + n * this->top_dim_, bias);
            }
        }
    }
}

后向传播

template <typename Dtype>
void ConvQuantizedLayer<Dtype>::Backward_gpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom)
{
  const Dtype* weight = this->blobs_[0]->gpu_data();
  Dtype* weight_diff = this->blobs_[0]->mutable_gpu_diff();
  const Dtype* xnor_weight = this->xnor_blobs_[0]->gpu_data();

  for (int i = 0; i < top.size(); ++i)
  {
    const Dtype* top_diff = top[i]->gpu_diff();
    // d(loss)/d(bias)
    if (this->bias_term_ && this->param_propagate_down_[1])
    {
      Dtype* bias_diff = this->blobs_[1]->mutable_gpu_diff();
      for (int n = 0; n < this->num_; ++n)
      {
        this->backward_gpu_bias(bias_diff, top_diff + n * this->top_dim_);
      }
    }
    //d(loss)/d(weight) and d(loss)/d(input)
    if (this->param_propagate_down_[0] || propagate_down[i])
    {
      const Dtype* bottom_data = bottom[i]->gpu_data();
      Dtype* bottom_diff = bottom[i]->mutable_gpu_diff();

      const Dtype* xnor_bottom_data = this->xnor_bottom_[i]->gpu_data();

      for (int n = 0; n < this->num_; ++n)  //for each batchSize
      {
        // gradient w.r.t. weight. Note that we will accumulate diffs.
        if (this->param_propagate_down_[0])
        {
            //d(loss)/d(weight_xnor)
            this->weight_gpu_gemm(xnor_bottom_data + n * this->bottom_dim_,
                top_diff + n * this->top_dim_, weight_diff);

            //d(loss)/d(weight) cut diff
            gradient_array_ongpu_xnor(weight,conv_out_channels_*kernel_dim_,weight_diff,MAXCUTOFF_W,MINCUTOFF_W);

        }
        // gradient w.r.t. bottom data.
        if (propagate_down[i])
        {
            //d(loss)/d(input_xnor)
            this->backward_gpu_gemm(top_diff + n * this->top_dim_, xnor_weight,
                bottom_diff + n * this->bottom_dim_);

            //d(loss)/d(input)
            gradient_array_ongpu_xnor(bottom_data,num_*bottom_dim_,bottom_diff,MAXCUTOFF_A,MINCUTOFF_A);
        }
      }
    }
  }// top.size
}

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

caffe

量化模型训练与部署