Caffe源码 - BatchNormLayer 与 ScaleLayer

Author： AIHGF
发布时间：May 21, 2018
5356views
No comments
39199 words
Categories：深度平台

Caffe BVLC 的 BN 层是由 batchnormlayer + scalelayer 两层来实现的.

BatchNorm 主要有两部分计算：

[1] 对输入进行归一化，${ x_{norm} = \frac{x- \mu}{\sigma} }$，其中， $\mu$ 和 $\sigma$ 是计算的均值和方差；—— 对应 Caffe BatchNorm 层
[2] 归一化后进行缩放和平移，得到输出 ${ y = \gamma \cdot x_{norm} + \beta }$. —— 对应 Caffe Scale 层；Scale层设置 bias_term=True，即对应于 ${ \beta }$.

<h2>1. BatchNormalization 简述</h2>

Batch Normalization 论文给出的计算：

前向计算：

后向计算：

Caffe BatchNorm 层的训练，根据从总样本中的 mini-batch 个样本，进行多次前向训练，每次计算都会考虑已经计算得到的 mean 和 variance.

前向计算 Caffe 实现中，不是将每次计算的 mean 和 variance 的结果简单累加，而是通过一个因子(一般小于 1 的变量) 把前一次计算的 mean 和 variance 的作用逐渐较少，再加上本次计算的 mean 和 variance，作为最终的结果. 即滑动平均(Moving Average)的方式. 其过程如下： ${ S_{t-1} }$ - 前一次 mini-batch 计算的 mean； ${ Y_{t} }$ - 本次 mini-batch 计算的 mean； ${ \lambda }$ - 滑动平均因子， moving_average_fraction Forward 计算中， [F1] - 滑动系数， ${ s_{new} = \lambda s_{old} + 1 }$ [F2] - 均值，${ \mu _{new} = \lambda \mu _{old} + \mu }$ [F3] - 方差，${ \sigma _{new} = \lambda \sigma _{old} + m\sigma }$，其中，${ m > 1 } $ 时，${ m = \frac{m-1}{m} }$ Caffe 源码未加参数 ${ \gamma }$ 和 ${ \beta }$.</p>
反向计算对输入的梯度进行计算，没有参数 $\gamma$ 和 $\beta$. 方差的梯度计算： $\frac{\partial L}{\partial \sigma} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \sigma} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot (x_i- \mu)(-\frac{1}{2})(\sigma + eps)^{-\frac{3}{2}}$ 均值的梯度计算： $\frac{\partial L}{\partial \mu} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \mu} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{-1}{\sqrt {\sigma + eps}} $ 输入 $x$ 的梯度计算： $\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial y_i} \frac{1}{\sqrt{\sigma+eps}} + \frac{\partial L}{\partial \sigma} \frac{\partial \sigma}{\partial x_i} + \frac{\partial L}{\partial \mu} \frac{\partial \mu}{\partial x_i} $ $\ \ \ \ \ = \frac{\partial L}{\partial y_i} \frac{1}{\sqrt{\sigma + eps}} + \frac{\partial L}{\partial \sigma} \frac{2}{n} (x_i- \mu) + \frac{\partial L}{\partial \mu} \frac{1}{n} $ $\ \ \ \ \ = \frac{1}{\sqrt{\sigma + eps}} (\frac{\partial L}{\partial y_i}- \frac{1}{n}\sum_{i=0}^n \frac{\partial L}{\partial y_i}- (\frac{1}{n} \sum_{i=0}^n \frac{\partial L}{\partial y_i} y_i) y_i ) $ $\ \ \ \ \ = \frac{1}{\sqrt{\sigma + eps}} (\frac{\partial L}{\partial y_i}- \frac{1}{n}(\frac{\partial L}{\partial y_i})- \frac{1}{n}(\frac{\partial L}{\partial y_i} \cdot y_i) \cdot y_i)$</p>

Caffe Scale 层是主要处理参数 $\gamma$ 和 $\beta$ (均为向量).

前向计算： ${ top = \gamma \cdot bottom + \beta }$ ${ y = \gamma \cdot x + \beta }$
反向计算： ${ \frac{\partial y}{\partial x} = \gamma }$ ${ \frac{\partial y}{\partial \gamma} = x }$ ${ \frac{\partial y}{\partial \beta} = 1 }$

<h2>2. prototxt 中的定义</h2>

在Caffe 中，一般一个 BatchNorm 层后接一个 Scale 层，例如：

layer {
    bottom: "conv1"
    top: "conv1"
    name: "bn_conv1"
    type: "BatchNorm"
    batch_norm_param {
        use_global_stats: true
    }
    param {
        name: "bn_conv1_0"
          lr_mult: 0
    }
    param {
        name: "bn_conv1_1"
          lr_mult: 0
        }
    param {
        name: "bn_conv1_2"
          lr_mult: 0
        }
}

layer {
    bottom: "conv1"
    top: "conv1"
    name: "scale_conv1"
    type: "Scale"
    scale_param {
        bias_term: true
    }
    param {
        name: "scale_conv1_0"
          lr_mult: 0
    }
    param {
        name: "scale_conv1_1"
          lr_mult: 0
    }
}

From train_voc_trainval_aug.prototxt

<h2>3. caffeproto 中 BatchNorm 的定义</h2>

message LayerParameter {
    optional BatchNormParameter batch_norm_param = 139;
}

message BatchNormParameter {
  // If false, normalization is performed over the current mini-batch
  // and global statistics are accumulated (but not yet used) by a moving
  // average.
  // 如果 use_global_stats = 0，则对当前 mini-batch 内的数据归一化； 同时 global statistics 通过滑动平均逐渐累加.
  // If true, those accumulated mean and variance values are used for the
  // normalization.
  // 如果 use_global_stats = 1，则采用累加的 均值和方差 对数据进行归一化.
  // By default, it is set to false when the network is in the training
  // phase and true when the network is in the testing phase.
  // 默认情况下，网络训练时 use_global_stats = 0；网络测试时 use_global_stats = 1. 
  optional bool use_global_stats = 1;

  // What fraction of the moving average remains each iteration?
  // 滑动平均时每次迭代保留的百分比？
  // Smaller values make the moving average decay faster, giving more
  // weight to the recent values.
  // 较小的值使得平均累加过程衰退较快，给予最近的值较大的权重
  // Each iteration updates the moving average @f$S_{t- 1}@f$ with the
  // current mean @f$ Y_t @f$ by 
  // @f$ S_t = (1-\beta)Y_t + \beta \cdot S_{t-1} @f$, where @f$ \beta @f$
  // is the moving_average_fraction parameter.
  optional float moving_average_fraction = 2 [default = .999];
  // Small value to add to the variance estimate so that we don't divide by
  // zero.
  // 保持数值稳定性
  optional float eps = 3 [default = 1e-5];
}

optional float moving_average_fraction = 2 [default = .999]: 每次迭代时，根据当前均值 ${ Y_t }$ 更新滑动平均值 ${ S_{t-1} }$: ${ S_t = (1-\beta)Y_t + \beta \cdot S_{t-1} }$ 其中，${ \beta }$ 是 moving_average_fraction 参数.

<h2>4. batch_norm_layer.hpp</h2>

#ifndef CAFFE_BATCHNORM_LAYER_HPP_
#define CAFFE_BATCHNORM_LAYER_HPP_

#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

namespace caffe {

/**
 * @brief Normalizes the input to have 0-mean and/or unit (1) variance across
 *        the batch.
 * BatchNorm 功能：    
 *    将 mini-batch 的输入归一化为均值为0或方差为1.
 *
 * This layer computes Batch Normalization as described in [1]. For each channel
 * in the data (i.e. axis 1), it subtracts the mean and divides by the variance,
 * where both statistics are computed across both spatial dimensions and across
 * the different examples in the batch.
 * 对数据中的每一 channel，如 axis=1，BatchNorm 首先减均值，然后除以其方差. 
 * 其中，均值和方差是对 mini-batch 内的不同样本的所有 spatial 维度进行计算得到.
 *
 * By default, during training time, the network is computing global
 * mean/variance statistics via a running average, which is then used at test
 * time to allow deterministic outputs for each input. You can manually toggle
 * whether the network is accumulating or using the statistics via the
 * use_global_stats option. For reference, these statistics are kept in the
 * layer's three blobs: (0) mean, (1) variance, and (2) moving average factor.
 * 默认情况，训练时，网络通过平均累加，计算全局均值和方差值，然后用于测试来计算每一个输入的输出.
 * 可以通过手工设置 use_global_stats 参数，来控制网络是采用累加还是统计值.
 * 统计值被保存在网络层的三个 blobs：(0) mean, (1) variance, and (2) moving average factor
 *
 * Note that the original paper also included a per-channel learned bias and
 * scaling factor. To implement this in Caffe, define a ScaleLayer configured
 * with bias_term: true after each BatchNormLayer to handle both the bias
 * and scaling factor.
 * 原始论文中还包括一个 per-channel 的学习 bias 和一个 scaling 因子.
 * 因此，Caffe 实现中，在每个 BatchNormLayer 后的 ScaleLayer 中配置 bias_term: true 来处理 bias 和 scaling 因子.
 *
 * [1] S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network
 *     Training by Reducing Internal Covariate Shift." arXiv preprint
 *     arXiv:1502.03167 (2015).
 *
 * TODO(dox): thorough documentation for Forward, Backward, and proto params.
 */
template <typename Dtype>
class BatchNormLayer : public Layer<Dtype> {
 public:
  explicit BatchNormLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "BatchNorm"; }
  virtual inline int ExactNumBottomBlobs() const { return 1; }
  virtual inline int ExactNumTopBlobs() const { return 1; }

 protected:
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
     const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  Blob<Dtype> mean_, variance_, temp_, x_norm_; // 均值，方差，
  bool use_global_stats_;
  Dtype moving_average_fraction_;
  int channels_;
  Dtype eps_;

  // extra temporarary variables is used to carry out sums/broadcasting
  // using BLAS
  Blob<Dtype> batch_sum_multiplier_;
  Blob<Dtype> num_by_chans_;
  Blob<Dtype> spatial_sum_multiplier_;
};

}  // namespace caffe

#endif  // CAFFE_BATCHNORM_LAYER_HPP_

<h2>5. batch_norm_layer.cpp</h2>

#include <algorithm>
#include <vector>

#include "caffe/layers/batch_norm_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {

template <typename Dtype>
void BatchNormLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  BatchNormParameter param = this->layer_param_.batch_norm_param(); // batchnorm 参数
  moving_average_fraction_ = param.moving_average_fraction(); // 滑动平均参数
  // 训练时，mean 和 variance 基于 mini-batch 计算
  // 测试时，mean 和 variance 基于整个 dataset.
  use_global_stats_ = this->phase_ == TEST; 
  if (param.has_use_global_stats())
    use_global_stats_ = param.use_global_stats();
  if (bottom[0]->num_axes() == 1)
    channels_ = 1;
  else
    channels_ = bottom[0]->shape(1);
  eps_ = param.eps();
  if (this->blobs_.size() > 0) {
    LOG(INFO) << "Skipping parameter initialization";
  } else {
    this->blobs_.resize(3); // 存储的学习参数
    vector<int> sz;
    sz.push_back(channels_);
    this->blobs_[0].reset(new Blob<Dtype>(sz)); // 均值滑动平均值，channels_ 大小的数组
    this->blobs_[1].reset(new Blob<Dtype>(sz)); // 方差滑动平均值，channels_ 大小的数组
    sz[0] = 1;
    this->blobs_[2].reset(new Blob<Dtype>(sz)); // 滑动平均系数，大小为 1 的数组
    for (int i = 0; i < 3; ++i) {
      caffe_set(this->blobs_[i]->count(), Dtype(0),
                this->blobs_[i]->mutable_cpu_data()); // 值初始化为 0
    }
  }
  // Mask statistics from optimization by setting local learning rates
  // for mean, variance, and the bias correction to zero.
  for (int i = 0; i < this->blobs_.size(); ++i) {
    if (this->layer_param_.param_size() == i) {
      ParamSpec* fixed_param_spec = this->layer_param_.add_param();
      fixed_param_spec->set_lr_mult(0.f);
    } else {
      CHECK_EQ(this->layer_param_.param(i).lr_mult(), 0.f)
          << "Cannot configure batch normalization statistics as layer "
          << "parameters.";
    }
  }
}

template <typename Dtype>
void BatchNormLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  // 如果 bottom 是一维的，均值和方差的个数为1；否则，等于 channels
  if (bottom[0]->num_axes() >= 1)
    CHECK_EQ(bottom[0]->shape(1), channels_);
  top[0]->ReshapeLike(*bottom[0]); // top[0] 与输入 bottom[0] 的形状一致

  vector<int> sz;
  sz.push_back(channels_);
  mean_.Reshape(sz); // 存储均值
  variance_.Reshape(sz); // 存储方差
  temp_.ReshapeLike(*bottom[0]); // 存储减去均值 mean_ 后的每个数的方差
  x_norm_.ReshapeLike(*bottom[0]);
  sz[0] = bottom[0]->shape(0);
  batch_sum_multiplier_.Reshape(sz); // batch size

  // 空间维度， height*width
  int spatial_dim = bottom[0]->count()/(channels_*bottom[0]->shape(0));

  if (spatial_sum_multiplier_.num_axes() == 0 ||
      spatial_sum_multiplier_.shape(0) != spatial_dim) {
    sz[0] = spatial_dim;
    spatial_sum_multiplier_.Reshape(sz);
    Dtype* multiplier_data = spatial_sum_multiplier_.mutable_cpu_data();
    // spatial_sum_multiplier_ 初始化值为1，其尺寸为 height*width 
    caffe_set(spatial_sum_multiplier_.count(), Dtype(1), multiplier_data);
  }

  int numbychans = channels_bottom[0]->shape(0); // channels  batchsize
  if (num_by_chans_.num_axes() == 0 ||
      num_by_chans_.shape(0) != numbychans) {
    sz[0] = numbychans;
    num_by_chans_.Reshape(sz);
    caffe_set(batch_sum_multiplier_.count(), Dtype(1),
        batch_sum_multiplier_.mutable_cpu_data()); // 初始化值为 1
  }
}

// Forward 函数，计算均值和方差，以矩阵-向量乘积的方式.
template <typename Dtype>
void BatchNormLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();
  Dtype* top_data = top[0]->mutable_cpu_data();
  int num = bottom[0]->shape(0); // 
  int spatial_dim = bottom[0]->count()/(bottom[0]->shape(0)channels_); // heightwidth

  // 判断 BN 层的输入输出是否是同一 blob
  if (bottom[0] != top[0]) {
    caffe_copy(bottom[0]->count(), bottom_data, top_data);
  }

  if (use_global_stats_) {
    // 如果 use_global_stats_ = 1，则使用预定义的均值和方差估计值.
    // use the stored mean/variance estimates.
    const Dtype scale_factor = this->blobs_[2]->cpu_data()[0] == 0 ?
        0 : 1 / this->blobs_[2]->cpu_data()[0];
    caffe_cpu_scale(variance_.count(), scale_factor,
        this->blobs_[0]->cpu_data(), mean_.mutable_cpu_data()); // 乘以缩放因子
    caffe_cpu_scale(variance_.count(), scale_factor,
        this->blobs_[1]->cpu_data(), variance_.mutable_cpu_data());
  } else {
    // 如果 use_global_stats_ = 0
    // compute mean
    // 均值计算
    // num_by_chans_ = (1. / (num  spatial_dim))  bottom_data * spatial_sum_multiplier_ 
    // channels*num 行； spatial_dim 列
    // 共 channels * num 个值
    caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim,
        1. / (num * spatial_dim), bottom_data,
        spatial_sum_multiplier_.cpu_data(), 0.,
        num_by_chans_.mutable_cpu_data());

    // mean_ = 1  num_by_chans_  batch_sum_multiplier_
    // num 行； channels 列
    // 每个通道值相加，得到 channel 个值
    caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
        num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
        mean_.mutable_cpu_data());
  }

  // subtract mean
  // 减均值
  // num_by_chans_ = 1  batch_sum_multiplier_  mean_ 
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
      batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  // top_data = -1  num_by_chans_   + spatial_sum_multiplier_ + 1.0 * top_data
  // top_data 中的数据减去均值 mean_
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,
      spatial_dim, 1, -1, num_by_chans_.cpu_data(),
      spatial_sum_multiplier_.cpu_data(), 1., top_data);

  if (!use_global_stats_) {
    // 如果 use_global_stats_ = 0，计算方差
    // compute variance using var(X) = E((X-EX)^2)
    // 对向量的每一个值求其方差，得到结果为 temp_
    caffe_sqr<Dtype>(top[0]->count(), top_data, temp_.mutable_cpu_data());  // (X-EX)^2
    // num_by_chans_ = (1. / (num  spatial_dim))  temp_ * spatial_sum_multiplier_
    caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim,
        1. / (num * spatial_dim), temp_.cpu_data(),
        spatial_sum_multiplier_.cpu_data(), 0.,
        num_by_chans_.mutable_cpu_data()); // 矩阵向量乘
    // variance_ = 1.0  num_by_chans_  batch_sum_multiplier_
    caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
        num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
        variance_.mutable_cpu_data());  // E((X_EX)^2)

    // compute and save moving average
    // 计算并保存滑动平均值
    // 简述部分的 [F1] 步
    this->blobs_[2]->mutable_cpu_data()[0] *= moving_average_fraction_;
    this->blobs_[2]->mutable_cpu_data()[0] += 1;

    // this->blobs_[0] = 1  mean_ + moving_average_fraction_  this->blobs_[0]
    // 简述部分的 [F2] 步
    caffe_cpu_axpby(mean_.count(), Dtype(1), mean_.cpu_data(),
        moving_average_fraction_, this->blobs_[0]->mutable_cpu_data());
    // m = num  height  width 
    int m = bottom[0]->count()/channels_;

    Dtype bias_correction_factor = m > 1 ? Dtype(m)/(m-1) : 1;
    // this->blobs_[1] = bias_correction_factor  variance_ + moving_average_fraction_  this->blobs_[1]
    // 无偏估计方差 m/(m-1)
    // 简述部分的 [F3] 步
    caffe_cpu_axpby(variance_.count(), bias_correction_factor,
        variance_.cpu_data(), moving_average_fraction_,
        this->blobs_[1]->mutable_cpu_data());
  }

  // normalize variance
  // 方差归一化
  // variance_ = variance_ + eps_ 添加一个很小的值
  caffe_add_scalar(variance_.count(), eps_, variance_.mutable_cpu_data());
  // 对 variance_ 的每个值进行操作，求开方
  caffe_sqrt(variance_.count(), variance_.cpu_data(),
             variance_.mutable_cpu_data());

  // replicate variance to input size
  // 下面两个 gemm 函数将 channels_ 个值的方差 variance_ 扩展到 channels_  num  height * width
  // num_by_chans_ = 1  batch_sum_multiplier_  variance_
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
      batch_sum_multiplier_.cpu_data(), variance_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  // temp_ = 1.0  num_by_chans_  spatial_sum_multiplier_
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,
      spatial_dim, 1, 1., num_by_chans_.cpu_data(),
      spatial_sum_multiplier_.cpu_data(), 0., temp_.mutable_cpu_data());

  // 逐元素操作，top_data[i] = top_data[i] / temp_[i] 
  caffe_div(temp_.count(), top_data, temp_.cpu_data(), top_data);
  // TODO(cdoersch): The caching is only needed because later in-place layers
  //                 might clobber the data.  Can we skip this if they won't?
  // 将 top_data 的计算结果 copy 到 x_norm_. 
  caffe_copy(x_norm_.count(), top_data, x_norm_.mutable_cpu_data());
}


// 参考简述中的反向计算公式.  
template <typename Dtype>
void BatchNormLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  const Dtype* top_diff; // 梯度
  if (bottom[0] != top[0]) {
    top_diff = top[0]->cpu_diff();
  } else {
    caffe_copy(x_norm_.count(), top[0]->cpu_diff(), x_norm_.mutable_cpu_diff());
    top_diff = x_norm_.cpu_diff();
  }
  Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
  if (use_global_stats_) {
    caffe_div(temp_.count(), top_diff, temp_.cpu_data(), bottom_diff);
    return;
  }
  const Dtype* top_data = x_norm_.cpu_data();
  int num = bottom[0]->shape()[0];
  int spatial_dim = bottom[0]->count()/(bottom[0]->shape(0)*channels_);
  // if Y = (X-mean(X))/(sqrt(var(X)+eps)), then
  //
  // dE(Y)/dX =
  //   (dE/dY - mean(dE/dY) - mean(dE/dY cdot Y) cdot Y)
  //     ./ sqrt(var(X) + eps)
  //
  // where cdot and ./ are hadamard product and elementwise division,
  // respectively, dE/dY is the top diff, and mean/var/sum are all computed
  // along all dimensions except the channels dimension.  In the above
  // equation, the operations allow for expansion (i.e. broadcast) along all
  // dimensions except the channels dimension where required.

  // sum(dE/dY cdot Y)
  caffe_mul(temp_.count(), top_data, top_diff, bottom_diff);
  caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1.,
      bottom_diff, spatial_sum_multiplier_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
      num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
      mean_.mutable_cpu_data());

  // reshape (broadcast) the above
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
      batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,
      spatial_dim, 1, 1., num_by_chans_.cpu_data(),
      spatial_sum_multiplier_.cpu_data(), 0., bottom_diff);

  // sum(dE/dY cdot Y) cdot Y
  caffe_mul(temp_.count(), top_data, bottom_diff, bottom_diff);

  // sum(dE/dY)-sum(dE/dY cdot Y) cdot Y
  caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1.,
      top_diff, spatial_sum_multiplier_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
      num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
      mean_.mutable_cpu_data());
  // reshape (broadcast) the above to make
  // sum(dE/dY)-sum(dE/dY cdot Y) cdot Y
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
      batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num * channels_,
      spatial_dim, 1, 1., num_by_chans_.cpu_data(),
      spatial_sum_multiplier_.cpu_data(), 1., bottom_diff);

  // dE/dY - mean(dE/dY)-mean(dE/dY cdot Y) cdot Y
  caffe_cpu_axpby(temp_.count(), Dtype(1), top_diff,
      Dtype(-1. / (num * spatial_dim)), bottom_diff);

  // note: temp_ still contains sqrt(var(X)+eps), computed during the forward
  // pass.
  caffe_div(temp_.count(), bottom_diff, temp_.cpu_data(), bottom_diff);
}


#ifdef CPU_ONLY
STUB_GPU(BatchNormLayer);
#endif

INSTANTIATE_CLASS(BatchNormLayer);
REGISTER_LAYER_CLASS(BatchNorm);
}  // namespace caffe

<h2>6. caffeproto 中 Scale 的定义</h2>

message LayerParameter {
    optional ScaleParameter scale_param = 142;
}

message ScaleParameter {
  // The first axis of bottom[0] (the first input Blob) along which to apply
  // bottom[1] (the second input Blob).  May be negative to index from the end
  // (e.g., -1 for the last axis).
  // 根据 bottom[0] 指定 bottom[1] 的形状
  // For example, if bottom[0] is 4D with shape 100x3x40x60, the output
  // top[0] will have the same shape, and bottom[1] may have any of the
  // following shapes (for the given value of axis):
  //    (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60
  //    (axis == 1 == -3)          3;     3x40;     3x40x60
  //    (axis == 2 == -2)                   40;       40x60
  //    (axis == 3 == -1)                                60
  // Furthermore, bottom[1] may have the empty shape (regardless of the value of
  // "axis") -- a scalar multiplier.
  // 例如，如果 bottom[0] 的 shape 为 100x3x40x60，则 top[0] 输出相同的 shape；
  // bottom[1] 可以包含上面 shapes 中的任一种(对于给定 axis 值). 
  // 而且，bottom[1] 可以是 empty shape 的，没有任何的 axis 值，只是一个标量的乘子.
  optional int32 axis = 1 [default = 1];

  // (num_axes is ignored unless just one bottom is given and the scale is
  // a learned parameter of the layer.  Otherwise, num_axes is determined by the
  // number of axes by the second bottom.)
  // (忽略 num_axes 参数，除非只给定一个 bottom 及 scale 是网络层的一个学习到的参数. 
  // 否则，num_axes 是由第二个 bottom 的数量来决定的.)
  // The number of axes of the input (bottom[0]) covered by the scale
  // parameter, or -1 to cover all axes of bottom[0] starting from axis.
  // Set num_axes := 0, to multiply with a zero-axis Blob: a scalar.
  // bottom[0] 的 num_axes 是由 scale 参数覆盖的；
  optional int32 num_axes = 2 [default = 1];

  // (filler is ignored unless just one bottom is given and the scale is
  // a learned parameter of the layer.)
  // (忽略 filler 参数，除非只给定一个 bottom 及 scale 是网络层的一个学习到的参数.
  // The initialization for the learned scale parameter.
  // scale 参数学习的初始化
  // Default is the unit (1) initialization, resulting in the ScaleLayer
  // initially performing the identity operation.
  // 默认是单位初始化，使 Scale 层初始进行单位操作.
  optional FillerParameter filler = 3;

  // Whether to also learn a bias (equivalent to a ScaleLayer+BiasLayer, but
  // may be more efficient).  Initialized with bias_filler (defaults to 0).
  // 是否学习 bias，等价于 ScaleLayer+BiasLayer，只不过效率更高
  // 采用 bias_filler 进行初始化. 默认为 0.
  optional bool bias_term = 4 [default = false];
  optional FillerParameter bias_filler = 5;
}

即，按元素计算连个输入的乘积。该过程以广播第二个输入来匹配第一个输入矩阵的大小。
也就是通过平铺第二个输入矩阵来计算按元素乘积（点乘）。

简化：

optional int32 axis [default = 1] ; 默认的处理维度
optional int32 num_axes [default = 1] ; //在BN中可以忽略，主要决定第二个bottom
optional FillerParameter filler ; //初始alpha和beta的填充方式。
optional FillerParameter bias_filler;
optional bool bias_term = 4 [default = false]; //是否学习bias，若不学习，则简化为 y = alpha*x

<h2>7. scale_layer.hpp</h2>

#ifndef CAFFE_SCALE_LAYER_HPP_
#define CAFFE_SCALE_LAYER_HPP_

#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

#include "caffe/layers/bias_layer.hpp"

namespace caffe {

/**
 * @brief Computes the elementwise product of two input Blobs, with the shape of
 *        the latter Blob "broadcast" to match the shape of the former.
 *        Equivalent to tiling the latter Blob, then computing the elementwise
 *        product. Note: for efficiency and convenience, this layer can
 *        additionally perform a "broadcast" sum too when bias_term: true
 *        is set.
 *
 * The latter, scale input may be omitted, in which case it's learned as
 * parameter of the layer (as is the bias, if it is included).
 */
template <typename Dtype>
class ScaleLayer: public Layer<Dtype> {
 public:
  explicit ScaleLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "Scale"; }
  // Scale
  virtual inline int MinBottomBlobs() const { return 1; }
  virtual inline int MaxBottomBlobs() const { return 2; }
  virtual inline int ExactNumTopBlobs() const { return 1; }

 protected:
  /**
   * In the below shape specifications, @f$ i @f$ denotes the value of the
   * axis field given by this-&gt;layer_param_.scale_param().axis(), after
   * canonicalization (i.e., conversion from negative to positive index,
   * if applicable).
   *
   * @param bottom input Blob vector (length 2)
   *   -# @f$ (d_0 times ... times
   *           d_i times ... times d_j times ... times d_n) @f$
   *      the first factor @f$ x @f$
   *   -# @f$ (d_i \times ... \times d_j) @f$
   *      the second factor @f$ y @f$
   * @param top output Blob vector (length 1)
   *   -# @f$ (d_0 times ... times
   *           d_i times ... times d_j times ... times d_n) @f$
   *      the product @f$ z = x y @f$ computed after "broadcasting" y.
   *      Equivalent to tiling @f$ y @f$ to have the same shape as @f$ x @f$,
   *      then computing the elementwise product.
   */
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  shared_ptr<Layer<Dtype> > bias_layer_;
  vector<Blob<Dtype>*> bias_bottom_vec_;
  vector<bool> bias_propagate_down_;
  int bias_param_id_;

  Blob<Dtype> sum_multiplier_;
  Blob<Dtype> sum_result_;
  Blob<Dtype> temp_;
  int axis_;
  int outer_dim_, scale_dim_, inner_dim_;
};


}  // namespace caffe

#endif  // CAFFE_SCALE_LAYER_HPP_

<h2>8. scale_layer.cpp</h2>

#include <algorithm>
#include <vector>

#include "caffe/filler.hpp"
#include "caffe/layer_factory.hpp"
#include "caffe/layers/scale_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {

template <typename Dtype>
void ScaleLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const ScaleParameter& param = this->layer_param_.scale_param(); // scale 参数
  // 判断 bottom blobs 是否已经有值
  if (bottom.size() == 1 && this->blobs_.size() > 0) { 
    LOG(INFO) << "Skipping parameter initialization";
  } else if (bottom.size() == 1) {
    // scale is a learned parameter; initialize it
    // 待学习参数 scale，初始化
    axis_ = bottom[0]->CanonicalAxisIndex(param.axis()); //
    const int num_axes = param.num_axes();
    CHECK_GE(num_axes, -1) << "num_axes must be non-negative, "
                           << "or -1 to extend to the end of bottom[0]";
    if (num_axes >= 0) {
      CHECK_GE(bottom[0]->num_axes(), axis_ + num_axes)
          << "scale blob's shape extends past bottom[0]'s shape when applied "
          << "starting with bottom[0] axis = " << axis_;
    }
    this->blobs_.resize(1); // gamma

    // 
    const vector<int>::const_iterator& shape_start =
        bottom[0]->shape().begin() + axis_;
    const vector<int>::const_iterator& shape_end =
        (num_axes == -1) ? bottom[0]->shape().end() : (shape_start + num_axes);
    vector<int> scale_shape(shape_start, shape_end);
    this->blobs_[0].reset(new Blob<Dtype>(scale_shape));
    FillerParameter filler_param(param.filler());
    if (!param.has_filler()) {
      // 未初始化时，初始化值为 1
      // Default to unit (1) filler for identity operation.
      filler_param.set_type("constant");
      filler_param.set_value(1);
    }
    shared_ptr<Filler<Dtype> > filler(GetFiller<Dtype>(filler_param));
    filler->Fill(this->blobs_[0].get());
  }
  if (param.bias_term()) { // 是否需要处理 bias 项
    LayerParameter layer_param(this->layer_param_);
    layer_param.set_type("Bias");
    BiasParameter* bias_param = layer_param.mutable_bias_param();
    bias_param->set_axis(param.axis());
    if (bottom.size() > 1) {
      bias_param->set_num_axes(bottom[1]->num_axes());
    } else {
      bias_param->set_num_axes(param.num_axes());
    }
    bias_param->mutable_filler()->CopyFrom(param.bias_filler());
    bias_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param);
    bias_bottom_vec_.resize(1);
    bias_bottom_vec_[0] = bottom[0];
    bias_layer_->SetUp(bias_bottom_vec_, top);
    if (this->blobs_.size() + bottom.size() < 3) {
      // case: blobs.size == 1 && bottom.size == 1
      // or blobs.size == 0 && bottom.size == 2
      bias_param_id_ = this->blobs_.size();
      this->blobs_.resize(bias_param_id_ + 1);
      this->blobs_[bias_param_id_] = bias_layer_->blobs()[0];
    } else {
      // bias param already initialized
      bias_param_id_ = this->blobs_.size() - 1;
      bias_layer_->blobs()[0] = this->blobs_[bias_param_id_];
    }
    bias_propagate_down_.resize(1, false);
  }
  this->param_propagate_down_.resize(this->blobs_.size(), true);
}

template <typename Dtype>
void ScaleLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const ScaleParameter& param = this->layer_param_.scale_param();
  Blob<Dtype>* scale = (bottom.size() > 1) ? bottom[1] : this->blobs_[0].get();
  // Always set axis_ == 0 in special case where scale is a scalar
  // (num_axes == 0). Mathematically equivalent for any choice of axis_, so the
  // actual setting can be safely ignored; and computation is most efficient
  // with axis_ == 0 and (therefore) outer_dim_ == 1. (Setting axis_ to
  // bottom[0]->num_axes() - 1, giving inner_dim_ == 1, would be equally
  // performant.)
  axis_ = (scale->num_axes() == 0) ?
      0 : bottom[0]->CanonicalAxisIndex(param.axis());
  CHECK_GE(bottom[0]->num_axes(), axis_ + scale->num_axes())
      << "scale blob's shape extends past bottom[0]'s shape when applied "
      << "starting with bottom[0] axis = " << axis_;
  for (int i = 0; i < scale->num_axes(); ++i) {
    CHECK_EQ(bottom[0]->shape(axis_ + i), scale->shape(i))
        << "dimension mismatch between bottom[0]->shape(" << axis_ + i
        << ") and scale->shape(" << i << ")";
  }
  outer_dim_ = bottom[0]->count(0, axis_);
  scale_dim_ = scale->count();
  inner_dim_ = bottom[0]->count(axis_ + scale->num_axes());

  // 如果 top 层和 bottom 层同名，则进行 in-place 计算
  if (bottom[0] == top[0]) {  // in-place computation
    temp_.ReshapeLike(*bottom[0]);
  } else {
    top[0]->ReshapeLike(*bottom[0]);
  }

  sum_result_.Reshape(vector<int>(1, outer_dim_ * scale_dim_));
  const int sum_mult_size = std::max(outer_dim_, inner_dim_);
  sum_multiplier_.Reshape(vector<int>(1, sum_mult_size));
  if (sum_multiplier_.cpu_data()[sum_mult_size - 1] != Dtype(1)) {
    caffe_set(sum_mult_size, Dtype(1), sum_multiplier_.mutable_cpu_data());
  }
  if (bias_layer_) {
    bias_bottom_vec_[0] = top[0];
    bias_layer_->Reshape(bias_bottom_vec_, top);
  }
}


template <typename Dtype>
void ScaleLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>>& bottom, const vector<Blob<Dtype>>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();
  if (bottom[0] == top[0]) {
    // In-place computation; need to store bottom data before overwriting it.
    // Note that this is only necessary for Backward; we could skip this if not
    // doing Backward, but Caffe currently provides no way of knowing whether
    // we'll need to do Backward at the time of the Forward call.
    // in-place 计算，需要先临时复制一份，再进行计算.
    caffe_copy(bottom[0]->count(), bottom[0]->cpu_data(),
               temp_.mutable_cpu_data());
  }
  const Dtype* scale_data =
      ((bottom.size() > 1) ? bottom[1] : this->blobs_[0].get())->cpu_data();
  Dtype* top_data = top[0]->mutable_cpu_data();
  for (int n = 0; n < outer_dim_; ++n) {
    for (int d = 0; d < scale_dim_; ++d) {
      const Dtype factor = scale_data[d];
      caffe_cpu_scale(inner_dim_, factor, bottom_data, top_data);
      bottom_data += inner_dim_;
      top_data += inner_dim_;
    }
  }
  if (bias_layer_) {
    bias_layer_->Forward(bias_bottom_vec_, top);
  }
}

template <typename Dtype>
void ScaleLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
  if (bias_layer_ &&
      this->param_propagate_down_[this->param_propagate_down_.size() - 1]) {
    bias_layer_->Backward(top, bias_propagate_down_, bias_bottom_vec_);
  }
  const bool scale_param = (bottom.size() == 1);
  Blob<Dtype>* scale = scale_param ? this->blobs_[0].get() : bottom[1];
  if ((!scale_param && propagate_down[1]) ||
      (scale_param && this->param_propagate_down_[0])) {
    const Dtype* top_diff = top[0]->cpu_diff();
    const bool in_place = (bottom[0] == top[0]);
    const Dtype* bottom_data = (in_place ? &temp_ : bottom[0])->cpu_data();
    // Hack: store big eltwise product in bottom[0] diff, except in the special
    // case where this layer itself does the eltwise product, in which case we
    // can store it directly in the scale diff, and we're done.
    // If we're computing in-place (and not doing eltwise computation), this
    // hack doesn't work and we store the product in temp_.
    const bool is_eltwise = (bottom[0]->count() == scale->count());
    Dtype* product = (is_eltwise ? scale->mutable_cpu_diff() :
        (in_place ? temp_.mutable_cpu_data() : bottom[0]->mutable_cpu_diff()));
    caffe_mul(top[0]->count(), top_diff, bottom_data, product);
    if (!is_eltwise) {
      Dtype* sum_result = NULL;
      if (inner_dim_ == 1) {
        sum_result = product;
      } else if (sum_result_.count() == 1) {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        Dtype* scale_diff = scale->mutable_cpu_diff();
        if (scale_param) {
          Dtype result = caffe_cpu_dot(inner_dim_, product, sum_mult);
          *scale_diff += result;
        } else {
          *scale_diff = caffe_cpu_dot(inner_dim_, product, sum_mult);
        }
      } else {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        sum_result = (outer_dim_ == 1) ?
            scale->mutable_cpu_diff() : sum_result_.mutable_cpu_data();
        caffe_cpu_gemv(CblasNoTrans, sum_result_.count(), inner_dim_,
                       Dtype(1), product, sum_mult, Dtype(0), sum_result);
      }
      if (outer_dim_ != 1) {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        Dtype* scale_diff = scale->mutable_cpu_diff();
        if (scale_dim_ == 1) {
          if (scale_param) {
            Dtype result = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);
            *scale_diff += result;
          } else {
            *scale_diff = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);
          }
        } else {
          caffe_cpu_gemv(CblasTrans, outer_dim_, scale_dim_,
                         Dtype(1), sum_result, sum_mult, Dtype(scale_param),
                         scale_diff);
        }
      }
    }
  }
  if (propagate_down[0]) {
    const Dtype* top_diff = top[0]->cpu_diff();
    const Dtype* scale_data = scale->cpu_data();
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
    for (int n = 0; n < outer_dim_; ++n) {
      for (int d = 0; d < scale_dim_; ++d) {
        const Dtype factor = scale_data[d];
        caffe_cpu_scale(inner_dim_, factor, top_diff, bottom_diff);
        bottom_diff += inner_dim_;
        top_diff += inner_dim_;
      }
    }
  }
}

#ifdef CPU_ONLY
STUB_GPU(ScaleLayer);
#endif

INSTANTIATE_CLASS(ScaleLayer);
REGISTER_LAYER_CLASS(Scale);

}  // namespace caffe

<h2>Reference</h2>

[1] - caffe中batch_norm层代码详细注解

[2] - CAFFE源码学习笔记之batch_norm_layer

[3] - Caffe BatchNormalization 推导

[4] - Caffe Scale层解析

Last modification：October 10th, 2018 at 04:37 pm

Caffe源码 - BatchNormLayer 与 ScaleLayer

AIHGF • 2018 年 05 月 21 日

Caffe BVLC 的 BN 层是由 batchnormlayer + scalelayer 两层来实现的.

BatchNorm 主要有两部分计算：

<h2>1. BatchNormalization 简述</h2>

Batch Normalization 论文给出的计算：

前向计算：

后向计算：

Caffe BatchNorm 层的训练，根据从总样本中的 mini-batch 个样本，进行多次前向训练，每次计算都会考虑已经计算得到的 mean 和 variance.

前向计算 Caffe 实现中，不是将每次计算的 mean 和 variance 的结果简单累加，而是通过一个因子(一般小于 1 的变量) 把前一次计算的 mean 和 variance 的作用逐渐较少，再加上本次计算的 mean 和 variance，作为最终的结果. 即滑动平均(Moving Average)的方式. 其过程如下： ${ S_{t-1} }$ - 前一次 mini-batch 计算的 mean； ${ Y_{t} }$ - 本次 mini-batch 计算的 mean； ${ \lambda }$ - 滑动平均因子， moving_average_fraction Forward 计算中， [F1] - 滑动系数， ${ s_{new} = \lambda s_{old} + 1 }$ [F2] - 均值，${ \mu _{new} = \lambda \mu _{old} + \mu }$ [F3] - 方差，${ \sigma _{new} = \lambda \sigma _{old} + m\sigma }$，其中，${ m > 1 } $ 时，${ m = \frac{m-1}{m} }$ Caffe 源码未加参数 ${ \gamma }$ 和 ${ \beta }$.</p>
反向计算对输入的梯度进行计算，没有参数 $\gamma$ 和 $\beta$. 方差的梯度计算： $\frac{\partial L}{\partial \sigma} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \sigma} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot (x_i- \mu)(-\frac{1}{2})(\sigma + eps)^{-\frac{3}{2}}$ 均值的梯度计算： $\frac{\partial L}{\partial \mu} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \mu} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{-1}{\sqrt {\sigma + eps}} $ 输入 $x$ 的梯度计算： $\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial y_i} \frac{1}{\sqrt{\sigma+eps}} + \frac{\partial L}{\partial \sigma} \frac{\partial \sigma}{\partial x_i} + \frac{\partial L}{\partial \mu} \frac{\partial \mu}{\partial x_i} $ $\ \ \ \ \ = \frac{\partial L}{\partial y_i} \frac{1}{\sqrt{\sigma + eps}} + \frac{\partial L}{\partial \sigma} \frac{2}{n} (x_i- \mu) + \frac{\partial L}{\partial \mu} \frac{1}{n} $ $\ \ \ \ \ = \frac{1}{\sqrt{\sigma + eps}} (\frac{\partial L}{\partial y_i}- \frac{1}{n}\sum_{i=0}^n \frac{\partial L}{\partial y_i}- (\frac{1}{n} \sum_{i=0}^n \frac{\partial L}{\partial y_i} y_i) y_i ) $ $\ \ \ \ \ = \frac{1}{\sqrt{\sigma + eps}} (\frac{\partial L}{\partial y_i}- \frac{1}{n}(\frac{\partial L}{\partial y_i})- \frac{1}{n}(\frac{\partial L}{\partial y_i} \cdot y_i) \cdot y_i)$</p>

Caffe Scale 层是主要处理参数 $\gamma$ 和 $\beta$ (均为向量).

前向计算： ${ top = \gamma \cdot bottom + \beta }$ ${ y = \gamma \cdot x + \beta }$
反向计算： ${ \frac{\partial y}{\partial x} = \gamma }$ ${ \frac{\partial y}{\partial \gamma} = x }$ ${ \frac{\partial y}{\partial \beta} = 1 }$

<h2>2. prototxt 中的定义</h2>

在Caffe 中，一般一个 BatchNorm 层后接一个 Scale 层，例如：

layer {
    bottom: "conv1"
    top: "conv1"
    name: "bn_conv1"
    type: "BatchNorm"
    batch_norm_param {
        use_global_stats: true
    }
    param {
        name: "bn_conv1_0"
          lr_mult: 0
    }
    param {
        name: "bn_conv1_1"
          lr_mult: 0
        }
    param {
        name: "bn_conv1_2"
          lr_mult: 0
        }
}

layer {
    bottom: "conv1"
    top: "conv1"
    name: "scale_conv1"
    type: "Scale"
    scale_param {
        bias_term: true
    }
    param {
        name: "scale_conv1_0"
          lr_mult: 0
    }
    param {
        name: "scale_conv1_1"
          lr_mult: 0
    }
}

From train_voc_trainval_aug.prototxt

<h2>3. caffeproto 中 BatchNorm 的定义</h2>

message LayerParameter {
    optional BatchNormParameter batch_norm_param = 139;
}

message BatchNormParameter {
  // If false, normalization is performed over the current mini-batch
  // and global statistics are accumulated (but not yet used) by a moving
  // average.
  // 如果 use_global_stats = 0，则对当前 mini-batch 内的数据归一化； 同时 global statistics 通过滑动平均逐渐累加.
  // If true, those accumulated mean and variance values are used for the
  // normalization.
  // 如果 use_global_stats = 1，则采用累加的 均值和方差 对数据进行归一化.
  // By default, it is set to false when the network is in the training
  // phase and true when the network is in the testing phase.
  // 默认情况下，网络训练时 use_global_stats = 0；网络测试时 use_global_stats = 1. 
  optional bool use_global_stats = 1;

  // What fraction of the moving average remains each iteration?
  // 滑动平均时每次迭代保留的百分比？
  // Smaller values make the moving average decay faster, giving more
  // weight to the recent values.
  // 较小的值使得平均累加过程衰退较快，给予最近的值较大的权重
  // Each iteration updates the moving average @f$S_{t- 1}@f$ with the
  // current mean @f$ Y_t @f$ by 
  // @f$ S_t = (1-\beta)Y_t + \beta \cdot S_{t-1} @f$, where @f$ \beta @f$
  // is the moving_average_fraction parameter.
  optional float moving_average_fraction = 2 [default = .999];
  // Small value to add to the variance estimate so that we don't divide by
  // zero.
  // 保持数值稳定性
  optional float eps = 3 [default = 1e-5];
}

optional float moving_average_fraction = 2 [default = .999]: 每次迭代时，根据当前均值 ${ Y_t }$ 更新滑动平均值 ${ S_{t-1} }$: ${ S_t = (1-\beta)Y_t + \beta \cdot S_{t-1} }$ 其中，${ \beta }$ 是 moving_average_fraction 参数.

<h2>4. batch_norm_layer.hpp</h2>

#ifndef CAFFE_BATCHNORM_LAYER_HPP_
#define CAFFE_BATCHNORM_LAYER_HPP_

#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

namespace caffe {

/**
 * @brief Normalizes the input to have 0-mean and/or unit (1) variance across
 *        the batch.
 * BatchNorm 功能：    
 *    将 mini-batch 的输入归一化为均值为0或方差为1.
 *
 * This layer computes Batch Normalization as described in [1]. For each channel
 * in the data (i.e. axis 1), it subtracts the mean and divides by the variance,
 * where both statistics are computed across both spatial dimensions and across
 * the different examples in the batch.
 * 对数据中的每一 channel，如 axis=1，BatchNorm 首先减均值，然后除以其方差. 
 * 其中，均值和方差是对 mini-batch 内的不同样本的所有 spatial 维度进行计算得到.
 *
 * By default, during training time, the network is computing global
 * mean/variance statistics via a running average, which is then used at test
 * time to allow deterministic outputs for each input. You can manually toggle
 * whether the network is accumulating or using the statistics via the
 * use_global_stats option. For reference, these statistics are kept in the
 * layer's three blobs: (0) mean, (1) variance, and (2) moving average factor.
 * 默认情况，训练时，网络通过平均累加，计算全局均值和方差值，然后用于测试来计算每一个输入的输出.
 * 可以通过手工设置 use_global_stats 参数，来控制网络是采用累加还是统计值.
 * 统计值被保存在网络层的三个 blobs：(0) mean, (1) variance, and (2) moving average factor
 *
 * Note that the original paper also included a per-channel learned bias and
 * scaling factor. To implement this in Caffe, define a ScaleLayer configured
 * with bias_term: true after each BatchNormLayer to handle both the bias
 * and scaling factor.
 * 原始论文中还包括一个 per-channel 的学习 bias 和一个 scaling 因子.
 * 因此，Caffe 实现中，在每个 BatchNormLayer 后的 ScaleLayer 中配置 bias_term: true 来处理 bias 和 scaling 因子.
 *
 * [1] S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network
 *     Training by Reducing Internal Covariate Shift." arXiv preprint
 *     arXiv:1502.03167 (2015).
 *
 * TODO(dox): thorough documentation for Forward, Backward, and proto params.
 */
template <typename Dtype>
class BatchNormLayer : public Layer<Dtype> {
 public:
  explicit BatchNormLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "BatchNorm"; }
  virtual inline int ExactNumBottomBlobs() const { return 1; }
  virtual inline int ExactNumTopBlobs() const { return 1; }

 protected:
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
     const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  Blob<Dtype> mean_, variance_, temp_, x_norm_; // 均值，方差，
  bool use_global_stats_;
  Dtype moving_average_fraction_;
  int channels_;
  Dtype eps_;

  // extra temporarary variables is used to carry out sums/broadcasting
  // using BLAS
  Blob<Dtype> batch_sum_multiplier_;
  Blob<Dtype> num_by_chans_;
  Blob<Dtype> spatial_sum_multiplier_;
};

}  // namespace caffe

#endif  // CAFFE_BATCHNORM_LAYER_HPP_

<h2>5. batch_norm_layer.cpp</h2>

#include <algorithm>
#include <vector>

#include "caffe/layers/batch_norm_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {

template <typename Dtype>
void BatchNormLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  BatchNormParameter param = this->layer_param_.batch_norm_param(); // batchnorm 参数
  moving_average_fraction_ = param.moving_average_fraction(); // 滑动平均参数
  // 训练时，mean 和 variance 基于 mini-batch 计算
  // 测试时，mean 和 variance 基于整个 dataset.
  use_global_stats_ = this->phase_ == TEST; 
  if (param.has_use_global_stats())
    use_global_stats_ = param.use_global_stats();
  if (bottom[0]->num_axes() == 1)
    channels_ = 1;
  else
    channels_ = bottom[0]->shape(1);
  eps_ = param.eps();
  if (this->blobs_.size() > 0) {
    LOG(INFO) << "Skipping parameter initialization";
  } else {
    this->blobs_.resize(3); // 存储的学习参数
    vector<int> sz;
    sz.push_back(channels_);
    this->blobs_[0].reset(new Blob<Dtype>(sz)); // 均值滑动平均值，channels_ 大小的数组
    this->blobs_[1].reset(new Blob<Dtype>(sz)); // 方差滑动平均值，channels_ 大小的数组
    sz[0] = 1;
    this->blobs_[2].reset(new Blob<Dtype>(sz)); // 滑动平均系数，大小为 1 的数组
    for (int i = 0; i < 3; ++i) {
      caffe_set(this->blobs_[i]->count(), Dtype(0),
                this->blobs_[i]->mutable_cpu_data()); // 值初始化为 0
    }
  }
  // Mask statistics from optimization by setting local learning rates
  // for mean, variance, and the bias correction to zero.
  for (int i = 0; i < this->blobs_.size(); ++i) {
    if (this->layer_param_.param_size() == i) {
      ParamSpec* fixed_param_spec = this->layer_param_.add_param();
      fixed_param_spec->set_lr_mult(0.f);
    } else {
      CHECK_EQ(this->layer_param_.param(i).lr_mult(), 0.f)
          << "Cannot configure batch normalization statistics as layer "
          << "parameters.";
    }
  }
}

template <typename Dtype>
void BatchNormLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  // 如果 bottom 是一维的，均值和方差的个数为1；否则，等于 channels
  if (bottom[0]->num_axes() >= 1)
    CHECK_EQ(bottom[0]->shape(1), channels_);
  top[0]->ReshapeLike(*bottom[0]); // top[0] 与输入 bottom[0] 的形状一致

  vector<int> sz;
  sz.push_back(channels_);
  mean_.Reshape(sz); // 存储均值
  variance_.Reshape(sz); // 存储方差
  temp_.ReshapeLike(*bottom[0]); // 存储减去均值 mean_ 后的每个数的方差
  x_norm_.ReshapeLike(*bottom[0]);
  sz[0] = bottom[0]->shape(0);
  batch_sum_multiplier_.Reshape(sz); // batch size

  // 空间维度， height*width
  int spatial_dim = bottom[0]->count()/(channels_*bottom[0]->shape(0));

  if (spatial_sum_multiplier_.num_axes() == 0 ||
      spatial_sum_multiplier_.shape(0) != spatial_dim) {
    sz[0] = spatial_dim;
    spatial_sum_multiplier_.Reshape(sz);
    Dtype* multiplier_data = spatial_sum_multiplier_.mutable_cpu_data();
    // spatial_sum_multiplier_ 初始化值为1，其尺寸为 height*width 
    caffe_set(spatial_sum_multiplier_.count(), Dtype(1), multiplier_data);
  }

  int numbychans = channels_bottom[0]->shape(0); // channels  batchsize
  if (num_by_chans_.num_axes() == 0 ||
      num_by_chans_.shape(0) != numbychans) {
    sz[0] = numbychans;
    num_by_chans_.Reshape(sz);
    caffe_set(batch_sum_multiplier_.count(), Dtype(1),
        batch_sum_multiplier_.mutable_cpu_data()); // 初始化值为 1
  }
}

// Forward 函数，计算均值和方差，以矩阵-向量乘积的方式.
template <typename Dtype>
void BatchNormLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();
  Dtype* top_data = top[0]->mutable_cpu_data();
  int num = bottom[0]->shape(0); // 
  int spatial_dim = bottom[0]->count()/(bottom[0]->shape(0)channels_); // heightwidth

  // 判断 BN 层的输入输出是否是同一 blob
  if (bottom[0] != top[0]) {
    caffe_copy(bottom[0]->count(), bottom_data, top_data);
  }

  if (use_global_stats_) {
    // 如果 use_global_stats_ = 1，则使用预定义的均值和方差估计值.
    // use the stored mean/variance estimates.
    const Dtype scale_factor = this->blobs_[2]->cpu_data()[0] == 0 ?
        0 : 1 / this->blobs_[2]->cpu_data()[0];
    caffe_cpu_scale(variance_.count(), scale_factor,
        this->blobs_[0]->cpu_data(), mean_.mutable_cpu_data()); // 乘以缩放因子
    caffe_cpu_scale(variance_.count(), scale_factor,
        this->blobs_[1]->cpu_data(), variance_.mutable_cpu_data());
  } else {
    // 如果 use_global_stats_ = 0
    // compute mean
    // 均值计算
    // num_by_chans_ = (1. / (num  spatial_dim))  bottom_data * spatial_sum_multiplier_ 
    // channels*num 行； spatial_dim 列
    // 共 channels * num 个值
    caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim,
        1. / (num * spatial_dim), bottom_data,
        spatial_sum_multiplier_.cpu_data(), 0.,
        num_by_chans_.mutable_cpu_data());

    // mean_ = 1  num_by_chans_  batch_sum_multiplier_
    // num 行； channels 列
    // 每个通道值相加，得到 channel 个值
    caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
        num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
        mean_.mutable_cpu_data());
  }

  // subtract mean
  // 减均值
  // num_by_chans_ = 1  batch_sum_multiplier_  mean_ 
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
      batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  // top_data = -1  num_by_chans_   + spatial_sum_multiplier_ + 1.0 * top_data
  // top_data 中的数据减去均值 mean_
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,
      spatial_dim, 1, -1, num_by_chans_.cpu_data(),
      spatial_sum_multiplier_.cpu_data(), 1., top_data);

  if (!use_global_stats_) {
    // 如果 use_global_stats_ = 0，计算方差
    // compute variance using var(X) = E((X-EX)^2)
    // 对向量的每一个值求其方差，得到结果为 temp_
    caffe_sqr<Dtype>(top[0]->count(), top_data, temp_.mutable_cpu_data());  // (X-EX)^2
    // num_by_chans_ = (1. / (num  spatial_dim))  temp_ * spatial_sum_multiplier_
    caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim,
        1. / (num * spatial_dim), temp_.cpu_data(),
        spatial_sum_multiplier_.cpu_data(), 0.,
        num_by_chans_.mutable_cpu_data()); // 矩阵向量乘
    // variance_ = 1.0  num_by_chans_  batch_sum_multiplier_
    caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
        num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
        variance_.mutable_cpu_data());  // E((X_EX)^2)

    // compute and save moving average
    // 计算并保存滑动平均值
    // 简述部分的 [F1] 步
    this->blobs_[2]->mutable_cpu_data()[0] *= moving_average_fraction_;
    this->blobs_[2]->mutable_cpu_data()[0] += 1;

    // this->blobs_[0] = 1  mean_ + moving_average_fraction_  this->blobs_[0]
    // 简述部分的 [F2] 步
    caffe_cpu_axpby(mean_.count(), Dtype(1), mean_.cpu_data(),
        moving_average_fraction_, this->blobs_[0]->mutable_cpu_data());
    // m = num  height  width 
    int m = bottom[0]->count()/channels_;

    Dtype bias_correction_factor = m > 1 ? Dtype(m)/(m-1) : 1;
    // this->blobs_[1] = bias_correction_factor  variance_ + moving_average_fraction_  this->blobs_[1]
    // 无偏估计方差 m/(m-1)
    // 简述部分的 [F3] 步
    caffe_cpu_axpby(variance_.count(), bias_correction_factor,
        variance_.cpu_data(), moving_average_fraction_,
        this->blobs_[1]->mutable_cpu_data());
  }

  // normalize variance
  // 方差归一化
  // variance_ = variance_ + eps_ 添加一个很小的值
  caffe_add_scalar(variance_.count(), eps_, variance_.mutable_cpu_data());
  // 对 variance_ 的每个值进行操作，求开方
  caffe_sqrt(variance_.count(), variance_.cpu_data(),
             variance_.mutable_cpu_data());

  // replicate variance to input size
  // 下面两个 gemm 函数将 channels_ 个值的方差 variance_ 扩展到 channels_  num  height * width
  // num_by_chans_ = 1  batch_sum_multiplier_  variance_
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
      batch_sum_multiplier_.cpu_data(), variance_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  // temp_ = 1.0  num_by_chans_  spatial_sum_multiplier_
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,
      spatial_dim, 1, 1., num_by_chans_.cpu_data(),
      spatial_sum_multiplier_.cpu_data(), 0., temp_.mutable_cpu_data());

  // 逐元素操作，top_data[i] = top_data[i] / temp_[i] 
  caffe_div(temp_.count(), top_data, temp_.cpu_data(), top_data);
  // TODO(cdoersch): The caching is only needed because later in-place layers
  //                 might clobber the data.  Can we skip this if they won't?
  // 将 top_data 的计算结果 copy 到 x_norm_. 
  caffe_copy(x_norm_.count(), top_data, x_norm_.mutable_cpu_data());
}


// 参考简述中的反向计算公式.  
template <typename Dtype>
void BatchNormLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  const Dtype* top_diff; // 梯度
  if (bottom[0] != top[0]) {
    top_diff = top[0]->cpu_diff();
  } else {
    caffe_copy(x_norm_.count(), top[0]->cpu_diff(), x_norm_.mutable_cpu_diff());
    top_diff = x_norm_.cpu_diff();
  }
  Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
  if (use_global_stats_) {
    caffe_div(temp_.count(), top_diff, temp_.cpu_data(), bottom_diff);
    return;
  }
  const Dtype* top_data = x_norm_.cpu_data();
  int num = bottom[0]->shape()[0];
  int spatial_dim = bottom[0]->count()/(bottom[0]->shape(0)*channels_);
  // if Y = (X-mean(X))/(sqrt(var(X)+eps)), then
  //
  // dE(Y)/dX =
  //   (dE/dY - mean(dE/dY) - mean(dE/dY cdot Y) cdot Y)
  //     ./ sqrt(var(X) + eps)
  //
  // where cdot and ./ are hadamard product and elementwise division,
  // respectively, dE/dY is the top diff, and mean/var/sum are all computed
  // along all dimensions except the channels dimension.  In the above
  // equation, the operations allow for expansion (i.e. broadcast) along all
  // dimensions except the channels dimension where required.

  // sum(dE/dY cdot Y)
  caffe_mul(temp_.count(), top_data, top_diff, bottom_diff);
  caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1.,
      bottom_diff, spatial_sum_multiplier_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
      num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
      mean_.mutable_cpu_data());

  // reshape (broadcast) the above
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
      batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,
      spatial_dim, 1, 1., num_by_chans_.cpu_data(),
      spatial_sum_multiplier_.cpu_data(), 0., bottom_diff);

  // sum(dE/dY cdot Y) cdot Y
  caffe_mul(temp_.count(), top_data, bottom_diff, bottom_diff);

  // sum(dE/dY)-sum(dE/dY cdot Y) cdot Y
  caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1.,
      top_diff, spatial_sum_multiplier_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,
      num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,
      mean_.mutable_cpu_data());
  // reshape (broadcast) the above to make
  // sum(dE/dY)-sum(dE/dY cdot Y) cdot Y
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,
      batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,
      num_by_chans_.mutable_cpu_data());
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num * channels_,
      spatial_dim, 1, 1., num_by_chans_.cpu_data(),
      spatial_sum_multiplier_.cpu_data(), 1., bottom_diff);

  // dE/dY - mean(dE/dY)-mean(dE/dY cdot Y) cdot Y
  caffe_cpu_axpby(temp_.count(), Dtype(1), top_diff,
      Dtype(-1. / (num * spatial_dim)), bottom_diff);

  // note: temp_ still contains sqrt(var(X)+eps), computed during the forward
  // pass.
  caffe_div(temp_.count(), bottom_diff, temp_.cpu_data(), bottom_diff);
}


#ifdef CPU_ONLY
STUB_GPU(BatchNormLayer);
#endif

INSTANTIATE_CLASS(BatchNormLayer);
REGISTER_LAYER_CLASS(BatchNorm);
}  // namespace caffe

<h2>6. caffeproto 中 Scale 的定义</h2>

message LayerParameter {
    optional ScaleParameter scale_param = 142;
}

message ScaleParameter {
  // The first axis of bottom[0] (the first input Blob) along which to apply
  // bottom[1] (the second input Blob).  May be negative to index from the end
  // (e.g., -1 for the last axis).
  // 根据 bottom[0] 指定 bottom[1] 的形状
  // For example, if bottom[0] is 4D with shape 100x3x40x60, the output
  // top[0] will have the same shape, and bottom[1] may have any of the
  // following shapes (for the given value of axis):
  //    (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60
  //    (axis == 1 == -3)          3;     3x40;     3x40x60
  //    (axis == 2 == -2)                   40;       40x60
  //    (axis == 3 == -1)                                60
  // Furthermore, bottom[1] may have the empty shape (regardless of the value of
  // "axis") -- a scalar multiplier.
  // 例如，如果 bottom[0] 的 shape 为 100x3x40x60，则 top[0] 输出相同的 shape；
  // bottom[1] 可以包含上面 shapes 中的任一种(对于给定 axis 值). 
  // 而且，bottom[1] 可以是 empty shape 的，没有任何的 axis 值，只是一个标量的乘子.
  optional int32 axis = 1 [default = 1];

  // (num_axes is ignored unless just one bottom is given and the scale is
  // a learned parameter of the layer.  Otherwise, num_axes is determined by the
  // number of axes by the second bottom.)
  // (忽略 num_axes 参数，除非只给定一个 bottom 及 scale 是网络层的一个学习到的参数. 
  // 否则，num_axes 是由第二个 bottom 的数量来决定的.)
  // The number of axes of the input (bottom[0]) covered by the scale
  // parameter, or -1 to cover all axes of bottom[0] starting from axis.
  // Set num_axes := 0, to multiply with a zero-axis Blob: a scalar.
  // bottom[0] 的 num_axes 是由 scale 参数覆盖的；
  optional int32 num_axes = 2 [default = 1];

  // (filler is ignored unless just one bottom is given and the scale is
  // a learned parameter of the layer.)
  // (忽略 filler 参数，除非只给定一个 bottom 及 scale 是网络层的一个学习到的参数.
  // The initialization for the learned scale parameter.
  // scale 参数学习的初始化
  // Default is the unit (1) initialization, resulting in the ScaleLayer
  // initially performing the identity operation.
  // 默认是单位初始化，使 Scale 层初始进行单位操作.
  optional FillerParameter filler = 3;

  // Whether to also learn a bias (equivalent to a ScaleLayer+BiasLayer, but
  // may be more efficient).  Initialized with bias_filler (defaults to 0).
  // 是否学习 bias，等价于 ScaleLayer+BiasLayer，只不过效率更高
  // 采用 bias_filler 进行初始化. 默认为 0.
  optional bool bias_term = 4 [default = false];
  optional FillerParameter bias_filler = 5;
}

简化：

optional int32 axis [default = 1] ; 默认的处理维度
optional int32 num_axes [default = 1] ; //在BN中可以忽略，主要决定第二个bottom
optional FillerParameter filler ; //初始alpha和beta的填充方式。
optional FillerParameter bias_filler;
optional bool bias_term = 4 [default = false]; //是否学习bias，若不学习，则简化为 y = alpha*x

<h2>7. scale_layer.hpp</h2>

#ifndef CAFFE_SCALE_LAYER_HPP_
#define CAFFE_SCALE_LAYER_HPP_

#include <vector>

#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"

#include "caffe/layers/bias_layer.hpp"

namespace caffe {

/**
 * @brief Computes the elementwise product of two input Blobs, with the shape of
 *        the latter Blob "broadcast" to match the shape of the former.
 *        Equivalent to tiling the latter Blob, then computing the elementwise
 *        product. Note: for efficiency and convenience, this layer can
 *        additionally perform a "broadcast" sum too when bias_term: true
 *        is set.
 *
 * The latter, scale input may be omitted, in which case it's learned as
 * parameter of the layer (as is the bias, if it is included).
 */
template <typename Dtype>
class ScaleLayer: public Layer<Dtype> {
 public:
  explicit ScaleLayer(const LayerParameter& param)
      : Layer<Dtype>(param) {}
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);

  virtual inline const char* type() const { return "Scale"; }
  // Scale
  virtual inline int MinBottomBlobs() const { return 1; }
  virtual inline int MaxBottomBlobs() const { return 2; }
  virtual inline int ExactNumTopBlobs() const { return 1; }

 protected:
  /**
   * In the below shape specifications, @f$ i @f$ denotes the value of the
   * axis field given by this-&gt;layer_param_.scale_param().axis(), after
   * canonicalization (i.e., conversion from negative to positive index,
   * if applicable).
   *
   * @param bottom input Blob vector (length 2)
   *   -# @f$ (d_0 times ... times
   *           d_i times ... times d_j times ... times d_n) @f$
   *      the first factor @f$ x @f$
   *   -# @f$ (d_i \times ... \times d_j) @f$
   *      the second factor @f$ y @f$
   * @param top output Blob vector (length 1)
   *   -# @f$ (d_0 times ... times
   *           d_i times ... times d_j times ... times d_n) @f$
   *      the product @f$ z = x y @f$ computed after "broadcasting" y.
   *      Equivalent to tiling @f$ y @f$ to have the same shape as @f$ x @f$,
   *      then computing the elementwise product.
   */
  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top);
  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);
  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  shared_ptr<Layer<Dtype> > bias_layer_;
  vector<Blob<Dtype>*> bias_bottom_vec_;
  vector<bool> bias_propagate_down_;
  int bias_param_id_;

  Blob<Dtype> sum_multiplier_;
  Blob<Dtype> sum_result_;
  Blob<Dtype> temp_;
  int axis_;
  int outer_dim_, scale_dim_, inner_dim_;
};


}  // namespace caffe

#endif  // CAFFE_SCALE_LAYER_HPP_

<h2>8. scale_layer.cpp</h2>

#include <algorithm>
#include <vector>

#include "caffe/filler.hpp"
#include "caffe/layer_factory.hpp"
#include "caffe/layers/scale_layer.hpp"
#include "caffe/util/math_functions.hpp"

namespace caffe {

template <typename Dtype>
void ScaleLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const ScaleParameter& param = this->layer_param_.scale_param(); // scale 参数
  // 判断 bottom blobs 是否已经有值
  if (bottom.size() == 1 && this->blobs_.size() > 0) { 
    LOG(INFO) << "Skipping parameter initialization";
  } else if (bottom.size() == 1) {
    // scale is a learned parameter; initialize it
    // 待学习参数 scale，初始化
    axis_ = bottom[0]->CanonicalAxisIndex(param.axis()); //
    const int num_axes = param.num_axes();
    CHECK_GE(num_axes, -1) << "num_axes must be non-negative, "
                           << "or -1 to extend to the end of bottom[0]";
    if (num_axes >= 0) {
      CHECK_GE(bottom[0]->num_axes(), axis_ + num_axes)
          << "scale blob's shape extends past bottom[0]'s shape when applied "
          << "starting with bottom[0] axis = " << axis_;
    }
    this->blobs_.resize(1); // gamma

    // 
    const vector<int>::const_iterator& shape_start =
        bottom[0]->shape().begin() + axis_;
    const vector<int>::const_iterator& shape_end =
        (num_axes == -1) ? bottom[0]->shape().end() : (shape_start + num_axes);
    vector<int> scale_shape(shape_start, shape_end);
    this->blobs_[0].reset(new Blob<Dtype>(scale_shape));
    FillerParameter filler_param(param.filler());
    if (!param.has_filler()) {
      // 未初始化时，初始化值为 1
      // Default to unit (1) filler for identity operation.
      filler_param.set_type("constant");
      filler_param.set_value(1);
    }
    shared_ptr<Filler<Dtype> > filler(GetFiller<Dtype>(filler_param));
    filler->Fill(this->blobs_[0].get());
  }
  if (param.bias_term()) { // 是否需要处理 bias 项
    LayerParameter layer_param(this->layer_param_);
    layer_param.set_type("Bias");
    BiasParameter* bias_param = layer_param.mutable_bias_param();
    bias_param->set_axis(param.axis());
    if (bottom.size() > 1) {
      bias_param->set_num_axes(bottom[1]->num_axes());
    } else {
      bias_param->set_num_axes(param.num_axes());
    }
    bias_param->mutable_filler()->CopyFrom(param.bias_filler());
    bias_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param);
    bias_bottom_vec_.resize(1);
    bias_bottom_vec_[0] = bottom[0];
    bias_layer_->SetUp(bias_bottom_vec_, top);
    if (this->blobs_.size() + bottom.size() < 3) {
      // case: blobs.size == 1 && bottom.size == 1
      // or blobs.size == 0 && bottom.size == 2
      bias_param_id_ = this->blobs_.size();
      this->blobs_.resize(bias_param_id_ + 1);
      this->blobs_[bias_param_id_] = bias_layer_->blobs()[0];
    } else {
      // bias param already initialized
      bias_param_id_ = this->blobs_.size() - 1;
      bias_layer_->blobs()[0] = this->blobs_[bias_param_id_];
    }
    bias_propagate_down_.resize(1, false);
  }
  this->param_propagate_down_.resize(this->blobs_.size(), true);
}

template <typename Dtype>
void ScaleLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const ScaleParameter& param = this->layer_param_.scale_param();
  Blob<Dtype>* scale = (bottom.size() > 1) ? bottom[1] : this->blobs_[0].get();
  // Always set axis_ == 0 in special case where scale is a scalar
  // (num_axes == 0). Mathematically equivalent for any choice of axis_, so the
  // actual setting can be safely ignored; and computation is most efficient
  // with axis_ == 0 and (therefore) outer_dim_ == 1. (Setting axis_ to
  // bottom[0]->num_axes() - 1, giving inner_dim_ == 1, would be equally
  // performant.)
  axis_ = (scale->num_axes() == 0) ?
      0 : bottom[0]->CanonicalAxisIndex(param.axis());
  CHECK_GE(bottom[0]->num_axes(), axis_ + scale->num_axes())
      << "scale blob's shape extends past bottom[0]'s shape when applied "
      << "starting with bottom[0] axis = " << axis_;
  for (int i = 0; i < scale->num_axes(); ++i) {
    CHECK_EQ(bottom[0]->shape(axis_ + i), scale->shape(i))
        << "dimension mismatch between bottom[0]->shape(" << axis_ + i
        << ") and scale->shape(" << i << ")";
  }
  outer_dim_ = bottom[0]->count(0, axis_);
  scale_dim_ = scale->count();
  inner_dim_ = bottom[0]->count(axis_ + scale->num_axes());

  // 如果 top 层和 bottom 层同名，则进行 in-place 计算
  if (bottom[0] == top[0]) {  // in-place computation
    temp_.ReshapeLike(*bottom[0]);
  } else {
    top[0]->ReshapeLike(*bottom[0]);
  }

  sum_result_.Reshape(vector<int>(1, outer_dim_ * scale_dim_));
  const int sum_mult_size = std::max(outer_dim_, inner_dim_);
  sum_multiplier_.Reshape(vector<int>(1, sum_mult_size));
  if (sum_multiplier_.cpu_data()[sum_mult_size - 1] != Dtype(1)) {
    caffe_set(sum_mult_size, Dtype(1), sum_multiplier_.mutable_cpu_data());
  }
  if (bias_layer_) {
    bias_bottom_vec_[0] = top[0];
    bias_layer_->Reshape(bias_bottom_vec_, top);
  }
}


template <typename Dtype>
void ScaleLayer<Dtype>::Forward_cpu(
    const vector<Blob<Dtype>>& bottom, const vector<Blob<Dtype>>& top) {
  const Dtype* bottom_data = bottom[0]->cpu_data();
  if (bottom[0] == top[0]) {
    // In-place computation; need to store bottom data before overwriting it.
    // Note that this is only necessary for Backward; we could skip this if not
    // doing Backward, but Caffe currently provides no way of knowing whether
    // we'll need to do Backward at the time of the Forward call.
    // in-place 计算，需要先临时复制一份，再进行计算.
    caffe_copy(bottom[0]->count(), bottom[0]->cpu_data(),
               temp_.mutable_cpu_data());
  }
  const Dtype* scale_data =
      ((bottom.size() > 1) ? bottom[1] : this->blobs_[0].get())->cpu_data();
  Dtype* top_data = top[0]->mutable_cpu_data();
  for (int n = 0; n < outer_dim_; ++n) {
    for (int d = 0; d < scale_dim_; ++d) {
      const Dtype factor = scale_data[d];
      caffe_cpu_scale(inner_dim_, factor, bottom_data, top_data);
      bottom_data += inner_dim_;
      top_data += inner_dim_;
    }
  }
  if (bias_layer_) {
    bias_layer_->Forward(bias_bottom_vec_, top);
  }
}

template <typename Dtype>
void ScaleLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
  if (bias_layer_ &&
      this->param_propagate_down_[this->param_propagate_down_.size() - 1]) {
    bias_layer_->Backward(top, bias_propagate_down_, bias_bottom_vec_);
  }
  const bool scale_param = (bottom.size() == 1);
  Blob<Dtype>* scale = scale_param ? this->blobs_[0].get() : bottom[1];
  if ((!scale_param && propagate_down[1]) ||
      (scale_param && this->param_propagate_down_[0])) {
    const Dtype* top_diff = top[0]->cpu_diff();
    const bool in_place = (bottom[0] == top[0]);
    const Dtype* bottom_data = (in_place ? &temp_ : bottom[0])->cpu_data();
    // Hack: store big eltwise product in bottom[0] diff, except in the special
    // case where this layer itself does the eltwise product, in which case we
    // can store it directly in the scale diff, and we're done.
    // If we're computing in-place (and not doing eltwise computation), this
    // hack doesn't work and we store the product in temp_.
    const bool is_eltwise = (bottom[0]->count() == scale->count());
    Dtype* product = (is_eltwise ? scale->mutable_cpu_diff() :
        (in_place ? temp_.mutable_cpu_data() : bottom[0]->mutable_cpu_diff()));
    caffe_mul(top[0]->count(), top_diff, bottom_data, product);
    if (!is_eltwise) {
      Dtype* sum_result = NULL;
      if (inner_dim_ == 1) {
        sum_result = product;
      } else if (sum_result_.count() == 1) {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        Dtype* scale_diff = scale->mutable_cpu_diff();
        if (scale_param) {
          Dtype result = caffe_cpu_dot(inner_dim_, product, sum_mult);
          *scale_diff += result;
        } else {
          *scale_diff = caffe_cpu_dot(inner_dim_, product, sum_mult);
        }
      } else {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        sum_result = (outer_dim_ == 1) ?
            scale->mutable_cpu_diff() : sum_result_.mutable_cpu_data();
        caffe_cpu_gemv(CblasNoTrans, sum_result_.count(), inner_dim_,
                       Dtype(1), product, sum_mult, Dtype(0), sum_result);
      }
      if (outer_dim_ != 1) {
        const Dtype* sum_mult = sum_multiplier_.cpu_data();
        Dtype* scale_diff = scale->mutable_cpu_diff();
        if (scale_dim_ == 1) {
          if (scale_param) {
            Dtype result = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);
            *scale_diff += result;
          } else {
            *scale_diff = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);
          }
        } else {
          caffe_cpu_gemv(CblasTrans, outer_dim_, scale_dim_,
                         Dtype(1), sum_result, sum_mult, Dtype(scale_param),
                         scale_diff);
        }
      }
    }
  }
  if (propagate_down[0]) {
    const Dtype* top_diff = top[0]->cpu_diff();
    const Dtype* scale_data = scale->cpu_data();
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
    for (int n = 0; n < outer_dim_; ++n) {
      for (int d = 0; d < scale_dim_; ++d) {
        const Dtype factor = scale_data[d];
        caffe_cpu_scale(inner_dim_, factor, top_diff, bottom_diff);
        bottom_diff += inner_dim_;
        top_diff += inner_dim_;
      }
    }
  }
}

#ifdef CPU_ONLY
STUB_GPU(ScaleLayer);
#endif

INSTANTIATE_CLASS(ScaleLayer);
REGISTER_LAYER_CLASS(Scale);

}  // namespace caffe

<h2>Reference</h2>

[1] - caffe中batch_norm层代码详细注解

[2] - CAFFE源码学习笔记之batch_norm_layer

[3] - Caffe BatchNormalization 推导

[4] - Caffe Scale层解析

Caffe源码 - BatchNormLayer 与 ScaleLayer

※相关文章推荐※

※最新文章推荐※

Leave a Comment Cancel reply

Caffe源码 - BatchNormLayer 与 ScaleLayer