OneFlow 中的 Softmax

2023-11-18

Softmax 是深度学习模型中的常见算子。PyTorch 的 Softmax 算子直接调用 cuDNN 的接口。而 OneFlow 内部针对输入数据的类别数量，采用3个 kernel 来分别处理，在多数情况下都可以获得比 cuDNN 更优的性能表现。测试结果可见如何实现一个高效的Softmax CUDA kernel？——OneFlow 性能优化分享。下面对其实现进行介绍。OneFlow 的静态分层结构如下图所示：

softmax

oneflow.nn.functional.softmax 直接调用了 C++的实现。


# ref https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py
def softmax(input: Tensor, dim: Optional[int] = None, dtype=None) -> Tensor:
    r"""Applies a softmax function.
    Softmax is defined as:
    :math:`\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}`
    It is applied to all slices along dim, and will re-scale them so that the elements
    lie in the range `[0, 1]` and sum to 1.

    See :class:`~oneflow.nn.Softmax` for more details.

    Args:
        input (Tensor): input
        dim (int): A dimension along which softmax will be computed.
        dtype (:class:`oneflow.dtype`, optional): the desired data type of returned tensor.
            If specified, the input tensor is casted to :attr:`dtype` before the operation
            is performed. This is useful for preventing data type overflows. Default: None.

    .. note::
        This function doesn't work directly with NLLLoss,
        which expects the Log to be computed between the Softmax and itself.
        Use log_softmax instead (it's faster and has better numerical properties).
    """
    if dtype is None:
        ret = flow._C.softmax(input, dim)
    else:
        ret = flow._C.softmax(input.to(dtype), dim)
    return ret

在 OneFlow 系统中存在两类算子（op）：系统 op 和 user op。OneFlow user op 的定义及 kernel 实现分别在 oneflow/user/ops 和 oneflow/user/kernels 目录下。

OneFlow_SoftmaxOp

def OneFlow_SoftmaxOp : OneFlow_BaseOp<"softmax", [NoMemoryEffect, DeclareOpInterfaceMethods<UserOpCompatibleInterface>]> {
  let input = (ins
    OneFlow_Tensor:$in
  );
  let output = (outs
    OneFlow_Tensor:$out
  );
  let has_logical_tensor_desc_infer_fn = 1;
  let has_physical_tensor_desc_infer_fn = 1;
  let has_get_sbp_fn = 1;
  let has_data_type_infer_fn = 1;
  let has_compute_complexity_fn = 1;
}

Functor 层作为 OneFlow 的基础设施，为 python 端和 C++端提供了 op 操作的统一入口。各种 op 在 Functor 层需要完成对输入tensor 的 shape、dtype、维度、元素个数等各种 check，以及对 op 特有的逻辑进行解析和处理。

oneflow/core/functional/impl/activation_functor.cpp 文件中 ONEFLOW_FUNCTION_LIBRARY 将 SoftmaxFunctor 注册为 Softmax 的实现。

SoftmaxFunctor

OpBuilder 用于构建 UserOp。

class SoftmaxFunctor : public SoftmaxFunctorBase {
 public:
  SoftmaxFunctor() {
    op_ = CHECK_JUST(one::OpBuilder("softmax").Input("in").Output("out").Build());
  }
};

SoftmaxFunctorBase

Tensor::shape 返回输入的 Shape。
Shape::NumAxes 调用 oneflow::Shape::NumAxes 返回维数。


class SoftmaxFunctorBase {
 public:
  Maybe<Tensor> operator()(const std::shared_ptr<one::Tensor>& input,
                           const Optional<int64_t>& dim) const {
    const auto input_shape = input->shape();
    const int64_t num_axes = input_shape->NumAxes();

get_dim函数判断输入形状是否为2维。

    const auto get_dim = [num_axes]() -> int64_t {
      const int64_t ndim = num_axes;
      if (ndim == 0 || ndim == 1 || ndim == 3) {
        return 0;
      } else {
        return 1;
      }
    };

JUST 宏能够得到表达式。
如果dim为0直接赋值给dim_，否则调用get_dim函数。
maybe_wrap_dim 的处理是支持正负数的。
如果dim_不是最后一维，在其前后调用 Transpose 进行转换。
sequence_function 宏创建一个 SequenceFunction 对象。
TransposeFunctor 为 Transpose 算子的实现。
OpInterpUtil::Dispatch 调度算子到设备上进行计算。

    int64_t dim_ = dim ? JUST(dim) : get_dim();
    dim_ = JUST(maybe_wrap_dim(dim_, num_axes));
    if (dim_ != num_axes - 1) {
      std::vector<int> input_perm(input_shape->dim_vec().size(), 0);
      for (size_t i = 1; i < input_perm.size(); ++i) { input_perm[i] = i; }
      input_perm[dim_] = input_perm[input_perm.size() - 1];
      input_perm[input_perm.size() - 1] = dim_;

      return sequence_function(functional::Transpose)
          .then([&](const std::shared_ptr<one::Tensor>& x) {
            return OpInterpUtil::Dispatch<Tensor>(*op_, {x});
          })
          .then(std::bind(functional::Transpose, std::placeholders::_1, input_perm))
          .call(input, input_perm);
    }

    return OpInterpUtil::Dispatch<Tensor>(*op_, {input});
  }

包含一个 OpExpr 指针。后者维护了op_name、input arg、output arg 信息。

 protected:
  SoftmaxFunctorBase() = default;
  virtual ~SoftmaxFunctorBase() = default;

  std::shared_ptr<OpExpr> op_;
};

SoftmaxKernel

REGISTER_USER_KERNEL 可以注册 softmax kernel。

class SoftmaxKernel final : public user_op::OpKernel, public user_op::CudaGraphSupport {
 public:
  SoftmaxKernel() = default;
  ~SoftmaxKernel() override = default;

 private:
  using user_op::OpKernel::Compute;

SoftmaxKernel::Compute

UserKernelComputeContext::Tensor4ArgNameAndIndex 根据参数名和索引返回对应的 Tensor。
BlobTensorView::shape_view 得到 ShapeView。
ConstShapeMixIn::NumAxes 得到维数。
NewSoftmaxPrimitive 函数创建一个 Softmax 对象。
调用 SoftmaxImpl::Launch 函数。

  void Compute(user_op::KernelComputeContext* ctx) const override {
    const user_op::Tensor* in = ctx->Tensor4ArgNameAndIndex("in", 0);
    user_op::Tensor* out = ctx->Tensor4ArgNameAndIndex("out", 0);
    const ShapeView& in_shape = in->shape_view();
    const int64_t cols = in_shape.At(in_shape.NumAxes() - 1);
    const int64_t rows = in_shape.Count(0, in_shape.NumAxes() - 1);
    std::unique_ptr<ep::primitive::Softmax> primitive = NewSoftmaxPrimitive(ctx);
    CHECK(primitive);
    primitive->Launch(ctx->stream(), rows, cols, in->dptr(), out->mut_dptr());
  }
  bool AlwaysComputeWhenAllOutputsEmpty() const override { return false; }
};

NewSoftmaxPrimitive

NewPrimitive 使用 SoftmaxFactory 抽象工厂来创建。

template<typename Context>
std::unique_ptr<ep::primitive::Softmax> NewSoftmaxPrimitive(Context* ctx) {
  const DataType data_type = ctx->TensorDesc4ArgNameAndIndex("in", 0)->data_type();
  return ep::primitive::NewPrimitive<ep::primitive::SoftmaxFactory>(ctx->device_type(), data_type);
}

SoftmaxFactory

REGISTER_PRIMITIVE_FACTORY 宏会调用 REGISTER_CLASS 实例化一个 AutoRegistrationFactory::RawRegisterType 对象。
工厂的实现为 SoftmaxFactoryImpl，即 GenericSoftmaxFactoryImpl 模板类。GenericSoftmaxFactoryImpl 会调用 NewSoftmax 函数创建一个 SoftmaxImpl 对象。SoftmaxImpl 即 Softmax 的实现。

class SoftmaxFactory : public Factory<Softmax> {
 public:
  OF_DISALLOW_COPY_AND_MOVE(SoftmaxFactory);
  SoftmaxFactory() = default;
  ~SoftmaxFactory() override = default;

  virtual std::unique_ptr<Softmax> New(DataType data_type) = 0;
};

SoftmaxImpl

SoftmaxImpl::Launch 函数调用 SoftmaxGpu

template<typename SoftmaxBase, Algorithm algorithm, typename T>
class SoftmaxImpl : public SoftmaxBase {
 public:
  OF_DISALLOW_COPY_AND_MOVE(SoftmaxImpl);
  SoftmaxImpl() = default;
  ~SoftmaxImpl() override = default;

  void Launch(Stream* stream, size_t rows, size_t cols, const void* x, void* y) override {
    cudaStream_t cuda_stream = stream->As<CudaStream>()->cuda_stream();
    SoftmaxGpu<algorithm, T>(cuda_stream, rows, cols, reinterpret_cast<const T*>(x),
                             reinterpret_cast<T*>(y));
  }
};

SoftmaxGpu

DefaultComputeType 默认使用原有类型，对于half和bfloat16使用float类型。
DirectLoad 和 DirectStore 结构体封装了源和目的片外地址。DirectLoad 首先加载源数据到临时变量中，然后转换为片上计算类型。
如果计算类型和源数据类型一致，是否还有必要使用临时变量来存储呢？
根据算法选择调用 DispatchSoftmax 或者 DispatchLogSoftmax 函数。

template<Algorithm algorithm, typename T>
void SoftmaxGpu(cudaStream_t cuda_stream, size_t rows, size_t cols, const T* x, T* y) {
  using ComputeType = typename cuda::softmax::DefaultComputeType<T>::type;
  oneflow::cuda::softmax::DirectLoad<T, ComputeType> load(x, cols);
  oneflow::cuda::softmax::DirectStore<ComputeType, T> store(y, cols);
  if (algorithm == Algorithm::kSoftmax) {
    OF_CUDA_CHECK((cuda::softmax::DispatchSoftmax<decltype(load), decltype(store), ComputeType>(
        cuda_stream, load, store, rows, cols)));
  } else if (algorithm == Algorithm::kLogSoftmax) {
    OF_CUDA_CHECK((cuda::softmax::DispatchLogSoftmax<decltype(load), decltype(store), ComputeType>(
        cuda_stream, load, store, rows, cols)));
  } else {
    UNIMPLEMENTED();
  }
}

DirectLoad

Pack 是一个联合体。
一次性读取N个元素，然后逐个按照DST类型赋值到dst中。

template<typename SRC, typename DST>
struct DirectLoad {
  DirectLoad(const SRC* src, int64_t row_size) : src(src), row_size(row_size) {}
  template<int N>
  __device__ void load(DST* dst, int64_t row, int64_t col) const {
    Pack<SRC, N> pack;
    const int64_t offset = (row * row_size + col) / N;
    pack.storage = *(reinterpret_cast<const PackType<SRC, N>*>(src) + offset);
#pragma unroll
    for (int i = 0; i < N; ++i) { dst[i] = static_cast<DST>(pack.elem[i]); }
  }
  const SRC* src;
  int64_t row_size;
};

Pack

PackType 为 GetPackType

template<typename T, int N>
union Pack {
  static_assert(sizeof(PackType<T, N>) == sizeof(T) * N, "");
  __device__ Pack() {
    // do nothing
  }
  PackType<T, N> storage;
  T elem[N];
};

GetPackType

类std::aligned_storage对象构造完成时，即分配了长度为Len个字节的内存，且该内存满足大小为 Align 的对齐要求。

template<typename T, int N>
struct GetPackType {
  using type = typename std::aligned_storage<N * sizeof(T), N * sizeof(T)>::type;
};

DirectStore

template<typename SRC, typename DST>
struct DirectStore {
  DirectStore(DST* dst, int64_t row_size) : dst(dst), row_size(row_size) {}
  template<int N>
  __device__ void store(const SRC* src, int64_t row, int64_t col) {
    Pack<DST, N> pack;
    const int64_t offset = (row * row_size + col) / N;
#pragma unroll
    for (int i = 0; i < N; ++i) { pack.elem[i] = static_cast<DST>(src[i]); }
    *(reinterpret_cast<PackType<DST, N>*>(dst) + offset) = pack.storage;
  }
  DST* dst;
  int64_t row_size;
};

DispatchSoftmax

计算类型不是double的实现：

列数小于1024，则调用 DispatchSoftmaxWarpImpl；
否则调用 TryDispatchSoftmaxBlockSMemImpl；
如果 share memory 版本失败则调用 DispatchSoftmaxBlockUncachedImpl。

3种实现首先根据cols的奇偶确定pack_size为1或2。
DispatchSoftmaxWarpImplCols 根据cols的大小确定cols_per_thread、thread_group_width和rows_per_access3个参数的取值：

类别数量较小时，Warp 内对线程进一步分组来处理，每组处理1或2行数据，每个线程处理pack_size个类别；
类别数量较大时，Warp 内所有线程处理1行数据，每个线程处理cols_per_thread个类别。

DispatchSoftmaxWarpImplPadding 根据cols是否能对齐 Warp 处理宽度确定是否需要对cols进行padding处理。
LaunchSoftmaxWarpImpl 中设置block_size为128，再根据不同硬件设置grid_size。

TryDispatchSoftmaxBlockSMemImplPackSize 调用 TryDispatchSoftmaxBlockSMemImplBlockSize 函数，cols为偶数时单次次处理两列，否则处理1列。
TryDispatchSoftmaxBlockSMemImplBlockSize 在给定硬件约束下确保 SM 可调度线程块数量最大，然后优选较大的block_size。

DispatchSoftmaxBlockUncachedImplPackSize 根据cols为奇数或者偶数调用 LaunchSoftmaxBlockUncachedImpl。

template<typename LOAD, typename STORE, typename ComputeType>
inline typename std::enable_if<!std::is_same<ComputeType, double>::value, cudaError_t>::type
DispatchSoftmax(cudaStream_t stream, LOAD load, STORE store, const int64_t rows,
                const int64_t cols) {
  if (cols <= 1024) {
    return DispatchSoftmaxWarpImpl<LOAD, STORE, ComputeType, Algorithm::kSoftmax>(
        stream, load, store, rows, cols);
  } else {
    bool dispatch_smem_impl_success;
    {
      cudaError_t err =
          TryDispatchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, Algorithm::kSoftmax>(
              stream, load, store, rows, cols, &dispatch_smem_impl_success);
      if (err != cudaSuccess) { return err; }
    }
    if (!dispatch_smem_impl_success) {
      return DispatchSoftmaxBlockUncachedImpl<LOAD, STORE, ComputeType, Algorithm::kSoftmax>(
          stream, load, store, rows, cols);
    }
    return cudaSuccess;
  }
}

DispatchSoftmax

计算类型为double，则直接调用 DispatchSoftmaxBlockUncachedImpl。

template<typename LOAD, typename STORE, typename ComputeType>
inline typename std::enable_if<std::is_same<ComputeType, double>::value, cudaError_t>::type
DispatchSoftmax(cudaStream_t stream, LOAD load, STORE store, const int64_t rows,
                const int64_t cols) {
  return DispatchSoftmaxBlockUncachedImpl<LOAD, STORE, ComputeType, Algorithm::kSoftmax>(
      stream, load, store, rows, cols);
}

DispatchSoftmaxWarpImplCols

pack_size为1的版本。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, Algorithm algorithm>
typename std::enable_if<pack_size == 1, cudaError_t>::type DispatchSoftmaxWarpImplCols(
    cudaStream_t stream, LOAD load, STORE store, const int64_t rows, const int64_t cols) {
  if (cols <= 0) { return cudaErrorInvalidValue; }

DEFINE_ONE_ELIF宏尝试找到一个刚好能处理cols的thread_group_width值，然后以此调用 DispatchSoftmaxWarpImplPadding 函数。

#define DEFINE_ONE_ELIF(thread_group_width)                                                        \
  else if (cols <= (thread_group_width)*pack_size) {                                               \
    if (rows % 2 == 0) {                                                                           \
      return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, pack_size,        \
                                            thread_group_width, 2, algorithm>(stream, load, store, \
                                                                              rows, cols);         \
    } else {                                                                                       \
      return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, pack_size,        \
                                            thread_group_width, 1, algorithm>(stream, load, store, \
                                                                              rows, cols);         \
    }                                                                                              \
  }
  DEFINE_ONE_ELIF(1)
  DEFINE_ONE_ELIF(2)
  DEFINE_ONE_ELIF(4)
  DEFINE_ONE_ELIF(8)
  DEFINE_ONE_ELIF(16)
  DEFINE_ONE_ELIF(32)
#undef DEFINE_ONE_ELIF

kWarpSize 为线程束大小。
如果超过了单个 Warp 的处理能力，则找到刚好可以处理cols时每个 Warp 内的列数，然后仍然调用 DispatchSoftmaxWarpImplPadding 函数。

#define DEFINE_ONE_ELIF(col)                                                                      \
  else if (cols <= (col)*kWarpSize) {                                                             \
    return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, col, kWarpSize, 1, \
                                          algorithm>(stream, load, store, rows, cols);            \
  }
  DEFINE_ONE_ELIF(2)
  DEFINE_ONE_ELIF(3)
  DEFINE_ONE_ELIF(4)
  DEFINE_ONE_ELIF(5)
  DEFINE_ONE_ELIF(6)
  DEFINE_ONE_ELIF(7)
  DEFINE_ONE_ELIF(8)
  DEFINE_ONE_ELIF(9)
  DEFINE_ONE_ELIF(10)
  DEFINE_ONE_ELIF(11)
  DEFINE_ONE_ELIF(12)
  DEFINE_ONE_ELIF(13)
  DEFINE_ONE_ELIF(14)
  DEFINE_ONE_ELIF(15)
  DEFINE_ONE_ELIF(16)
  DEFINE_ONE_ELIF(17)
  DEFINE_ONE_ELIF(18)
  DEFINE_ONE_ELIF(19)
  DEFINE_ONE_ELIF(20)
  DEFINE_ONE_ELIF(21)
  DEFINE_ONE_ELIF(22)
  DEFINE_ONE_ELIF(23)
  DEFINE_ONE_ELIF(24)
  DEFINE_ONE_ELIF(25)
  DEFINE_ONE_ELIF(26)
  DEFINE_ONE_ELIF(27)
  DEFINE_ONE_ELIF(28)
  DEFINE_ONE_ELIF(29)
  DEFINE_ONE_ELIF(30)
  DEFINE_ONE_ELIF(31)
  DEFINE_ONE_ELIF(32)
#undef DEFINE_ONE_ELIF
  else {
    return cudaErrorInvalidValue;
  }
}

LaunchSoftmaxWarpImpl

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, int cols_per_thread,
         int thread_group_width, int rows_per_access, bool padding, Algorithm algorithm>
inline cudaError_t LaunchSoftmaxWarpImpl(cudaStream_t stream, LOAD load, STORE store,
                                         const int64_t rows, const int64_t cols) {

block_dim是固定的128大小。可以参考如何设置CUDA Kernel中的grid_size和block_size？中的介绍。
waves是期望的作业次数。
thread_group_width为代表处理元素的线程组的宽度，是 kWarpSize 的因数。
thread_groups_per_block为 block 内部划分的线程组数量。
num_blocks为任务支持的最大分块数。
rows_per_access是单次处理的 batch 数。

  constexpr int block_size = 128;
  constexpr int waves = 32;
  static_assert(block_size % thread_group_width == 0, "");
  constexpr int thread_groups_per_block = block_size / thread_group_width;
  dim3 block_dim(thread_group_width, thread_groups_per_block);
  const int64_t num_blocks =
      (rows / rows_per_access + thread_groups_per_block - 1) / thread_groups_per_block;

GetNumBlocks 函数查询设备的 SM 数量以及每个 SM 支持的最大线程数计算 block 数量。

  int grid_dim_x;
  {
    cudaError_t err = GetNumBlocks(block_size, num_blocks, waves, &grid_dim_x);
    if (err != cudaSuccess) { return err; }
  }

Grid 是一维的，Block 是两维的。
启动 SoftmaxWarpImpl 在 Warp 内完成一行的计算。

  SoftmaxWarpImpl<LOAD, STORE, ComputeType, pack_size, cols_per_thread, thread_group_width,
                  rows_per_access, padding, algorithm>
      <<<grid_dim_x, block_dim, 0, stream>>>(load, store, rows, cols);
  return cudaPeekAtLastError();
}

GetNumBlocks

cudaGetDevice 返回当前正在使用的设备。

inline cudaError_t GetNumBlocks(int64_t block_size, int64_t max_blocks, int64_t waves,
                                int* num_blocks) {
  int dev;
  {
    cudaError_t err = cudaGetDevice(&dev);
    if (err != cudaSuccess) { return err; }
  }

cudaDeviceGetAttribute 返回有关设备的信息。
得到设备上的多处理器数量和每个多处理器的最大常驻线程数。

  int sm_count;
  {
    cudaError_t err = cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, dev);
    if (err != cudaSuccess) { return err; }
  }
  int tpm;
  {
    cudaError_t err = cudaDeviceGetAttribute(&tpm, cudaDevAttrMaxThreadsPerMultiProcessor, dev);
    if (err != cudaSuccess) { return err; }
  }

根据整个 GPU 上的最大常驻线程数计算出 block 块数。

  *num_blocks =
      std::max<int>(1, std::min<int64_t>(max_blocks, sm_count * tpm / block_size * waves));
  return cudaSuccess;
}

SoftmaxWarpImpl

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, int cols_per_thread,
         int thread_group_width, int rows_per_access, bool padding, Algorithm algorithm>
__global__ void SoftmaxWarpImpl(LOAD load, STORE store, const int64_t rows, const int64_t cols) {
  static_assert(cols_per_thread % pack_size == 0, "");
  static_assert(thread_group_width <= kWarpSize, "");
  static_assert(kWarpSize % thread_group_width == 0, "");

num_packs是每个线程需要处理的数据包的个数。
rows_per_access是单次处理的 batch 数。
buf用于存储输入 x x x 以及分子项 e x i e^{x_i} exi、 e x i − α e^{x_i -\alpha} exi−α。
blockIdx.x为 block 在行方向上的索引。blockDim.y为 Block 内划分的线程组数量。global_thread_group_id为全局线程组索引，在 Block 内编号是连续的。
num_global_thread_group为全局线程组数量。
lane_id为线程束内的线程 id。
step是 GPU 所有线程单次可处理的 batch 数。令该数值尽可能大，从而利于合并访存。

  constexpr int num_packs = cols_per_thread / pack_size;
  assert(cols <= cols_per_thread * thread_group_width);
  ComputeType buf[rows_per_access][cols_per_thread];
  const int global_thread_group_id = blockIdx.x * blockDim.y + threadIdx.y;
  const int num_global_thread_group = gridDim.x * blockDim.y;
  const int lane_id = threadIdx.x;
  const int64_t step = num_global_thread_group * rows_per_access;

thread_max用于存储最大类别概率。首先初始化为 -inf。
Inf 能够返回不同类型的 inf 值。
col为当前需要处理的列。
如果不需要padding或者col在cols范围内，则调用 DirectLoad::load 加载输入数据。求出线程负责列的最大值。
否则，将row_buf设置为-inf。

  for (int64_t row = global_thread_group_id * rows_per_access; row < rows; row += step) {
    ComputeType thread_max[rows_per_access];
#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      thread_max[row_id] = -Inf<ComputeType>();
      ComputeType* row_buf = buf[row_id];
#pragma unroll
      for (int pack_id = 0; pack_id < num_packs; ++pack_id) {
        const int pack_offset = pack_id * pack_size;
        const int col = (pack_id * thread_group_width + lane_id) * pack_size;
        if (!padding || col < cols) {
          load.template load<pack_size>(row_buf + pack_offset, row + row_id, col);
#pragma unroll
          for (int i = 0; i < pack_size; ++i) {
            thread_max[row_id] = max(thread_max[row_id], row_buf[pack_offset + i]);
          }
        } else {
#pragma unroll
          for (int i = 0; i < pack_size; ++i) { row_buf[pack_offset + i] = -Inf<ComputeType>(); }
        }
      }
    }

WarpAllReduce 函数调用 MaxOp 规约得到线程组内的最大值。
thread_group_width参数可以实现 Warp 内的线程组分组处理。

    ComputeType warp_max[rows_per_access];
#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      warp_max[row_id] = WarpAllReduce<MaxOp, ComputeType, thread_group_width>(thread_max[row_id]);
    }

row_buf中保存 e x i − α e^{x_i -\alpha} exi−α。
线程内求和，thread_sum保存线程内的 ∑ j e x j − α \sum_j e^{x_j -\alpha} ∑jexj−α 。
可以通过从任何设备线程调用__trap()函数来启动 trap 操作。内核的执行被中止并在主机程序中引发中断。

    ComputeType thread_sum[rows_per_access];
#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      thread_sum[row_id] = 0;
      ComputeType* row_buf = buf[row_id];
#pragma unroll
      for (int i = 0; i < cols_per_thread; ++i) {
        if (algorithm == Algorithm::kSoftmax) {
          row_buf[i] = Exp(row_buf[i] - warp_max[row_id]);
          thread_sum[row_id] += row_buf[i];
        } else if (algorithm == Algorithm::kLogSoftmax) {
          row_buf[i] -= warp_max[row_id];
          thread_sum[row_id] += Exp(row_buf[i]);
        } else {
          __trap();
        }
      }
    }

调用 WarpAllReduce 函数得到各行的warp_sum。

    ComputeType warp_sum[rows_per_access];
#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      warp_sum[row_id] = WarpAllReduce<SumOp, ComputeType, thread_group_width>(thread_sum[row_id]);
    }

Div 和 Log 有快速计算实现。
计算 Softmax ( x i ) = e x i − α ∑ j e x j − α \text{Softmax}(x_i) = \frac{e^{x_i -\alpha}}{\sum_j e^{x_j -\alpha}} Softmax(xi)=∑jexj−αexi−α 或者 LogSoftmax ( x i ) = log ⁡ ( e x i − α ∑ j e x j − α ) = x i − α − log ⁡ ( ∑ j e x j − α ) \text{LogSoftmax}(x_{i}) = \log\left(\frac{e^{x_i-\alpha} }{ \sum_j e^{x_j-\alpha}} \right) = x_i-\alpha - \log({ \sum_j e^{x_j-\alpha}}) LogSoftmax(xi)=log(∑jexj−αexi−α)=xi−α−log(∑jexj−α)
DirectStore::store 保存结果。

#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      ComputeType* row_buf = buf[row_id];
#pragma unroll
      for (int i = 0; i < cols_per_thread; ++i) {
        if (algorithm == Algorithm::kSoftmax) {
          row_buf[i] = Div(row_buf[i], warp_sum[row_id]);
        } else if (algorithm == Algorithm::kLogSoftmax) {
          row_buf[i] -= Log(warp_sum[row_id]);
        } else {
          __trap();
        }
      }
#pragma unroll
      for (int i = 0; i < num_packs; ++i) {
        const int col = (i * thread_group_width + lane_id) * pack_size;
        if (!padding || col < cols) {
          store.template store<pack_size>(row_buf + i * pack_size, row + row_id, col);
        }
      }
    }
  }
}

WarpAllReduce

__shfl_xor_sync基于自身通道 ID 的按位异或从通道复制。
不断减半mask，实现蝶型归约。实现为下图的逆序过程。
在这里插入图片描述

template<template<typename> class ReductionOp, typename T, int thread_group_width = kWarpSize>
__inline__ __device__ T WarpAllReduce(T val) {
  for (int mask = thread_group_width / 2; mask > 0; mask /= 2) {
    val = ReductionOp<T>()(val, __shfl_xor_sync(0xffffffff, val, mask));
  }
  return val;
}

DispatchSoftmaxWarpImplCols

pack_size为2的版本。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, Algorithm algorithm>
typename std::enable_if<pack_size == 2, cudaError_t>::type DispatchSoftmaxWarpImplCols(
    cudaStream_t stream, LOAD load, STORE store, const int64_t rows, const int64_t cols) {
  if (cols <= 0) { return cudaErrorInvalidValue; }

尝试找到一个刚好能处理cols的thread_group_width值，然后以此调用 DispatchSoftmaxWarpImplPadding 函数。奇数处理1行，偶数处理两行。

#define DEFINE_ONE_ELIF(thread_group_width)                                                        \
  else if (cols <= (thread_group_width)*pack_size) {                                               \
    if (rows % 2 == 0) {                                                                           \
      return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, pack_size,        \
                                            thread_group_width, 2, algorithm>(stream, load, store, \
                                                                              rows, cols);         \
    } else {                                                                                       \
      return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, pack_size,        \
                                            thread_group_width, 1, algorithm>(stream, load, store, \
                                                                              rows, cols);         \
    }                                                                                              \
  }
  DEFINE_ONE_ELIF(1)
  DEFINE_ONE_ELIF(2)
  DEFINE_ONE_ELIF(4)
  DEFINE_ONE_ELIF(8)
  DEFINE_ONE_ELIF(16)
  DEFINE_ONE_ELIF(32)
#undef DEFINE_ONE_ELIF

如果cols较大，Warp 内所有线程处理1行数据，每个线程处理cols_per_thread个类别。

#define DEFINE_ONE_ELIF(col)                                                                      \
  else if (cols <= (col)*kWarpSize) {                                                             \
    return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, col, kWarpSize, 1, \
                                          algorithm>(stream, load, store, rows, cols);            \
  }
  DEFINE_ONE_ELIF(4)
  DEFINE_ONE_ELIF(6)
  DEFINE_ONE_ELIF(8)
  DEFINE_ONE_ELIF(10)
  DEFINE_ONE_ELIF(12)
  DEFINE_ONE_ELIF(14)
  DEFINE_ONE_ELIF(16)
  DEFINE_ONE_ELIF(18)
  DEFINE_ONE_ELIF(20)
  DEFINE_ONE_ELIF(22)
  DEFINE_ONE_ELIF(24)
  DEFINE_ONE_ELIF(26)
  DEFINE_ONE_ELIF(28)
  DEFINE_ONE_ELIF(30)
  DEFINE_ONE_ELIF(32)
#undef DEFINE_ONE_ELIF
  else {
    return cudaErrorInvalidValue;
  }
}

TryDispatchSoftmaxBlockSMemImplBlockSize

根据cols计算出需要的 Shared Memory 大小。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, Algorithm algorithm>
inline cudaError_t TryDispatchSoftmaxBlockSMemImplBlockSize(cudaStream_t stream, LOAD load,
                                                            STORE store, const int64_t rows,
                                                            const int64_t cols, bool* success) {
  constexpr int block_size_conf_1 = 128;
  constexpr int block_size_conf_2 = 256;
  constexpr int block_size_conf_3 = 512;
  constexpr int block_size_conf_4 = 1024;
  const size_t smem = cols * sizeof(ComputeType);

cudaOccupancyMaxActiveBlocksPerMultiprocessor 返回设备函数的占用。
SoftmaxBlockSMemImpl 为 kernel 函数。
smem超过 SM 内 Shared Memory 的大小时，kernel 会无法启动。
优先让 SM 同时调度的 block 数量达到最大，其次让 block_size 达到最大。从而提高硬件的利用率。

  int max_active_blocks_conf_1;
  {
    cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_active_blocks_conf_1,
        SoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_1, algorithm>,
        block_size_conf_1, smem);
    if (err != cudaSuccess) { return err; }
  }
  if (max_active_blocks_conf_1 <= 0) {
    *success = false;
    return cudaSuccess;
  }
  int max_active_blocks_conf_4;
  {
    cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_active_blocks_conf_4,
        SoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_4, algorithm>,
        block_size_conf_4, smem);
    if (err != cudaSuccess) { return err; }
  }

如果block_size_conf_1和block_size_conf_4获得的占用相等，则选择较大的max_active_blocks_conf_4。
LaunchSoftmaxBlockSMemImpl 启动在 Block 内处理一行的 kernel。

  if (max_active_blocks_conf_4 == max_active_blocks_conf_1) {
    *success = true;
    return LaunchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_4,
                                      algorithm>(stream, load, store, smem, rows, cols);
  }

依次向下尝试max_active_blocks_conf_3和max_active_blocks_conf_2。

  int max_active_blocks_conf_3;
  {
    cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_active_blocks_conf_3,
        SoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_3, algorithm>,
        block_size_conf_3, smem);
    if (err != cudaSuccess) { return err; }
  }
  if (max_active_blocks_conf_3 == max_active_blocks_conf_1) {
    *success = true;
    return LaunchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_3,
                                      algorithm>(stream, load, store, smem, rows, cols);
  }
  int max_active_blocks_conf_2;
  {
    cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_active_blocks_conf_2,
        SoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_2, algorithm>,
        block_size_conf_2, smem);
    if (err != cudaSuccess) { return err; }
  }
  if (max_active_blocks_conf_2 == max_active_blocks_conf_1) {
    *success = true;
    return LaunchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_2,
                                      algorithm>(stream, load, store, smem, rows, cols);
  }
  *success = true;
  return LaunchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_1,
                                    algorithm>(stream, load, store, smem, rows, cols);
}
```## [SoftmaxBlockSMemImpl](https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/cuda/softmax.cuh#L482)
```mermaid
graph TD
SoftmaxBlockSMemImpl-->BlockAllReduce

一个 Block 处理一行元素。
使用动态分配的共享内存。以double类型来对齐，即64-bit。ComputeType仅可能是float。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, int block_size,
         Algorithm algorithm>
__global__ void SoftmaxBlockSMemImpl(LOAD load, STORE store, const int64_t rows,
                                     const int64_t cols) {
  extern __shared__ __align__(sizeof(double)) unsigned char shared_buf[];
  auto* buf = reinterpret_cast<ComputeType*>(shared_buf);
  const int tid = threadIdx.x;
  assert(cols % pack_size == 0);
  const int num_packs = cols / pack_size;

每次处理gridDim.x行。
thread_max用于存储最大类别概率。首先初始化为 -inf。
先将输入加载到pack，再拷贝到buf共享内存中。
buf的形状为[pack_size, num_packs]，这样pack_size内不是连续的，需要逐个存。

  for (int64_t row = blockIdx.x; row < rows; row += gridDim.x) {
    ComputeType thread_max = -Inf<ComputeType>();
    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
      load.template load<pack_size>(pack, row, pack_id * pack_size);
#pragma unroll
      for (int i = 0; i < pack_size; ++i) {
        buf[i * num_packs + pack_id] = pack[i];
        thread_max = max(thread_max, pack[i]);
      }
    }

BlockAllReduce 函数调用 cub::BlockReduce 需要两块内存。
row_max为类别最大值 α \alpha α。buf中保存 e x i − α e^{x_i -\alpha} exi−α。
线程内求和，thread_sum保存线程内的 ∑ j e x j − α \sum_j e^{x_j -\alpha} ∑jexj−α 。

    const ComputeType row_max = BlockAllReduce<MaxOp, ComputeType, block_size>(thread_max);
    ComputeType thread_sum = 0;
    for (int col = tid; col < cols; col += block_size) {
      if (algorithm == Algorithm::kSoftmax) {
        const ComputeType exp_x = Exp(buf[col] - row_max);
        buf[col] = exp_x;
        thread_sum += exp_x;
      } else {
        const ComputeType x = buf[col] - row_max;
        buf[col] = x;
        thread_sum += Exp(x);
      }
    }

再次调用 BlockAllReduce 函数得到所有类别的和。
计算 Softmax ( x i ) = e x i − α ∑ j e x j − α \text{Softmax}(x_i) = \frac{e^{x_i -\alpha}}{\sum_j e^{x_j -\alpha}} Softmax(xi)=∑jexj−αexi−α 或者 LogSoftmax ( x i ) = log ⁡ ( e x i − α ∑ j e x j − α ) = x i − α − log ⁡ ( ∑ j e x j − α ) \text{LogSoftmax}(x_{i}) = \log\left(\frac{e^{x_i-\alpha} }{ \sum_j e^{x_j-\alpha}} \right) = x_i-\alpha - \log({ \sum_j e^{x_j-\alpha}}) LogSoftmax(xi)=log(∑jexj−αexi−α)=xi−α−log(∑jexj−α)

    const ComputeType row_sum = BlockAllReduce<SumOp, ComputeType, block_size>(thread_sum);
    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
#pragma unroll
      for (int i = 0; i < pack_size; ++i) {
        if (algorithm == Algorithm::kSoftmax) {
          pack[i] = Div(buf[i * num_packs + pack_id], row_sum);
        } else if (algorithm == Algorithm::kLogSoftmax) {
          pack[i] = buf[i * num_packs + pack_id] - Log(row_sum);
        } else {
          __trap();
        }
      }
      store.template store<pack_size>(pack, row, pack_id * pack_size);
    }
  }
}

LaunchSoftmaxBlockUncachedImpl

不使用 Shared Memory 时，需要多次访问 Global Memory。函数设置较大的block_size。因为block_size越大，SM 中能同时并行执行的 Block 数就越少，对 cache 的请求次数就越少，就有更多机会命中 Cache。

GetNumBlocks 函数查询设备的 SM 数量以及每个 SM 支持的最大线程数计算 block 数量。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, Algorithm algorithm>
inline cudaError_t LaunchSoftmaxBlockUncachedImpl(cudaStream_t stream, LOAD load, STORE store,
                                                  const int64_t rows, const int64_t cols) {
  constexpr int block_size = 1024;
  constexpr int waves = 32;
  int grid_dim_x;
  {
    cudaError_t err = GetNumBlocks(block_size, rows, waves, &grid_dim_x);
    if (err != cudaSuccess) { return err; }
  }

启动 SoftmaxBlockUncachedImpl kernel 函数。

  SoftmaxBlockUncachedImpl<LOAD, STORE, ComputeType, pack_size, block_size, algorithm>
      <<<grid_dim_x, block_size, 0, stream>>>(load, store, rows, cols);
  return cudaPeekAtLastError();
}

SoftmaxBlockUncachedImpl

对于cols没有任何限制。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, int block_size,
         Algorithm algorithm>
__global__ void SoftmaxBlockUncachedImpl(LOAD load, STORE store, const int64_t rows,
                                         const int64_t cols) {
  const int tid = threadIdx.x;
  assert(cols % pack_size == 0);
  const int num_packs = cols / pack_size;

每个 Block 处理一行元素。
每个线程处理pack_size个类别。
首先求出线程内的最大值，然后调用 BlockAllReduce 归约得到全局最大值。

  for (int64_t row = blockIdx.x; row < rows; row += gridDim.x) {
    ComputeType thread_max = -Inf<ComputeType>();
    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
      load.template load<pack_size>(pack, row, pack_id * pack_size);
#pragma unroll
      for (int i = 0; i < pack_size; ++i) { thread_max = max(thread_max, pack[i]); }
    }
    const ComputeType row_max = BlockAllReduce<MaxOp, ComputeType, block_size>(thread_max);

分两步得到 ∑ j e x j − α \sum_j e^{x_j -\alpha} ∑jexj−α 。

    ComputeType thread_sum = 0;
    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
      load.template load<pack_size>(pack, row, pack_id * pack_size);
#pragma unroll
      for (int i = 0; i < pack_size; ++i) { thread_sum += Exp(pack[i] - row_max); }
    }
    const ComputeType row_sum = BlockAllReduce<SumOp, ComputeType, block_size>(thread_sum);

计算 Softmax ( x i ) = e x i − α ∑ j e x j − α \text{Softmax}(x_i) = \frac{e^{x_i -\alpha}}{\sum_j e^{x_j -\alpha}} Softmax(xi)=∑jexj−αexi−α 或者 LogSoftmax ( x i ) = log ⁡ ( e x i − α ∑ j e x j − α ) = x i − α − log ⁡ ( ∑ j e x j − α ) \text{LogSoftmax}(x_{i}) = \log\left(\frac{e^{x_i-\alpha} }{ \sum_j e^{x_j-\alpha}} \right) = x_i-\alpha - \log({ \sum_j e^{x_j-\alpha}}) LogSoftmax(xi)=log(∑jexj−αexi−α)=xi−α−log(∑jexj−α)

    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
      load.template load<pack_size>(pack, row, pack_id * pack_size);
#pragma unroll
      for (int i = 0; i < pack_size; ++i) {
        if (algorithm == Algorithm::kSoftmax) {
          pack[i] = Div(Exp(pack[i] - row_max), row_sum);
        } else if (algorithm == Algorithm::kLogSoftmax) {
          pack[i] = (pack[i] - row_max) - Log(row_sum);
        } else {
          __trap();
        }
      }
      store.template store<pack_size>(pack, row, pack_id * pack_size);
    }
  }
}

参考资料：

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

OneFlow 中的 Softmax 的相关文章

在 python 上使用 TensorRT .engine 文件进行推理

我使用 Nvidia 的迁移学习工具包 TLT 进行训练然后使用 tlt converter 将 etlt 模型转换为 engine 文件我想使用这个 engine 文件在 python 中进行推理但由于我使用 TLT 进行训练因此
使 CUDA 内存不足

我正在尝试训练网络但我明白了我将批量大小设置为 300 并收到此错误但即使我将其减少到 100 我仍然收到此错误更令人沮丧的是在 1200 个图像上运行 10 epoch 大约需要 40 分钟有什么建议吗错了我怎样才能加快这
使用 load_model 加载经过训练的tensorflow.keras模型会返回JSON解码错误，而未经训练的模型加载正常

我有一个训练有素的 Keras 模型使用 tensorflow keras API 构建和训练并使用tf keras save model 没有可选参数的方法 Tensorflow 是最新的我的 Python 版本是 3 8 根据我的
BatchNorm 动量约定 PyTorch

Is the 批归一化动量约定 http pytorch org docs master modules torch nn modules batchnorm html 默认 0 1 与其他库一样正确例如Tensorflow默认情况下似乎
Keras，训练模型后如何预测？

我正在使用 reuters example 数据集它运行良好我的模型已经过训练我阅读了有关如何保存模型的信息以便稍后加载它以再次使用但如何使用这个保存的模型来预测新文本呢我用吗models predict 我必须以特殊方式准备这
如何使用 TensorFlow 设置 Udacity 深度学习课程的学习环境 (Windows)

相信很多对深度学习感兴趣的人都听说过这门课程 https www udacity com course deep learning ud730 https www udacity com course deep learning ud730
NvCplGetThermalSettings 返回 false

问题您好我正在尝试使用 Delphi 获取 nividia gtx 980 的 GPU 温度我看过C 问题他的解决方案是不使用nvcpl dll 我认为这不是正确的解决方案因为 nivida 有完整的文档说明如何处理 API 见下
检查输入时出错：预期 conv2d_1_input 有 4 个维度，但得到形状为 (800, 1000) 的数组

我正在尝试使用 CNN 进行情感分析我的代码我的数据具有 1000 1000 形状当我将数据传递给 convolution2D 时它会抛出一个错误我无法解决我尝试了以下解决方案但仍然面临问题在构建 CNN 时我收到 Kera
如何加载 caffe 模型并转换为 numpy 数组？

我有一个 caffemodel 文件其中包含 ethereon 的 caffe tensorflow 转换实用程序不支持的层我想生成我的咖啡模型的 numpy 表示我的问题是如何将 caffemodel 文件我还有 prototx
从文本文件中提取与输入单词最相似的前 N 个单词

我有一个文本文件其中包含我使用 BeautifulSoup 提取的网页内容我需要根据给定的单词从文本文件中找到 N 个相似的单词流程如下从中提取文本的网站 https en wikipedia org wiki Football h
iOS 上的 OpenCV - GPU 使用情况？

我正在尝试开发一个 iOS 应用程序可以对来自相机的视频执行实时效果就像 iPad 上的 Photobooth 一样我熟悉 OpenCV 的 API 但如果大多数处理是在 CPU 上完成而不是在 GPU 上完成我担心 iOS 上的性
Keras Maxpooling2d 层给出 ValueError

我正在尝试在 keras 中复制 VGG16 模型以下是我的代码 model Sequential model add ZeroPadding2D 1 1 input shape 3 224 224 model add Convoluti
在tensorflow .ckpt文件中使用预训练模型

我有一个 ckpt 文件我只想得到 cnn 的权重我已经从 ckpt 检查点文件中进行了训练 inception resnet v2 2016 08 30 import tensorflow as tf saver tf train S
如何在 Caffe 的网络中出现多次损失？

如果我在网络中定义多个损失层从这些末端到网络的开头是否会发生多个反向传播我的意思是他们真的是这样工作的吗假设我有这样的事情 Layer1 Layer2 Layer n Layer cls1 bottom layer n top cl
Keras 中的损失函数和度量有什么区别？ [复制]

这个问题在这里已经有答案了我不清楚 Keras 中损失函数和指标之间的区别该文档对我没有帮助损失函数用于优化您的模型这是优化器将最小化的函数指标用于判断模型的性能这仅供您查看与优化过程无关
预测测试图像时出现错误 - 无法重塑大小数组

我正在尝试使用 TensorFlow 和 Keras 在 Python 中进行图像识别并且我已经关注了下面的博客 https stackabuse com image recognition in python with tensorfl
TensorFlow：带有轴选项的 bincount

在 TensorFlow 中我可以使用 tf bincount 获取数组中每个元素的计数 x tf placeholder tf int32 None freq tf bincount x tf Session run freq feed
无法获取未知等级的 Shape 长度

我有一个神经网络来自tf data数据生成器和tf keras模型如下简化版本因为太长 dataset A tf data Dataset反对与next x方法调用get next为了x train迭代器和next y方法调用get
OpenCL 内核在 Nvidia GPU 上每个线程使用多少寄存器？

我的第一个问题是如何获取 Nvidia GPU 上 OpenCL 内核代码的寄存器使用信息因为 nvcc 编译器给出了相同的使用信息nvcc ptxas options vCUDA 内核代码的标志我还从 AMD GPU for Open
Pytorch RuntimeError：“host_softmax”未针对“torch.cuda.LongTensor”实现

我正在使用 pytorch 来训练模型但是在计算交叉熵损失时我遇到了运行时错误 Traceback most recent call last File deparser py line 402 in

随机推荐

Keras保存模型并载入模型继续训练

我们以MNIST手写数字识别为例 import numpy as np from keras datasets import mnist from keras utils import np utils from keras models
PyQt5中的按钮3-QCommandLinkButton

PyQt5中的按钮3 QCommandLinkButton QCommandLinkButton介绍 QCommandLinkButton举例 QCommandLinkButton介绍 CommandLinkButton 外观像是一个被设置
verilog赋多位值_关于verilog 赋值

1 wire表示直通即只要输入有变化输出马上无条件地反映 reg表示一定要有触发输出才会反映输入 2 只有 lt 表示非阻塞给沿触发的寄存器赋值是阻塞赋值给电平触发的信号赋值 3 不指定就默认为1位wire类型专门指定出wir
计算机视觉论文-2021-07-19

本专栏是计算机视觉方向论文收集积累时间 2021年7月19日来源 paper digest 欢迎关注原创公众号计算机视觉联盟回复西瓜书手推笔记可获取我的机器学习纯手推笔记直达笔记地址机器学习手推笔记 GitHub地址 1 T
在struts1.1框架下，利用smartupload实现文件的上传（可以是多个文件）

1 前端页面upload jsp 后台处理程序UplodAction java 2 struts config的配置参数如下没有设置 ActionForm
JAVA的WebService规范(支持)

SOA Service Oriented Architecture 面向服务架构是一种思想它将应用程序的不同功能单元通过中立的契约独立于硬件平台操作系统和编程语言联系起来使得各种形式的功能单元更好的集成目前来说 WebServi
windows7旗舰版 appium环境搭建

1 安装jdk 8u171 windows x64 exe 注意配置环境变量参考资料 https jingyan baidu com article 6dad5075d1dc40a123e36ea3 html java version 查
Modbus RTU协议各知识点入门 + 实例

文章目录 1 起因 2 几个重点 2 1 一些难懂的概念 2 2 CRC的高低位 2 3 其他 3 介绍 3 1 起源 3 2 分类 4 格式 4 1 串口协议 4 2 帧格式 5 数据类型 6 功能码 7 CRC16 modbus 8 实
React(Hook介绍)

为什么要用Hook 介绍Hooks之前首先要给大家说一下React的组件创建方式一种是类组件一种是纯函数组件并且React团队希望组件不要变成复杂的容器最好只是数据流的管道开发者根据需要组合管道即可也就是说组件的最佳写法应
如何在 CentOS 8 上安装 Python 3.8

本文来自于阿里云官方镜像站 https developer aliyun com mirror utm content g 1000307095 原文链接 https developer aliyun com article 756221
报错java.lang.Long cannot be cast to java.lang.Integer解析

用博客记录工作中出现的问题给自己一个提醒也给其他朋友一些借鉴报错 java lang Long cannot be cast to java lang Integer Long 无法转化成Integer类型这个异常经常出现在hin
正确理解Widget::Widget(QWidget *parent) :QWidget(parent)这句话

该如何理解下面段代码的第二行QWidget parent 1 Widget Widget QWidget parent 2 QWidget parent 3 4 在讲解原因之前先请大家看下面的一个例子 include
git切换分支时报错（error: pathspec ‘master‘ did not match any file(s) known to git.）的解决方法

git切换分支时报错切换分支 root git my code git checkout master 产生如下报错 error pathspec master did not match any file s known to git
用 Python 爬取网红城市大长沙！

这两天获取了两份关于长沙的数据长沙景点和长沙美食之后进行了分析如果有朋友想去长沙或者周边城市玩要仔细看看喔导入库长沙景点数据获取长沙景点的数据获取方法和之前那篇关于厦门的文章是一样的只是重新跑了一遍代码具体过程不再阐述
HADOOP集群搭建

安装步骤机器mini yum mini2 mini3 mini4 注意下面的步骤在4台机子上都要做的操作 1先将虚拟机的网络模式选为NAT 2修改主机名 vi etc sysconfig network NETWORKING yes H
git：kex_exchange_identification：Connection closed by 52.74.223.119 port 22

使用Rider的git进行push操作时提示 kex exchange identification Connection closed by remote host Connection closed by 52 74 223 119 p
AJAX 缓存处理

关于AJAX请求服务器后缓存数据造成没有及时刷新的问题最近在做项目的时候使用了ajax去请求服务器的数据刚开始还可以我测试一切运行正常我不是专业的测试人员哈所以还是有些问题没有测出来哈后来ajax请求的数据变化了但是页面数
设计模式之命令模式

在日常生活中我们常常会遇到这样一些问题需要向某些对象发送请求但是并不知道请求的接收者是谁也不知道被请求的操作是哪个我们只需在程序运行时指定具体的请求接收者即可此时可以使用命令模式来进行设计使得请求发送者与请求接收者消除彼此之
react 显示当前时间_react中monent如何获取日期？

方法 1 使用 npm install moment save 安装moment 2 在组件中使用import语句引入moment 3 使用monent提供的方法来获取日期例 moment format 获取当前时间本教程操作环境 wi
OneFlow 中的 Softmax

Softmax 是深度学习模型中的常见算子 PyTorch 的 Softmax 算子直接调用 cuDNN 的接口而 OneFlow 内部针对输入数据的类别数量采用3个 kernel 来分别处理在多数情况下都可以获得比 cuDNN 更优

OneFlow 中的 Softmax

参考资料：

OneFlow 中的 Softmax 的相关文章

随机推荐

热门标签