tvm 入门（二）

2023-10-26

代码是一个tvm入门的例子。

以向量相加为例。

使用tvm的流程是：

1.描述串行的向量相加是怎么做的

2.描述并行的时候，怎么对计算单元做划分

3.编译目标函数。本文所示代码可以看到用tvm生成的cuda代码。

4.把编译生成的内容保存成文件，加载该文件做高性能计算。（可选）

"""
Get Started with TVM
====================
**Author**: `Tianqi Chen <https://tqchen.github.io>`_

This is an introduction tutorial to TVM.
TVM is a domain specific language for efficient kernel construction.

In this tutorial, we will demonstrate the basic workflow in TVM.
"""
from __future__ import absolute_import, print_function

import tvm
import numpy as np

# Global declarations of environment.

tgt_host="llvm"
# Change it to respective GPU if gpu is enabled Ex: cuda, opencl
tgt="cuda"

######################################################################
# Vector Add Example
# ------------------
# In this tutorial, we will use a vector addition example to demonstrate
# the workflow.
#

######################################################################
# Describe the Computation
# ------------------------
# As a first step, we need to describe our computation.
# TVM adopts tensor semantics, with each intermediate result
# represented as multi-dimensional array. The user need to describe
# the computation rule that generate the tensors.
#
# We first define a symbolic variable n to represent the shape.
# We then define two placeholder Tensors, A and B, with given shape (n,)
#
# We then describe the result tensor C, with a compute operation.
# The compute function takes the shape of the tensor, as well as a lambda function
# that describes the computation rule for each position of the tensor.
#
# No computation happens during this phase, as we are only declaring how
# the computation should be done.
#
n = tvm.var("n")
A = tvm.placeholder((n,), name='A')
B = tvm.placeholder((n,), name='B')
C = tvm.compute(A.shape, lambda i: A[i] + B[i], name="C")
print(type(C))

######################################################################
# Schedule the Computation
# ------------------------
# While the above lines describes the computation rule, we can compute
# C in many ways since the axis of C can be computed in data parallel manner.
# TVM asks user to provide a description of computation called schedule.
#
# A schedule is a set of transformation of computation that transforms
# the loop of computations in the program.
#
# After we construct the schedule, by default the schedule computes
# C in a serial manner in a row-major order.
#
# .. code-block:: c
#
#   for (int i = 0; i < n; ++i) {
#     C[i] = A[i] + B[i];
#   }
#
s = tvm.create_schedule(C.op)

######################################################################
# We used the split construct to split the first axis of C,
# this will split the original iteration axis into product of
# two iterations. This is equivalent to the following code.
#
# .. code-block:: c
#
#   for (int bx = 0; bx < ceil(n / 64); ++bx) {
#     for (int tx = 0; tx < 64; ++tx) {
#       int i = bx * 64 + tx;
#       if (i < n) {
#         C[i] = A[i] + B[i];
#       }
#     }
#   }
#
bx, tx = s[C].split(C.op.axis[0], factor=64)

######################################################################
# Finally we bind the iteration axis bx and tx to threads in the GPU
# compute grid. These are GPU specific constructs that allows us
# to generate code that runs on GPU.
#
if tgt == "cuda":
  s[C].bind(bx, tvm.thread_axis("blockIdx.x"))
  s[C].bind(tx, tvm.thread_axis("threadIdx.x"))

######################################################################
# Compilation
# -----------
# After we have finished specifying the schedule, we can compile it
# into a TVM function. By default TVM compiles into a type-erased
# function that can be directly called from python side.
#
# In the following line, we use tvm.build to create a function.
# The build function takes the schedule, the desired signature of the
# function(including the inputs and outputs) as well as target language
# we want to compile to.
#
# The result of compilation fadd is a GPU device function(if GPU is involved)
# that can as well as a host wrapper that calls into the GPU function.
# fadd is the generated host wrapper function, it contains reference
# to the generated device function internally.
#
fadd = tvm.build(s, [A, B, C], tgt, target_host=tgt_host, name="myadd")

######################################################################
# Run the Function
# ----------------
# The compiled function TVM function is designed to be a concise C API
# that can be invoked from any languages.
#
# We provide an minimum array API in python to aid quick testing and prototyping.
# The array API is based on `DLPack <https://github.com/dmlc/dlpack>`_ standard.
#
# - We first create a gpu context.
# - Then tvm.nd.array copies the data to gpu.
# - fadd runs the actual computation.
# - asnumpy() copies the gpu array back to cpu and we can use this to verify correctness
#
ctx = tvm.context(tgt, 0)

n = 1024
a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), ctx)
b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), ctx)
c = tvm.nd.array(np.zeros(n, dtype=C.dtype), ctx)
fadd(a, b, c)
tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())

######################################################################
# Inspect the Generated Code
# --------------------------
# You can inspect the generated code in TVM. The result of tvm.build
# is a tvm Module. fadd is the host module that contains the host wrapper,
# it also contains a device module for the CUDA (GPU) function.
#
# The following code fetches the device module and prints the content code.
#
if tgt == "cuda":
    dev_module = fadd.imported_modules[0]
    print("-----GPU code-----")
    print(dev_module.get_source())
else:
    print(fadd.get_source())

######################################################################
# .. note:: Code Specialization
#
#   As you may noticed, during the declaration, A, B and C both
#   takes the same shape argument n. TVM will take advantage of this
#   to pass only single shape argument to the kernel, as you will find in
#   the printed device code. This is one form of specialization.
#
#   On the host side, TVM will automatically generate check code
#   that checks the constraints in the parameters. So if you pass
#   arrays with different shapes into the fadd, an error will be raised.
#
#   We can do more specializations. For example, we can write
#   :code:`n = tvm.convert(1024)` instead of :code:`n = tvm.var("n")`,
#   in the computation declaration. The generated function will
#   only take vectors with length 1024.
#

######################################################################
# Save Compiled Module
# --------------------
# Besides runtime compilation, we can save the compiled modules into
# file and load them back later. This is called ahead of time compilation.
#
# The following code first does the following step:
#
# - It saves the compiled host module into an object file.
# - Then it saves the device module into a ptx file.
# - cc.create_shared calls a env compiler(gcc) to create a shared library
#
from tvm.contrib import cc
from tvm.contrib import util

temp = util.tempdir()
fadd.save(temp.relpath("myadd.o"))
if tgt == "cuda":
    fadd.imported_modules[0].save(temp.relpath("myadd.ptx"))
cc.create_shared(temp.relpath("myadd.so"), [temp.relpath("myadd.o")])
print(temp.listdir())

######################################################################
# .. note:: Module Storage Format
#
#   The CPU(host) module is directly saved as a shared library(so).
#   There can be multiple customed format on the device code.
#   In our example, device code is stored in ptx, as well as a meta
#   data json file. They can be loaded and linked seperatedly via import.
#

######################################################################
# Load Compiled Module
# --------------------
# We can load the compiled module from the file system and run the code.
# The following code load the host and device module seperatedly and
# re-link them together. We can verify that the newly loaded function works.
#
fadd1 = tvm.module.load(temp.relpath("myadd.so"))
if tgt == "cuda":
    fadd1_dev = tvm.module.load(temp.relpath("myadd.ptx"))
    fadd1.import_module(fadd1_dev)
fadd1(a, b, c)
tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())

######################################################################
# Pack Everything into One Library
# --------------------------------
# In the above example, we store the device and host code seperatedly.
# TVM also supports export everything as one shared library.
# Under the hood, we pack the device modules into binary blobs and link
# them together with the host code.
# Currently we support packing of Metal, OpenCL and CUDA modules.
#
fadd.export_library(temp.relpath("myadd_pack.so"))
fadd2 = tvm.module.load(temp.relpath("myadd_pack.so"))
fadd2(a, b, c)
tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())

######################################################################
# .. note:: Runtime API and Thread-Safety
#
#   The compiled modules of TVM do not depend on the TVM compiler.
#   Instead, it only depends on a minimum runtime library.
#   TVM runtime library wraps the device drivers and provides
#   thread-safe and device agnostic call into the compiled functions.
#
#   This means you can call the compiled TVM function from any thread,
#   on any GPUs.
#

######################################################################
# Generate OpenCL Code
# --------------------
# TVM provides code generation features into multiple backends,
# we can also generate OpenCL code or LLVM code that runs on CPU backends.
#
# The following codeblocks generate opencl code, creates array on opencl
# device, and verifies the correctness of the code.
#
if tgt == "opencl":
    fadd_cl = tvm.build(s, [A, B, C], "opencl", name="myadd")
    print("------opencl code------")
    print(fadd_cl.imported_modules[0].get_source())
    ctx = tvm.cl(0)
    n = 1024
    a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), ctx)
    b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), ctx)
    c = tvm.nd.array(np.zeros(n, dtype=C.dtype), ctx)
    fadd_cl(a, b, c)
    tvm.testing.assert_allclose(c.asnumpy(), a.asnumpy() + b.asnumpy())

######################################################################
# Summary
# -------
# This tutorial provides a walk through of TVM workflow using
# a vector add example. The general workflow is
#
# - Describe your computation via series of operations.
# - Describe how we want to compute use schedule primitives.
# - Compile to the target function we want.
# - Optionally, save the function to be loaded later.
#
# You are more than welcomed to checkout other examples and
# tutorials to learn more about the supported operations, schedule primitives
# and other features in TVM.
#

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

tvm 入门（二）的相关文章

JS逆向时碰到了恶心的死代码怎么办？手把手教你解决！

文章作者夜幕团队 NightTeam 蔡老板 Loco 润色校对夜幕团队 NightTeam Loco 你是否也曾有过在逆向时看到一大坨代码但自己却无从下手的遭遇你是否也曾有过跟着代码跳了很久之后才发现那一大坨代码其实没有
MySQL的运行机制是什么？它有哪些引擎？

目录整个 SQL 的执行流程 1 查询缓存的利弊 2 如何选择数据库引 3 InnoDB 自增主键整个 SQL 的执行流程首先客户端端先要发送用户信息去服务端进行授权认证如果使用的是命令行工具通常需要输入如下信息 mysql h
《Flutter入门疑难杂症》 Flutter的UI适配方案

本文后面已经被我用新的extension大法替代咯有兴趣的可以看我写的这篇文章 https blog csdn net WZAHD article details 111404843 spm 1001 2014 3001 5501 话不多
TimeLine 的使用说明

一关于 TimeLine Timeline是一套基于时间轴的多轨道动画系统它支持可视化编辑实时预览这一个技术相对于其他动画系统最大的区别就是 Timeline可以针对多个游戏物体做出的一系列动画主要用于过场动画的制作实现电影级
如何记录键盘SIGQUIT次数

Unix信号在计算机科学中信号英语 Signals 是Unix 类Unix以及其他POSIX兼容的操作系统中进程间通讯的一种有限制的方式它是一种异步的通知机制用来提醒进程一个事件已经发生当一个信号发送给一个进程操作系统中断了进
EOS与以太坊有哪些区别？

以太坊是一个专门为开发和运行去中心化应用 DAPP 搭建的智能合约平台 EOS与以太坊类似同样是基于智能合约和区块链而搭建但是从技术和设计理念等方面来看这两者之间实际上存在明显的区别那么EOS和以太坊到底有什么区别呢这个问题要从
深入了解接口测试：Postman 接口测试指南

在现代软件开发生命周期中接口测试是一个至关重要的部分使用 Postman 这一工具可以轻松地进行接口测试以下是一份简单的使用教程帮助你快速上手安装 Postman 首先你需要在电脑上安装 Postman 你可以从官网上下载并
python如何取0到无穷大_python如何表示无穷大

float inf 表示正无穷 float inf 或 float inf 表示负无穷其中 inf 均可以写成 Inf 起步 python中整型不用担心溢出因为python理论上可以表示无限大的整数直到把内存挤爆而无穷大在编程中常常
使用 Mapbox 在 Vue 中开发一个地理信息定位应用

本文首发自前端修罗场点击加入是一个由资深开发者独立运行的专业技术社区我专注 Web 技术 Web3 区块链答疑解惑面试辅导以及职业发展博主创作的前端面试复习笔记点击订阅广受好评已帮助多人提升实力拿到 offer
20210601

一调整系统的共享内存上限今天遇到创建32个大小为100MB的共享内存失败原因是创建的共享内存总大小超过了系统允许的共享内存上限查询系统共享内存上限的命令是 ipcs l ops g null kernel ipcs l Shared
nodejs原生搭建后端服务

node nodejs原生搭建后端服务 nodejs写后端默默的小跟班的博客 CSDN博客
CentOS8下安装配置Wireguard

1 CentOS8 0服务端安装 yum update y yum install epel release https www elrepo org elrepo release 8 el8 elrepo noarch rpm yum i
nova policy overide (by quqi99)

作者张华发表于 2023 05 19 版权声明可以任意转载转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 Problem ExternalNetworkAttachForbidden will be thrown w
【无标题】

C bug记录 request for member endTime in something not a structure or union 粗心的时候经常遇到这个问题但有时候就想不起来原因这是因为对指针结构体的成员变量使用了但应
学习记录679@scp 拷贝当前主机某目录下某段时间内的文件到另一台服务器

要按拷贝当前服务器的下的某个文件夹下的某段时间内的文件到另一台服务器需要结合find exec scp命令如下我在当前主机下执行拷贝此目录下的时间介于2020 12 24和2020 12 31之间的文件注意不包括2020 12 3
Linux软件安装管理：在VMware挂载本地iso光盘镜像、配置yum软件仓库

在操作VMware安装Linux系统后由于安装CentOS 7的最小化安装少了一些工具比如 ifconfig 及 netstat 等由于没问外部在线网络环境访问下载相关依赖包则我们需要配置离线依赖库本次操作是在Vmware上操作的
在cesium中使用3D地形数据terrain builder的打开步骤

本来题目名字叫做大龄无经验程序员终成正果纪念上班第三天后加之后再不行必须把这篇博文发出去了本篇用cesium terrain builder生成cesium可以使用的地形数据并用cesium terrain server发布使
API函数的调用过程

API函数的调用过程 ring3 Windows API Application Programing Interface 应用程序接口简称API函数 Windows 有多少个API 主要是存放在C WINDOWS system32下面所
如何理解面向对象编程（OOP）

想要理解OOP 首先需要清楚什么是对象所谓对象就是由一组数据结构和处理它们的方法组成的划重点数据包括对象的特性状态等的静态信息方法也就是行为包括该对象的对数据的操作功能等能动信息把相同行为的对象归纳为类类是一个抽象的概
python网络请求错误：ConnectionRefusedError: [WinError 10061] 由于目标计算机积极拒绝，无法连接。

在用pycharm3 7做socket实验的时候出现错误 Traceback most recent call last File D Maindocuments Mainsoftware PycharmProjects socket c

随机推荐

逻辑左移、逻辑右移、算术左移、算术右移、循环左移、循环右移

逻辑左移时最高位丢失最低位补0 逻辑右移时最高位补0 最低位丢失算术左移时依次左移一位尾部补0 最高的符号位保持不变算术右移时依次右移一位尾部丢失符号位右移后原位置上复制一个符号位循环左移时将最高位重新放置最低位
docker mysql镜像有那几个版本

Docker MySQL 镜像有几个版本可供选择例如 MySQL 8 0 MySQL 5 7 MySQL 5 6 MySQL 5 5 你可以在 Docker Hub 上查看最新的 MySQL 镜像版本
Dell R710 iDRAC6 远程控制卡设置

IPMI设置设置服务器主板BIOS 以启用 iDRAC6 控制卡启用iDRAC6 控制卡配置 IP 用户名密码默认情况下启用的 iDRAC6 网络界面使用静态 IP 地址 192 168 0 120 必须对其进行配置才能访问i
day13 栈与队列

LeetCode 239 力扣维护一个单调队列入队列时保证单调递减可以将小于待入队的数全部移除出队列如果不是队首出最大元素无需处理 package algor trainingcamp import java util De
python stock query

AKShare is an elegant and simple financial data interface library for Python built for human beings 开源财经数据接口库可以画线 GitHu
stm32+hx711+蓝牙hc05 称重系统（蓝牙电子秤）

stm32 称重模块hx711 蓝牙模块hc05 本项目使用主控stm32f103c8t6 称重模块hx711 蓝牙模块hc05上传至手机app 电脑app显示数值模块 1 stm32f103c8t6最小系统板 2 hx711 HX711
DataStore入门及在项目中的使用

首先给个官网的的地址应用架构数据层 DataStore Android 开发者 Android Developers 小伙伴们可以直接看官网的资料本篇文章是对官网的部分细节进行补充一为什么要使用DataStore 代替Shared
起飞！8个 Python 加速运行骚操作

转自网络本次分享纯Python编程的加速运行方法 Python 是一种脚本语言相比 C C 这样的编译语言在效率和性能方面存在一些不足但是有很多时候 Python 的效率并没有想象中的那么夸张本文对一些 Python 代码加速
2020年09月 C/C++（三级）真题解析#中国电子学会#全国青少年软件编程等级考试

C C 编程 1 8级全部真题点这里第1题铺砖对于一个2行N列的走道现在用12 22的砖去铺满问有多少种不同的方式时间限制 3000 内存限制 131072 输入整个测试有多组数据请做到文件底结束每行给出一个数字N 0
常见三方结算周期都是有哪些？

T1 为工作日次日就是一个工作日结算如遇到节假日则延迟到节假日结束后第一个工作日结算 T0 为工作日当天结算当然如遇节假日节假日中的交易不结算延迟到节假日结束后的第一个工作日统一结算 D1 为交易后自然日次日结算包含节假日内
Streamlit 讲解专栏（三）：两种方案构建多页面

文章目录 1 前言 2 第一种方案使用Session State实现多页面交互 2 1 Session State简介 2 2 多页面应用的基本结构 2 3 实现多页面交互的代码示例 2 4 Session State机制的优缺点 3 第
如何实现 Array 和 List 之间的转换？

在 Java 中我们可以通过以下方法实现 Array 和 List 之间的转换数组转 List String arr apple banana orange List
2023最新宝塔面板8.0.1企业版开心版

宝塔面板是目前一个非常好用的可视化面板这几天我自己搭建了一个宝塔云端不经过宝塔官方接口无需绑定手机号安装之后直接显示企业版所有插件全部免费使用付费的也能使用脚本如下 centos安装脚本 yum install y wget
【蓝牙开发】Andorid蓝牙绝对音量

蓝牙绝对音量相关基础知识 1 什么是绝对音量 Android 蓝牙部分的官方文档有如下描述 Androud Bluetooth Service 在 Android 6 0 及更高版本中 Android 蓝牙堆栈允许音频源设置绝对音量以便用
【毕业设计】深度学习行人重识别算法研究与实现 - python opencv yolo Reid

文章目录 0 前言 1 课题背景 2 效果展示 3 行人检测 4 行人重识别 5 其他工具 6 最后 0 前言 Hi 大家好这里是丹成学长的毕设系列文章对毕设有任何疑问都可以问学长哦这两年开始各个学校对毕设的要求越来越高难度也越来
WIN10的传递优化文件是个什么鬼？能删除吗？

Win10正式版系统使用时间长了以后电脑磁盘总会产生一些冗余的文件和磁盘碎片我们可以用一些工具软件对硬盘进行优化和处理其实利用系统自带的工具也可以清理磁盘的但在操作过程中我们常常会看到一个名叫传递优化文件的选项而且多数情况下
UA到底是什么

欢迎关注勤于奋每天12点准时更新国外LEAD相关技术今天这篇文章也是比较基础的文章其实有很多新手还是搞不懂所以我特意来写篇文章聊聊这个UA 到底是什么在我们国外LEAD当中起到的作用如果搞不懂很多东西你都没有办法灵活运用首先
19款资源整合类网站推荐：每一个网站都堪称以一敌百

强烈推荐这19个资源聚合网站每一个网站都足以堪称以一敌百因为每一个网站都聚合了相当多不同类型不同领域的网站资源并且做了分类导航方便大家查找使用所以与其收藏那么多零碎的网址不如收藏下面这些网站资源比格张发现更好的资源一个
java8的日期工具类（获取当前时间相隔天数小时分钟秒等处理）

package com example list test import java text ParseException import java text SimpleDateFormat import java time import
tvm 入门（二）

代码是一个tvm入门的例子以向量相加为例使用tvm的流程是 1 描述串行的向量相加是怎么做的 2 描述并行的时候怎么对计算单元做划分 3 编译目标函数本文所示代码可以看到用tvm生成的cuda代码 4 把编译生成的内容保存成文件加

tvm 入门（二）

tvm 入门（二） 的相关文章

随机推荐

热门标签

tvm 入门（二）的相关文章