Neon Instruction C支持的向量运算

2023-11-13

转载请标明出处：https://blog.csdn.net/u013752202/article/details/92008843

文章目的：

快速索引到需要的向量运算

vadd -> ri = ai + bi; //--1、Vector add(正常指令): r, a, b have equal lane sizes

vaddl -> ri = ai + bi; //--2、Vector long add(长指令): a, b have equal lane sizes,

vhadd -> ri = (ai + bi) >> 1; //--4、Vector halving add:

vrhadd -> ri = (ai + bi + 1) >> 1; //--5、Vector rounding halving add:

vqadd -> ri = sat(ai + bi); //--6、Vector saturating add(饱和指令):

vaddhn -> ri = sat(ai + bi); //--7、Vector add high half(窄指令):

vraddhn -> ri = ai + bi; //--8、Vector rounding add high half(窄指令):

vmul -> ri = ai * bi; //--1、Vector multiply(正常指令):

vmla -> ri = ai + bi * ci; //--2、Vector multiply accumulate:

vqdmulh -> ri = sat(ai * bi); //--6、Vector saturating doubling multiply high:

vqdmlal -> ri = ai + bi * ci; //--8、Vector saturating doubling multiply accumulate long:

vqdmlsl -> ri = ai - bi * ci; //--9、Vector saturating doubling multiply subtract long:

vmull -> ri = ai * bi; //--10、Vector long multiply(长指令):

vqdmull -> ri = ai * bi; //--11、Vector saturating doubling long multiply:

vfma -> ri = ai + bi * ci; //--12、Fused multiply accumulate:

vfms -> ri = ai - bi * ci; //--13、Fused multiply subtract:

vsub -> ri = ai - bi; //--1、Vector subtract(正常指令):

vsubl -> ri = ai - bi; //--2、Vector long subtract(长指令):

vsubw -> ri = ai - bi; //--3、Vector wide subtract(宽指令):

vqsub -> ri = sat(ai - bi); //--4、Vector saturating subtract(饱和指令):

vhsub -> ri = (ai - bi) >> 1; //--5、Vector halving subtract:

vsubhn -> ri = ai - bi; //--6、Vector subtract high half(窄指令):

vrsubhn -> ai - bi; //--7、Vector rounding subtract high half(窄指令):

vceq -> ri = ai == bi ? 1...1 : 0...0; //--1、Vector compare equal(正常指令):

vcge-> ri = ai >= bi ? 1...1:0...0; //--2、Vector compare greater-than or equal(正常指令):

vcle -> ri = ai <= bi ? 1...1:0...0; //--3、Vector compare less-than or equal(正常指令):

vcgt -> ri = ai > bi ? 1...1:0...0; //--4、Vector compare greater-than(正常指令):

vclt -> ri = ai < bi ? 1...1:0...0; //--5、Vector compare less-than(正常指令):

vtst -> ri = (ai & bi != 0) ? 1...1:0...0; //--正常指令，

vabd -> ri = |ai - bi|; //--1、Absolute difference between the arguments(正常指令):

vabdl -> ri = |ai - bi|; //--2、Absolute difference - long(长指令):

vaba -> ri = ai + |bi - ci|; //--3、Absolute difference and accumulate:

vabal -> ri = ai + |bi - ci|; //--4、Absolute difference and accumulate - long:

vmax -> ri = ai >= bi ? ai : bi; //--正常指令, returns the larger of each pair

vmin -> ri = ai >= bi ? bi : ai; //--正常指令, returns the smaller of each pair

vshl -> ri = ai << bi; //--1、Vector shift left(饱和指令): (negative values shift right)

vshr -> ri = ai >> b; //--1、Vector shift right by constant: The results are truncated.

vshl -> ri = ai << b; //--2、Vector shift left by constant:

vrshr -> ri = ai >> b; //--3、Vector rounding shift right by constant:

vsra -> ri = (ai >> c) + (bi >> c); //--4、Vector shift right by constant and accumulate:

vqshl -> ri = sat(ai << b); //--6、Vector saturating shift left by constant:

vqshlu -> ri = ai << b; //--7、Vector signed->unsigned saturating shift left by constant:

vshrn -> ri = ai >> b; //--8、Vector narrowing shift right by constant:

vqshrn -> ri = ai >> b; //--11、Vector narrowing saturating shift right by constant:

vrshrn -> ri = ai >> b; //--12、Vector rounding narrowing shift right by constant:

vshll -> ri = ai << b; //--14、Vector widening shift left by constant:

vabs -> ri = |ai|; //--1、Absolute(正常指令):

vqabs -> ri = sat(|ai|); //--2、Saturating absolute(饱和指令):

vneg -> ri = -ai; //--1、Negate(正常指令): negates each element in a vector.

vqneg -> ri = sat(-ai); //--2、Saturating Negate:

vn -> ri = ~ai; //--1、Bitwise not(正常指令): vm

vand -> ri = ai & bi; //--2、Bitwise and(正常指令): performs a bitwise AND between

vorr -> ri = ai | bi; //--3、Bitwise or(正常指令): performs a bitwise OR between

veor -> ri = ai ^ bi; //--4、Bitwise exclusive or (EOR or XOR)(正常指令):

vbic -> ri = ~ai & bi; //--5、Bit Clear(正常指令):

vorn -> ri = ai | (~bi); //--6、Bitwise OR complement(正常指令):

vn -> ri = ai[0...8]; //--1、Vector narrow integer(窄指令): vmo copies the least

vmul -> ri = ai * b; //--1、Vector multiply by scalar:

vmull -> ri = ai * b; //--3、Vector long multiply with scalar:

vmull -> ri = ai * b[c]; //--4、Vector long multiply by scalar:

vqdmull -> ri = sat(ai * b); //--5、Vector saturating doubling long multiply with scalar:

vqdmull -> ri = sat(ai * b[c]); //--6、Vector saturating doubling long multiply by scalar:

vmla -> ri = ai + bi * c; //--11、Vector multiply accumulate with scalar:

vmla -> ri = ai + bi * c[d]; //--12、Vector multiply accumulate by scalar:

vmlal -> ri = ai + bi * c; //--13、Vector widening multiply accumulate with scalar:

vmlal -> ri = ai + bi * c[d]; //--14、Vector widening multiply accumulate by scalar:

vmls -> ri = ai - bi * c; //--17、Vector multiply subtract with scalar:

vmls -> ri = ai - bi * c[d]; //--18、Vector multiply subtract by scalar:

vmlsl -> ri = ai - bi * c; //--19、Vector widening multiply subtract with scalar:

vmlsl -> ri = ai - bi * c[d]; //--20、Vector widening multiply subtract by scalar:

函数说明详见《Neon Intrinsics各函数介绍》

转载请标明出处：https://blog.csdn.net/u013752202/article/details/92008843

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

SIMD

neon

并行加速

Neon Instruction C支持的向量运算的相关文章

Arm Neon Intrinsics 与手动组装

https web archive org web 20170227190422 http hilbert space de p 22 https web archive org web 20170227190422 http hilber
如何在 MSVC 中启用 SSE4.1 和 SSE3（但不是 AVX）

我正在尝试使用 MSVC 启用不同的 simd 支持有一个页面讨论启用一些simd 例如SSE2 AVX AVX2https learn microsoft com en us cpp build reference arch x86 r
检查运行时是否支持 SSE [重复]

这个问题在这里已经有答案了我想检查运行时是否支持 SSE4 或 AVX 以便我的程序可以利用处理器特定的指令而无需为每个处理器创建二进制文件如果我可以在运行时确定它我可以使用一个接口并在不同的指令集之间切换 GCC 有一种方法可以做
有没有办法在运行时检测 iOS 上的 VFP/NEON/Thumb/...？

因此通过查询就可以很容易地找出 iOS 设备运行的 CPU 类型sysctlbyname hw cpusubtype 但似乎没有明显的方法来弄清楚 CPU 实际具有哪些功能想想 VFP NEON Thumb 有人能想办法做到这一点吗基
SSE比FPU慢？

我有一大段代码其主体部分包含这段代码 result nx m Lx ny m Ly m Lz sqrt nx nx ny ny 1 我将其矢量化如下一切都已经是float m128 r mm mul ps mm set ps ny nx
gcc、simd 内在函数和快速数学概念

大家好我正在尝试掌握一些有关浮点 SIMD 数学内在函数和 gcc 的快速数学标志的概念更具体地说我在 x86 cpu 上使用 MinGW 和 gcc v4 5 0 我已经搜索了一段时间这就是我认为我目前所理解的当我在没有标志
为什么 Clang 无法通过 constexpr 函数中的索引获取 __m128 的数据

include
vgetmantps 与 andpd 获取浮点数尾数的说明

对于 skylakex agnerfog 的指令表 Instruction Operands ops fused domain ops unfused domain ops each port Latency Reciprocal thro
快速、无分支的 unsigned int 绝对差

我有一个程序它花费大部分时间计算 RGB 值之间的欧几里德距离无符号 8 位的 3 元组 Word8 我需要一个快速无分支的 unsigned int 绝对差函数这样 unsigned difference Word8 gt Wor
armv8 NEON if 条件

我想了解armv8 NEON内联汇编代码中的if条件在armv7中这可以通过检查溢出位来实现如下所示 VMRS r4 FPSCR BIC r4 r4 1 lt lt 27 VMSR FPSCR r4 vtst 16 d30 d30 d
为什么 OpenMP SIMD 指令会降低性能？

我正在学习如何在 OpenMP Fortran 中使用 SIMD 指令我写了简单的代码 program loop implicit none integer i j real 8 x x 0 0 do i 1 10000 do j 1
ARM NEON：如何实现 256 字节查找表

我正在使用内联汇编将我编写的一些代码移植到 NEON 我需要的一件事是将范围 0 128 的字节值转换为表中采用完整范围 0 255 的其他字节值该表很短但其背后的数学并不容易因此我认为不值得每次即时计算它所以我想尝试查找表我
使用 SIMD (System.Numerics) 编写向量和函数并使其比 for 循环更快

我编写了一个函数来将 a 的所有元素相加double 使用 SIMD 的数组 System Numerics Vector 并且性能比朴素方法差在我的电脑上Vector
进行水平 SSE 向量和（或其他简化）的最快方法

给定一个由三个或四个浮点数组成的向量对它们求和的最快方法是什么 SSE movaps shuffle add movd 总是比 x87 快吗 SSE3 中的水平相加指令值得吗转移到 FPU 然后是 faddp faddp 的成本是多
使用 ARM NEON 内在函数添加 alpha 和排列

我正在开发一个 iOS 应用程序需要相当快地将图像从 RGB gt BGRA 转换如果可能的话我想使用 NEON 内在函数有没有比简单分配组件更快的方法 void neonPermuteRGBtoBGRA unsigned char
调用always_inline‘_mm_mullo_epi32’时内联失败：目标特定选项不匹配

我正在尝试使用 cmake 编译 C 程序该程序使用 SIMD 内在函数当我尝试编译它时出现两个错误 usr lib gcc x86 64 linux gnu 5 include smmintrin h 326 1 错误调用alwa
在 SIMD 操作上下文中，非压缩指令和压缩指令有什么区别？

在 SIMD 操作上下文中非压缩指令和压缩指令有什么区别我正在阅读一篇关于优化 SSE 代码的文章 http www cortstratton org articles OptimizingForSSE php batch http w
gcc 编译器开关 (-mavx -mavx2 -mavx512f) 到底有什么作用？

我在 C C 代码中明确使用了英特尔 SIMD 内在扩展为了编译代码我需要在命令行上指定 mavx mavx512 或类似的内容我对这一切都很满意然而从阅读 gcc 手册页来看并不清楚这些命令行标志是否也告诉 gcc 编译器尝试
为什么在强度降低乘法和循环进位加法之后，这段代码的执行速度会变慢？

我正在读书阿格纳雾 https en wikipedia org wiki Agner Fog s 优化手册 https en wikipedia org wiki Agner Fog Optimization 我遇到了这个例子 doub
使用 AVX/AVX2 转置 8x8 浮点

转置 8x8 矩阵可以通过制作四个 4x4 矩阵并对每个矩阵进行转置来实现这不是我想要的在另一个问题中一个答案给出了解决方案 https stackoverflow com a 2518670 4144148x8 矩阵只需要 24 条

随机推荐

一个经过改良的XMLHelper（包含了序列化，反序列化，创建xml文件，读取节点

转自 http www 360doc com content 13 0905 20 1944636 312482651 shtml public class XmlHelper public XmlHelper public enum Xm
如何在 VS Code 中安装和使用 Amazon CodeWhisperer

大家好今天我将向大家介绍如何在 Visual Studio Code 简称 VS Code 中安装和使用 Amazon CodeWhisperer 这是一个强大的 AI 辅助代码生成工具 CodeWhisperer 可以帮助你自动生成你需
114DNS Public DNS+ 阿里DNS 百度DNS 360 DNS派 Google DNS

为什么80 的码农都做不了架构师 gt gt gt 114DNS 腾讯dnspod DNS 阿里DNS 百度DNS 360DNS Google DNS公共DNS评测体验报告从ping及dig返回时间对比测试国内DNS普遍很快而阿里DN
在react中使用redux并实现计数器案例

React Redux 在recat中不使用redux 时遇到的问题在react中组件通信的数据是单向的顶层组件可以通过props属性向下层组件传递数据而下层组件不能向上层组件传递数据要实现下层组件修改数据需要上层组传递修改数据的
Matplotlib 散点图绘制详解

目录基础点的大小点的颜色透明度颜色条多组散点 1 散点图基础代码 import matplotlib pyplot as plt import numpy as np 第一组散点 x np array 1 2 3 4 5 6
在C++上利用onnxruntime （CUDA）和 opencv 部署模型onnx

概述将得到的模型转化为onnx模型加载到c 中运行来完成模型的部署下载并安装onnxruntime CMakeLists txt cmake minimum required VERSION 2 8 project test 使用c
一起学nRF51xx 10 - rng

前言随机数产生器 RNG 的结构随机数发生器 RNG 根据内部热产生真实的非确定性随机数噪音 RNG通过触发START任务启动并通过触发STOP任务停止当随机数已经生成它会产生一个VALRDY事件同时把随机数存入VALUE寄存器
智慧城市领域大单，巨头占尽优势

智慧城市领域哪个公司做的比较好一前言二智慧城市中标大单清单三中标厂商分析 1 华为 2 科大讯飞 3 腾讯 4 阿里 5 中国电科 6 中国电子 7 百度 8 数字广东四获取智慧城市等全套最新解决方案合集一前言在
python eclipse+pydev(An error has occurred when creating this preference page)

Eclipse 安装pydev Help gt Install New Software gt add gt Location http pydev org updates 点击pydev左边的小三角勾选pydev for eclipse
Shell init Ubuntu

echo HISTFILESIZE 99999 gt gt bashrc echo HISTSIZE 99999 gt gt bashrc echo HISTTIMEFORMAT F T gt gt bashrc echo PROMPT C
Thrift原理简析(JAVA)

Apache Thrift是一个跨语言的服务框架本质上为RPC 同时具有序列化反序列化机制当我们开发的service需要开放出去的时候就会遇到跨语言调用的问题 JAVA语言开发了一个UserService用来提供获取用户信息的服务
CUDA编程基础与实践学习笔记（十）

线程束 warp 一个GPU由多个SM组成一个SM上可以放多个线程块不同线程块之间并行或顺序执行一个线程块分为多个线程束一个线程束由32个线程有连续的线程号组成从更细粒度来看一个SM以一个线程束为单位产生管理调度执行线
Java面向对象 - 封装、继承和多态

第1关什么是封装如何使用封装相关知识为了完成本关任务你需要掌握 1 什么是封装 2 封装的意义 3 实现Java封装的步骤 package case1 public class TestPersonDemo public stat
GoLang之”奇怪用法“实践总结

2013 11 23 wcdj 0 摘要本文通过对A Tour of Go的实践总结Go语言的基础用法 1 Go语言奇怪用法有哪些 1 go的变量声明顺序是先写变量名再写类型名此与C C 的语法孰优孰劣可见下文解释 http
销售心理学

销售中的心理学影响你一生的销售心理学书籍要想钓到鱼就要像鱼一样思考在生活中如果想钓到鱼你就得像鱼那样思考而不是像渔夫那样思考当你对鱼了解得越多你也就越来越会钓鱼了这样的想法用在销售中同样适用要知道销售的过程其实就是销
【Redis17】Redis进阶：管道

Redis进阶管道管道是啥我们做开发的同学们经常会在 Linux 环境中用到管道命令比如 ps ef grep php 在之前学习 Laravel框架时的 Laravel6 4 管道过滤器https mp weixin qq com
Latex使用

问题在使用latex的过程中插入图片在某些条件下图片可能会出现越过后续的文字出现在下一页的页首解决办法在该tex文件首部加上 usepackage stfloats 然后参数设置成H如下 begin figure H center
使用frp 实现内网穿透 & 将私人电脑变成一个服务器

使用frp 实现内网穿透 frp 是什么 frp 是一个可用于内网穿透的高性能的反向代理应用支持 tcp udp 协议为 http 和 https 应用协议提供了额外的能力且尝试性支持了点对点穿透作用比如你需要用到云服务器部署你的
阅读GFS论文

GFS论文发表距今已经十几年了据之开源的hdfs也已经在业界得到了广泛应用为了取得分布式系统的真经拜读一下这篇经典论文重要假设软硬件失败乃家常便饭我们写大文件不屑小文件文件改动的主流是追加新数据随机写是非主流一旦写完仅
Neon Instruction C支持的向量运算

转载请标明出处 https blog csdn net u013752202 article details 92008843 文章目的快速索引到需要的向量运算 vadd gt ri ai bi 1 Vector add 正常指令 r a

Neon Instruction C支持的向量运算

Neon Instruction C支持的向量运算 的相关文章

随机推荐

热门标签

Neon Instruction C支持的向量运算的相关文章