详细了解大量 3x3 矩阵的逆算法

2024-01-28

我遵循这个原始帖子：用于反转大量 3x3 矩阵的 PyCuda 代码 https://stackoverflow.com/questions/55357826/pycuda-adapt-existing-code-and-kernel-code-to-perform-a-high-number-of-3x3-mat。建议作为答案的代码是：

$ cat t14.py
import numpy as np
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""

__device__ unsigned getoff(unsigned &off){
  unsigned ret = off & 0x0F;
  off >>= 4;
  return ret;
}   

// in-place is acceptable i.e. out == in) 
// T = float or double only
const int block_size = 288;
typedef double T; // *** can set to float or double
__global__ void inv3x3(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){

  __shared__ T si[block_size];
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  T det = 1;
  if (idx < n*9)
    det = in[idx];
  unsigned sibase = (threadIdx.x / 9)*9;
  unsigned lane = threadIdx.x - sibase; // cheaper modulo
  si[threadIdx.x] = det;
  __syncthreads();
  unsigned off = pat[lane];
  T a  = si[sibase + getoff(off)];
  a   *= si[sibase + getoff(off)];
  T b  = si[sibase + getoff(off)];
  b   *= si[sibase + getoff(off)];
  a -= b;
  __syncthreads();
  if (lane == 0) si[sibase+3] = a;
  if (lane == 3) si[sibase+4] = a;
  if (lane == 6) si[sibase+5] = a;
  __syncthreads();
  det =  si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];
  if (idx < n*9)
    out[idx] = a / det;
}   

""")
# host code
def gpuinv3x3(inp, n):
    # internal constants not to be modified
    hpat = (0x07584, 0x08172, 0x04251, 0x08365, 0x06280, 0x05032, 0x06473, 0x07061, 0x03140)
    # Convert parameters into numpy array
    # *** change next line between float32 and float64 to match float or double
    inpd = np.array(inp, dtype=np.float64)
    hpatd = np.array(hpat, dtype=np.uint32)
    # *** change next line between float32 and float64 to match float or double
    output = np.empty((n*9), dtype= np.float64)
    # Get kernel function
    matinv3x3 = kernel.get_function("inv3x3")
    # Define block, grid and compute
    blockDim = (288,1,1) # do not change
    gridDim = ((n/32)+1,1,1)
    # Kernel function
    matinv3x3 (
        cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
        block=blockDim, grid=gridDim)
    return output
inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0)
n = 2
result = gpuinv3x3(inp, n)
print(result.reshape(2,3,3))

结果在包含 18 个值的初始一维数组（即 2 个 3x3 矩阵）上给出了正确的逆矩阵，即：

[[[ 2.         -0.         -1.        ]
  [-1.         -0.33333333  1.        ]
  [-0.          0.33333333 -0.        ]]

 [[ 1.          0.          0.        ]
  [ 0.          1.          0.        ]
  [ 0.          0.          1.        ]]]

主要问题：我想详细了解该算法的工作原理，特别是内核如何允许对初始 1D 向量使用共享内存，并在我在大量 3x3 矩阵上执行此代码时带来优化。

我理解这句话：size_t idx = threadIdx.x+blockDim.x*blockIdx.x;它给出了由当前工作组块的本地 threadIdx 和 blockIdx 标识的当前工作项的全局索引。
我明白那个__shared__ T si[block_size];代表一个共享数组，即与工作组块关联：这就是我们所说的Local Memory.

另一方面，我不理解内核代码的以下部分：

 __shared__ T si[block_size];

 size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
 T det = 1;
 if (idx < n*9)
   det = in[idx];
 unsigned sibase = (threadIdx.x / 9)*9;
 unsigned lane = threadIdx.x - sibase; // cheaper modulo
 si[threadIdx.x] = det;
 __syncthreads();
 unsigned off = pat[lane];
 c
 __syncthreads();
 if (lane == 0) si[sibase+3] = a;
 if (lane == 3) si[sibase+4] = a;
 if (lane == 6) si[sibase+5] = a;
 __syncthreads();

确实有什么作用sibase索引定义为unsigned sibase = (threadIdx.x / 9)*9;

另外，参数的用处是什么lane被定义为：unsigned lane = threadIdx.x - sibase; // cheaper modulo

最后，应用移位：

      T a  = si[sibase + getoff(off)];
      a   *= si[sibase + getoff(off)];
      T b  = si[sibase + getoff(off)];
      b   *= si[sibase + getoff(off)];
      a -= b;

但我没有清楚地看到功能。

对于这部分我来说同样的问题：

 if (lane == 0) si[sibase+3] = a;
 if (lane == 3) si[sibase+4] = a;
 if (lane == 6) si[sibase+5] = a;

行列式的计算方式很奇怪，我无法理解，即：

 det =  si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];

我不是 OpenCL 的初学者，但我还没有足够的专家来完全理解这个内核代码。

预赛

首先，了解 3x3 矩阵求逆的算术很重要，请参阅here https://ardoris.wordpress.com/2008/07/18/general-formula-for-the-inverse-of-a-3x3-matrix/（及下文）。

用于内核设计的一般方法是为每个线程分配一个矩阵结果元素。因此，每个矩阵需要 9 个线程。最终，每个线程将负责计算每个矩阵的 9 个数值结果之一。为了计算两个矩阵，我们需要 18 个线程，3 个矩阵需要 27 个线程。

辅助任务是决定线程块/网格大小。这遵循典型的方法（总体问题大小决定所需的线程总数），但我们将为线程块大小选择 288，因为这是 9（每个矩阵的线程数）和 32（每个矩阵的线程数）的方便倍数。 CUDA 中每个扭曲的线程数），这为我们提供了一定的效率衡量标准（没有浪费的线程，没有数据存储的间隙）。

由于我们的线程策略是每个矩阵元素一个线程，因此我们必须使用 9 个线程共同解决矩阵求逆运算。主要任务是计算辅因子的转置矩阵，然后计算行列式，然后进行最终算术（除以行列式）来计算每个结果元素。

辅助因子的计算

第一个任务是计算辅助因子的转置矩阵A，称为M:

        |a b c|
let A = |d e f|
        |g h i|

    
        |ei-fh ch-bi bf-ce|
    M = |fg-di ai-cg cd-af|
        |dh-eg bg-ah ae-bd|

我们有 9 个线程用于此任务，矩阵有 9 个元素M来计算，所以我们将为每个元素分配一个线程M。的每个元素M取决于多个输入值（a, b, c等）所以我们首先将每个输入值（有 9 个，每个线程一个）加载到共享内存中：

  // allocate enough shared memory for one element per thread in the block:
  __shared__ T si[block_size];
  // compute a globally unique thread index, so each thread has a unique number 0,1,2,etc.
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  // establish a temporary variable that will use and reuse during thread processing
  T det = 1;
  // do a thread check to make sure that our next load will be in-bounds for the input array in
  if (idx < n*9)
  // load one element per thread, 9 threads per matrix will load an entire matrix
    det = in[idx];
  // for a given matrix (9 threads) compute the base offset into shared memory, where this matrix data (9 elements) will be stored.  All 9 threads have the same base offset
  unsigned sibase = (threadIdx.x / 9)*9;
  // for each group of 9 threads handling a matrix, compute for each thread in that group, a group offset or "lane" from 0..8, so each thread in the group has a unique identifier/assignment in the group
  unsigned lane = threadIdx.x - sibase; // cheaper modulo
  // let each thread place its matrix element a,b,c, etc. into shared memory
  si[threadIdx.x] = det;
  // shared memory is now loaded, make sure all threads have loaded before any calculations begin
  __syncthreads();

现在每个A矩阵元素（a, b, c, ...) 被加载到共享内存中，我们可以开始计算其中的辅因子M。让我们关注特定线程 (0) 及其辅助因子 (ei-fh）。计算此辅因子所需的所有矩阵元素 (e, i, f, and h）现在位于共享内存中。我们需要一种方法来按顺序加载它们，并执行所需的乘法和减法。

此时我们观察到两件事：

each M元素（辅因子）有一组不同的 4 个所需元素A
each M元素（辅因子）遵循相同的通用算术，给定四个任意元素A，我们将它们统称为X, Y, Z and W。算术是XY-ZW。我取第一个元素，将其乘以第二个元素，然后取第三个和第四个元素并将它们相乘，然后减去两个乘积。

由于所有 9 个辅助因子的一般操作顺序（上面的 2）都是相同的，因此我们只需要一种方法来安排 4 个所需矩阵元素的加载。此方法被编码到硬编码到示例中的负载模式中：

 hpat = (0x07584, 0x08172, 0x04251, 0x08365, 0x06280, 0x05032, 0x06473, 0x07061, 0x03140)

有9种负载模式，每种负载模式占用一个十六进制数，每个线程一种负载模式，即每个线程一种负载模式M矩阵元素（辅因子）。在特定的A矩阵，矩阵元素a, b, c等（已经）加载到共享内存中group偏移量为 0、1、2 等。给定线程的加载模式将允许我们生成组偏移量序列，需要检索以下矩阵元素A从它们在共享内存中的位置，按顺序使用来计算分配给该线程的辅因子。考虑线程 0 及其辅助因子ei-fh，负载模式如何0x7584对需要的模式进行编码以选择e, then i, then f, then h?

为此，我们有一个辅助函数getoff它采用加载模式，并连续（每次调用时）剥离索引。我第一次打电话getoff参数为0x7584，它“剥离”索引 4，返回该索引，并替换0x7584加载模式0x758以供下次使用。 4对应于e。下次我打电话的时候getoff with 0x758它“剥离”索引 8，返回该索引并替换0x758 with 0x75。 8对应于i。下一次产生索引 5，对应于f，最后一次产生索引 7，对应于h.

有了这个描述，我们将浏览代码，假装我们是线程 0，并描述计算过程ei-fh:

  // get the load pattern for my matrix "lane"
  unsigned off = pat[lane];
  //load my temporary variable `a` with the first item indexed in the load pattern:
  T a  = si[sibase + getoff(off)];
  // multiply my temporary variable `a` with the second item indexed in the load pattern
  a   *= si[sibase + getoff(off)];
  //load my temporary variable `b` with the third item indexed in the load pattern
  T b  = si[sibase + getoff(off)];
  // multiply my temporary variable `b` with the fourth item indexed in the load pattern
  b   *= si[sibase + getoff(off)];
  // compute the cofactor by subtracting the 2 products
  a -= b;

sibase正如第一个注释代码部分中已经指出的，是共享内存中的基址偏移量，其中A存储矩阵元素。这getoff然后函数添加到该基地址以选择相关的输入元素。

行列式的计算

行列式的数值由下式给出：

det(A) = det = a(ei-fh) - b(di-fg) + c(dh-eg)

如果我们分解它，我们会发现所有项实际上都已经计算过了：

a,b,c:  these are input matrix elements, in shared locations (group offsets) 0, 1, 2
ei-fh:  cofactor computed by thread 0
di-fg:  cofactor computed by thread 3 (with sign reversed)
dh-eg:  cofactor computed by thread 6

现在，每个线程都需要行列式的值，因为每个线程在计算其最终（结果）元素期间都会使用它。因此，我们将让矩阵中的每个线程冗余地计算相同的值（这比在一个线程中计算它然后将该值广播到其他线程更有效）。为了促进这一点，我们需要将 3 个已计算的辅因子提供给所有 9 个线程。因此，我们将在共享内存中选择 3 个（不再需要）位置来“发布”这些值。我们仍然需要位置 0、1、2 中的值，因为我们需要输入矩阵元素a, b, and c用于计算行列式。但我们在其余工作中不再需要位置 3、4 或 5 中的输入元素，因此我们将重用这些元素：

  // we are about to change shared values, so wait until all previous usage is complete
  __syncthreads();
  // load cofactor computed by thread 0 into group offset 3 in shared
  if (lane == 0) si[sibase+3] = a;
  // load cofactor computed by thread 3 into group offset 4 in shared
  if (lane == 3) si[sibase+4] = a;
  // load cofactor computed by thread 6 into group offset 5 in shared
  if (lane == 6) si[sibase+5] = a;
  // make sure shared memory loads are complete
  __syncthreads();
  // let every thread compute the determinant (same for all threads)
  //       a       * (ei-fh)    +  b         * -(fg-di)   +  c         * (dh-eg)
  det =  si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];

最终结果的计算

这仅涉及（对于每个线程）将该线程先前计算的辅助因子除以刚刚计算的行列式，并存储该结果：

  // another thread check: make sure this thread is actually doing useful work
  if (idx < n*9)
  // take previously computed cofactor, divide by determinant, store result
    out[idx] = a / det;

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

详细了解大量 3x3 矩阵的逆算法的相关文章

如何更改 FacetGrid 中的边距标题颜色

使用 Seaborn Facet Grids 如何仅更改边距标题的颜色注意g set titles color red 更改两个标题 p sns load dataset penguins sns displot data p x fli
上传时的 Google Drive API——这些额外的空行从何而来？

总结一下该程序我从我的 Google 云端硬盘下载一个文件然后在本地计算机中打开并读取一个文件 file a txt 然后在我的计算机中打开另一个文件 file b txt 处于附加模式并且在使用这个新的 file b 更新我的 Go
如何在 Jupyter Notebook 中运行 Python 异步代码？

我有一些 asyncio 代码在 Python 解释器 CPython 3 6 2 中运行良好我现在想在具有 IPython 内核的 Jupyter 笔记本中运行它我可以运行它 import asyncio asyncio get ev
ctypes 错误：libdc1394 错误：无法初始化 libdc1394

我正在尝试将程序编译为共享库我可以使用 ctypes 在 Python 代码中使用该库使用以下命令该库可以正常编译 g shared Wl soname mylib O3 o mylib so fPIC files pkg config
在Python中如何获取字典的部分视图？

是否有可能获得部分视图dict在Python中类似于pandasdf tail df head 说你有很长一段时间dict 而您只想检查某些元素开头结尾等 dict 就像是 dict head 3 To see the first 3
从sklearn PCA获取特征值和向量

如何获取 PCA 应用程序的特征值和特征向量 from sklearn decomposition import PCA clf PCA 0 98 whiten True converse 98 variance X train clf f
Pandas dataframe：每批行的操作

我有一个熊猫数据框df我想计算每批行的一些统计信息例如假设我有一个batch size 200000 对于每批batch sizerows 我想要一列的唯一值的数量ID我的数据框我怎样才能做这样的事情呢这是我想要的一个例子 prin
字符串中的注释和注释中的字符串

我正在尝试使用 Python 和 Regex 计算 C 代码中包含的注释中的字符数但没有成功我可以先删除字符串以删除字符串中的注释但这也会删除注释中的字符串结果会很糟糕是否有机会通过使用正则表达式来询问不匹配注释中的字符串反之亦
使用 NLTK 在 Python 中获取大量名词（或形容词）；或 Python Mad Libs

Like 这个问题 https stackoverflow com questions 7439555 noun adjective etc word lists or dictionaries common words 我有兴趣按词性获取
“一旦获取切片就无法更新查询”。最佳实践？

由于我的项目的性质我发现自己不断地从查询集中取出切片如下所示 Thread objects filter board requested board id order by updatedate 10 但这给我带来了实际对我选择的元素进
根据 Pandas 中的列表对多列进行排序

感谢有关如何根据 pandas 中的倍数列表对给定多列进行排序的任何提示如下所示 import pandas as pd sort a a d e sort b s1 s3 s6 sort c t1 t2 t3 df pd DataFra
如何根据 HTTP 请求使用 Python 和 Flask 执行 shell 命令并流输出？

下列的这个帖子 https stackoverflow com questions 15092961 how to continuously display python output in a webpage 我能够tail f网页的日志
揭秘sharedctypes性能

在 python 中可以在多个进程之间共享 ctypes 对象然而我注意到分配这些对象似乎非常昂贵考虑以下代码 from multiprocessing import sharedctypes as sct import ctypes
对使用 importlib.util 导入的对象进行酸洗

我在使用Python的pickle时遇到了一个问题我需要通过将文件路径提供给 importlib util 来加载一些 Python 模块如下所示 import importlib util spec importlib util sp
Python、subprocess、call()、check_call 和 returncode 来查找命令是否存在

我已经弄清楚如何使用 call 让我的 python 脚本运行命令 import subprocess mycommandline lumberjack sleep all night work all day subprocess cal
使用 Conda 更新特定模块会删除大量软件包

我最近开始使用 Anaconda Python 发行版因为它提供了许多开箱即用的数据分析库使用 conda 创建环境和安装软件包也轻而易举但是当我想更新 Python 本身或任何其他模块时我遇到了一些严重的问题我事先被告知我的很多
Werkzeug 中的线程和本地代理。用法

首先我想确保我正确理解了功能的分配分配本地代理功能以通过线程内的模块包共享变量对象我对吗其次用法对我来说仍然不清楚也许是因为我误解了作业我用烧瓶如果我有两个或更多模块 A B 我想将对象C从模块A导入到模块B 但我
计算互相关函数？

In R 我在用ccf or acf计算成对互相关函数以便我可以找出哪个移位给我带来最大值从它的外观来看 R给我一个标准化的值序列 Python 的 scipy 中是否有类似的东西或者我应该使用fft模块目前我正在这样做 xcor
python 日志记录会刷新每个日志吗？

当我使用标准模块将日志写入文件时logging 每个日志会分别刷新到磁盘吗例如下面的代码会将日志刷新 10 次吗 logging basicConfig level logging DEBUG filename debug log fo
使用Multiprocessing和Pool时如何访问全局变量？

我试图避免将变量冗余地传递到dataList e g 1 globalDict 2 globalDict 3 globalDict 并在全球范围内使用它们 global globalDict然而在下面的代码中并不是这样做的解决方案是否有

随机推荐

将两个对象（其中一个对象持有对另一个对象的引用）传递到线程中

我有两个对象其中第二个对象需要第一个对象比它更长寿因为它保存对第一个对象的引用我需要将它们移到一个线程中但编译器抱怨第一个线程的寿命不够长这是代码 use std thread trait Facade Sync fn add s
eclipse cdt：从 pkg-config 添加包含路径

我想将动态配置路径从 pkg config 生成添加到我的项目中这基本上是针对像 boost 这样的第三方依赖项因此工作区包含不合适文件系统也不包含因为这将是硬编码的每个开发人员都必须手动更改它我在项目属性 gt c 常规
如何在Python中的多个异步进程之间进行同步？

我有一个使用 fastapi 的异步 http web 服务我在服务器上的不同端口上运行同一服务的多个实例并且前面有一个 nginx 服务器因此我可以全部利用它们我有一个特定的资源需要保护只有一个客户端可以访问它 app get
dyld：未找到符号：_ffi_prep_closure_loc（在 Mac 上）

我做了一个常规的flutter run今天在我的 Mac 上针对我的 iPhone 模拟器出现了这些错误 Error output from CocoaPods dyld lazy symbol binding failed Symbo
ASCII 比较和字符串比较的区别

我正在使用 C 当我比较两个字符值时它会向我发送正确的输出例如 CompareTo Its sending me positive value 12 means gt is true 但是当我比较两个相同值的字符串时它会向我发送不同的
带有参数和通配符的 SQL LIKE 运算符

我有一个查询我想返回名称中具有特定字符串且两侧带有通配符的所有客户端因此输入可能是 Smith 我想返回 The John Smith Company 或 Smith and Bros 等所有内容我希望 Client 得到提示因此
在 Angular 和 Chrome DevTools 中提取 css 文件的问题

I use ng serve sm ec提取 scss 文件angular 5 但是当我做出改变时Elements的翁格莱Chrome DevTools像那样它没有自动保存我必须改变它Sourcesonglet 并保存以使其正常工作我
iTextSharp - 如何打开/读取/提取文件附件？

我有一些 PDF 文件其中包含两个带有静态名称的附加文件我想使用 iTextSharp 将这些文件提取到临时目录以便我可以进一步使用它们我尝试按照教程进行操作但当我遇到问题时iTextSharp text pdf PdfReader
在 C# 中设置 MimeType

在 C 中设置 mimetypes 是否有比我尝试做的更好的方法提前致谢 static String MimeType string filePath String ret null FileInfo file new FileInfo
C++ 中网络掩码转换为 CIDR 格式

我必须将 2 个 DWORD IP 地址和网络掩码转换为 CDIR 格式所以我有 2 个 DWORD 对应 1 1 1 1 和 255 255 255 255 我想拿出字符串 1 1 1 1 32 对此有什么想法吗 Thanks 最简单的
xml.etree.ElementTree - 设置 xmlns = '...' 时遇到问题

我肯定错过了什么我正在尝试设置谷歌产品提要但我很难注册命名空间例子路线在这里 https support google com merchants answer 160589 https support google com mer
Oracle NUMBER 类型是否可能溢出？

我正在使用名为 Appworx 的进程调度软件其中每个进程和子进程可以有任意数量的条件如果条件为真则采取一些条件操作 goto 语句是可能的条件操作之一其中一个普通整数是标签每个条件从 1 开始编号我想使用此功能来循环评估和
将 SDL 应用程序移植到 iOS

我仅使用 SDL 无 OpenGL 用 C 创建了一个小游戏并希望将其移植到 iOS 6 我无意公开发布仅供个人使用该应用程序仅使用准系统 SDL 库本身没有 ttf 或图像那么将游戏移植到 iOS 6 的最佳无麻烦方式是什么
在可区分联合中使用 F# 中的 and 关键字

我今天面临以下 DU 声明 type Grammar Definition list and Definition Def of string Expression and Range Char of char Range of char
使用htaccess重写使子目录成为其自己的根目录以用于根相对路径请求

我可以使用 htaccess 捕获来自某个子目录的请求并使该目录使用自身作为任何根相对路径请求的根目录吗例如如果我有 http www example com subFIXED subANY restofpath 其中 subFIXED
如何检测何时使用 MKUserTrackingBarButtonItem

我目前正在尝试找到一种方法来检测何时MKUserTrackingBarButtonItem被使用除了处理mapView didUpdateUserLocation 委托方法我尝试设置ActionMKUserTrackingBarButt
在cowplot::plot_grid 的多面板图中设置单个面板的宽度和高度

我正在使用多面板图ggplot2 and cowplot包但我需要更改单个图的高度最简单地用一个例子来展示 library ggplot2 library cowplot p1 lt ggplot iris aes Sepal Widt
表单 Google 脚本防止重复

我正在制作一个谷歌表单有一个名为 name 的字段其中包含其他字段如标题公司和电子邮件地址如果数据库中已经有一个特定的人我希望其他信息用新信息替换旧信息即更新功能但我在使用 Google Apps 脚本执行此操作时遇到麻烦
如何在使用通用开发服务器的团队环境中使用 Coldfusion Builder 2？

我们的 CF 9 开发环境位于通用开发服务器 DEVWEB 上我们正在考虑从 Dreamweaver 迁移到 Coldfusion Builder 作为我们的 IDE 我担心 CF Builder 在项目根目录中放置 settings 目
详细了解大量 3x3 矩阵的逆算法

我遵循这个原始帖子用于反转大量 3x3 矩阵的 PyCuda 代码 https stackoverflow com questions 55357826 pycuda adapt existing code and kernel code

详细了解大量 3x3 矩阵的逆算法

详细了解大量 3x3 矩阵的逆算法 的相关文章

随机推荐

热门标签

详细了解大量 3x3 矩阵的逆算法的相关文章