从稀疏定义列表中挑选无模式下值的算法

2024-03-10

我有以下问题。

我正在开发一个随机模拟器，它随机采样系统的配置，并存储每个配置在特定时间实例被访问次数的统计数据。代码大致是这样的

f[_Integer][{_Integer..}] :=0
...
someplace later in the code, e.g.,
index = get index;
c = get random configuration (i.e. a tuple of integers, say a pair {n1, n2}); 
f[index][c] = f[index][c] + 1;
which tags that configuration c has occurred once more in the simulation at time instance index.

代码完成后，会出现一个 f 的定义列表，看起来像这样（我手动输入它只是为了强调最重要的部分）

?f
f[1][{1, 2}] = 112
f[1][{3, 4}] = 114
f[2][{1, 6}] = 216
f[2][{2, 7}] = 227
...
f[index][someconfiguration] = some value
...
f[_Integer][{_Integer..}] :=0

请注意，首先出现的无模式定义可能相当稀疏。人们也不知道将选择哪些值和配置。

问题是有效地提取所需索引的值，例如发出类似的问题

result = ExtractConfigurationsAndOccurences[f, 2]

它应该给出一个具有结构的列表

result = {list1, list2}

where

list1 = {{1, 6}, {2, 7}} (* the list of configurations that occurred during the simulation*)
list2 = {216, 227} (* how many times each of them occurred *)

问题是 ExtractConfigurationsAndOccurences 应该非常快。我能想到的唯一解决方案是使用 SubValues[f] （给出完整列表）并使用Cases陈述。我意识到应该不惜一切代价避免这个过程，因为需要测试的配置（定义）呈指数级增长，这会大大减慢代码速度。

Mathematica 中有没有一种自然的方法可以快速地做到这一点？

我希望 Mathematica 将 f[2] 视为具有许多向下值的单个头，但使用 DownValues[f[2]] 没有给出任何结果。使用 SubValues[f[2]] 也会导致错误。

这是我之前答案的完全重写。事实证明，在我之前的尝试中，我忽略了一种基于压缩数组和稀疏数组组合的简单得多的方法，该方法比以前的所有方法更快、内存效率更高（至少在我所使用的样本大小范围内）测试过），同时只对原始版本进行最小程度的更改SubValues- 基于的方法。由于问题是关于最有效的方法，我将从答案中删除其他方法（鉴于它们相当复杂并且占用大量空间。那些想看到它们的人可以检查这个的过去的修订）回答）。

原本的`SubValues`基于方法

我们首先引入一个函数来为我们生成配置的测试样本。这里是：

Clear[generateConfigurations];
generateConfigurations[maxIndex_Integer, maxConfX_Integer, maxConfY_Integer, 
  nconfs_Integer] :=
Transpose[{
  RandomInteger[{1, maxIndex}, nconfs],
  Transpose[{
     RandomInteger[{1, maxConfX}, nconfs],
     RandomInteger[{1, maxConfY}, nconfs]
  }]}];

我们可以生成一个小样本来说明：

In[3]:= sample  = generateConfigurations[2,2,2,10]
Out[3]= {{2,{2,1}},{2,{1,1}},{1,{2,1}},{1,{1,2}},{1,{1,2}},
          {1,{2,1}},{2,{1,2}},{2,{2,2}},{1,{2,2}},{1,{2,1}}}

这里我们只有 2 个索引，以及“x”和“y”数字都从 1 到 2 变化的配置 - 10 个这样的配置。

随着我们增加，以下函数将帮助我们模拟配置频率的累积SubValues基于 - 的重复出现的计数器：

Clear[testAccumulate];
testAccumulate[ff_Symbol, data_] :=
  Module[{},
   ClearAll[ff];
   ff[_][_] = 0;
   Do[
     doSomeStuff;
     ff[#1][#2]++ & @@ elem;
     doSomeMoreStaff;
   , {elem, data}]];

The doSomeStuff and doSomeMoreStaff符号在这里代表一些可能排除或跟随计数代码的代码。这data参数应该是由以下方式生成的表单的列表generateConfigurations。例如：

In[6]:= 
testAccumulate[ff,sample];
SubValues[ff]

Out[7]= {HoldPattern[ff[1][{1,2}]]:>2,HoldPattern[ff[1][{2,1}]]:>3,
   HoldPattern[ff[1][{2,2}]]:>1,HoldPattern[ff[2][{1,1}]]:>1,
   HoldPattern[ff[2][{1,2}]]:>1,HoldPattern[ff[2][{2,1}]]:>1,
   HoldPattern[ff[2][{2,2}]]:>1,HoldPattern[ff[_][_]]:>0}

以下函数将从列表中提取结果数据（索引、配置及其频率）SubValues:

Clear[getResultingData];
getResultingData[f_Symbol] :=
   Transpose[{#[[All, 1, 1, 0, 1]], #[[All, 1, 1, 1]], #[[All, 2]]}] &@
        Most@SubValues[f, Sort -> False];

例如：

In[10]:= result = getResultingData[ff]
Out[10]= {{2,{2,1},1},{2,{1,1},1},{1,{2,1},3},{1,{1,2},2},{2,{1,2},1},
{2,{2,2},1},{1,{2,2},1}}

为了完成数据处理周期，这里有一个简单的函数来提取固定索引的数据，基于Select:

Clear[getResultsForFixedIndex];
getResultsForFixedIndex[data_, index_] := 
  If[# === {}, {}, Transpose[#]] &[
    Select[data, First@# == index &][[All, {2, 3}]]];

对于我们的测试示例，

In[13]:= getResultsForFixedIndex[result,1]
Out[13]= {{{2,1},{1,2},{2,2}},{3,2,1}}

这可能与 @zorank 在代码中尝试的接近。

基于压缩数组和稀疏数组的更快解决方案

正如 @zorank 指出的，对于具有更多索引和配置的较大样本，这会变得很慢。我们现在将生成一个大样本来说明（笔记！这需要大约 4-5 Gb RAM，因此如果超出可用 RAM，您可能需要减少配置数量):

In[14]:= 
largeSample = generateConfigurations[20,500,500,5000000];
testAccumulate[ff,largeSample];//Timing

Out[15]= {31.89,Null}

我们现在将从中提取完整数据SubValues of ff:

In[16]:= (largeres = getResultingData[ff]); // Timing
Out[16]= {10.844, Null}

这需要一些时间，但只需执行一次。但是当我们开始为固定索引提取数据时，我们发现它非常慢：

In[24]:= getResultsForFixedIndex[largeres,10]//Short//Timing
Out[24]= {2.687,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291},
 {256,38},{352,469}},{<<1>>}}}

我们在这里用来加快速度的主要思想是将各个列表打包在largeres，那些指数、组合和频率。虽然无法打包完整列表，但这些部分单独可以：

In[18]:= Timing[
   subIndicesPacked = Developer`ToPackedArray[largeres[[All,1]]];
   subCombsPacked =  Developer`ToPackedArray[largeres[[All,2]]];
   subFreqsPacked =  Developer`ToPackedArray[largeres[[All,3]]];
]
Out[18]= {1.672,Null}

这也需要一些时间，但又是一次性操作。

然后，将使用以下函数更有效地提取固定索引的结果：

Clear[extractPositionFromSparseArray];
extractPositionFromSparseArray[HoldPattern[SparseArray[u___]]] := {u}[[4, 2, 2]]

Clear[getCombinationsAndFrequenciesForIndex];
getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, 
    packedFreqs_, index_Integer] :=
With[{positions = 
         extractPositionFromSparseArray[
               SparseArray[1 - Unitize[packedIndices - index]]]},
  {Extract[packedCombs, positions],Extract[packedFreqs, positions]}];

现在，我们有：

In[25]:=  
getCombinationsAndFrequenciesForIndex[subIndicesPacked,subCombsPacked,subFreqsPacked,10]
  //Short//Timing

Out[25]= {0.094,{{{196,26},{53,36},{360,43},{104,144},<<157674>>,{31,305},{240,291},
{256,38},{352,469}},{<<1>>}}}

我们的速度提高了 30 倍。天真的Select方法。

关于复杂性的一些注释

请注意，第二种解决方案速度更快，因为它使用优化的数据结构，但其复杂度与Select- 基于的，即与所有索引的唯一组合的总列表的长度成线性。因此，理论上，前面讨论的基于嵌套哈希表等的解决方案may渐进地变得更好。问题是，实际上我们可能会在那之前很久就达到内存限制。对于 1000 万配置样本，上面的代码仍然比我之前发布的最快解决方案快 2-3 倍。

EDIT

修改如下：

Clear[getCombinationsAndFrequenciesForIndex];
getCombinationsAndFrequenciesForIndex[packedIndices_, packedCombs_, 
    packedFreqs_, index_Integer] :=
 With[{positions =  
          extractPositionFromSparseArray[
             SparseArray[Unitize[packedIndices - index], Automatic, 1]]},
    {Extract[packedCombs, positions], Extract[packedFreqs, positions]}];

使代码速度仍然提高了一倍。此外，对于更稀疏的索引（例如，使用参数调用样本生成函数，例如generateConfigurations[2000, 500, 500, 5000000]），相对于Select- 基于函数是关于100 times.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

wolframmathematica