从字符串列表创建 numpy 结构化数组

2023-12-12

我正在开发一个 python 实用程序来从第谷 2 星目录中获取数据。我正在开发的功能之一是查询目录并返回给定明星 ID（或一组明星 ID）的所有信息。

我目前正在通过循环遍历目录文件的行来执行此操作，然后尝试将行解析为 numpy 结构化数组（如果被查询）。（请注意，如果有更好的方法来做到这一点，你可以让我知道，即使这不是这个问题的目的——我这样做是因为目录太大，无法一次将所有内容加载到内存中时间）

不管怎样，一旦我确定了我想要保留的记录，我就遇到了问题......我不知道如何将其解析为结构化数组。

例如，假设我要保留的记录是：

record = '0002 00038 1| |  3.64121230|  1.08701186|   14.1|  -23.0| 69| 82| 1.8| 1.9|1968.56|1957.30| 3|1.0|3.0|0.9|3.0|12.444|0.213|11.907|0.189|999| |         |  3.64117944|  1.08706861|1.83|1.73| 81.0|104.7| | 0.0'

现在，我尝试将其解析为具有 dtype 的 numpy 结构化数组：

        dform = [('starid', [('TYC1', int), ('TYC2', int), ('TYC3', int)]),
             ('pflag', str),
             ('starBearing', [('rightAscension', float), ('declination', float)]),
             ('properMotion', [('rightAscension', float), ('declination', float)]),
             ('uncertainty', [('rightAscension', int), ('declination', int), ('pmRA', float), ('pmDc', float)]),
             ('meanEpoch', [('rightAscension', float), ('declination', float)]),
             ('numPos', int),
             ('fitGoodness', [('rightAscension', float), ('declination', float), ('pmRA', float), ('pmDc', float)]),
             ('magnitude', [('BT', [('mag', float), ('err', float)]), ('VT', [('mag', float), ('err', float)])]),
             ('starProximity', int),
             ('tycho1flag', str),
             ('hipparcosNumber', str),
             ('observedPos', [('rightAscension', float), ('declination', float)]),
             ('observedEpoch', [('rightAscension', float), ('declination', float)]),
             ('observedError', [('rightAscension', float), ('declination', float)]),
             ('solutionType', str),
             ('correlation', float)]

这看起来应该是一件相当简单的事情，但我尝试的一切都失败了......

我试过了：

np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'))
np.genfromtxt(BytesIO(record.encode()),dtype=dform,delimiter=(' ','|'),missing_values=' ',filling_values=None)

两者都给了我

{TypeError}cannot perform accumulate with flexible type

这是没有意义的，因为它不应该进行任何积累。

我也尝试过

np.array(re.split('\|| ',record),dtype=dform)

哪个抱怨

{TypeError}a bytes-like object is required, not 'str'

和另一个变体

np.array([x.encode() for x in re.split('\|| ',record)],dtype=dform)

它不会抛出错误，但也肯定不会返回正确的结果：

[ ((842018864, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)...

那么我该怎么做呢？我认为 genfromtxt 选项是可行的方法（特别是因为偶尔可能会丢失数据），但我不明白为什么它不起作用。这是我必须自己编写解析器的东西吗？

抱歉，这个答案又长又散漫，但这就是弄清楚发生了什么的原因。数据类型的复杂性尤其被其长度所隐藏。

我得到了TypeError: cannot perform accumulate with flexible type当我尝试你的列表时出错delimiter。详细信息显示错误发生在LineSplitter。无需赘述，分隔符应该是一个字符（或默认的“空白”）。

来自genfromtxt docs:

分隔符：str、int 或序列，可选用于分隔值的字符串。默认情况下，任何连续的空格充当分隔符。整数或整数序列也可以提供为每个字段的宽度。

The genfromtxtsplitter比string稍微强大一点.split that loadtxt用途，但不像re分离器。

至于{TypeError}a bytes-like object is required, not 'str'，您为几个字段指定 dtype'str'。那是字节字符串，就像你的record是 unicode 字符串（在 Py3 中）。但你已经意识到BytesIO(record.encode()).

我喜欢测试genfromtxt案例：

record = b'....'
np.genfromtxt([record], ....)

或者更好

records = b"""one line
tow line
three line
"""
np.genfromtxt(records.splitlines(), ....)

如果我让genfromtxt推导字段类型，只使用一个分隔符，我得到 32 个字段：

In [19]: A=np.genfromtxt([record],dtype=None,delimiter='|')
In [20]: len(A.dtype)
Out[20]: 32
In [21]: A
Out[21]: 
array((b'0002 00038 1', False, 3.6412123, 1.08701186, 14.1, -23.0, 69, 82, 1.8, 1.9, 1968.56, 1957.3, 3, 1.0, 3.0, 0.9, 3.0, 12.444, 0.213, 11.907, 0.189, 999, False, False, 3.64117944, 1.08706861, 1.83, 1.73, 81.0, 104.7, False, 0.0), 
      dtype=[('f0', 'S12'), ('f1', '?'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8'), ... ('f26', '<f8'), ('f27', '<f8'), ('f28', '<f8'), ('f29', '<f8'), ('f30', '?'), ('f31', '<f8')])

当我们解决整个字节和分隔符问题时

np.array([x for x in re.split(b'\|| ',record)],dtype=dform)

确实运行。我现在发现您的 dform 很复杂，具有嵌套的复合字段。

但是要定义结构化数组，您需要给它一个记录列表，例如

np.array([(record1...), (record2...), ....], dtype([(field1),(field2 ),...]))

在这里，您尝试创建一条记录。我可以将您的列表包装在一个元组中，但随后我发现该长度和dform长度，66 v 17。如果算上所有子字段dform可能需要 66 个值，但我们不能只用一个元组来做到这一点。

我从未尝试过从如此复杂的数组创建数组dtype，所以我正在寻找让它发挥作用的方法。

In [41]: np.zeros((1,),dform)
Out[41]: 
array([ ((0, 0, 0), '', (0.0, 0.0), (0.0, 0.0), (0, 0, 0.0, 0.0), (0.0, 0.0), 0, (0.0, 0.0, 0.0, 0.0), ((0.0, 0.0), (0.0, 0.0)), 0, '', '', (0.0, 0.0), (0.0, 0.0), (0.0, 0.0), '', 0.0)], 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), ('pflag', '<U'), ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]), ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]), ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), ('meanEpoch', ....('solutionType', '<U'), ('correlation', '<f8')])

In [64]: for name in A.dtype.names:
    print(A[name].dtype)
   ....:     
[('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
int32
[('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]
[('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]
int32
<U1
<U1
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
[('rightAscension', '<f8'), ('declination', '<f8')]
<U1
float64

我数了一下，有 34 个原始数据类型字段。大多数是“标量”，有些有 2-4 个术语，其中一个还有更深层次的嵌套。

如果我将前 2 个分隔空格替换为|, record.split(b'|')给我 34 个字符串。

让我们尝试一下genfromtxt:

In [79]: np.genfromtxt([record],delimiter='|',dtype=dform)
Out[79]: 
array(((2, 38, 1), '', (3.6412123, 1.08701186), (14.1, -23.0), 
   (69, 82, 1.8, 1.9), (1968.56, 1957.3), 3, (1.0, 3.0, 0.9, 3.0),
   ((12.444, 0.213), (11.907, 0.189)), 999, '', '', 
   (3.64117944, 1.08706861), (1.83, 1.73), (81.0, 104.7), '', 0.0), 
      dtype=[('starid', [('TYC1', '<i4'), ('TYC2', '<i4'), ('TYC3', '<i4')]), 
 ('pflag', '<U'), 
 ('starBearing', [('rightAscension', '<f8'), ('declination', '<f8')]),  
 ('properMotion', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('uncertainty', [('rightAscension', '<i4'), ('declination', '<i4'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('meanEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]),   
 ('numPos', '<i4'), 
 ('fitGoodness', [('rightAscension', '<f8'), ('declination', '<f8'), ('pmRA', '<f8'), ('pmDc', '<f8')]), 
 ('magnitude', [('BT', [('mag', '<f8'), ('err', '<f8')]), ('VT', [('mag', '<f8'), ('err', '<f8')])]), 
 ('starProximity', '<i4'), ('tycho1flag', '<U'), ('hipparcosNumber', '<U'), 
 ('observedPos', [('rightAscension', '<f8'), ('declination', '<f8')]),
 ('observedEpoch', [('rightAscension', '<f8'), ('declination', '<f8')]), 
 ('observedError', [('rightAscension', '<f8'), ('declination', '<f8')]), ('solutionType', '<U'), ('correlation', '<f8')])

这看起来几乎是合理的。genfromtxt实际上可以在复合字段之间拆分值。这就是我想要尝试的更多np.array().

因此，如果您解决了分隔符和字节/unicode问题，genfromtxt可以处理这个烂摊子。

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

从字符串列表创建 numpy 结构化数组

python

Arrays

NumPy

从字符串列表创建 numpy 结构化数组的相关文章

python - 是否可以扩展 xml-rpc 可以序列化的事物集？

在python中将文本文件解析为列表

python 中分割字符串以获得一个值？

可以在 TensorFlow 中使用排名相关作为成本函数吗？

sudo pip install python-Levenshtein 失败，错误代码 1

在 Swift 中检查一个数组是否包含另一个数组的所有元素

在径向（树）网络x图中查找末端节点（叶节点）

spacy 如何使用词嵌入进行命名实体识别 (NER)？

Python 中的 @staticmethod 与 @classmethod

C# 用数组封送结构体

为什么我用 beautifulSoup 刮的时候有桌子，但没有 pandas

将 Pandas 列中的列表拆分为单独的列

无法截取宽度为 0 的屏幕截图

conda-env list / conda info --envs 如何查找环境？

将一个列表的元素除以另一个列表的元素

从java中的字符串数组中删除空值

Python - 如何查询定义方法的类？

Python 3.2 中 **kwargs 和 dict 有什么区别？

从字符串列表创建 numpy 结构化数组

从字符串列表创建 numpy 结构化数组 的相关文章

从字符串列表创建 numpy 结构化数组的相关文章