Pandas：将一系列字典保存到磁盘

2024-01-04

我有一个 python pandas 系列词典：

id           dicts
1            {'5': 1, '8': 20, '1800': 2}
2            {'2': 2, '8': 1, '1000': 25, '1651': 1}
...          ...
...          ...
...          ...
20000000     {'2': 1, '10': 20}

字典中的(key, value)代表('feature', count)。存在大约 2000 个独特功能。

该系列在 pandas 中的内存使用量约为 500MB。将该对象写入磁盘的最佳方法是什么（理想情况下磁盘空间使用率较低，并且写入速度快并且之后读回速度快）？

考虑的选项（并尝试了前两个）：
- to_csv（但将字典视为字符串，因此之后转换回字典非常慢）
- cPickle（但在执行过程中内存不足）
- 转换为 scipy 稀疏矩阵结构

我很好奇你的Series仅占用500MB。如果您正在使用.memory_usage方法，这只会返回每个 python 对象引用使用的总内存，这是您的 Series 存储的所有内存。这并没有考虑到字典的实际内存。粗略计算 20,000,000 * 288 字节 = 5.76GB 应该是您的内存使用量。 288 字节是每个字典所需内存的保守估计。

转换为稀疏矩阵

无论如何，请尝试以下方法将数据转换为稀疏矩阵表示：

import numpy as np, pandas as pd
from sklearn.feature_extraction import DictVectorizer
from scipy.sparse import csr_matrix
import pickle

我会用ints 而不是字符串作为键，因为这将在以后保持正确的顺序。所以，假设您的系列名为dict_series:

dict_series = dict_series.apply(lambda d: {int(k):d[k] for k in d}

这可能会占用大量内存，您最好简单地创建您的Series of dicts using ints 从一开始就作为键。或者您可以直接跳过此步骤。现在，构建稀疏矩阵：

dv = DictVectorizer(dtype=np.int32)
sparse = dv.fit_transform(dict_series)

保存到磁盘

现在，本质上，您的稀疏矩阵可以从 3 个字段重建：sparse.data, sparse.indices, sparse.indptr，可选地，sparse.shape。节省数组负载的最快且最有效的内存方法sparse.data, sparse.indices, sparse.indptr是使用 np.ndarraytofile方法，它将数组保存为原始字节。来自文档 http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tofile.html:

这是快速存储阵列数据的便捷功能。有关字节顺序和精度的信息丢失，因此该方法不适用对于用于归档数据或传输数据的文件来说是一个不错的选择具有不同字节序的机器之间。

因此，此方法会丢失任何数据类型信息和字节顺序。前一个问题可以通过事先记下数据类型来解决，无论如何您都会使用 np.int32 。如果您在本地工作，后一个问题不是问题，但如果可移植性很重要，您将需要研究存储信息的替代方法。

# to save
sparse.data.tofile('data.dat')
sparse.indices.tofile('indices.dat')
sparse.indptr.tofile('indptr.dat')
# don't forget your dict vectorizer!
with open('dv.pickle', 'wb') as f:
    pickle.dump(dv,f) # pickle your dv to be able to recover your original data!

要恢复一切：

with open('dv.pickle', 'rb') as f:
    dv = pickle.load(f)

sparse = csr_matrix((np.fromfile('data.dat', dtype = np.int32),
                     np.fromfile('indices.dat', dtype = np.int32),
                     np.fromfile('indptr.dat', dtype = np.int32))

original = pd.Series(dv.inverse_transform(sparse))

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)