我正在将一个制表符分隔的文本文件读入 pandas 数据帧。在阅读本文时,我遇到了运行时错误。我已经浏览了与此错误相关的帖子,所有这些帖子都暗示了在迭代时不应修改字典的规则他们。就我而言,我所做的就是读取文件。这个问题如何与迭代和更改 dicts 的错误联系起来?
>>> import pandas as pd
>>> df=pd.read_csv("dummy_data.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")
Traceback (most recent call last):
File "<input>", line 1, in <module>
df=pd.read_csv("dummy_data.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")
File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/parsers.py", line 709, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/parsers.py", line 431, in _read
compression = _infer_compression(filepath_or_buffer, compression)
File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/common.py", line 270, in _infer_compression
filepath_or_buffer = _stringify_path(filepath_or_buffer)
File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/pandas/io/common.py", line 157, in _stringify_path
from py.path import local as LocalPath
File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/py/__init__.py", line 148, in <module>
'Syslog' : '._log.log:Syslog',
File "/home/avadhut/.virtualenvs/avadhut_virtual/lib/python3.5/site-packages/py/_vendored_packages/apipkg.py", line 63, in initpkg
for module in sys.modules.values():
RuntimeError: dictionary changed size during iteration
Edit 1:通过交互模式读取文件时,我在尝试读取文件时遇到相同的错误两次。第三次运行同一行不会引发任何错误。这种不稳定行为的原因可能是什么?
>>> df=pd.read_csv("product_name.txt",header=None,chunksize=10000,error_bad_lines=False,warn_bad_lines=True,engine='c',sep="\t",encoding="latin-1")
Edit 2:
要在此处复制错误,请使用指向 1000 行数据集的链接: S3 链接到数据集 https://s3.amazonaws.com/ai-labs-misc-files/dummy_data.txt
Edit 3: 找到一个有类似问题的链接:Pandas CSV 文件,偶尔有额外的列 https://stackoverflow.com/a/20062750/8229596但其中提到的标志(error_bad_lines)似乎不适用于我的情况。
>>> df = pd.read_csv("unclean.csv", error_bad_lines=False, header=None)
Edit 4:我开发了一个脚本,用于将虚拟数据(在编辑 2 中提到)加载到 pandas 数据帧,然后将其保存到 hdf5 文件。我运行这个脚本 20 次,并且没有一次遇到运行时错误。另一方面,在尝试时在交互模式下读取文件会暴露运行时错误和不稳定的行为。python 脚本与交互模式行为不同的原因可能是什么.我正在使用 Pandas ==0.22.0 和 Python==3.5.2 和 table==3.4.4
import pandas as pd
import tables
df=pd.read_csv("dummy.txt",header=None,error_bad_lines=False,warn_bad_lines=False,engine='c',sep="\t",encoding="latin-1",names=["product_name_id","current_product_name_id","product_n","active_f","create_d","create_user_n","change_d","change_user_n","ft_timestamp"])
df.to_hdf(path_or_buf="/home/avadhut/data_files/dummy_data.h5",key="dummy",mode="a",format="table")
df=pd.read_hdf("/home/avadhut/data_files/dummy_data.h5",key="dummy")
print(df.head(100))