在python dask中读取带分隔符的csv

2024-01-09

我正在尝试创建一个DataFrame通过读取由 '######' 5 个哈希值分隔的 csv 文件

代码是:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()

错误是:

dask.async.ValueError:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns.  These first 1,000 rows led us to an incorrect
guess.

For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.

You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.

    df = dd.read_csv(..., dtype={'my-column': float})

Pandas has given us the following error when trying to parse the file:

  "The 'dtype' option is not supported with the 'python' engine"

Traceback
 ---------
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)

那么如何摆脱它呢。

如果我遵循错误,那么我必须为每一列提供 dtype,但如果我有 100 多个列,那么这是没有用的。

如果我在没有分隔符的情况下阅读,那么一切都会很好,但到处都是#####。所以在将其计算为pandas之后DataFrame,有办法摆脱它吗?

所以请帮助我。


将整个文件读取为dtype=object,意味着所有列都将被解释为类型object。这应该正确读取,摆脱#####在每一行中。从那里你可以使用以下命令将其转换为 pandas 数据框compute()方法。一旦数据进入 pandas 框架,您就可以使用 pandasinfer_objects方法来更新类型而不必对它们进行硬编码。

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)

在python dask中读取带分隔符的csv 的相关文章

随机推荐