我尝试使用以下命令加载大型数据文件(约 2000 万行)fread()
来自数据表包裹。然而,有些行却造成了很大的麻烦。
最小的例子:
text.csv contains:
id, text
1,"""Oops"",\""The"",""Georgia"""
fread("text.csv", sep=",")
Error in fread("text.csv", sep = ",") :
Not positioned correctly after testing format of header row. ch=','
In addition: Warning message:
In fread("text.csv", sep = ",") :
Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text
read.table()
效果稍好一些,但速度太慢,而且内存效率太低。
> read.table("text.csv", header = TRUE, sep=",")
id text
1 1 "Oops",\\"The","Georgia"
我意识到我的文本文件格式不正确,但它太大而无法以实际方式进行编辑。
非常感谢任何帮助。
EDIT:
实际数据记录的一个小样本:
sample1.txt, a good record:
materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
> fread("sample1.txt", sep=",")
materiale_id dk5 description creator subject-phrase
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin NA
title type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music
sample2.txt, a good and a bad record:
materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article
> fread("sample2.txt", sep=",")
Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"...
EDIT 2:
更新到 R 版本 3.2.3 和 data.table 1.9.6。对上述内容有帮助,但会与其他记录产生问题:
sample3.txt, a good and a bad record:
materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music
125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music
> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") :
Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
EDIT 3:
更新至开发版本1.9.7数据表中断fread()
共:
> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") :
showProgress is not type integer but type 'logical'. Please report.
EDIT 4:
当记录包含字符串时,我的文件中似乎出现了问题\\"
(字面意思,不是正则表达式)。显然,反斜杠太多了,导致fread()
将双引号误解为字符串的结尾,而本应将其视为乱码。
到目前为止我最好的解决方案是这样做:
m1 <- readLines("data.csv", encoding="UTF-8")
m2 <- gsub("\\\\\"", "\\\"", m1)
writeLines(m2, "data_new.csv", useBytes = TRUE)
m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")
这似乎有效。
但我并不 100% 理解这一点,因此非常欢迎任何澄清。