我有一个字符向量,它是一些 PDF 抓取的文件pdftotext
(命令行工具)。
一切都(幸福地)排列得很好。然而,该向量充满了一种空白类型,无法使用正则表达式:
> test
[1] "Address:" "Clinic Information:" "Store " "351 South Washburn" "Aurora Quick Care"
[6] "Info" "St. Oshkosh, WI 54904" "Phone: 920‐232‐0718" "Pewaukee"
> grepl("[0-9]+ [A-Za-z ]+",test)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> dput(test)
c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
"Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
"Pewaukee")
> test.pasted <- c("Address:", "Clinic Information:", "Store ", "351 South Washburn",
+ "Aurora Quick Care", "Info", "St. Oshkosh, WI 54904", "Phone: 920‐232‐0718",
+ "Pewaukee")
> grepl("[0-9]+ [A-Za-z ]+",test.pasted)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
> Encoding(test)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
> Encoding(test.pasted)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "UTF-8" "unknown"
显然,有一些角色没有在dput
,如下面的问题所示:
如何正确输出国际化文本? https://stackoverflow.com/questions/11369390/how-to-properly-dput-internationalized-text
我无法复制/粘贴整个向量......我如何搜索并销毁这个非空白空白?
Edit
显然我还不太清楚,因为答案无处不在。这是一个更简单的测试用例:
> grepl("Clinic Information:", test[2])
[1] FALSE
> grepl("Clinic Information:", "Clinic Information:") # Where the second phrase is copy/pasted from the screen
[1] TRUE
屏幕上和屏幕上打印的“诊所”和“信息”之间有一个空格。dput
输出,但字符串中的任何内容都不是标准空间。我的目标是消除这个问题,这样我就可以正确地 grep 出该元素。