有没有办法使用 R 将数据从 .pdf 文件导入为 HTML 格式?
我尝试使用以下代码:
library(tm)
filename = "file.pdf"
doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),language = "en",id = "id1")
head(doc)
HTML 中的输出显示为:
## $content
## [1] " sample data"
## [2] ""
## [3] " records"
## [4] ""
## [5] " 31 July 2017"
## [6] ""
## [7] ""
## [8] "R Markdown setup
## [9] ""
## [10] ""
## [11] "R Markdown"
## [12] ""
## [13] "This is an R Markdown document. Markdown is a simple formatting syntax for"
## [14] "authoring HTML, PDF, and MS Word documents. For more details on using R"
## [15] "Markdown see http://rmarkdown.rstudio.com."
## [16] "When you click the Knit button a document will be generated that includes"
## [17] "both content as well as the output of any embedded R code chunks within the"
## [18] "document. You can embed an R code chunk like this:"
## [19] "{r cars} summary(cars)"
请帮忙!
我在这里下载了 pdf 文件:https://fie.org/competition/2022/152/results/pools/pdf?lang=en https://fie.org/competition/2022/152/results/pools/pdf?lang=en
使用以下代码,我已经能够将 PDF 文件转换为 html 文件:
library(RDCOMClient)
path_PDF <- "C:\\pdf_with_table.pdf"
path_Html <- "C:\\temp.html"
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Html, FileFormat = 9) # saves to html
在我看来,直接从 PDF 中提取表格或者将 PDF 转换为 Word 文件并从 Word 文件中提取表格会更直接。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)