我正在尝试从这个表中获取数据PDF https://www.dropbox.com/s/y3nivxhjvvzva7d/test1.pdf?dl=0。我尝试过 pdfminer 和 pypdf,运气不错,但我无法真正从表中获取数据。
This is what one of the tables looks like:
如您所见,某些列标有“x”。我正在尝试将此表放入对象列表中。
这是到目前为止的代码,我现在正在使用 pdfminer。
# pdfminer test
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter, PDFPageAggregator
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage
from pdfminer.image import ImageWriter
from cStringIO import StringIO
import sys
import os
def pdfToText(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ''
maxpages = 0
caching = True
pagenos = set()
records = []
i = 1
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,
caching=caching, check_extractable=True):
# process page
interpreter.process_page(page)
# only select lines from the line containing 'Tool' to the line containing "1 The 'All'"
lines = retstr.getvalue().splitlines()
idx = containsSubString(lines, 'Tool')
lines = lines[idx+1:]
idx = containsSubString(lines, "1 The 'All'")
lines = lines[:idx]
for line in lines:
records.append(line)
i += 1
fp.close()
device.close()
retstr.close()
return records
def containsSubString(list, substring):
# find a substring in a list item
for i, s in enumerate(list):
if substring in s:
return i
return -1
# process pdf
fn = '../test1.pdf'
ft = 'test.txt'
text = pdfToText(fn)
outFile = open(ft, 'w')
for i in range(0, len(text)):
outFile.write(text[i])
outFile.close()
That produces a text file and it gets all of the text but, the x's don't have the spacing preserved. The output looks like this:
x 在文本文档中只是单倍行距
现在,我只是生成文本输出,但我的目标是使用表中的数据生成一个 html 文档。我一直在寻找 OCR 示例,其中大多数看起来令人困惑或不完整。我愿意使用 C# 或任何其他可能产生我正在寻找的结果的语言。
EDIT:将会有多个这样的 pdf 文件,我需要从中获取表格数据。所有 pdf 的标题都是相同的(据我所知)。