我正在使用 PyPdf 从 pdf 文件中读取文本。然而 pyPDF 不会逐行读取 pdf 中的文本,它以某种随意的方式读取。当 pdf 中不存在新行时,将新行放在某处。
import PyPDF2
pdf_path = r'C:\Users\PDFExample\Desktop\Temp\sample.pdf'
pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
page_nos = pdfReader.numPages
for i in range(page_nos):
# Creating a page object
pageObj = pdfReader.getPage(i)
# Printing Page Number
print("Page No: ",i)
# Extracting text from page
# And splitting it into chunks of lines
text = pageObj.extractText().split(" ")
# Finally the lines are stored into list
# For iterating over list a loop is used
for i in range(len(text)):
# Printing the line
# Lines are seprated using "\n"
print(text[i],end="\n\n")
print()
这给我的内容是
Our Ref :
21
1
8
88
1
11
5
Name:
S
ky Blue
Ref 1 :
1
2
-
34
-
56789
-
2021/2
Ref 2:
F2021004
444
Amount:
$
1
00
.
11
...
而预期的是
Our Ref :2118881115 Name: Sky Blue Ref 1 :12-34-56789-2021/2 Ref 2:F2021004444
Amount: $100.11 Total Paid:$0.00 Balance: $100.11 Date of A/C: 01/08/2021 Date Received: 10/12/2021
Last Paid: Amt Last Paid: A/C Status: CLOSED Collector : Sunny Jane
这是 pdf 文件的链接https://pdfhost.io/v/eCiktZR2d_sample2 https://pdfhost.io/v/eCiktZR2d_sample2
我尝试了一个名为 pdfplumber 的不同包。它能够按照我想要的方式逐行阅读 pdf。
1.安装pdfplumber包
pip install pdfplumber
2. 获取文本并将其存储在某个容器中
import pdfplumber
pdf_text = None
with pdfplumber.open(pdf_path) as pdf:
first_page = pdf.pages[0]
pdf_text = first_page.extract_text()
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)