Python PDF处理教程

2023-10-23

PDF 或便携式文档格式首先由 Adobe 推出，但现在由国际标准化组织 (ISO) 维护，并且它是一个开放标准。

PDF 文件的一些主要组件是纯文本、按钮、表单、单选按钮、图像、音频、视频、签名和元数据。
在 Python 中，我们可以执行不同的任务来处理 PDF 文件中的数据并创建 PDF 文件。

在本教程中，我们将使用 Python PDF 处理库创建一个 PDF 文件，从中提取不同的组件，并使用示例对其进行编辑。

目录 hide

1 流行的 Python PDF 库
2 提取文本
3 提取图像
4 提取一个表
5 提取网址
6 将页面提取为图像
7 创建 PDF
8 Add text
9 突出显示文本
10 添加图像
11 添加表格
12 创建表格
13 填写表格
14 调整页面大小
15 将 PDF 转换为 CSV 或 Excel
16 加水印
17 PDF 元数据（读取和编辑）

流行的 Python PDF 库

处理 PDF 文件的主要库是 PyPDF2、PDFrw 和 tabula-py。 pyPDF 包于 2005 年发布。

该包后来的发展是为了兼容不同版本的Python和优化目的。

现在该库存在 pyPDF、PyPDF2 和 PyPDF4 版本，pyPDF 和 PyPDF2+ 之间的主要区别在于 PyPDF2+ 版本与 Python 3 兼容。

在本教程中，我们将使用 PyPDF2 运行代码，因为 PyPDF4 与 Python 3 不完全兼容。
要安装适用于 Python 的 PyPDF2，我们使用以下 pip 命令：


pip install pyPDF2

如果您使用的是 Anaconda，则可以使用以下命令安装 PyPDF2：


conda install pyPDF2

PDFrw 库是 PyPDF2 的另一个替代品。这两个库之间的主要区别是 PyPDF2 加密文件的能力以及 PDFrw 与 ReportLab 集成的能力。
要安装适用于 Python 的 PDFrw，我们使用以下 pip 命令：


pip install PDFrw

如果您使用的是 Anaconda，则可以使用以下命令安装 PDFrw：


conda install PDFrw

tabula-py 是一个被数据科学专业人士广泛使用的库，用于解析非常规格式的 PDF 中的数据并将其制成表格。
要为 Python 安装 tabula-py，我们使用以下 pip 命令：


pip install tabula-py

如果您使用的是 Anaconda，则可以使用以下命令安装 tabula-py：


conda install tabula-py

PyMuPDF 是一个多平台、轻量级 PDF、XPS 和电子书查看器、渲染器和工具包。处理PDF文件中的图像时也非常方便。
要安装 Python 版 PyMuPDF，我们使用以下 pip 命令：


pip install PyMuPDF

pdf2image 是一个用于将 PDF 文件转换为图像的 Python 库。要安装它，我们需要在我们的系统中配置 poppler。

对于Windows，我们需要将其下载到我们的系统并将以下内容添加到我们的PATH作为convert_from_path的参数：


poppler_path = r"C:\path\to\poppler-xx\bin"

对于 Linux 用户（基于 Debian），我们可以简单地通过以下方式安装它：


sudo apt-get install poppler

之后，我们可以通过运行以下 pip 命令来安装 pdf2image：


pip install poppler-utils

ReportLab也是一个用于处理PDF文件的Python库。特别是该库的 Canvas 类在创建 PDF 文件时非常方便。我们使用以下 pip 命令安装它：


pip install reportlab

endesive 是一个 Python 库，用于对邮件、PDF 和 XML 文档中的数字签名进行数字签名和验证。我们使用以下 pip 命令安装它：


pip install endesive

提取文本

Sometimes, we need to extract text from PDF files and process it. For example, we have the following two-pages in the Example.PDF file with plain text in it:

We save this file in the same directory where our Python file is saved.

To extract the text from the pages for processing, we will use the PyPDF2 library as follows:


from PyPDF2 import PdfFileReader as pfr
with open('pdf_file', 'mode_of_opening') as file:
    pdfReader = pfr(file)
    page = pdfReader.getPage(0)
    print(page.extractText())

在我们的代码中，我们首先从 PyPDF2 导入 PdfFileReader 作为 pfr。然后我们以“rb”（读和写）模式打开 PDF 文件。接下来，我们为该文件创建一个 pdfFileReader 对象。

我们可以使用 pdfReader 对象的不同方法来处理数据。

例如，在上面的代码中，我们使用 getPage 方法，参数为页码，并创建 page 对象，现在我们可以对其执行 extractText() 方法以从中获取所有文本作为字符串。现在，作为示例，让我们从 Example.pdf 文件的第一页中提取数据：


from PyPDF2 import PdfFileReader as pfr
with open('Example.pdf', 'rb') as file:
    pdfReader = pfr(file)
    page = pdfReader.getPage(0)
    print(page.extractText())

Running this code, we get the following result which is the plain text of the page in string format:

提取图像

在本节中，我们将解析 PDF 文件以将其中的图像保存到本地计算机。为此，我们使用 PyMuPDF 库从 PDF 文件中获取它，并使用 Pillow 将其保存到本地计算机。

To demonstrate this, we create a sample PDF file with images called ExtractImage.pdf and place it next to our Python file:

Now, let’s have a look at the code below which retrieves the images from our PDF file and saves them in the current directory.


import fitz
import io
from PIL import Image
file_in_pdf_format = fitz.open("ExtractImage.pdf")
for page_number in range(len(file_in_pdf_format)):
    page = file_in_pdf_format[page_number]
    img_list = page.get_images()
    if len(img_list) == 0:
        print("There is no image on page ", page_number)
        pass
    for img_index, img in enumerate(page.get_images(), start=1):
        xref = img[0]
        base_img = file_in_pdf_format.extract_image(xref)
        img_bytes = base_img["image"]
        img_ext = base_img["ext"]
        image = Image.open(io.BytesIO(img_bytes))
        image.save(open(f"image{page_number + 1}_{img_index}.{img_ext}", "wb"))

我们可以看到，除了 pitz(PyMuPDF) 之外，还导入了来自 PIL 的 io 和 Image。

PIL helps create an object of the image, and io helps us interact with the operating system to get the size of our file.
Running this piece of code, we get the following result:

The above image shows that after running the code, we get the images saved in the same directory. And the name of the images indicates the page where the image was found on, and its order.

提取一个表

有时我们的 PDF 文件中有表格。为了处理它们，我们需要从 PDF 文件中提取它们并将它们转换为熊猫数据框。为此，我们使用 tabula-py 从名为的文件中提取数据提取表.pdf，以及 pandas 来进一步处理它。


import tabula
tables = tabula.read_pdf("ExtractTable.pdf",pages="all")
print(tables)

从上面的代码片段可以看出，处理 PDF 文件中的表格非常简单。我们通过指定页码来读取它。

It returns the table as a pandas dataframe that we can further use and manipulate.
Running the above code on ExtractTable.pdf, we get this result:

提取网址

还可以在 PDF 文件中检测 URL 或超链接。为了在 PDF 文件中检测它们，我们使用 re 和 PyPDF2 库。

正如我们提取纯文本一样，我们也可以从中提取文本并使用正则表达式提取类似于 URL 模式的字符序列，即 http:// 加上一些其他不带空格的字符。在下面的例子中，我们使用提取网址.pdf文件来演示。


import PyPDF2
import re
def url_finder(page_content):
   regex = r"(https?://\S+)"
   url = re.findall(regex,page_content)
   return url
with open("ExtractURLs.pdf", 'rb') as file:
    readPDF = PyPDF2.PdfFileReader(file)
    for page_no in range(readPDF.numPages):
        page=readPDF.getPage(page_no)
        text = page.extractText()
        print(f"URLS of page {page_no}: "+str(url_finder(text)))
    file.close()

In the code above, our regular expression “https?://\S+” first selects all of the strings that start with http or https (the question mark means the s is optional) till it finds white space which means the URL is ended.
Running the above code, we get the following result:

As we can see, our code returns the URLs of each page in a list.

将页面提取为图像

有时我们需要将 PDF 文件的页面转换为图像。为此，我们使用 pdf2image 库。

该模块返回所有页面的列表。然后，使用图像的名称和格式对列表中的每个元素调用 save，将它们保存到我们的机器中。

这是一个使用名为的文件演示它的示例示例.pdf.


from pdf2image import convert_from_path
imgs = convert_from_path('Example.pdf')
for i in range(len(imgs)):
    imgs[i].save('Page'+ str(i+1) +'.jpg', 'JPEG')

Running the above code, we get the images saved in our working directory as JPEG images.

创建 PDF

为了创建 PDF 文件，我们可以使用 reportlab 库的 Canvas 类。我们首先创建一个 Canvas 类的对象，并以参数作为 PDF 文件的名称，pdf文件.pdf.

接下来，我们调用它的drawString 方法，参数为要放置的位置和内容。最后，我们保存我们的文件。


from reportlab.pdfgen.canvas import Canvas
canv = Canvas("pdffile.pdf")
canv.drawString(72,72,"This is a PDF file.")
canv.save()

Here is the result of running our create_pdf.py file.

Add text

As seen in the above section, we pass our text as an argument to drawString and specify its place. The location identifier tells the distance from the left bottom. It specifies the beginning of the string.

As seen above, this is how our text will be displayed on the page in our file pdffile.pdf.

突出显示文本

为了突出显示 PDF 文件中的文本，我们使用 PyMuPDF 库。首先，我们打开我们的PDF文件pdf文件.pdf使用 PyMuPDF 库。然后我们遍历页面以突出显示指定的字符序列。


import fitz
pdf_file = fitz.open("pdffile.pdf")
for page in pdf_file:
    text_to_be_highlighted = "PDF"
    highlight = p.searchFor(text_to_be_highlighted)
    for inst in highlight:
        highlight = page.addHighlightAnnot(inst)
        highlight.update()
pdf_file.save("output.pdf", garbage=4, deflate=True, clean=True)

The PDF file before highlighting.

The PDF file after highlighting.

添加图像

要将图像添加到 PDF 文件，我们使用 PyMuPDF 库。为此，我们选择当前文件pdf文件.pdf, 目标文件pdf文件与图像.pdf，以及要插入的图像：


import fitz
pdf_file = "pdffile.pdf"
pdf_file_with_image = "pdffilewithimage.pdf"
image = "cat.png"
location = fitz.Rect(450,20,550,120)
file_handle = fitz.open(pdf_file)
first_page = file_handle[0]
first_page.insertImage(filename = image,rect=location)
file_handle.save(pdf_file_with_image)

As seen above, using the Rect method, we create a rectangle where we want to fit our image. Running the above code, we see the following in our PDF file.

PDF file without image

PDF file after an image is inserted.

添加表格

要将表格添加到 PDF 文件，我们使用 reportlab 库。下面的代码导入所有必需的模块并创建一个名为 table_pdf.pdf 的 PDF 文件。


from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Table, TableStyle
doc = SimpleDocTemplate("table.pdf", pagesize=A4)
members = []
frame= [['#', 'id', 'name'],['1', '2332', 'Jack'],['2', '3573', 'Jerry']]
table=Table(frame)
members.append(table)
doc.build(members)

如上所示，从库中导入所有必要的模块后，我们将创建一个对象，以 PDF 文件的名称及其页面大小作为参数。

然后，我们将行添加到新列表中，并将其作为参数传递给 Table 类。

After that, we append the result to our ‘members’ list, and finally, to save it on our doc, we call the build method on our doc with members as an argument to it, and it will be saved in our PDF file.

This is the final PDF table_pdf.pdf, with a page that contains lists of frame as its rows.

创建表格

为了在我们的PDF文件中创建表单，我们主要使用reportlab库的canvas模块。与其他类型的表单类似，我们的 PDF 表单也包含文本字段、单选按钮、多项选择和复选框。

最终结果存储在表格_pdf.pdf


from reportlab.pdfgen import canvas
from reportlab.lib.colors import magenta, pink, blue, green
myCanvas = canvas.Canvas('form_pdf.pdf')
myCanvas.setFont("Helvetica", 18)
myCanvas.drawCentredString(500, 500, 'A Form')
interactiveForm = myCanvas.acroForm
myCanvas.drawString(20, 500, 'Name:')
interactiveForm.textfield(name='fname', tooltip='Your Name',
            x=100, y=600, borderStyle='solid',
            borderColor=green, fillColor=pink, 
            width=200,
            textColor=magenta, forceBorder=True)
myCanvas.drawString(30, 600, 'Male:')
interactiveForm.radio(name='radio2', tooltip='Radio field 2',
        value='value1', selected=True,
        x=100, y=600, buttonStyle='diamond',
        borderStyle='solid', shape='square',
        borderColor=magenta, fillColor=pink, 
        borderWidth=1,
        textColor=blue, forceBorder=True)
interactiveForm.radio(name='radio2', tooltip='Radio field 2',
        value='value2', selected=False,
        x=100, y=600, buttonStyle='diamond',
        borderStyle='solid', shape='square',
        borderColor=magenta, fillColor=pink, 
        borderWidth=1,
        textColor=blue, forceBorder=True)
myCanvas.drawString(150, 659, 'Female:')
interactiveForm.radio(name='radio3', tooltip='Radio Field 3',
        value='value1', selected=False,
        x=200, y=650, buttonStyle='diamond',
        borderStyle='solid', shape='circle',
        borderColor=blue, fillColor=green, 
        borderWidth=2,
        textColor=blue, forceBorder=False)
interactiveForm.radio(name='radio3', tooltip='Field radio3',
        value='value2', selected=True,
        x=200, y=650, buttonStyle='diamond',
        borderStyle='solid', shape='circle',
        borderColor=magenta, fillColor=pink, 
        borderWidth=1,
        textColor=blue, forceBorder=True)
myCanvas.drawString(5, 650, 'Pick a character:')
options = [('Tom', 'tom'), ('Jerry', 'jerry'), ('Spike', 'spike')]
interactiveForm.choice(name='choice2', tooltip='Choice 2',
            value='Tom',
            options=options, 
            x=190, y=550, width=70, height=30,
            borderStyle='bevelled', borderWidth=2,
            forceBorder=True)
myCanvas.save()

在上面的代码中，首先我们创建了一个Canvas类的对象并设置了它的字体。然后，我们创建一个表单变量。

Now for putting strings to our pdf file, we use the object of our Canvas class, and for defining our form, we use the variable ‘form’. After running the above code, we get the following PDF form.

填写表格

为了使用 Python 填写表单，我们使用 pdfrw 库。在我们的 PDF 表格中表格_pdf.pdf，我们有一个字段作为 fname，我们应该把 Bob Martin 放在那里。

为此，我们首先打开输入文件，读取它并解析页面。然后我们将填充的数据定义为字典。最后，我们将它与 data_dict 和 pdf 输出一起作为参数传递给 fill_pdf 函数。


import pdfrw 
source = "form_pdf.pdf"
destination = "output.pdf"
myTemplate = pdfrw.PdfReader(source)
MYKEY = '/Annots'
FIELDKEY = '/T'
VALUE_KEY = '/V'
RECTKEY = '/Rect'
SUB_KEY = '/Subtype'
WIDGET= '/Widget'
data = {
    'fname': 'Bob Martin'
}
def fill_form(source, dest, data):
    myTemplate = pdfrw.PdfReader(source)
    for pg_number in myTemplate.pages:
        annots = pg_number[MYKEY]
        for annot in annots:
            if annot[SUB_KEY] == WIDGET:
                if annot[FIELDKEY]:
                    key = annot[FIELDKEY][1:-1]
                    if key in data.keys():
                        if type(data[key]) == bool:
                            if data[key] == True:
                                annot.update(pdfrw.PdfDict(AS=pdfrw.PdfName('Yes')))
                        else:
                            annot.update(pdfrw.PdfDict(V='{}'.format(data[key])))
                            annot.update(pdfrw.PdfDict(AP=''))
    pdfrw.PdfWriter().write(dest, myTemplate)
fill_form(source, destination, data)

After running the above code, we will get the name in the field as shown below:

调整页面大小

有时我们需要调整 PDF 文件的大小。为此，我们可以使用 PyPDF2。在下面的代码中，我们调整文件大小pdf文件.pdf到“resizedpdffile.pdf”。


import PyPDF2
pdf_file = "pdffile.pdf"
pdf_file = PyPDF2.PdfFileReader(pdf)
p0 = pdf_file.getPage(0)
p0.scaleBy(0.5)
pdf_writer = PyPDF2.PdfFileWriter()
pdf_writer.addPage(0)
with open("resizedpdffile.pdf", "wb+") as f:
    pdf_writer.write(f)

上面的代码首先读取我们的 PDF 文件，然后获取它的第一页。然后它会缩放 PDF 文件并打开 pdfwriter。最后，它向 pdfwriter 添加一个页面，并打开一个新的 PDF 文件“resizedpdffile.pdf”，将缩放后的页面添加到其中。

将 PDF 转换为 CSV 或 Excel

当我们将数据作为 PDF 文件中的表格时，我们可以检索它并将其另存为CSV使用 tabula-py 库的文件。下面，代码转换PDF文件表_pdf.pdf to CSV.


import tabula as tb
df = tb.read_pdf("table_pdf.pdf", pages='all')
tb.convert_into("table_pdf.pdf", "table_pdf_in_csv.csv", output_format="csv", pages='all')
print(df)

After running the above code, we will have our CSV file also saved in the working directory.

加水印

水印是Word和PDF文件中常用的背景显示。要在 Python 中向 PDF 添加水印，我们使用 PyPDF2 库。代码添加了水印pdf文件.pdf并保存一个新文件，名称为水印.pdf


import PyPDF2
pdf_file = "pdffile.pdf"
watermark = "watermark.pdf"
final = "merged.pdf"
input = open(pdf_file,'rb')
input_pdf = PyPDF2.PdfFileReader(pdf_file)
watermark_handle = open(watermark,'rb')
watermark_file = PyPDF2.PdfFileReader(watermark_handle)
pdf_page = input_pdf.getPage(0)
watermark_page = watermark_file.getPage(0)
pdf_page.mergePage(watermark_page)
generated_pdf = PyPDF2.PdfFileWriter()
generated_pdf.addPage(pdf_page)
final = open(final,'wb')
generated_pdf.write(final)
final.close()
watermark_handle.close()
input.close()

在上面的代码中，首先，我们导入PyPDF2并存储pdf和水印文件的内容。接下来，我们打开它们阅读它们的内容，并访问它们的第一页。

Then we merge the watermark file on the PDF file and write the result to our final file. In the end, we close all our files.

Our PDF file.

Our watermark file.

Our watermarked PDF file.

PDF 元数据（读取和编辑）

为了更好地维护我们的PDF文件，我们应该向它添加元数据。在下面的示例中，我们将元数据添加到 PDF 文件中pdf文件与图像.pdf使用 pdfrw 库。


from pdfrw import PdfReader, PdfWriter, PdfDict
pdf_file = PdfReader('pdffilewithimage.pdf')
metadata_info = PdfDict(Author='LikeGeeks', Title='PDF Title')
pdf_file.Info.update(metadata_info)
PdfWriter().write('new.pdf', pdf_file)

如代码中所示，我们首先使用 PdfReader 类打开 pdf 文件。接下来，我们创建元数据对象，然后将其添加到文件中。最后，我们将其全部写入“new.pdf”文件中。

为了读取 PDF 文件的元数据，我们使用 PyPDF2 库的 PdfFileReader 模块。


from PyPDF2 import PdfFileReader
with open("new.pdf", "rb") as f:
    pdffile = PdfFileReader(f)
    pdf_info = pdffile.getDocumentInfo()
    print(pdf_info)

Running the above code we get the following result.

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

python