使用 Python 处理 PDF - 使用我们的示例阅读、生成、编辑和提取文本

PDF 是一种广泛使用的文档格式，用于数字出版物。另一方面，Python 是一种多功能的编程语言，在当今数字世界中拥有广泛的范围的应用程序。当两者结合使用时，Python 可以成为一种高效的工具，用于操作和提取信息PDF文档。在本文中，我们将探讨 Python 可用于进行PDF 处理的不同方式，以及它如何帮助我们提高生产力和效率。

Python PDF 库

要在 Python 中处理 PDF 文件，可以使用各种库。一些流行的库可将 Python 与 PDF 一起使用，例如 PyPDF2、reportlab 和 fpdf。

使用 Python 读取 PDF

要读取 PDF 文件，可以使用PyPDF2库。这是一个示例

import json
import PyPDF2

# Open the PDF file
pdf_file = open('example.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Get the number of pages in the PDF file
num_pages = pdf_reader.numPages

# Loop through all the pages and extract the text
for page in range(num_pages):
    page_obj = pdf_reader.getPage(page)
    print(page_obj.extractText())
    
# Close the PDF file
pdf_file.close()

使用 Python 生成 PDF

要从头开始生成新的 PDF 文件，可以使用reportlab或fpdf库。这是一个使用reportlab的示例

from reportlab.pdfgen import canvas

# Create a new PDF file
pdf_file = canvas.Canvas('example.pdf')

# Add text to the PDF
pdf_file.drawString(100, 750, "Hello World")

# Save and close the PDF file
pdf_file.save()

同样，可以使用fpdf库创建 PDF。

使用 Python 编辑 PDF

要编辑现有的 PDF 文件，可以使用PyPDF2库。这是一个在 PDF 文件中旋转页面的示例

import PyPDF2

# Open the PDF file
pdf_file = open('example.pdf', 'rb')

# Create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

# Create a PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()

# Rotate the pages and add them to the PDF writer
for page in range(pdf_reader.numPages):
    page_obj = pdf_reader.getPage(page)
    page_obj.rotateClockwise(90)
    pdf_writer.addPage(page_obj)
    
# Save the rotated PDF file
with open('example_rotated.pdf', 'wb') as pdf_output:
    pdf_writer.write(pdf_output)
    
# Close the PDF files
pdf_file.close()
pdf_output.close()

总之，Python 提供了多个库来处理 PDF 文件，使你能够以编程方式读取、生成和编辑 PDF。

如何使用 Python 从 PDF 中提取文本

要使用 Python 从 PDF 中提取文本，可以使用PyPDF2或pdfminer库。这些库允许你解析 PDF 并提取文本内容。

示例 1：使用`PyPDF2`

import PyPDF2

pdf_file = open('file.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)

text = ''
for page_num in range(pdf_reader.numPages):
    page = pdf_reader.getPage(page_num)
    text += page.extractText()

print(text)

示例 2：使用`pdfminer`

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def pdf_to_text(pdf_path):
    manager = PDFResourceManager()
    output = StringIO()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    with open(pdf_path, 'rb') as file:
        for page in PDFPage.get_pages(file, check_extractable=True):
            interpreter.process_page(page)

        text = output.getvalue()

    return text

这两种方法都允许你使用 Python 从 PDF 中提取文本内容。

如何合并 PDF 页面

将多个 PDF 文件合并成一个文档是文档处理中的一项常见任务。Python 中的PyPDF2库可以轻松地将多个 PDF 文件合并成一个文档。

使用`PyPDF2`合并两个 PDF 页面

import PyPDF2

# Open the first PDF file
pdf1 = PyPDF2.PdfFileReader(open('file1.pdf', 'rb'))

# Open the second PDF file
pdf2 = PyPDF2.PdfFileReader(open('file2.pdf', 'rb'))

# Merge the two PDF files
output = PyPDF2.PdfFileWriter()
output.addPage(pdf1.getPage(0))
output.addPage(pdf2.getPage(0))

# Save the merged PDF file
with open('merged.pdf', 'wb') as f:
    output.write(f)

使用`PyPDF2`合并整个 PDF 文件

from PyPDF2 import PdfFileMerger

pdfs = ['file1.pdf', 'file2.pdf']
merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('merged_pdf.pdf', 'wb') as f:
    merger.write(f)

使用上述代码示例，你可以使用PyPDF2库在 Python 中合并多个 PDF 页面或整个 PDF 文件。通过合并 PDF 文件，你可以轻松地创建一个更容易管理和分发的单个文档。

如何从 PDF 中移除水印

使用 Python 从 PDF 文件中移除水印很容易，可以使用许多库来完成。这里有一些使用PyPDF2和PyMuPDF库移除水印的解决方案。

# Solution 1
import PyPDF2

# Open the PDF file
pdf = open('filename.pdf', 'rb')

# Create a PDFReader object
pdf_reader = PyPDF2.PdfReader()

# Create a PDFWriter object
pdf_writer = PyPDF2.PdfWriter()

# Iterate over the pages in the PDF file
for page in pdf_reader:
    # Remove the watermark 
    page.mergePage(None)
    # Add the page to the PDFWriter object
    pdf_writer.addPage(page)

# Save the PDF with the watermark removed
with open('filename_nw.pdf', 'wb') as f:
    pdf_writer.write(f)

import fitz
# Solution 2
# Open the PDF file
pdf = fitz.open('filename.pdf')

# Iterate over the pages in the PDF file
for page in pdf:
    # Get the annotations on the page
    annotations = page.annots()
    # Iterate through the annotations
    for annotation in annotations:
        # Check if the annotation is a watermark
        if annotation.type[0] == 8:
            # Remove the annotation
            page.deleteAnnot(annotation)
    
# Save the PDF with the watermark removed
pdf.save('filename_nw.pdf')

使用这些简单的解决方案，你可以使用 Python 和PyPDF2和 PyMuPDF 库轻松地从 PDF 文件中移除水印。

如何将 HTML 转换为 PDF

将HTML转换为 PDF 是 Web 开发中的一项常见任务。幸运的是，Python 提供了几个库来轻松完成此任务。以下是如何使用流行的 Python 库将HTML转换为 PDF 的两个示例

使用`pdfkit`库

import pdfkit

pdfkit.from_file('path/to/file.html', 'path/to/output.pdf')

使用`weasyprint`库

from weasyprint import HTML

HTML('path/to/file.html').write_pdf('path/to/output.pdf')

这两个库都提供了使用几行代码将HTML转换为 PDF 的功能，使其可以轻松地合并到任何 Python 项目中。不要忘记在实现解决方案之前使用 pip 安装所需的库。

与我们一起贡献！

不要犹豫，在 GitHub 上为 Python 教程做出贡献：创建一个分支，更新内容并发出拉取请求。

开始

Aliaksandr Sumich作者

Python 工程师，第三方 Web 服务集成专家。

Evgeniy Melnikov贡献者

更新：05/03/2024 - 21:52

PDF 文件处理

目录

Python PDF 库

使用 Python 读取 PDF

使用 Python 生成 PDF

使用 Python 编辑 PDF

如何使用 Python 从 PDF 中提取文本

示例 1：使用`PyPDF2`

示例 2：使用`pdfminer`

如何合并 PDF 页面

使用`PyPDF2`合并两个 PDF 页面

使用`PyPDF2`合并整个 PDF 文件

如何从 PDF 中移除水印

如何将 HTML 转换为 PDF

使用`pdfkit`库

使用`weasyprint`库

与我们一起贡献！

PDF 文件处理

目录

Python PDF 库

使用 Python 读取 PDF

使用 Python 生成 PDF

使用 Python 编辑 PDF

如何使用 Python 从 PDF 中提取文本

示例 1：使用PyPDF2

示例 2：使用pdfminer

如何合并 PDF 页面

使用PyPDF2合并两个 PDF 页面

使用PyPDF2合并整个 PDF 文件

如何从 PDF 中移除水印

如何将 HTML 转换为 PDF

使用pdfkit库

使用weasyprint库

与我们一起贡献！

示例 1：使用`PyPDF2`

示例 2：使用`pdfminer`

使用`PyPDF2`合并两个 PDF 页面

使用`PyPDF2`合并整个 PDF 文件

使用`pdfkit`库

使用`weasyprint`库