Back to Blogs

The Python Developer's Toolkit for PDF Processing

July 17, 2024
5 mins
blog image

PDFs (Portable Document Format) are a ubiquitous part of our digital lives, from eBooks and research papers to invoices and contracts. For developers, automating PDF processing can save time and boost productivity.


🔥Fun Fact: While PDFs may appear to contain well-structured text, they do not inherently include paragraphs, sentences, or even words. Instead, a PDF file is only aware of individual characters and their placement on the page.🔥


This characteristic makes extracting meaningful text from PDFs challenging. The characters forming a paragraph are indistinguishable from those in tables, footers, or figure descriptions. Unlike formats such as .txt files or Word documents, PDFs do not contain a continuous stream of text.


A PDF document is composed of a collection of objects that collectively describe the appearance of one or more pages. These may include interactive elements and higher-level application data. The file itself contains these objects along with associated structural information, all encapsulated in a single self-contained sequence of bytes.


In this comprehensive guide, we’ll explore how to process PDFs in Python using various libraries. We’ll cover tasks such as reading, extracting text and metadata, creating, merging, and splitting PDFs. 



Prerequisites

Before diving into the code, ensure you have the following:

  • Python installed on your system
  • Basic understanding of Python programming
  • Required libraries: PyPDF2, pdfminer.six, ReportLab, and PyMuPDF (fitz)

You can install these libraries using pip:

pip install PyPDF2 pdfminer.six reportlab PyMuPDF

Reading PDFs with PyPDF2

PyPDF2 is a pure-python library used for splitting, merging, cropping, and transforming pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files.

Code Example

Here we are reading a PDF and extracting text from it:

import PyPDF2

def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text()
        return text


# Usage
file_path = 'sample.pdf'
print(extract_text_from_pdf(file_path))


Extracting Text and Metadata with pdfminer.six

pdfminer.six is a tool for extracting information from PDF documents, focusing on getting and analyzing the text data.


Code Example

Here’s how to extract text and metadata from a PDF:

from pdfminer.high_level import extract_text

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

def extract_text_with_pdfminer(file_path):
    return extract_text(file_path)

def extract_metadata(file_path):
    with open(file_path, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument(parser)
        metadata = doc.info[0]
    return metadata


# Usage
file_path = 'sample.pdf'
print(extract_text_with_pdfminer(file_path))
print(extract_metadata(file_path))


Creating and Modifying PDFs with ReportLab

ReportLab is a robust library for creating PDFs from scratch, allowing for the addition of various elements like text, images, and graphics.

Code Example

To create a simple PDF:

from reportlab.lib.pagesizes import letter

from reportlab.pdfgen import canvas

def create_pdf(file_path):
    c = canvas.Canvas(file_path, pagesize=letter)
    c.drawString(100, 750, "Hello from Encord!")
    c.save()


# Usage
create_pdf("test.pdf")

To modify an existing PDF, you can use PyPDF2 in conjunction with ReportLab.

Manipulating PDFs with PyPDF2

Code Example for Merging PDFs

from PyPDF2 import PdfMerger

def merge_pdfs(pdf_list, output_path):
    merger = PdfMerger()
    for pdf in pdf_list:
        merger.append(pdf)
    merger.write(output_path)
    merger.close()


# Usage
pdf_list = ['file1.pdf', 'file2.pdf']
merge_pdfs(pdf_list, 'merged.pdf')



Code Example for Splitting PDFs

from PyPDF2 import PdfReader, PdfWriter

def split_pdf(input_path, start_page, end_page, output_path):
    reader = PdfReader(input_path)
    writer = PdfWriter()
    for page_num in range(start_page, end_page):
        writer.add_page(reader.pages[page_num])
    with open(output_path, 'wb') as output_pdf:
        writer.write(output_pdf)


# Usage
split_pdf('merged.pdf', 0, 2, 'split_output.pdf')


Code Example for Rotating Pages

from PyPDF2 import PdfReader, PdfWriter

def rotate_pdf(input_path, output_path, rotation_degrees=90):
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page_num in range(len(reader.pages)):
        page = reader.pages[page_num]
        page.rotate(rotation_degrees)
        writer.add_page(page)

    with open(output_path, 'wb') as output_pdf:
        writer.write(output_pdf)


# Usage
input_path = 'input.pdf'
output_path = 'rotated_output.pdf'
rotate_pdf(input_path, output_path, 90)



Extracting Images from PDFs using PyMuPDF (fitz)

PyMuPDF (also known as fitz) allows for advanced operations like extracting images from PDFs.

Code Example

Here is how to extract images from PDFs:

import fitz

def extract_images(file_path):
    pdf_document = fitz.open(file_path)
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        images = page.get_images(full=True)
        for image_index, img in enumerate(images):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            with open(f"image{page_num+1}_{image_index}.{image_ext}", "wb") as image_file:
            image_file.write(image_bytes)


# Usage
extract_images('sample.pdf')


light-callout-cta If you're extracting images from PDFs to build a dataset for your computer vision model, be sure to explore Encord—a comprehensive data development platform designed for computer vision and multimodal AI teams.

Conclusion

Python provides a powerful toolkit for PDF processing, enabling developers to perform a wide range of tasks from basic text extraction to complex document manipulation. Libraries like PyPDF2, pdfminer.six, and PyMuPDF offer complementary features that cover most PDF processing needs.


When choosing a library, consider the specific requirements of your project. PyPDF2 is great for basic operations, pdfminer.six excels at text extraction, and PyMuPDF offers a comprehensive set of features including image extraction and table detection.


As you get deeper into PDF processing with Python, explore the official documentation of these libraries for more advanced features and optimizations (I have linked them in this blog!). Remember to handle exceptions and edge cases, especially when dealing with large or complex PDF files.

sideBlogCtaBannerMobileBGencord logo

Power your AI models with the right data

Automate your data curation, annotation and label validation workflows.

Get started
Written by
author-avatar-url

Akruti Acharya

View more posts
Frequently asked questions
  • For basic text extraction, PyPDF2 is sufficient. However, for complex layouts or more accurate extraction, pdfminer.six is recommended. PyMuPDF (fitz) is also a powerful option that balances speed and accuracy. Choose based on your specific needs and the complexity of your PDFs.

  • PyMuPDF (fitz) is the most efficient library for extracting images from PDFs. It can handle various image formats and maintains good quality.

  • Yes, you can fill PDF forms programmatically using PyPDF2. The library allows you to read form fields, update their values, and save the modified PDF.

  • Merging PDFs is straightforward with PyPDF2. The PdfMerger class allows you to append multiple PDF files and write them to a new file.

  • Absolutely. While Python itself doesn't have built-in OCR capabilities, you can use the OCRmyPDF library, which integrates with Tesseract OCR, or use pytesseract. It can convert scanned PDFs into searchable PDFs, making text extraction possible from image-based documents.

  • Yes, Python can extract tables from PDFs using libraries like PyMuPDF (fitz). It provides methods to detect and extract tabular data. While not perfect for all table formats, it works well for many structured tables in PDFs.

Explore our products