Which Python library is best for extracting text from PDFs?

For basic text extraction, PyPDF2 is sufficient. However, for complex layouts or more accurate extraction, pdfminer.six is recommended. PyMuPDF (fitz) is also a powerful option that balances speed and accuracy. Choose based on your specific needs and the complexity of your PDFs.

How can I extract images from a PDF using Python?

PyMuPDF (fitz) is the most efficient library for extracting images from PDFs. It can handle various image formats and maintains good quality.

Is it possible to fill out PDF forms programmatically with Python?

Yes, you can fill PDF forms programmatically using PyPDF2. The library allows you to read form fields, update their values, and save the modified PDF.

How do I merge multiple PDFs into a single document using Python?

Merging PDFs is straightforward with PyPDF2. The PdfMerger class allows you to append multiple PDF files and write them to a new file.

Can Python be used to perform OCR on scanned PDFs?

Absolutely. While Python itself doesn't have built-in OCR capabilities, you can use the OCRmyPDF library, which integrates with Tesseract OCR, or use pytesseract. It can convert scanned PDFs into searchable PDFs, making text extraction possible from image-based documents.

Can Python extract tables from PDFs?

Yes, Python can extract tables from PDFs using libraries like PyMuPDF (fitz). It provides methods to detect and extract tabular data. While not perfect for all table formats, it works well for many structured tables in PDFs.

Back to Blogs

Contents

Prerequisites
Reading PDFs with PyPDF2
Extracting Text and Metadata with pdfminer.six
Creating and Modifying PDFs with ReportLab
Manipulating PDFs with PyPDF2
Extracting Images from PDFs using PyMuPDF (fitz)
Conclusion

Encord Blog

The Python Developer's Toolkit for PDF Processing

Summarize with AI

July 17, 2024

5 mins

Back to Blogs

Data infrastructure for multimodal AI

Click around the platform to see the product in action.

Contents

Prerequisites
Reading PDFs with PyPDF2
Extracting Text and Metadata with pdfminer.six
Creating and Modifying PDFs with ReportLab
Manipulating PDFs with PyPDF2
Extracting Images from PDFs using PyMuPDF (fitz)
Conclusion

Written by

Akruti Acharya

View more posts

PDFs (Portable Document Format) are a ubiquitous part of our digital lives, from eBooks and research papers to invoices and contracts. For developers, automating PDF processing can save time and boost productivity.

🔥Fun Fact: While PDFs may appear to contain well-structured text, they do not inherently include paragraphs, sentences, or even words. Instead, a PDF file is only aware of individual characters and their placement on the page.🔥

This characteristic makes extracting meaningful text from PDFs challenging. The characters forming a paragraph are indistinguishable from those in tables, footers, or figure descriptions. Unlike formats such as .txt files or Word documents, PDFs do not contain a continuous stream of text.

A PDF document is composed of a collection of objects that collectively describe the appearance of one or more pages. These may include interactive elements and higher-level application data. The file itself contains these objects along with associated structural information, all encapsulated in a single self-contained sequence of bytes.

In this comprehensive guide, we’ll explore how to process PDFs in Python using various libraries. We’ll cover tasks such as reading, extracting text and metadata, creating, merging, and splitting PDFs.

⚙️ Want to streamline your document annotation process? Check out this list of the Top 8 Document Annotation Tools.

Prerequisites

Before diving into the code, ensure you have the following:

Python installed on your system
Basic understanding of Python programming
Required libraries: PyPDF2, pdfminer.six, ReportLab, and PyMuPDF (fitz)

You can install these libraries using pip:

pip install PyPDF2 pdfminer.six reportlab PyMuPDF

Reading PDFs with PyPDF2

PyPDF2 is a pure-python library used for splitting, merging, cropping, and transforming pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. For organizations looking to tailor these functionalities further or integrate them into larger applications, custom Python development services can be leveraged to enhance and expand the capabilities of libraries like PyPDF2 to fit specific business needs.

Code Example

Here we are reading a PDF and extracting text from it:

import PyPDF2

def extract_text_from_pdf(file_path):
    with open(file_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        text = ''
        for page_num in range(len(reader.pages)):
            text += reader.pages[page_num].extract_text()
        return text


# Usage
file_path = 'sample.pdf'
print(extract_text_from_pdf(file_path))

Extracting Text and Metadata with pdfminer.six

pdfminer.six is a tool for extracting information from PDF documents, focusing on getting and analyzing the text data.

Code Example

Here’s how to extract text and metadata from a PDF:

from pdfminer.high_level import extract_text

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

def extract_text_with_pdfminer(file_path):
    return extract_text(file_path)

def extract_metadata(file_path):
    with open(file_path, 'rb') as file:
        parser = PDFParser(file)
        doc = PDFDocument(parser)
        metadata = doc.info[0]
    return metadata


# Usage
file_path = 'sample.pdf'
print(extract_text_with_pdfminer(file_path))
print(extract_metadata(file_path))

Creating and Modifying PDFs with ReportLab

ReportLab is a robust library for creating PDFs from scratch, allowing for the addition of various elements like text, images, and graphics.

Code Example

To create a simple PDF:

from reportlab.lib.pagesizes import letter

from reportlab.pdfgen import canvas

def create_pdf(file_path):
    c = canvas.Canvas(file_path, pagesize=letter)
    c.drawString(100, 750, "Hello from Encord!")
    c.save()


# Usage
create_pdf("test.pdf")

To modify an existing PDF, you can use PyPDF2 in conjunction with ReportLab.

Manipulating PDFs with PyPDF2

Code Example for Merging PDFs

from PyPDF2 import PdfMerger

def merge_pdfs(pdf_list, output_path):
    merger = PdfMerger()
    for pdf in pdf_list:
        merger.append(pdf)
    merger.write(output_path)
    merger.close()


# Usage
pdf_list = ['file1.pdf', 'file2.pdf']
merge_pdfs(pdf_list, 'merged.pdf')

Code Example for Splitting PDFs

from PyPDF2 import PdfReader, PdfWriter

def split_pdf(input_path, start_page, end_page, output_path):
    reader = PdfReader(input_path)
    writer = PdfWriter()
    for page_num in range(start_page, end_page):
        writer.add_page(reader.pages[page_num])
    with open(output_path, 'wb') as output_pdf:
        writer.write(output_pdf)


# Usage
split_pdf('merged.pdf', 0, 2, 'split_output.pdf')

Code Example for Rotating Pages

from PyPDF2 import PdfReader, PdfWriter

def rotate_pdf(input_path, output_path, rotation_degrees=90):
    reader = PdfReader(input_path)
    writer = PdfWriter()

    for page_num in range(len(reader.pages)):
        page = reader.pages[page_num]
        page.rotate(rotation_degrees)
        writer.add_page(page)

    with open(output_path, 'wb') as output_pdf:
        writer.write(output_pdf)


# Usage
input_path = 'input.pdf'
output_path = 'rotated_output.pdf'
rotate_pdf(input_path, output_path, 90)

Extracting Images from PDFs using PyMuPDF (fitz)

PyMuPDF (also known as fitz) allows for advanced operations like extracting images from PDFs.

Code Example

Here is how to extract images from PDFs:

import fitz

def extract_images(file_path):
    pdf_document = fitz.open(file_path)
    for page_num in range(len(pdf_document)):
        page = pdf_document.load_page(page_num)
        images = page.get_images(full=True)
        for image_index, img in enumerate(images):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            with open(f"image{page_num+1}_{image_index}.{image_ext}", "wb") as image_file:
            image_file.write(image_bytes)


# Usage
extract_images('sample.pdf')

If you're extracting images from PDFs to build a dataset for your computer vision model, be sure to explore Encord's Document Annotation Tool.

Conclusion

Python provides a powerful toolkit for PDF processing, enabling developers to perform a wide range of tasks from basic text extraction to complex document manipulation. Libraries like PyPDF2, pdfminer.six, and PyMuPDF offer complementary features that cover most PDF processing needs.

When choosing a library, consider the specific requirements of your project. PyPDF2 is great for basic operations, pdfminer.six excels at text extraction, and PyMuPDF offers a comprehensive set of features including image extraction and table detection.

As you get deeper into PDF processing with Python, explore the official documentation of these libraries for more advanced features and optimizations (I have linked them in this blog!). Remember to handle exceptions and edge cases, especially when dealing with large or complex PDF files.

Data infrastructure for multimodal AI

Click around the platform to see the product in action.

Written by

Akruti Acharya

View more posts

Previous blog

Meet Shivant - Technical CSM at Encord

Next blog

PPE Detection Using Computer Vision for Workplace Safety

Explore our products

Index

Manage & curate your data

Understand and manage your visual data, prioritize data for labeling, and initiate active learning pipelines.

Explore Index

Annotate

Supporting your labeling needs

Super charge your data annotation with AI-powered labeling — including automated interpolation, object detection and ML-based quality control.

Explore Annotate

Active

Find & fix data issues with ease

Monitor, troubleshoot, and evaluate the data and labels impacting model performance.

Explore Active

Frequently asked questions

For basic text extraction, PyPDF2 is sufficient. However, for complex layouts or more accurate extraction, pdfminer.six is recommended. PyMuPDF (fitz) is also a powerful option that balances speed and accuracy. Choose based on your specific needs and the complexity of your PDFs.
PyMuPDF (fitz) is the most efficient library for extracting images from PDFs. It can handle various image formats and maintains good quality.
Yes, you can fill PDF forms programmatically using PyPDF2. The library allows you to read form fields, update their values, and save the modified PDF.
Merging PDFs is straightforward with PyPDF2. The PdfMerger class allows you to append multiple PDF files and write them to a new file.
Absolutely. While Python itself doesn't have built-in OCR capabilities, you can use the OCRmyPDF library, which integrates with Tesseract OCR, or use pytesseract. It can convert scanned PDFs into searchable PDFs, making text extraction possible from image-based documents.
Yes, Python can extract tables from PDFs using libraries like PyMuPDF (fitz). It provides methods to detect and extract tabular data. While not perfect for all table formats, it works well for many structured tables in PDFs.