Contents
Prerequisites
Reading PDFs with PyPDF2
Extracting Text and Metadata with pdfminer.six
Creating and Modifying PDFs with ReportLab
Manipulating PDFs with PyPDF2
Extracting Images from PDFs using PyMuPDF (fitz)
Conclusion
Encord Blog
The Python Developer's Toolkit for PDF Processing
PDFs (Portable Document Format) are a ubiquitous part of our digital lives, from eBooks and research papers to invoices and contracts. For developers, automating PDF processing can save time and boost productivity.
🔥Fun Fact: While PDFs may appear to contain well-structured text, they do not inherently include paragraphs, sentences, or even words. Instead, a PDF file is only aware of individual characters and their placement on the page.🔥
This characteristic makes extracting meaningful text from PDFs challenging. The characters forming a paragraph are indistinguishable from those in tables, footers, or figure descriptions. Unlike formats such as .txt files or Word documents, PDFs do not contain a continuous stream of text.
A PDF document is composed of a collection of objects that collectively describe the appearance of one or more pages. These may include interactive elements and higher-level application data. The file itself contains these objects along with associated structural information, all encapsulated in a single self-contained sequence of bytes.
In this comprehensive guide, we’ll explore how to process PDFs in Python using various libraries. We’ll cover tasks such as reading, extracting text and metadata, creating, merging, and splitting PDFs.
Prerequisites
Before diving into the code, ensure you have the following:
- Python installed on your system
- Basic understanding of Python programming
- Required libraries: PyPDF2, pdfminer.six, ReportLab, and PyMuPDF (fitz)
You can install these libraries using pip:
pip install PyPDF2 pdfminer.six reportlab PyMuPDF
Reading PDFs with PyPDF2
PyPDF2 is a pure-python library used for splitting, merging, cropping, and transforming pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. For organizations looking to tailor these functionalities further or integrate them into larger applications, custom Python development services can be leveraged to enhance and expand the capabilities of libraries like PyPDF2 to fit specific business needs.
Code Example
Here we are reading a PDF and extracting text from it:
import PyPDF2 def extract_text_from_pdf(file_path): with open(file_path, 'rb') as file: reader = PyPDF2.PdfReader(file) text = '' for page_num in range(len(reader.pages)): text += reader.pages[page_num].extract_text() return text # Usage file_path = 'sample.pdf' print(extract_text_from_pdf(file_path))
Extracting Text and Metadata with pdfminer.six
pdfminer.six is a tool for extracting information from PDF documents, focusing on getting and analyzing the text data.
Code Example
Here’s how to extract text and metadata from a PDF:
from pdfminer.high_level import extract_text from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument def extract_text_with_pdfminer(file_path): return extract_text(file_path) def extract_metadata(file_path): with open(file_path, 'rb') as file: parser = PDFParser(file) doc = PDFDocument(parser) metadata = doc.info[0] return metadata # Usage file_path = 'sample.pdf' print(extract_text_with_pdfminer(file_path)) print(extract_metadata(file_path))
Creating and Modifying PDFs with ReportLab
ReportLab is a robust library for creating PDFs from scratch, allowing for the addition of various elements like text, images, and graphics.
Code Example
To create a simple PDF:
from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas def create_pdf(file_path): c = canvas.Canvas(file_path, pagesize=letter) c.drawString(100, 750, "Hello from Encord!") c.save() # Usage create_pdf("test.pdf")
To modify an existing PDF, you can use PyPDF2 in conjunction with ReportLab.
Manipulating PDFs with PyPDF2
Code Example for Merging PDFs
from PyPDF2 import PdfMerger def merge_pdfs(pdf_list, output_path): merger = PdfMerger() for pdf in pdf_list: merger.append(pdf) merger.write(output_path) merger.close() # Usage pdf_list = ['file1.pdf', 'file2.pdf'] merge_pdfs(pdf_list, 'merged.pdf')
Code Example for Splitting PDFs
from PyPDF2 import PdfReader, PdfWriter def split_pdf(input_path, start_page, end_page, output_path): reader = PdfReader(input_path) writer = PdfWriter() for page_num in range(start_page, end_page): writer.add_page(reader.pages[page_num]) with open(output_path, 'wb') as output_pdf: writer.write(output_pdf) # Usage split_pdf('merged.pdf', 0, 2, 'split_output.pdf')
Code Example for Rotating Pages
from PyPDF2 import PdfReader, PdfWriter def rotate_pdf(input_path, output_path, rotation_degrees=90): reader = PdfReader(input_path) writer = PdfWriter() for page_num in range(len(reader.pages)): page = reader.pages[page_num] page.rotate(rotation_degrees) writer.add_page(page) with open(output_path, 'wb') as output_pdf: writer.write(output_pdf) # Usage input_path = 'input.pdf' output_path = 'rotated_output.pdf' rotate_pdf(input_path, output_path, 90)
Extracting Images from PDFs using PyMuPDF (fitz)
PyMuPDF (also known as fitz) allows for advanced operations like extracting images from PDFs.
Code Example
Here is how to extract images from PDFs:
import fitz def extract_images(file_path): pdf_document = fitz.open(file_path) for page_num in range(len(pdf_document)): page = pdf_document.load_page(page_num) images = page.get_images(full=True) for image_index, img in enumerate(images): xref = img[0] base_image = pdf_document.extract_image(xref) image_bytes = base_image["image"] image_ext = base_image["ext"] with open(f"image{page_num+1}_{image_index}.{image_ext}", "wb") as image_file: image_file.write(image_bytes) # Usage extract_images('sample.pdf')
Conclusion
Python provides a powerful toolkit for PDF processing, enabling developers to perform a wide range of tasks from basic text extraction to complex document manipulation. Libraries like PyPDF2, pdfminer.six, and PyMuPDF offer complementary features that cover most PDF processing needs.
When choosing a library, consider the specific requirements of your project. PyPDF2 is great for basic operations, pdfminer.six excels at text extraction, and PyMuPDF offers a comprehensive set of features including image extraction and table detection.
As you get deeper into PDF processing with Python, explore the official documentation of these libraries for more advanced features and optimizations (I have linked them in this blog!). Remember to handle exceptions and edge cases, especially when dealing with large or complex PDF files.
Power your AI models with the right data
Automate your data curation, annotation and label validation workflows.
Get startedWritten by
Akruti Acharya
- For basic text extraction, PyPDF2 is sufficient. However, for complex layouts or more accurate extraction, pdfminer.six is recommended. PyMuPDF (fitz) is also a powerful option that balances speed and accuracy. Choose based on your specific needs and the complexity of your PDFs.
- PyMuPDF (fitz) is the most efficient library for extracting images from PDFs. It can handle various image formats and maintains good quality.
- Yes, you can fill PDF forms programmatically using PyPDF2. The library allows you to read form fields, update their values, and save the modified PDF.
- Merging PDFs is straightforward with PyPDF2. The PdfMerger class allows you to append multiple PDF files and write them to a new file.
- Absolutely. While Python itself doesn't have built-in OCR capabilities, you can use the OCRmyPDF library, which integrates with Tesseract OCR, or use pytesseract. It can convert scanned PDFs into searchable PDFs, making text extraction possible from image-based documents.
- Yes, Python can extract tables from PDFs using libraries like PyMuPDF (fitz). It provides methods to detect and extract tabular data. While not perfect for all table formats, it works well for many structured tables in PDFs.
Explore our products