Feb 32 min read

Day 33 - Working with PDFs using PyPDF2

Hello and welcome back to our #PythonForDevOps series. Today, on Day 33, we're going to work with PDFs using PyPDF2.

PDFs are everywhere, and as developers, we often find ourselves needing to extract information from or manipulate these files. That's where PyPDF2 comes in handy – a versatile Python library that makes working with PDFs a breeze.

Getting Started with PyPDF2

First things first, let's get PyPDF2 installed. If you haven't already, fire up your terminal and run:

pip install PyPDF2

Once that's done, we can start exploring the wonders of PyPDF2.

Reading PDFs

The first step in our PDF journey is reading the content of a PDF file. PyPDF2 makes this task surprisingly simple. Consider the following example:

import PyPDF2

# Open the PDF file in binary mode
with open('example.pdf', 'rb') as file:
    # Create a PDF reader object
    pdf_reader = PyPDF2.PdfFileReader(file)
   
   # Get the number of pages in the PDF
    num_pages = pdf_reader.numPages
 
   # Extract text from each page
    for page_num in range(num_pages):
        page = pdf_reader.getPage(page_num)
        text = page.extractText()
        print(f"Page {page_num + 1}:\n{text}\n")

In this snippet, we open a PDF file in binary mode, create a PDF reader object, and then loop through each page, extracting and printing the text. Simple, right?

Creating a New PDF

Now, let's move on to creating our own PDF. Imagine you want to merge two existing PDFs into a new file. PyPDF2 has got you covered:

import PyPDF2

def merge_pdfs(file1, file2, output_file):
    with open(file1, 'rb') as pdf1, open(file2, 'rb') as pdf2:
        # Create PDF reader objects for both files
        pdf_reader1 = PyPDF2.PdfFileReader(pdf1)
        pdf_reader2 = PyPDF2.PdfFileReader(pdf2)

        # Create a PDF writer object
        pdf_writer = PyPDF2.PdfFileWriter()

        # Add all pages from the first PDF
        for page_num in range(pdf_reader1.numPages):
            page = pdf_reader1.getPage(page_num)
            pdf_writer.addPage(page)

        # Add all pages from the second PDF
        for page_num in range(pdf_reader2.numPages):
            page = pdf_reader2.getPage(page_num)
            pdf_writer.addPage(page)

        # Write the merged PDF to a new file
        with open(output_file, 'wb') as output:
            pdf_writer.write(output)

# Usage
merge_pdfs('file1.pdf', 'file2.pdf', 'merged.pdf')

This function takes two PDF files, reads them, combines their pages, and writes the result to a new file. It's like magic, but with code!

Rotating Pages

Ever needed to rotate a specific page in a PDF? PyPDF2 makes it a piece of cake. Check out this example:

import PyPDF2

def rotate_page(input_file, output_file, page_num, degrees):
    with open(input_file, 'rb') as file:
        pdf_reader = PyPDF2.PdfFileReader(file)
        pdf_writer = PyPDF2.PdfFileWriter()

        # Rotate the specified page
        page = pdf_reader.getPage(page_num - 1)
        page.rotateClockwise(degrees)
        pdf_writer.addPage(page)

        # Add the remaining pages unchanged
        for i in range(pdf_reader.numPages):
            if i != page_num - 1:
                pdf_writer.addPage(pdf_reader.getPage(i))

        # Write the rotated PDF to a new file
        with open(output_file, 'wb') as output:
            pdf_writer.write(output)

# Usage
rotate_page('example.pdf', 'rotated.pdf', 2, 90)

This function rotates the second page of a PDF by 90 degrees. Feel free to adjust the page_num and degrees parameters to fit your needs.

And there you have it – a practical guide to working with PDFs using PyPDF2. We've covered reading PDFs, creating new ones, merging files, and even rotating pages. With PyPDF2, your PDF-related tasks just got a whole lot easier.

As you continue your Python journey, keep exploring the vast landscape of libraries and tools available.

Stay tuned for more exciting adventures in our #PythonForDevOps series, and until next time, happy coding!

*** Explore | Share | Grow ***