Best Python Libraries for Working with PDFs

Introduction

Overview of tutorial on popular and easy-to-use Python libraries for working with PDFs.
Details available at poppythonology.eu.
Weekly newsletter with Python-related articles.

Versions: PiPDF2, PiPDF4.
Installation: pip install PyPDF
Basic Usage:
- Import and create reader object
- Example: Extracting text and images from PDF
- Commands: from PyPDF2 import PdfReader reader = PdfReader('file.pdf') length = len(reader.pages) page = reader.pages[0] text = page.extract_text() for page in reader.pages: text = page.extract_text()
- Extract Images: Iterating over page.images and saving them. for image in page.images: with open(image.name, 'wb') as f: f.write(image.data)

Focus: Extracting tables
Installation: pip install pdfplumber
Basic Usage:
- Open PDF and extract tables
- Commands: import pdfplumber with pdfplumber.open('file.pdf') as f: for page in f.pages: tables = page.extract_table()

Focus: Extracting metadata, table of contents, converting to images, and links
Installation: pip install pymupdf
Basic Usage:
- Create document object and explore methods available
- Commands: import fitz doc = fitz.open('file.pdf') page_count = doc.page_count metadata = doc.metadata page = doc.load_page(0) text = page.get_text() pix = page.get_pixmap() pix.save(f'page_{page.number}.png') links = page.get_links()