StellarGlyph

PDF extraction is a data engineering problem before it is an LLM problem. If you have 2 board reports to read, sending the whole PDF to a large language model (LLM) with vision enabled is usually fine. However if instead you have hundreds of documents, scanned delivery notes, supplier certificates, invoices, statements, and technical manuals, that approach becomes slow, expensive, hard to observe, and inconsistent.

The better pattern is a PDF processing pipeline. This is a pipeline that allows reading and exporting of PDFs (and likely other common document types) to export them into data storage for use to generate context for AI agents. A processing pipeline usually has a number of distinct stages: ingest, processing (parsing and cleaning) and loading into a destination system for use by other downstream users (reporting, human users and agents). This is a pretty typical data engineering task.

PDF parsing can be split into two broad categories driven mostly by how the PDF was created and the quality of the document. For documents created by software, you can often use deterministic parsers where the document gives you reliable signals. For documents created through scans or by similar methods use OCR where the page is an image. Also by using LLMs judiciously for highly variable tasks (such as structure mapping or highly complex documents) you will get lower cost, better repeatability, and output structures that fit your needs.

Why PDFs Persist

PDFs persist because they solve several business problems well. They preserve page layout across machines. They are easy to email, store, print, sign, and archive. They support embedded fonts, images, metadata, permissions, forms, annotations, and digital signatures. They also sit comfortably between human workflows and regulated record-keeping.

There's even an ISO standard for this, ISO 19005-1 PDF/A standard specifies how to use PDF 1.4 for long-term preservation of electronic documents, which is one reason PDF remains common in legal, finance, public-sector, and compliance-heavy contexts.

PDFs are also useful because they freeze a moment in a process. A generated invoice, signed contract, certificate of analysis, or board pack needs to look the same when opened 5 years later. Editable source formats such as DOCX and XLSX can change when fonts, templates, macros, or software versions change. PDF is a stable delivery artefact.

The problem is that stable delivery artefacts are awkward data sources. A PDF page knows where glyphs and images sit. It does not always know that those glyphs form a purchase order number, that 3 adjacent text boxes are a table row, or that a header belongs to the following section.

That is where simple extraction starts to break.

Why Throwing The PDF At An LLM Stops Working

For a single document, the fastest route is to load the document into ChatGPT or Claude. Upload the file to a model with vision support, ask for the content in JSON or Markdown or event Word format and move on. This is good for prototypes, demos, and internal one-off analysis but comes with some challenges on larger documents:

Cost grows with page count, image resolution, and retry rate. A 3-page digitally generated invoice is an easier workload than a 180-page scanned contract pack. Vision calls also push more tokens and pixels through the model than plain text extraction. The issue then being that if there's a section of the document that is wrong, you'll have to reprocess the whole document again until you get to the version you need.
Latency becomes unpredictable. Large PDFs need page rendering, image transfer, model inference, and output validation. If the model times out or truncates the response, you need a retry strategy.
Failures are opaque. A model can miss a page, merge 2 tables, infer a value that is not present, or return JSON that validates syntactically while still being wrong.
Large documents hit context and output limits. Even when a model accepts the input, you still need to decide how to split the document and reconcile answers across chunks.
Sensitive documents create governance problems. Sending HR files, contracts, medical records, or customer statements to a remote model changes the risk profile.

A pipeline does requires more initial work, but in the end it also gives you control over cost, privacy, retries, observability, and output shape.

Start With PDF Classification

The first design decision is to classify the PDF. Most extraction mistakes come from treating every file the same way.

A digitally generated PDF usually has a text layer. The parser can read characters, coordinates, font information and metadata. This is the easiest path and if it works you should default to it. However, there may be no text layer at all in which case you may need to use OCR. OCR stands for Optical Character Recognition, a technology that converts images of text (such as scanned documents, photos, or PDFs) into machine-readable, editable text. In this case, the pipeline needs to render each page to an image, clean it and infer each character through OCR.

A hybrid PDF contains both. This is common in signed documents, scanned appendices, and documents where some pages were generated and some were attached as images. The pipeline should detect text coverage per page and choose the route page by page. This is one of the most important tips, processing a document page by page means you can parallelise the task and use the best processing method for each page. Also if 1 page fails it doesn't throw away the entire document.

Use a classifier to evaluate which type of parser should be used. For each page, count extractable characters, count images, inspect page dimensions, and sample whether text spans cover a plausible area. If a page has 20 characters and a full-page image, send it to OCR. If it has 2,000 characters with coordinates, parse it directly.

A damaged or adversarial PDF needs defensive handling. You need size limits, page limits, timeouts, password handling, and sandboxing. Good parsing libraries exposes options to impose document size limits and limit resource usage, which are exactly the kinds of controls you want before batch processing customer files.

Simple Text Extraction

Simple extraction reads characters from the PDF content stream. Libraries such as PyMuPDF, pypdf, PDFMiner, and pdfplumber can do this in Python. PyMuPDF’s documentation shows basic page iteration with page.get_text(), and it also warns that plain text output is coded as it appears in the document, with no attempt to improve reading order or line breaks.

A PDF viewer renders a page visually but a PDF parser receives drawing instructions. The order in which text appears in the file can differ from the order a human reads on the page.

This affects common patterns:

Multi-column reports can mix text from left and right columns.
Headers and footers can appear mid-stream.
Tables can flatten into lines with ambiguous spacing.
Rotated text can be extracted in unexpected order or not at all.
Embedded fonts can produce strange characters.
Form fields can sit outside the normal text flow.

For simple extraction, preserve as much of the metadata around a pdf as possible as this will be useful for later steps. Store each text span with page number, and source file. This gives downstream steps a chance to rebuild structure.

A minimal extraction record should look like this:

{
  "source_id": "supplier_contract_0421.pdf",
  "page": 12,
  "text": "Termination for convenience",
  "parser": "pymupdf",
  "extraction_mode": "text_layer"
}

This type of structure makes debugging so much easier in your pipeline. If a downstream LLM extracts the wrong termination clause, you can trace the answer back to a specific page.

A simple parser stage might look like this:

 
import os
import pymupdf
def extract_text_spans(path: str) -> list[dict]:
    """Extract one record per non-empty text span in a PDF.
 
    Args:
        path: Path to the PDF (or any PyMuPDF-supported document).
 
    Returns:
        A list of dicts, one per span, in reading order.
    """
    records: list[dict] = []
    source_id = os.path.basename(path)
 
    # Context manager ensures the document handle is always closed.
    with pymupdf.open(path) as doc:
        for page_index, page in enumerate(doc):
            page_dict = page.get_text("dict")
            for block in page_dict.get("blocks", []):
                # Image blocks have no "lines"; this skips them naturally.
                for line in block.get("lines", []):
                    for span in line.get("spans", []):
                        text = span.get("text", "").strip()
                        if not text:
                            continue
 
                        records.append(
                            {
                                "source_id": source_id,
                                "page": page_index + 1,  # 1-based page number
                                "text": text,
                                "parser": "pymupdf",
                                "extraction_mode": "text_layer",
                            }
                        )
 
    return records

When To Use OCR

OCR turns pixels into text. The usual flow is page rendering, image preprocessing, text detection, text recognition, and layout reconstruction.

Rendering converts each PDF page into an image at a chosen resolution. 200 dots per inch (DPI) is often enough for clean print. 300 DPI is safer for small fonts and poor scans. Higher DPI increases runtime and memory. It can also make noise more visible, so more pixels do not automatically mean better OCR.

Preprocessing can include rotation detection, de-skewing, de-noising, contrast adjustment, border removal, and page splitting. This stage should stay deterministic where possible. You want to know exactly what happened to the image before OCR.

Text detection finds regions that contain text. Text recognition converts each region into characters. Layout reconstruction groups words into lines, paragraphs, tables, and reading order.

The key trade-off is control versus integration. A direct Tesseract or PaddleOCR pipeline gives you detailed control over image preprocessing and OCR settings. Docling gives you a document-level abstraction, including layout and export formats.

For many business pipelines, the best route is mixed. Use a framework such as Docling for general document conversion. Add specialist OCR configuration for known document families, such as receipts, delivery notes, certificates, or handwritten forms.

Libraries And Models Worth Knowing

There is no single parser that wins on every PDF. We usually choose by document family, accuracy target, runtime, and deployment constraints.

For text-layer extraction:

PyMuPDF: fast PDF parsing, text extraction, rendering, annotations, and document mianipulation.
pdfplumber: good access to characters, lines, rectangles, and table-oriented geometry.
pypdf: useful for metadata, splitting, merging, forms, and simpler extraction tasks.

For OCR:

Tesseract: mature local OCR, good for clean printed text and controlled environments.
EasyOCR: neural OCR library with broad language support.
PaddleOCR: OCR and document analysis toolkit with multilingual support.
RapidOCR: lightweight OCR stack used in several document processing projects.

For document intelligence:

Docling: document conversion and enrichment toolkit that parses formats into a unified representation called Docling Document, then exports to formats including Markdown, JSON, HTML, text, DocLang XML, and Doctags.
Unstructured: document partitioning toolkit for breaking documents into typed elements.
LayoutParser: layout detection toolkit built around computer vision models.

Docling is worth particular attention because it sits between low-level parsing and LLM application code. Its architecture uses format-specific backends and pipelines, then returns a common document representation that can be exported, serialised, or chunked.

Its pipeline options also expose practical controls. You can enable OCR, table structure recognition, image extraction, accelerator settings, and table matching behaviour. We've found that if you are dealing with anything other than simple pdfs, this one library can handle 90% of documents well.

What A Good PDF Pipeline Looks Like

A good pipeline separates document parsing from information extraction. These are different jobs.

Document parsing answers: "What is on the page, where is it, and how is it structured?"

Information extraction answers: "Which of those elements correspond to the fields we need?"

The production pipeline should broadly be composed of the following stages:

Ingest the file and assign a stable document ID.
Validate file type, size, page count, and checksum.
Classify each page as text-layer, scanned, or hybrid.
Extract text spans from digital pages.
Render and OCR scanned pages.
Reconstruct layout into reading order.
Export an intermediate representation with metadata for Human in the Loop verification.
Chunk the document for retrieval or extraction.
Run deterministic rules for known fields.
Run LLM extraction for ambiguous fields and AI requirements for example summaries of sections.
Validate outputs against a schema.
Store evidence, confidence, timings, parser version, and model version.

The intermediate representation is where it may seem easy to cut corners. Plain Markdown is useful for an LLM, but it is usually too lossy as your system of record. Store a structured JSON representation first and then generate Markdown from that when needed.

A practical record for a section might look like this:

{
  "document_id": "doc_8f1c",
  "element_id": "p3_e14",
  "page": 3,
  "text": "Service fee...",
  "metadata": {
    "source_file": "statement.pdf",
    "parser": "docling",
    "ocr": false,
    "confidence": 0.94
  }
}

This gives you a lot of information that will help with debugging, refining and information retrieval when all this work pays off.

Which Parts Should Be Deterministic

For more robust pipelines make deterministic anything that should be repeatable, measurable, or cheap. Typical deterministic stages include:

File validation.
Metadata extraction
Page classification.
Text-layer extraction.
OCR preprocessing

Schema validation should also be deterministic. Use Pydantic, JSON Schema, or your database schema. If the extracted invoice date is missing, malformed, or after the ingestion date, fail the record or route it for review.

The same applies to normalisation where rules are clear. Currency codes, dates, VAT numbers, company registration numbers, and ISO country codes should go through rules and reference data before an LLM sees them.

For example:

from pydantic import BaseModel, Field, field_validator
from datetime import date
 
class InvoiceExtraction(BaseModel):
    supplier_name: str
    invoice_number: str
    invoice_date: date
    total_amount: float = Field(gt=0)
    currency: str = Field(min_length=3, max_length=3)
 
    @field_validator("currency")
    @classmethod
    def uppercase_currency(cls, value: str) -> str:
        return value.upper()

This gives the pipeline a contract it must adhere to.

Where LLMs Belong

Use LLMs where rules become brittle or fuzzy:

Text classification
Semantic field mapping
Messy text
Unusual table layouts
Cross-page reasoning are good candidates.

An LLM is useful when the question is closer to meaning than parsing:

Prompt:
From the provided evidence blocks, extract the termination notice period.
Return only JSON that matches the schema.
Include the source element IDs used as evidence.
If the evidence is absent, return null.
 
<evidence>
...
</evidence>

The important part is "provided evidence blocks". The model should receive selected document elements with page numbers and IDs, not the whole PDF. That keeps context small and makes the answer auditable.

For high-volume extraction, we prefer a 2-pass approach. The first pass retrieves candidate elements using deterministic filters, embeddings, or both. The second pass asks an LLM to extract from those candidates. This reduces cost and improves review because each answer has a short evidence trail.

Avoid asking the LLM to do everything at once. "Read this 90-page contract and extract all commercial terms" is hard to test. "Given these 8 candidate blocks, extract renewal term, notice period, and governing law" is much easier to validate. Evals are a great way of validating these smaller units of LLM work.

Accuracy Comes From Feedback Loops

PDF pipelines improve when you store failures. Keep the original file, parser output, OCR output, model prompt, model response, validation errors, and human corrections.

This lets you answer practical questions:

Which document families fail most often?
Which fields have low confidence?
Which OCR engine performs best on your scans?
Which table settings merge cells incorrectly?
Which suppliers changed their invoice template?
Which pages cost the most time?

Once you have that data, optimisation becomes specific. You can adjust DPI for one document class. You can switch table extraction mode for financial statements. You can add a deterministic rule for a supplier that accounts for 30% of failures.

This is the real benefit of a pipeline, PDFs are no longer mysterious blobs and instead are treated as semi-structured inputs with known failure modes. It also hints at where a lot of the work is not on initial pipeline setup but on driving up confidence and reliability, which usually takes 2-3x longer than the initial build.

Practical Tips from the Field

Start with a simple, conservative design first.

Use text-layer extraction wherever possible. Use OCR only on pages that need it. Store a structured intermediate representation with page and bounding-box metadata. Use deterministic validation before writing to operational systems. Use LLMs for semantic extraction from selected evidence, not full-document parsing. Build human review into the pipeline to help with refinement.

For tooling, start with PyMuPDF or pdfplumber for low-level experiments. Move to Docling when you need layout-aware conversion, table handling, OCR options, and exports for retrieval-augmented generation (RAG). Docling supports many input formats beyond PDF, including DOCX, XLSX, PPTX, HTML, images, audio, video, and several XML formats, which can matter once the pipeline expands beyond the first PDF use case.

The final test is operational. If a user disputes a value, can you show the source file, extracted page, extracted evidence, parser version, prompt, response, validation result, and final write? If you can, you have a robust and reliable PDF pipeline.

If your business has a backlog of PDFs that need to become searchable, structured, or available to AI systems, start with 50 representative documents. We can help turn that sample into a measured extraction pipeline before you commit to processing the full archive.

PDF Parsing For AI