Parser - pdf

Introduction

The PDF Document Parser is an implementation of the Document Parser interface used to parse the contents of PDF files into plain text. This component implements the Eino: Document Loader guide and is mainly used for the following scenarios:

When you need to convert PDF documents into a processable plain text format
When you need to split the contents of a PDF document by page

Features

The PDF parser has the following features:

Supports basic PDF text extraction
Optionally splits documents by page
Automatically handles PDF fonts and encoding
Supports multi-page PDF documents

Notes:

May not fully support all PDF formats currently
Will not retain formatting like spaces and line breaks
Complex PDF layouts may affect extraction results

Usage

Component Initialization

The PDF parser is initialized using the NewPDFParser function, with the main configuration parameters as follows:

import (
  "github.com/cloudwego/eino-ext/components/document/parser/pdf"
)

func main() {
    parser, err := pdf.NewPDFParser(ctx, &pdf.Config{
        ToPages: true,  // Whether to split the document by page
    })
}

Configuration parameters description:

ToPages: Whether to split the PDF into multiple documents by page, default is false

Parsing Documents

Document parsing is done using the Parse method:

docs, err := parser.Parse(ctx, reader, opts...)

Parsing options:

Supports setting the document URI using parser.WithURI
Supports adding extra metadata using parser.WithExtraMeta

Complete Usage Example

Basic Usage

package main

import (
    "context"
    "os"
    
    "github.com/cloudwego/eino-ext/components/document/parser/pdf"
    "github.com/cloudwego/eino/components/document/parser"
)

func main() {
    ctx := context.Background()
    
    // Initialize the parser
    p, err := pdf.NewPDFParser(ctx, &pdf.Config{
        ToPages: false, // Do not split by page
    })
    if err != nil {
        panic(err)
    }
    
    // Open the PDF file
    file, err := os.Open("document.pdf")
    if err != nil {
        panic(err)
    }
    defer file.Close()
    
    // Parse the document
    docs, err := p.Parse(ctx, file, 
        parser.WithURI("document.pdf"),
        parser.WithExtraMeta(map[string]any{
            "source": "./document.pdf",
        }),
    )
    if err != nil {
        panic(err)
    }
    
    // Use the parsed results
    for _, doc := range docs {
        println(doc.Content)
    }
}

Using loader

Refer to the example in the Eino: Document Loader guide

Feedback

Was this page helpful?

Please tell us how we can improve.

Last modified May 7, 2025 : docs: update document transformer (#1325) (c50d9fed5b)