agentskit.js
ToolsIntegrations

documentParsers

PDF, DOCX, XLSX parsers — BYO parser functions keep core dependency-free.

import { documentParsers } from '@agentskit/tools'
import pdfParse from 'pdf-parse'
import * as mammoth from 'mammoth'
import * as xlsx from 'xlsx'

const runtime = createRuntime({
  adapter,
  tools: [...documentParsers({
    parsePdf: async (buf) => (await pdfParse(buf)).text,
    parseDocx: async (buf) => (await mammoth.extractRawText({ buffer: buf })).value,
    parseXlsx: async (buf) => {
      const wb = xlsx.read(buf)
      return wb.SheetNames.map((n) => xlsx.utils.sheet_to_csv(wb.Sheets[n])).join('\n---\n')
    },
  })],
})

Sub-tools

NamePurpose
parsePdfExtract text from a PDF buffer
parseDocxExtract text from a .docx buffer
parseXlsxExtract CSV-flat sheets from .xlsx

Bundled: documentParsers(config) returns all three.

Why BYO

Core stays zero-dep. You pick parser quality + size trade-offs:

  • PDF: pdf-parse (small) / unpdf (WASM, browser-safe) / pdfjs-dist (Mozilla).
  • DOCX: mammoth (most faithful) / docx4js.
  • XLSX: xlsx (SheetJS) / exceljs.

Example — resume intake

const runtime = createRuntime({
  adapter,
  tools: [
    ...s3({ client, bucket: 'resumes' }),
    ...documentParsers({ parsePdf, parseDocx }),
    ...rag.tools,
  ],
})
✎ Edit this page on GitHub·Found a problem? Open an issue →·How to contribute →

On this page