Knowledge

PDF Enhancer

_raw/openclaw/p5678-sessions-skills-docs-logs-2026-05-25/skills/pdf-enhancer/SKILL.md


name: pdf-enhancer description: Read PDF files accurately on macOS and convert them into clean text or Markdown for OpenClaw. Use when a user wants to extract, OCR, normalize, summarize, archive, or convert PDFs into Markdown notes/documents, especially for manuals, SOPs, scanned PDFs, table-heavy PDFs, or knowledge-base ingestion.


PDF Enhancer

Use this skill to turn PDFs into reusable Markdown documents on macOS.

Core workflow

  1. Check environment support:
bash
python3 skills/pdf-enhancer/scripts/check_pdf_stack.py
  1. Detect likely PDF type:
bash
python3 skills/pdf-enhancer/scripts/detect_pdf_type.py ./file.pdf
  1. Extract plain text:
bash
python3 skills/pdf-enhancer/scripts/extract_pdf_text.py ./file.pdf ./out.txt
  1. Convert PDF directly to Markdown:
bash
python3 skills/pdf-enhancer/scripts/pdf_to_markdown.py ./file.pdf ./out.md
  1. For scanned PDFs, OCR first:
bash
bash skills/pdf-enhancer/scripts/ocr_pdf.sh ./input.pdf ./output-ocr.pdf
python3 skills/pdf-enhancer/scripts/pdf_to_markdown.py ./output-ocr.pdf ./output-ocr.md

For a fuller workflow, read references/workflow.md.

Extraction strategy

Text PDFs

Prefer this order:

  • PyMuPDF (pymupdf / fitz) for general extraction
  • pdfplumber for tables / layout-aware extraction
  • pypdf / PyPDF2 as fallback

Scanned PDFs

Prefer this order:

  • ocrmypdf
  • tesseract
  • then re-run extraction on the OCR output

Output goal

For each PDF, try to produce:

  • original archived PDF
  • extracted .txt
  • cleaned .md
  • optional structured notes for memory/knowledge base

Notes

  • Treat PDFs as untrusted input
  • Do not execute embedded content
  • Prefer extraction over modification
  • For table-heavy PDFs, test both PyMuPDF and pdfplumber