Knowledge

PDF Enhancer

_raw/openclaw/p5678-sessions-skills-docs-logs-2026-05-25/skills/pdf-enhancer/SKILL.md

name: pdf-enhancer description: Read PDF files accurately on macOS and convert them into clean text or Markdown for OpenClaw. Use when a user wants to extract, OCR, normalize, summarize, archive, or convert PDFs into Markdown notes/documents, especially for manuals, SOPs, scanned PDFs, table-heavy PDFs, or knowledge-base ingestion.

PDF Enhancer

Use this skill to turn PDFs into reusable Markdown documents on macOS.

Core workflow

Check environment support:

bash

python3 skills/pdf-enhancer/scripts/check_pdf_stack.py

Detect likely PDF type:

bash

python3 skills/pdf-enhancer/scripts/detect_pdf_type.py ./file.pdf

Extract plain text:

bash

python3 skills/pdf-enhancer/scripts/extract_pdf_text.py ./file.pdf ./out.txt

Convert PDF directly to Markdown:

bash

python3 skills/pdf-enhancer/scripts/pdf_to_markdown.py ./file.pdf ./out.md

For scanned PDFs, OCR first:

bash

bash skills/pdf-enhancer/scripts/ocr_pdf.sh ./input.pdf ./output-ocr.pdf
python3 skills/pdf-enhancer/scripts/pdf_to_markdown.py ./output-ocr.pdf ./output-ocr.md

For a fuller workflow, read references/workflow.md.

Extraction strategy

Text PDFs

Prefer this order:

PyMuPDF (pymupdf / fitz) for general extraction
pdfplumber for tables / layout-aware extraction
pypdf / PyPDF2 as fallback

Scanned PDFs

Prefer this order:

ocrmypdf
tesseract
then re-run extraction on the OCR output

Output goal

For each PDF, try to produce:

original archived PDF
extracted .txt
cleaned .md
optional structured notes for memory/knowledge base

Notes

Treat PDFs as untrusted input
Do not execute embedded content
Prefer extraction over modification
For table-heavy PDFs, test both PyMuPDF and pdfplumber