NAME
pdfmd — Smart PDF to Markdown converter with intelligent heading detection, automatic header/footer removal, orphan fragment…
SYNOPSIS
sudo apt-get install tesseract-ocrINFO
DESCRIPTION
Smart PDF to Markdown converter with intelligent heading detection, automatic header/footer removal, orphan fragment merging, and image export. Features a user-friendly GUI with preview mode, persistent settings, and per-page error recovery. Optimized for Obsidian and other Markdown-based note-taking workflows.
README
PDF to Markdown Converter (pdfmd)
A refined, privacy-first desktop and CLI tool that converts PDFs—including scanned documents—into clean, structured Markdown. Built for researchers, professionals, and creators who demand accuracy, speed, and absolute data privacy.
Fast. Local. Intelligent. Fully offline.
📑 Table of Contents
🛡️ Privacy & Security First
Many PDF converters silently upload documents to remote servers. This tool does not.
- No uploads: Your files never leave your machine
- No telemetry: No usage tracking or analytics
- No cloud processing: All computation happens locally
- No background requests: Completely offline operation
Every step—extraction, OCR, reconstruction, and rendering—happens locally on your machine.
Trusted for Sensitive Workflows
Intentionally designed for environments where confidentiality is non-negotiable:
- 🏥 Medical: Clinical notes, diagnostic reports, patient records
- ⚖️ Legal: Case files, evidence bundles, attorney-client communications
- 🏛️ Government: Policy drafts, restricted documents, classified materials
- 🎓 Academic Research: Paywalled journals, unpublished materials, grant proposals
- 💼 Corporate: Financial reports, IP-sensitive designs, strategic plans
Password-Protected PDFs — Secure Support
Full support for encrypted PDFs with security-first design:
✅ Passwords never logged or saved — Memory-only processing
✅ No command-line exposure — Prevents process monitoring attacks
✅ Auto-cleanup — Temporary files deleted immediately
✅ Interactive prompts — Hidden input in GUI and CLI
GUI: Modal password dialog with masked input (*****)
CLI: getpass hidden terminal input
Supports all PDF encryption standards: 40-bit RC4, 128-bit RC4, 128/256-bit AES.
✨ Key Features
🎯 Accurate Markdown From Any PDF
- Smart paragraph reconstruction — Joins wrapped lines intelligently
- Heading inference — Uses font metrics to detect document structure
- Bullet & numbered list detection — Recognizes various formats (•, ○, -, 1., a., etc.)
- Hyphenation repair — Automatically unwraps "hy-\nphen" patterns
- URL auto-linking — Converts plain URLs into clickable Markdown links
- Inline formatting — Preserves bold and italic styling
- Header/footer removal — Detects and strips repeating page elements
- Multi-column awareness — Reduces cross-column text mixing
📊 Automatic Table Detection & Reconstruction
Your PDFs often contain tables split across blocks, columns, and various layout quirks. The robust table engine handles:
- Column-aligned tables — Detects 2+ space separated columns
- Bordered tables — Recognizes explicit
|and¦delimiters - Tab-separated blocks — Handles tab-delimited data
- Multi-block vertical tables — Stitches tables split across PyMuPDF blocks
- Full Markdown rendering — Generates proper pipe tables with alignment
- Header row detection — Automatically identifies table headers
- Conservative heuristics — Avoids false positives on prose and lists
Perfect for academic papers, financial documents, and structured reports.
Detection Strategies (priority order):
- Bordered tables (highest confidence)
- Vertical multi-block tables
- ASCII whitespace-separated tables
🧮 Math-Aware Extraction & LaTeX Preservation
Scientific documents finally convert cleanly. The Math Engine automatically:
- Detects inline & display math regions — Distinguishes equations from prose
- Converts Unicode math to LaTeX —
α → \alpha,√x → \sqrt{x} - Handles superscripts/subscripts —
x² → x^{2},x₁₀ → x_{10} - Preserves existing LaTeX — Keeps
$...$and$$...$$intact - Avoids Markdown escaping — Math content bypasses normal escaping
- Maintains equation integrity — Keeps equations intact across line breaks
Ideal for scientific PDFs in physics, mathematics, engineering, and chemistry.
Examples:
E = mc²→E = mc^{2}α + β³→\alpha + \beta^{3}∫₀^∞ e^(-x²) dx→\int_{0}^{\infty} e^{-x^{2}} dx
📸 Scanned PDF Support (OCR)
- Tesseract OCR — Lightweight, accurate, works on all major platforms
- OCRmyPDF — High-fidelity layout preservation
- Auto-detection — Automatically identifies scanned pages
- Configurable quality — Balance between speed and accuracy
- Mixed-mode support — Handles PDFs with both digital text and scanned pages
Auto-Detection Heuristics:
- Text density analysis (< 50 chars/page = likely scanned)
- Image coverage detection (>30% page area)
- Combined signals trigger OCR automatically
🎨 Modern GUI Experience
- Dark/Light themes — Obsidian-style dark mode (default) with instant toggle
- Live progress tracking — Determinate progress bar with full logging
- Real-time console — View extraction and conversion logs as they happen
- Quick access — "Open Output Folder" link to finished Markdown
- Non-blocking conversion — Cancel long-running jobs anytime with Esc
- Keyboard shortcuts — Power-user workflow (Ctrl+Enter to convert)
- Persistent settings — Theme, paths, options, and profiles saved between sessions
- Conversion profiles — Built-in and custom presets for different document types
🖼️ Interface Preview
Dark Mode (Default)

Obsidian-inspired dark theme with purple accents for optimal late-night work sessions.
Toggle between themes instantly — your preference is saved between sessions.
🧠 Architecture Overview
A modular pipeline ensures clarity, stability, and extensibility.
PDF Input
↓
┌─────────────────┐
│ 1. EXTRACT │ ← Native PyMuPDF or OCR (Tesseract/OCRmyPDF)
└─────────────────┘
↓
┌─────────────────┐
│ 2. TRANSFORM │ ← Clean text, remove headers/footers, detect structure
└─────────────────┘
↓
┌─────────────────┐
│ 3. RENDER │ ← Generate Markdown with headings, lists, links
└─────────────────┘
↓
┌─────────────────┐
│ 4. EXPORT │ ← Write .md file + optional image assets
└─────────────────┘
↓
Markdown Output
📦 Module Overview
Each module maintains a single responsibility, ensuring the system remains clean, testable, and easy to extend.
| Module | Purpose |
|---|---|
extract.py | PDF text extraction, OCR orchestration, structural block formation, encrypted-PDF support |
tables.py | Advanced table detection and Markdown table reconstruction (cell grouping, alignment rows, safety handling) |
equations.py | Math detection heuristics and conversion to inline/display LaTeX-compatible Markdown |
transform.py | Text cleanup, header/footer removal, block classification, integration of table/math structures into the document flow |
render.py | Final Markdown generation with headings, lists, links, images, tables, and math rendering |
pipeline.py | End-to-end orchestration: extract → structure → transform → tables → equations → render |
models.py | Typed data structures: PageText, Block, Line, Span, Options |
utils.py | Platform helpers, OCR detection utilities, file handling, temp-file safety, logging tools |
app_gui.py | Tkinter GUI: profiles, theming, progress tracking, encrypted-PDF dialogs |
cli.py | Command-line interface for batch automation, scripting, and secured password prompts |
🏗️ Design Philosophy
⭐ Single Responsibility per Module
Each component focuses on doing one thing well:
- extraction
- structure analysis
- tables
- equations
- transformation
- rendering
- user workflow (GUI/CLI)
This eliminates cross-contamination and makes features reliable and testable.
🔄 Data Flow Overview
PDF → extract.py
↓
Raw blocks (text, spans, geometry)
↓
transform.py
↓
Structured blocks (paragraphs, lists, headings)
↓
tables.py
↓
Table blocks (aligned cells, rows, Markdown pipe tables)
↓
equations.py
↓
Equation blocks ($...$ / $$...$$)
↓
render.py
↓
Final Markdown output
This modular pipeline allows tables and equations to slot into the flow cleanly, without affecting the behavior of unrelated modules.
🔍 Why This Matters
- Researchers get reliable table conversion
- Academics get inline and display math suitable for Obsidian, Jupyter, pandoc, and mkdocs
- Developers get an extensible pipeline where new block types can be added without breaking existing components
- Users get clearer, more accurate Markdown output without extra configuration
🚀 Ready for Future Expansion
With tables and equations now modularized, future upgrades can be added easily:
- Better table spanning (row/column spans)
- Math rendering modes (strict, permissive)
- Charts detection
- Diagram extraction
- Semantic tagging for AI/LLM workflows
This architecture forms a scalable base for long-term evolution of pdfmd.
⚙️ Installation
Quick Install (Development)
# Clone repository git clone https://github.com/M1ck4/pdfmd.git cd pdfmdInstall dependencies manually
pip install pymupdf pillow pytesseract ocrmypdf
Launch GUI\python -m pdfmd.app_gui
Install as Package (Recommended)
# Clone and install git clone https://github.com/M1ck4/pdfmd.git cd pdfmdMinimal install (native text extraction only)
pip install -e .
OR: Full install with OCR support (recommended)
pip install -e .[full]
Use the CLI
pdfmd input.pdf
Platform-Specific Setup
Windows
Install Tesseract OCR:
- Download: https://github.com/UB-Mannheim/tesseract/wiki
- Run installer and check "Add to PATH"
Install Python packages (if running without the package installer):
pip install pymupdf pillow pytesseract ocrmypdfVerify installation:
tesseract --version
macOS
# Install Tesseract brew install tesseractInstall OCRmyPDF (recommended)
brew install ocrmypdf
Install Python dependencies manually
pip install pymupdf pillow pytesseract ocrmypdf
Linux (Ubuntu/Debian)
# System dependencies sudo apt-get update sudo apt-get install tesseract-ocr ocrmypdfPython dependencies
pip install pymupdf pillow pytesseract ocrmypdf
Windows Standalone Executable
Download the latest .exe from Releases — no Python required.
Note: Tesseract must still be installed separately for OCR functionality.
🚀 Usage
🖥️ GUI Application
Launching the GUI
The graphical interface can be started in several ways:
# If installed as a package: python -m pdfmd.app_guiDirect execution (from package directory):
python app_gui.py
Quick Workflow
Basic Conversion in 7 Steps:
📂 Select Input PDF
- Click Browse... next to "Input PDF"
- The path is remembered between sessions
💾 Choose Output Location
- Output path is auto-suggested as
input.md - Click Browse... to change location
- Or manually edit the path
- Output path is auto-suggested as
⚙️ Select Profile
- Choose from built-in profiles:
- Default — Balanced settings for most documents
- Academic article — Optimized for papers with equations
- Slides / handouts — Image export + page breaks
- Scan-heavy / OCR-first — Force OCR on all pages
- Or use your custom saved profiles
- Choose from built-in profiles:
🔧 Configure Options
OCR Mode:
off— Native text extraction (fastest)auto— Detect scanned pages automatically ✨ recommendedtesseract— Force OCR on all pagesocrmypdf— High-quality OCR preprocessing
Output Options:
- ☑️ Preview first 3 pages — Quick test before full conversion
- ☑️ Export images — Save images to
_assets/folder - ☑️ Insert page breaks — Add
---between pages
Text Processing:
- ☑️ Remove repeating header/footer — Auto-detect and strip
- ☑️ Promote CAPS to headings — Treat ALL CAPS as section titles
- ☑️ Defragment short orphans — Merge isolated short lines
Fine-Tuning:
- Heading size ratio (1.0-2.5) — Font size threshold for headings
- Orphan max length (10-120) — Character limit for line merging
▶️ Convert
- Click Convert → Markdown button
- Or press Ctrl+Enter (keyboard shortcut)
- The conversion runs in the background
📊 Monitor Progress
- Watch the progress bar for completion status
- View live logs in the console panel
- See current status in the status line
- Press Stop or Esc to cancel if needed
✅ Open Output
- When complete, click Open folder link
- Opens the output directory in your file manager
- Your Markdown file is ready to use!
Profiles
Built-in Profiles:
- Default — Balanced settings for general documents, auto-detect headers/footers, smart heading detection
- Academic article — Optimized for research papers, higher orphan threshold (60 chars), tighter heading ratio (1.10), OCR mode:
auto - Slides / handouts — Export images automatically, insert page breaks between slides, disabled header/footer removal, OCR mode:
auto - Scan-heavy / OCR-first — Force Tesseract OCR on all pages, no CAPS-to-heading conversion, best for old scanned documents
Custom Profiles:
- Adjust settings to your preference
- Click Save profile...
- Enter a profile name
- Profile is saved and available for future use
To delete: Select a custom profile, click Delete profile, confirm. (Built-in profiles cannot be deleted)
Keyboard Shortcuts
| Shortcut | Action |
|---|---|
| Ctrl+O | Browse for input PDF |
| Ctrl+Shift+O | Browse for output location |
| Ctrl+Enter | Start conversion |
| Esc | Stop/cancel conversion |
GUI Features
🎨 Themes
Toggle between Dark and Light themes. Theme preference is saved between sessions.
- Dark — Obsidian-inspired dark mode with deep blacks and purple accents
- Light — Clean light mode with high contrast
🔒 Password Protection
For encrypted PDFs:
- Start conversion as normal
- Password dialog appears automatically
- Enter password (input is hidden)
- Click OK or press Enter
- Conversion proceeds with decrypted content
Password is used in-memory only, never logged or saved to disk, not passed to external processes.
⚠️ Cancellation
Stop a long-running conversion by clicking Stop or pressing Esc. Current step completes, then conversion stops gracefully.
📝 Live Logging
The console panel shows real-time progress:
[pipeline] Extracting text...
[pipeline] Transforming pages...
[profile] Applied profile: Academic article
[pipeline] Removed repeating edges → header='Chapter 1', footer='- - 1'
[pipeline] Rendering Markdown...
[pipeline] Saved → /path/to/output.md
💾 Persistent Settings
Automatically saved between sessions:
- Last input/output paths
- Current options and settings
- Custom profiles
- Theme preference
Configuration stored at: ~/.pdfmd_gui.json
Common GUI Workflows
Quick Preview:
- Select your PDF
- Check Preview first 3 pages
- Click Convert
- Review output to verify settings
- Uncheck preview and run full conversion
Batch Processing:
- Convert first document with desired settings
- Click Save profile... with descriptive name
- For subsequent documents: Select new input PDF, choose your saved profile, click Convert
Scanned Documents:
- Select scanned PDF
- Set OCR mode to auto or tesseract
- Consider enabling Export images
- Click Convert
- Monitor OCR progress in logs (may take several minutes)
Academic Papers:
- Select Academic article profile
- Verify settings (OCR: auto, heading ratio: 1.10)
- Click Convert
- Tables and equations are automatically detected and formatted
📟 Command-Line Interface
Installation & Running
The CLI can be invoked in several ways:
# If installed as a package (recommended): pdfmd input.pdfUsing Python module syntax (from project root):
python -m pdfmd.cli input.pdf
Quick Start
# Basic conversion (writes input.md next to the PDF) pdfmd report.pdfSpecify output file
pdfmd report.pdf -o notes.md
Auto-detect scanned pages and OCR as needed
pdfmd scan.pdf --ocr auto
Batch convert multiple PDFs
pdfmd *.pdf --ocr auto -o converted_md/
Common CLI Workflows
📄 Standard Documents
# Clean, text-based PDFs (articles, reports, books) pdfmd document.pdfWith statistics summary
pdfmd document.pdf --stats
🔍 Scanned Documents
# Auto-detect and OCR scanned pages only pdfmd scan.pdf --ocr autoForce Tesseract OCR on all pages
pdfmd scan.pdf --ocr tesseract
Use OCRmyPDF for high-quality layout preservation
pdfmd scan.pdf --ocr ocrmypdf
🖼️ Documents with Images
# Extract images to _assets/ folder with references pdfmd presentation.pdf --export-imagesOCR + images for scanned slides
pdfmd slides.pdf --ocr auto --export-images
📋 Quick Preview
# Process only first 3 pages (fast inspection) pdfmd long_paper.pdf --preview-onlyPreview with stats
pdfmd long_paper.pdf --preview-only --stats
🔒 Password-Protected PDFs
# Interactive password prompt (secure, no command-line exposure) pdfmd encrypted.pdfThe CLI will detect encryption and prompt for password
Password is never logged or shown in process listings
🔇 Scripting & Automation
# Quiet mode (errors only, good for scripts) pdfmd batch/*.pdf --ocr auto --quiet --no-progressNon-interactive mode (fails if password needed)
pdfmd document.pdf --no-progress -q
🔬 Debug & Verbose Output
# Basic verbose output pdfmd document.pdf -vDebug-level detail (includes pipeline stages)
pdfmd document.pdf -vv
Without colored output (for logs)
pdfmd document.pdf -v --no-color
Full Options Reference
usage: pdfmd [-h] [-o OUTPUT] [--ocr {off,auto,tesseract,ocrmypdf}] [--export-images] [--page-breaks] [--preview-only] [--no-progress] [-q] [-v] [--stats] [--no-color] [--version] INPUT_PDF [INPUT_PDF ...]Convert PDF files to clean, Obsidian-ready Markdown with table and math-aware conversion. Runs fully offline: no uploads, no telemetry, no cloud dependencies.
positional arguments: INPUT_PDF Path(s) to input PDF file(s). Multiple files supported.
options: -h, --help Show this help message and exit
-o OUTPUT, --output OUTPUT Output path. For single input: .md file path. For multiple inputs: directory (created if needed). Default: writes input.md next to each PDF.
--ocr {off,auto,tesseract,ocrmypdf} OCR mode (default: off): off — use native text extraction only auto — detect scanned pages, OCR as needed tesseract — force page-by-page Tesseract OCR ocrmypdf — pre-process with OCRmyPDF for high-fidelity layout
--export-images Export images to _assets/ folder next to output file, with Markdown image references appended to document.
--page-breaks Insert '---' horizontal rule between pages in output.
--preview-only Only process first 3 pages (useful for quick inspection of large documents or testing settings).
--no-progress Disable terminal progress bar (useful for logging).
-q, --quiet Suppress non-error messages. Only show errors.
-v, --verbose Increase verbosity: -v — show conversion stages and logs -vv — debug-level detail with full pipeline info
--stats Print document statistics after conversion: word count, headings, tables, lists.
--no-color Disable colored terminal output (for log files).
--version Print version and exit.
Advanced CLI Examples
Batch Processing:
# Convert all PDFs in current directory pdfmd *.pdf --ocr auto -o markdown_output/Convert with consistent settings
for pdf in papers/*.pdf; do pdfmd "$pdf" --ocr auto --stats done
Tables and Math:
# The CLI automatically detects and converts: # • Text tables → GitHub-flavored Markdown tables # • Unicode math (E = mc², x₁₀², α + β³) → LaTeX-style equations # • Existing LaTeX math is preserved
pdfmd academic_paper.pdf --stats
Integration with Other Tools:
# Pipeline with other markdown tools pdfmd input.pdf -o - | pandoc -f markdown -o output.docxGenerate and preview
pdfmd paper.pdf && code paper.md
Conversion + commit
pdfmd updated.pdf && git add updated.md && git commit -m "Update notes"
Output Behavior
Single PDF:
pdfmd input.pdf # Creates: input.md (same directory as input.pdf)pdfmd input.pdf -o notes.md
Creates: notes.md (current directory)
pdfmd input.pdf -o ~/Documents/notes.md
Creates: ~/Documents/notes.md
Multiple PDFs:
pdfmd file1.pdf file2.pdf file3.pdf # Creates: file1.md, file2.md, file3.md (next to originals)pdfmd *.pdf -o converted/
Creates: converted/file1.md, converted/file2.md, ...
Directory is created if it doesn't exist
Image Export:
pdfmd slides.pdf --export-images
# Creates:
# slides.md
# slides_assets/
# ├── img_001_01.png
# ├── img_001_02.png
# └── ...
# Images referenced at end of slides.md
CLI Error Handling
Missing Dependencies:
$ pdfmd scan.pdf --ocr tesseract
Error: OCR mode 'tesseract' selected but Tesseract binary is not available. Install Tesseract from: https://github.com/UB-Mannheim/tesseract/wiki Then run: pip install pytesseract pillow
Password-Protected Files:
$ pdfmd encrypted.pdf
PDF is password protected. Enter password (input will be hidden):
[password entry is hidden]
Converting encrypted.pdf → encrypted.md
Invalid Files:
$ pdfmd missing.pdf Error: input file not found: missing.pdf
$ pdfmd document.txt Error: The input file must have a .pdf extension.
CLI Security Notes
Password Handling:
- Interactive prompts only — passwords never passed via command-line arguments
- No process exposure — passwords not visible in
psor process listings - Memory-only — passwords never logged, cached, or persisted to disk
- No network — all processing is local, passwords never transmitted
Privacy:
- 100% offline — no uploads, no telemetry, no external API calls
- No cloud dependencies — all OCR and processing happens on your machine
- Output is unencrypted — protect
.mdfiles according to your environment's security requirements
CLI Performance Tips
Large Documents:
# Preview first to check settings (fast) pdfmd large_book.pdf --preview-only --statsThen convert full document
pdfmd large_book.pdf --ocr auto
Disable progress bar for slight speed improvement
pdfmd large_book.pdf --no-progress
OCR Performance:
# Fastest: only OCR scanned pages pdfmd mixed.pdf --ocr autoMedium: page-by-page Tesseract (more accurate for scans)
pdfmd scan.pdf --ocr tesseract
Slowest but best quality: OCRmyPDF preprocessing
pdfmd scan.pdf --ocr ocrmypdf
Batch Optimization:
# Process in parallel (Unix/Linux/macOS): ls *.pdf | xargs -n 1 -P 4 pdfmd --ocr auto --quietWindows PowerShell parallel:
Get-ChildItem *.pdf | ForEach-Object -Parallel { pdfmd $_.FullName --ocr auto --quiet } -ThrottleLimit 4
Exit Codes
0— Success (all files converted)1— Error (one or more files failed)
# Use in scripts:
if pdfmd document.pdf --quiet; then
echo "Conversion successful"
else
echo "Conversion failed"
exit 1
fi
API Documentation
For developers wanting to integrate pdfmd into their own Python code, a full, detailed API reference is available:
This document covers:
- Programmatic use of
pdf_to_markdown - All
Optionsfields and behaviours - Progress & logging callbacks
- Advanced / lower-level pipeline access
- Integration examples (scripts, pandoc, Jupyter)
📊 Configuration Options
Key Settings
Heading Size Ratio (1.0 to 2.5, default 1.15)
- Font size multiplier for heading detection
- Lower = more headings, Higher = fewer headings
- Example: Body text 11pt → headings must be ≥12.65pt
Orphan Max Length (10 to 120, default 45)
- Maximum characters for orphan line merging
- Short isolated lines get merged into previous paragraph
CAPS to Headings (default: True)
- Treats ALL-CAPS or MOSTLY-CAPS lines as headings
Remove Headers/Footers (default: True)
- Detects repeating text across 3+ pages
- Removes "Page N", "- - 1", footer patterns
Defragment Short Lines (default: True)
- Merges short orphan lines into paragraphs
- Improves reading flow
Profile Storage
Settings saved to: ~/.pdfmd_gui.json
The GUI persists your last-used options to this config file. The CLI currently uses its own defaults and command-line flags.
Safe to edit manually for advanced customization.
To reset GUI settings:
rm ~/.pdfmd_gui.json
🗂️ Example Output
Before (PDF)
INTRODUCTION
This is a para-
graph with hyph-
enation.
• Bullet one
• Bullet two
Page 1
After (Markdown)
# Introduction
This is a paragraph with hyphenation.
- Bullet one
Bullet two
Improvements:
- ✅ Hyphenation repaired (
para-graph→paragraph) - ✅ Extra spaces normalized
- ✅ Bullets converted to Markdown
- ✅ Page numbers removed
- ✅ Heading properly formatted
Table Example
Before (PDF):
Name Age City
Alice 30 New York
Bob 25 London
Carol 35 Tokyo
After (Markdown):
| Name | Age | City |
|:------|----:|:---------|
| Alice | 30 | New York |
| Bob | 25 | London |
| Carol | 35 | Tokyo |
Math Example
Before (PDF):
The equation E = mc² shows mass-energy equivalence.
For integrals: ∫₀^∞ e^(-x²) dx = √π/2
After (Markdown):
The equation $E = mc^{2}$ shows mass-energy equivalence.
For integrals: $\int_{0}^{\infty} e^{-x^{2}} dx = \sqrt{\pi}/2$
⚡ Performance Tips
For Large Documents (100+ pages)
Test with preview mode first:
pdfmd large.pdf --preview-only --ocr autoDisable OCR if not needed:
pdfmd text-only.pdf --ocr offOnly export images when necessary — Each image adds processing time
For Slow Systems
- Use Tesseract instead of OCRmyPDF — Faster but less accurate
- Close other applications — OCR is CPU-intensive
- Process in batches — Split large PDFs first
Batch Processing Performance
# Process 4 PDFs simultaneously (Unix, requires GNU parallel)
find . -name "*.pdf" | parallel -j 4 pdfmd {} --ocr auto
OCR Strategy
Auto-Detection & Engine Selection:
| Platform | Primary OCR | Fallback | Notes |
|---|---|---|---|
| Windows | Tesseract | Native PyMuPDF | Fast, lightweight |
| macOS | OCRmyPDF | Tesseract | Best layout preservation |
| Linux | OCRmyPDF | Tesseract | Ideal for servers |
Scanned PDF Detection:
The auto mode analyzes the first 3 pages for:
- Text density (< 50 chars/page = likely scanned)
- Large images covering >30% of page area
- Combined low text + high image coverage triggers OCR
🛠️ Troubleshooting
Common Issues
"PyMuPDF (fitz) is not installed"
pip install pymupdf
"Tesseract binary is not available on PATH"
Windows: Reinstall Tesseract and check "Add to PATH" during installation
macOS: brew install tesseract
Linux: sudo apt-get install tesseract-ocr
Verify installation:
tesseract --version
"OCRmyPDF not found"
pip install ocrmypdf
Or on macOS:
brew install ocrmypdf
OCR Output is Poor Quality
- Check original scan quality — Blurry scans won't improve
- Try different OCR mode:
pdfmd scan.pdf --ocr ocrmypdf # Better than tesseract - Ensure Tesseract language data is installed
- For very poor scans, consider rescanning at higher DPI
Password Dialog Not Appearing (GUI)
- Ensure PyMuPDF is up to date:
pip install --upgrade pymupdf - Check that PDF actually requires a password (not just restricted)
- Try running from command line to see error messages
GUI Not Opening
# Check if tkinter is installed (comes with Python on most systems) python -c "import tkinter"On Linux, you may need to install:
sudo apt-get install python3-tk
Command Not Found: pdfmd
If installed as a package but command not found:
# Ensure pip install directory is in PATH, or use:
python -m pdfmd.cli input.pdf
GUI-Specific Issues
Conversion Hangs
Problem: Progress bar stuck, no log updates
Solution:
- Press Esc or click Stop to cancel
- Try with Preview first 3 pages to diagnose
- Check if PDF is corrupted or extremely large
- Try different OCR mode
Password Dialog Loops
Problem: Password dialog keeps appearing
Solution:
- Verify password is correct
- Check if PDF has user vs. owner password restrictions
- Try opening PDF in another viewer to test password
Output Folder Link Doesn't Work
Problem: "Open folder" link doesn't open file manager
Solution:
- Manually navigate to output file location
- Check file was actually created (look in logs)
- On Linux, ensure
xdg-openis available
Performance Issues
Slow OCR
Problem: OCR taking too long (>5 minutes for 50 pages)
Expected Behavior:
- Tesseract: ~1 page/second at 300 DPI
- OCRmyPDF: ~2-3 seconds/page (includes pre-processing)
Solutions:
- Use preview mode to test settings first
- Consider
--ocr autoinstead of forcing OCR on all pages - Disable image export if not needed
- Close resource-heavy applications
High Memory Usage
Problem: Application using excessive RAM
Causes:
- Large PDFs (>100 pages)
- High-resolution images
- OCR processing
Solutions:
- Process in preview mode first
- Split large PDFs into smaller chunks
- Disable image export
- Increase system swap space
🤗 Contributing
Contributions welcome! You can help by:
- Testing with difficult PDFs (scanned, multi-column, handwritten)
- Improving OCR heuristics and accuracy
- Enhancing Markdown formatting logic
- Expanding profile presets
- Adding unit tests
- Improving documentation
📜 License
MIT License. Free for personal and commercial use.
See LICENSE file for details.
🙏 Acknowledgments
Built with:
- PyMuPDF — Fast PDF rendering and text extraction
- Tesseract OCR — Google's open-source OCR engine
- OCRmyPDF — High-quality OCR layer addition
- Pillow — Image processing
- pytesseract — Python Tesseract wrapper
Special Thanks
- The PyMuPDF team for excellent PDF handling capabilities
- The Tesseract OCR community for continuous improvements
- All contributors and testers who help improve pdfmd
🔗 Links
- Repository: https://github.com/M1ck4/pdfmd
- Issues: https://github.com/M1ck4/pdfmd/issues
- Releases: https://github.com/M1ck4/pdfmd/releases
- Documentation: This README and inline code comments
📞 Support
Getting Help
- Check Documentation: Read this README thoroughly
- Search Issues: Check if your problem is already reported
- Ask Questions: Open a GitHub issue with the
questionlabel - Report Bugs: Provide detailed information (see Contributing section)
Feature Requests
We welcome feature requests! Please open an issue with:
- Clear description of the proposed feature
- Use cases and benefits
- Any implementation ideas (optional)
💡 Tips & Best Practices
For Researchers
- Use Academic article profile for papers
- Enable
--statsto verify table/equation extraction - Preview mode helps dial in heading detection
- Save custom profiles for different journal formats
For Legal Professionals
- Always verify password security (in-memory only)
- Use
--quietmode for scripting document workflows - Batch processing for discovery documents
- Consider splitting very large files first
For Developers
- Study the modular architecture for extending features
- Each module has clear input/output contracts
- Add custom profiles via JSON config
- Hook into pipeline stages for custom processing
For General Users
- Start with default settings and iterate
- Use preview mode to find optimal settings
- Save profiles once you find settings you like
- Keyboard shortcuts speed up workflow significantly
Free. Open. Useful. Private. Always.