PDF OCR Processor

Advanced PDF processing with AI-powered OCR, text extraction, and selectable text overlays using Ollama models

🚀 Features

AI-Powered OCR using Ollama models (llava, moondream, etc.)
Modular Architecture with clear separation of concerns
Multiple Output Formats:
- SVG with selectable text overlays
- Raw text extraction
- JSON metadata
Image Enhancement with multiple strategies
Robust Error Handling with configurable retries
Parallel Processing for batch operations
CLI Interface with progress tracking

🛠️ System Architecture

┌─────────────────────────────────────────────────┐
│               PDF OCR Processor                 │
├─────────────────┬───────────────────────────────┤
│  ┌────────────┐ │  ┌─────────────────────────┐  │
│  │ PDF        │ │  │      OCRProcessor       │  │
│  │ Processor  ├─┼─▶│  - Text extraction      │  │
│  └────────────┘ │  │  - Ollama integration   │  │
│                 │  └─────────────┬───────────┘  │
│  ┌────────────┐ │  ┌─────────────▼───────────┐  │
│  │ Image      │ │  │      SVG Generator      │  │
│  │ Enhancer   ├─┼─▶│  - Text overlay         │  │
│  └────────────┘ │  │  - Searchable output    │  │
└─────────────────┴───────────────────────────────┘

📦 Installation

Prerequisites

Python 3.8+
Ollama (for OCR processing)

System dependencies:

# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr poppler-utils

# macOS
brew install tesseract poppler

Install from source

# Clone the repository
git clone https://github.com/wronai/ocr.git
cd ocr

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # Linux/macOS

# Install dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # For development

🏁 Quick Start

Basic Usage

# Process a single PDF
python -m pdf_processor --input document.pdf --output output/

# Process all PDFs in a directory
python -m pdf_processor --input ./documents --output ./output --model llava:7b

# Show help
python -m pdf_processor --help

Python API

from pdf_processor import PDFProcessor
from pdf_processor.processing.pdf_processor import PDFProcessorConfig

# Configure the processor
config = PDFProcessorConfig(
    input_path="document.pdf",
    output_dir="./output",
    ocr_model="llava:7b",
    dpi=300,
    max_workers=4
)

# Process a document
processor = PDFProcessor(config)
result = processor.process_pdf("document.pdf")
print(f"Processed {result['pages_processed']} pages")

⚙️ Configuration

Configuration File

Create a config.yaml file:

# config.yaml
input_path: ./documents    # Input file or directory
output_dir: ./output       # Output directory
ocr_model: llava:7b        # Ollama model to use
dpi: 300                   # Image resolution
max_workers: 4             # Number of worker threads
timeout: 300               # Timeout in seconds
max_retries: 3             # Max retry attempts
log_level: INFO            # Logging level
log_file: pdf_processor.log # Log file path

# Image enhancement strategies
enhancement_strategies:
  - original            # Keep original image
  - grayscale           # Convert to grayscale
  - adaptive_threshold  # Apply adaptive thresholding
  - contrast_stretch    # Stretch contrast
  - sharpen             # Sharpen image
  - denoise             # Remove noise

Environment Variables

export OLLAMA_HOST="http://localhost:11434"
export OLLAMA_MODEL="llava:7b"
export LOG_LEVEL="DEBUG"

🚀 Advanced Usage

Processing Options

# Process with specific DPI
python -m pdf_processor --input document.pdf --output output/ --dpi 400

# Limit number of pages to process
python -m pdf_processor --input document.pdf --output output/ --max-pages 10

# Use a specific enhancement strategy
python -m pdf_processor --input document.pdf --output output/ --enhance grayscale

# Process in verbose mode
python -m pdf_processor --input document.pdf --output output/ --verbose

Available Enhancement Strategies

original: Keep original image (fastest)
grayscale: Convert to grayscale (good for text-heavy documents)
adaptive_threshold: Apply adaptive thresholding (good for low-quality scans)
contrast_stretch: Stretch contrast to improve readability
sharpen: Apply sharpening filter
denoise: Remove image noise

🛠️ Development

Project Structure

pdf_processor/
├── __init__.py          # Package initialization
├── cli.py               # Command-line interface
├── config/              # Configuration files
├── models/              # Data models
│   ├── __init__.py
│   ├── ocr_result.py    # OCR result data structures
│   └── retry_config.py  # Retry configuration
├── processing/          # Core processing modules
│   ├── __init__.py
│   ├── image_enhancement.py  # Image processing
│   ├── ocr_processor.py      # OCR processing
│   ├── pdf_processor.py      # Main PDF processing
│   └── svg_generator.py      # SVG output generation
└── utils/               # Utility functions
    ├── file_utils.py    # File operations
    ├── logging_utils.py # Logging configuration
    └── validation_utils.py # Input validation

Running Tests

# Install test dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest

# Run tests with coverage report
pytest --cov=pdf_processor --cov-report=html

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📚 Resources

🙏 Acknowledgments

The Ollama team for their amazing AI models
The PyMuPDF team for excellent PDF processing
All contributors who have helped improve this project

🛠️ Development Workflow

This project uses a script-based workflow for development tasks. All scripts are located in the scripts/ directory and can be run directly or via the Makefile.

Setup

Clone the repository and navigate to the project directory:
```
git clone https://github.com/wronai/ocr.git
cd ocr
```
Set up the development environment:
```
make install-dev
```
This will:
- Create and activate a virtual environment
- Install all development dependencies
- Set up pre-commit hooks

Common Development Tasks

# Run tests
make test

# Run tests with coverage
make test-cov

# Format code
make format

# Run linters
make lint

# Start development server
make dev-server

# Build documentation
make docs
make docs-serve  # Serve docs locally

Scripts Directory

All development and build scripts are located in the scripts/ directory. See scripts/README.md for detailed documentation of each script.

Docker Development

# Build Docker image
make docker-build

# Start services with Docker Compose
make docker-run

# Stop services
make docker-stop

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Please ensure your code follows our coding standards and includes appropriate tests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📜 Changelog

See CHANGELOG.md for a list of changes in each version.**

python proc.py --model llava:7b --workers 4

View Results
- Open output/*_complete.svg in your browser
- Check details in output/processing_report.json

📚 Documentation

Full documentation is available in the docs/ directory:

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github		.github
config		config
docs		docs
documents		documents
logs		logs
pdf_processor		pdf_processor
scripts		scripts
test_documents		test_documents
tests		tests
.editorconfig		.editorconfig
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.web		Dockerfile.web
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
changelog_and_contributing.md		changelog_and_contributing.md
configuration_files.txt		configuration_files.txt
docker-compose.yml		docker-compose.yml
examples_documentation.md		examples_documentation.md
install_script.sh		install_script.sh
pdf_processor.py		pdf_processor.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
setup.sh		setup.sh
test.py		test.py
test_ocr.py		test_ocr.py
test_various_pdfs.py		test_various_pdfs.py
verification_script.py		verification_script.py

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Processor

🚀 Features

🛠️ System Architecture

📦 Installation

Prerequisites

Install from source

🏁 Quick Start

Basic Usage

Python API

⚙️ Configuration

Configuration File

Environment Variables

🚀 Advanced Usage

Processing Options

Available Enhancement Strategies

🛠️ Development

Project Structure

Running Tests

🤝 Contributing

📄 License

📚 Resources

🙏 Acknowledgments

🛠️ Development Workflow

Setup

Common Development Tasks

Scripts Directory

Docker Development

🤝 Contributing

📄 License

📜 Changelog

📚 Documentation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages