Marker PDF: The Fast, Accurate PDF-to-Markdown Engine You Should Be Using
Data Extraction & Processing

Marker PDF: The Fast, Accurate PDF-to-Markdown Engine You Should Be Using

by Murali Anand

Convert PDFs and documents to clean Markdown, JSON, or HTML with Marker;fast, accurate, supports OCR, tables, images, and LLM-enhanced formatting.

If you work with documents regularly: PDFs, PPTs, DOCX files, scanned images, you’ve probably wasted hours converting them into clean, editable text. Most tools either butcher the formatting or break when the layout gets even slightly complex. That’s the problem Marker solves.


Marker is a high-accuracy document conversion library that converts PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files into Markdown, JSON, HTML, or chunks with impressive consistency. It’s built for developers who want a reliable, programmatic way to extract structured text without wrestling with messy output.

Why Marker Stands Out

Marker is not a naive text scraper. It uses a full pipeline of:

  • Providers → read information from PDFs or other formats
  • Builders → create structured document blocks
  • Processors → format tables, equations, lists, code, forms, and inline math
  • Renderers → output Markdown, HTML, JSON, or chunk representations

Because of this architecture, Marker handles:

  • Tables
  • Forms
  • Footnotes
  • Figures
  • Multi-column layouts
  • Code blocks
  • Equations (inline + block with LaTeX)
  • Images (extracted + referenced properly)
  • Page headers/footers cleanup

Most converters either flatten this structure or destroy formatting. Marker preserves it.

Installation

Marker requires Python 3.10+ and PyTorch.

Basic installation (PDF only)

bash
  1. 1
pip install marker-pdf

Full installation (all file types: PDF, DOCX, PPTX, HTML, EPUB, etc.)

bash
  1. 1
pip install marker-pdf[full]

That’s all you need to start converting documents.

A Simple PDF Conversion Takes Only a Few Lines

Here’s the smallest working Python example:

python
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
 
converter = PdfConverter(
artifact_dict=create_model_dict(),
)
 
rendered = converter("FILEPATH.pdf")
text, _, images = text_from_rendered(rendered)

The output Markdown already includes:

  • Extracted images
  • Properly formatted tables
  • LaTeX equations
  • Cleaned text blocks

You can save it directly or post-process it further.

Using the CLI

Convert a single file

bash
  1. 1
marker_single /path/to/file.pdf

Useful options:

  • --paginate_output
  • --page_range "0,5-10"
  • --output_format markdown|json|html|chunks
  • --use_llm (for maximum accuracy)
  • --force_ocr (fixes bad text or inline math)
  • --strip_existing_ocr
  • --disable_image_extraction
  • --redo_inline_math

Convert a folder

bash
  1. 1
marker /path/to/input/folder

Supports multiprocessing:

  • --workers N
  • --converter_cls (PDFConverter, TableConverter, OCRConverter)

Multi-GPU mode

bash
  1. 1
NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out

This mode crushes large batch jobs.

Advanced Configuration

You can customize everything via ConfigParser:

python
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
 
config = {
"output_format": "json",
"force_ocr": False,
"paginate_output": True
}
 
config_parser = ConfigParser(config)
 
converter = PdfConverter(
config=config_parser.generate_config_dict(),
artifact_dict=create_model_dict(),
processor_list=config_parser.get_processors(),
renderer=config_parser.get_renderer(),
llm_service=config_parser.get_llm_service()
)

This gives you control over:

  • OCR
  • Layout detection
  • Output format
  • Page selection
  • Math formatting
  • LLM enhancement

Hybrid LLM Mode (Optional but Powerful)

Marker’s --use_llm mode merges its structural extraction with an LLM (Gemini, Ollama, Claude, OpenAI, etc.) to:

  • Merge tables across multiple pages
  • Fix inline math
  • Improve table accuracy massively
  • Clean complex layouts
  • Post-process with your own custom prompt


If you need maximum quality, hybrid mode is the best choice.

Extracting Tables Only

Marker includes a dedicated table extractor:

python
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
from marker.converters.table import TableConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
 
converter = TableConverter(
artifact_dict=create_model_dict(),
)
 
rendered = converter("FILEPATH.pdf")
text, _, images = text_from_rendered(rendered)

With:

bash
  1. 1
  2. 2
--force_layout_block Table
--output_format json

you get precise cell positions.

OCR-Only Mode

When you just want OCR:

bash
  1. 1
marker_single FILENAME.pdf --converter_cls marker.converters.ocr.OCRConverter

Or in Python:

python
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
from marker.converters.ocr import OCRConverter
 
converter = OCRConverter(
artifact_dict=create_model_dict()
)
rendered = converter("FILEPATH.pdf")

Interactive Streamlit App

Try Marker visually:

bash
  1. 1
  2. 2
pip install streamlit streamlit-ace
marker_gui

This is useful for testing configuration options quickly.

Performance

Marker is built for speed:

  • ~0.18s per page (H100 GPU)
  • 3.5GB VRAM per worker
  • 25–122 pages/sec in batch mode

It beats or matches tools like Llamaparse, Mathpix, and Docling in accuracy while being faster.

Who Should Use Marker

Marker is ideal for:

  • PDF → Markdown automation
  • RAG pipelines needing clean chunkable text
  • OCR-heavy workflows
  • Processing research papers, books, financial documents, engineering drawings
  • Integrating document conversion into APIs or backend services

If accuracy and structure matter, Marker is one of the strongest open-source options.

Final Thoughts

Marker delivers the balance of:

  • speed
  • accuracy
  • deep configuration
  • LLM-enhanced cleanup
  • support for multiple document types

Whether you're building a full-scale document pipeline or just converting PDFs for analysis, Marker is absolutely worth integrating into your workflow.

Murali Anand

About Murali Anand

AI Engineer specializing in machine learning, LLM integration, and intelligent systems. Passionate about building cutting-edge AI solutions.