Marker PDF: The Fast, Accurate PDF-to-Markdown Engine You Should Be Using

If you work with documents regularly: PDFs, PPTs, DOCX files, scanned images, you’ve probably wasted hours converting them into clean, editable text. Most tools either butcher the formatting or break when the layout gets even slightly complex. That’s the problem Marker solves.

Marker is a high-accuracy document conversion library that converts PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files into Markdown, JSON, HTML, or chunks with impressive consistency. It’s built for developers who want a reliable, programmatic way to extract structured text without wrestling with messy output.

Why Marker Stands Out

Marker is not a naive text scraper. It uses a full pipeline of:

Providers → read information from PDFs or other formats
Builders → create structured document blocks
Processors → format tables, equations, lists, code, forms, and inline math
Renderers → output Markdown, HTML, JSON, or chunk representations

Because of this architecture, Marker handles:

Tables
Forms
Footnotes
Figures
Multi-column layouts
Code blocks
Equations (inline + block with LaTeX)
Images (extracted + referenced properly)
Page headers/footers cleanup

Most converters either flatten this structure or destroy formatting. Marker preserves it.

Installation

Marker requires Python 3.10+ and PyTorch.

Basic installation (PDF only)

bash

1
pip install marker-pdf

Full installation (all file types: PDF, DOCX, PPTX, HTML, EPUB, etc.)

bash

1
pip install marker-pdf[full]

That’s all you need to start converting documents.

A Simple PDF Conversion Takes Only a Few Lines

Here’s the smallest working Python example:

python

1
2
3
4
5
6
7
8
9
10
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
 
converter = PdfConverter(
    artifact_dict=create_model_dict(),
)
 
rendered = converter("FILEPATH.pdf")
text, _, images = text_from_rendered(rendered)

The output Markdown already includes:

Extracted images
Properly formatted tables
LaTeX equations
Cleaned text blocks

You can save it directly or post-process it further.

Using the CLI

Convert a single file

bash

1
marker_single /path/to/file.pdf

Useful options:

--paginate_output
--page_range "0,5-10"
--output_format markdown|json|html|chunks
--use_llm (for maximum accuracy)
--force_ocr (fixes bad text or inline math)
--strip_existing_ocr
--disable_image_extraction
--redo_inline_math

Convert a folder

bash

1
marker /path/to/input/folder

Supports multiprocessing:

--workers N
--converter_cls (PDFConverter, TableConverter, OCRConverter)

Multi-GPU mode

bash

1
NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out

This mode crushes large batch jobs.

Advanced Configuration

You can customize everything via ConfigParser:

python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.config.parser import ConfigParser
 
config = {
    "output_format": "json",
    "force_ocr": False,
    "paginate_output": True
}
 
config_parser = ConfigParser(config)
 
converter = PdfConverter(
    config=config_parser.generate_config_dict(),
    artifact_dict=create_model_dict(),
    processor_list=config_parser.get_processors(),
    renderer=config_parser.get_renderer(),
    llm_service=config_parser.get_llm_service()
)

This gives you control over:

OCR
Layout detection
Output format
Page selection
Math formatting
LLM enhancement

Hybrid LLM Mode (Optional but Powerful)

Marker’s --use_llm mode merges its structural extraction with an LLM (Gemini, Ollama, Claude, OpenAI, etc.) to:

Merge tables across multiple pages
Fix inline math
Improve table accuracy massively
Clean complex layouts
Post-process with your own custom prompt

If you need maximum quality, hybrid mode is the best choice.

Extracting Tables Only

Marker includes a dedicated table extractor:

python

1
2
3
4
5
6
7
8
9
10
from marker.converters.table import TableConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
 
converter = TableConverter(
    artifact_dict=create_model_dict(),
)
 
rendered = converter("FILEPATH.pdf")
text, _, images = text_from_rendered(rendered)

With:

bash

1
2
--force_layout_block Table
--output_format json

you get precise cell positions.

OCR-Only Mode

When you just want OCR:

bash

1
marker_single FILENAME.pdf --converter_cls marker.converters.ocr.OCRConverter

Or in Python:

python

1
2
3
4
5
6
from marker.converters.ocr import OCRConverter
 
converter = OCRConverter(
    artifact_dict=create_model_dict()
)
rendered = converter("FILEPATH.pdf")

Interactive Streamlit App

Try Marker visually:

bash

1
2
pip install streamlit streamlit-ace
marker_gui

This is useful for testing configuration options quickly.

Performance

Marker is built for speed:

~0.18s per page (H100 GPU)
3.5GB VRAM per worker
25–122 pages/sec in batch mode

It beats or matches tools like Llamaparse, Mathpix, and Docling in accuracy while being faster.

Who Should Use Marker

Marker is ideal for:

PDF → Markdown automation
RAG pipelines needing clean chunkable text
OCR-heavy workflows
Processing research papers, books, financial documents, engineering drawings
Integrating document conversion into APIs or backend services

If accuracy and structure matter, Marker is one of the strongest open-source options.

Final Thoughts

Marker delivers the balance of:

speed
accuracy
deep configuration
LLM-enhanced cleanup
support for multiple document types

Whether you're building a full-scale document pipeline or just converting PDFs for analysis, Marker is absolutely worth integrating into your workflow.

Why Marker Stands Out

Installation

Basic installation (PDF only)

Full installation (all file types: PDF, DOCX, PPTX, HTML, EPUB, etc.)

A Simple PDF Conversion Takes Only a Few Lines

Using the CLI

Convert a single file

Convert a folder

Multi-GPU mode

Advanced Configuration

Hybrid LLM Mode (Optional but Powerful)

Extracting Tables Only

OCR-Only Mode

Interactive Streamlit App

Performance

Who Should Use Marker

Final Thoughts

About Murali Anand