
Marker PDF: The Fast, Accurate PDF-to-Markdown Engine You Should Be Using
Convert PDFs and documents to clean Markdown, JSON, or HTML with Marker;fast, accurate, supports OCR, tables, images, and LLM-enhanced formatting.
If you work with documents regularly: PDFs, PPTs, DOCX files, scanned images, you’ve probably wasted hours converting them into clean, editable text. Most tools either butcher the formatting or break when the layout gets even slightly complex. That’s the problem Marker solves.
Marker is a high-accuracy document conversion library that converts PDF, image, PPTX, DOCX, XLSX, HTML, and EPUB files into Markdown, JSON, HTML, or chunks with impressive consistency. It’s built for developers who want a reliable, programmatic way to extract structured text without wrestling with messy output.
Why Marker Stands Out
Marker is not a naive text scraper. It uses a full pipeline of:
- Providers → read information from PDFs or other formats
- Builders → create structured document blocks
- Processors → format tables, equations, lists, code, forms, and inline math
- Renderers → output Markdown, HTML, JSON, or chunk representations
Because of this architecture, Marker handles:
- Tables
- Forms
- Footnotes
- Figures
- Multi-column layouts
- Code blocks
- Equations (inline + block with LaTeX)
- Images (extracted + referenced properly)
- Page headers/footers cleanup
Most converters either flatten this structure or destroy formatting. Marker preserves it.
Installation
Marker requires Python 3.10+ and PyTorch.
Basic installation (PDF only)
- 1
pip install marker-pdf
Full installation (all file types: PDF, DOCX, PPTX, HTML, EPUB, etc.)
- 1
pip install marker-pdf[full]
That’s all you need to start converting documents.
A Simple PDF Conversion Takes Only a Few Lines
Here’s the smallest working Python example:
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
from marker.converters.pdf import PdfConverterfrom marker.models import create_model_dictfrom marker.output import text_from_renderedconverter = PdfConverter(artifact_dict=create_model_dict(),)rendered = converter("FILEPATH.pdf")text, _, images = text_from_rendered(rendered)
The output Markdown already includes:
- Extracted images
- Properly formatted tables
- LaTeX equations
- Cleaned text blocks
You can save it directly or post-process it further.
Using the CLI
Convert a single file
- 1
marker_single /path/to/file.pdf
Useful options:
--paginate_output--page_range "0,5-10"--output_format markdown|json|html|chunks--use_llm(for maximum accuracy)--force_ocr(fixes bad text or inline math)--strip_existing_ocr--disable_image_extraction--redo_inline_math
Convert a folder
- 1
marker /path/to/input/folder
Supports multiprocessing:
--workers N--converter_cls(PDFConverter, TableConverter, OCRConverter)
Multi-GPU mode
- 1
NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
This mode crushes large batch jobs.
Advanced Configuration
You can customize everything via ConfigParser:
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
from marker.converters.pdf import PdfConverterfrom marker.models import create_model_dictfrom marker.config.parser import ConfigParserconfig = {"output_format": "json","force_ocr": False,"paginate_output": True}config_parser = ConfigParser(config)converter = PdfConverter(config=config_parser.generate_config_dict(),artifact_dict=create_model_dict(),processor_list=config_parser.get_processors(),renderer=config_parser.get_renderer(),llm_service=config_parser.get_llm_service())
This gives you control over:
- OCR
- Layout detection
- Output format
- Page selection
- Math formatting
- LLM enhancement
Hybrid LLM Mode (Optional but Powerful)
Marker’s --use_llm mode merges its structural extraction with an LLM (Gemini, Ollama, Claude, OpenAI, etc.) to:
- Merge tables across multiple pages
- Fix inline math
- Improve table accuracy massively
- Clean complex layouts
- Post-process with your own custom prompt
If you need maximum quality, hybrid mode is the best choice.
Extracting Tables Only
Marker includes a dedicated table extractor:
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
from marker.converters.table import TableConverterfrom marker.models import create_model_dictfrom marker.output import text_from_renderedconverter = TableConverter(artifact_dict=create_model_dict(),)rendered = converter("FILEPATH.pdf")text, _, images = text_from_rendered(rendered)
With:
- 1
- 2
--force_layout_block Table--output_format json
you get precise cell positions.
OCR-Only Mode
When you just want OCR:
- 1
marker_single FILENAME.pdf --converter_cls marker.converters.ocr.OCRConverter
Or in Python:
- 1
- 2
- 3
- 4
- 5
- 6
from marker.converters.ocr import OCRConverterconverter = OCRConverter(artifact_dict=create_model_dict())rendered = converter("FILEPATH.pdf")
Interactive Streamlit App
Try Marker visually:
- 1
- 2
pip install streamlit streamlit-acemarker_gui
This is useful for testing configuration options quickly.
Performance
Marker is built for speed:
- ~0.18s per page (H100 GPU)
- 3.5GB VRAM per worker
- 25–122 pages/sec in batch mode
It beats or matches tools like Llamaparse, Mathpix, and Docling in accuracy while being faster.
Who Should Use Marker
Marker is ideal for:
- PDF → Markdown automation
- RAG pipelines needing clean chunkable text
- OCR-heavy workflows
- Processing research papers, books, financial documents, engineering drawings
- Integrating document conversion into APIs or backend services
If accuracy and structure matter, Marker is one of the strongest open-source options.
Final Thoughts
Marker delivers the balance of:
- speed
- accuracy
- deep configuration
- LLM-enhanced cleanup
- support for multiple document types
Whether you're building a full-scale document pipeline or just converting PDFs for analysis, Marker is absolutely worth integrating into your workflow.

About Murali Anand
AI Engineer specializing in machine learning, LLM integration, and intelligent systems. Passionate about building cutting-edge AI solutions.