Onboarding: microsoft/markitdown

Item: microsoft/markitdown
Rating: 3
Author: RepoPilot

Generated by RepoPilot · 2026-05-05 · Source

Verdict

WAIT — Single-maintainer risk — review before adopting

Last commit 2w ago
5 active contributors
MIT licensed
CI configured
Tests present
⚠ Small team — 5 top contributors
⚠ Single-maintainer risk — top contributor 84% of commits

<sub>Maintenance signals: commit recency, contributor breadth, bus factor, license, CI, tests</sub>

TL;DR

MarkItDown is a lightweight Python library and CLI tool that converts a wide variety of file formats (PDF, DOCX, PPTX, XLSX, images, audio, HTML, CSV, JSON, XML, EPUB, ZIP, YouTube URLs) into Markdown text, preserving document structure (headings, tables, lists, links) for downstream consumption by LLMs and text analysis pipelines. It is specifically optimized for token-efficient LLM ingestion rather than high-fidelity human-readable rendering. The repo is a monorepo containing the core markitdown package plus two optional plugin packages: markitdown-mcp (Model Context Protocol server) and markitdown-ocr (OCR-enhanced conversion via image recognition). Monorepo structured under packages/: the core engine lives in packages/markitdown/src/markitdown/ (not shown in top-60 but implied), packages/markitdown-mcp/src/markitdown_mcp/ wraps it as an MCP server, and packages/markitdown-ocr/src/markitdown_ocr/ provides per-format OCR converters (_pdf_converter_with_ocr.py, _docx_converter_with_ocr.py, _pptx_converter_with_ocr.py, _xlsx_converter_with_ocr.py) registered via a plugin entry point in _plugin.py.

Who it's for

AI/ML engineers and LLM application developers who need to ingest arbitrary office documents or web content into LLM context windows without writing custom parsers for each format. Also useful for data pipeline engineers building RAG (Retrieval-Augmented Generation) systems that need structured text extraction from binary file formats like DOCX, PPTX, and PDF.

Maturity & risk

The repo has a full CI setup via GitHub Actions (.github/workflows/tests.yml and pre-commit.yml), pre-commit hooks (.pre-commit-config.yaml), a Dockerfile for containerized deployment, a devcontainer config, and formal security/support/CoC documentation — all indicators of a well-maintained OSS project. Given Microsoft/AutoGen team ownership, PyPI publication, and the breadth of supported formats, this is production-ready for document ingestion use cases, though the OCR and MCP sub-packages appear newer and more experimental.

The core package has broad format support, meaning it carries a large transitive dependency surface (python-docx, pptx, openpyxl, pdfminer, beautifulsoup4, etc.) — the [all] extras install everything at once, which can create version conflicts in complex environments. The OCR plugin (packages/markitdown-ocr) adds heavy vision-model dependencies and is clearly in earlier-stage development with only a small test data set in tests/ocr_test_data/. The security notice in the README explicitly warns that MarkItDown executes with the current process's privileges and will follow arbitrary URLs or file paths, making it unsafe to use with untrusted input without sandboxing.

Active areas of work

Active development is visible on the OCR plugin (packages/markitdown-ocr/), which has an extensive set of test fixture files covering edge cases (complex layouts, multi-page, scanned invoices: e.g. pdf_scanned_invoice.pdf, docx_complex_layout.docx). The MCP sub-package (markitdown-mcp) integrates MarkItDown with the Model Context Protocol, suggesting recent work on LLM tool-use / agent integration. Dependabot is configured (.github/dependabot.yml) indicating ongoing dependency maintenance.

Get running

git clone https://github.com/microsoft/markitdown.git cd markitdown python -m venv .venv source .venv/bin/activate pip install -e 'packages/markitdown[all]'

Verify installation:

python -c "from markitdown import MarkItDown; print(MarkItDown().convert('README.md').text_content[:200])"

Or use the CLI:

markitdown README.md

Daily commands:

CLI usage:

markitdown path/to/file.pdf markitdown path/to/file.docx -o output.md

Python API:

from markitdown import MarkItDown md = MarkItDown() result = md.convert('path/to/file.pptx') print(result.text_content)

MCP server:

cd packages/markitdown-mcp pip install -e . python -m markitdown_mcp

Map of the codebase

packages/markitdown-ocr/src/markitdown_ocr/_plugin.py: Entry point registration for the OCR plugin — shows how converters are discovered and injected into the core MarkItDown registry.
packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py: Centralized OCR inference logic shared across all format-specific OCR converters.
packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py: Canonical example of a format-specific converter class with OCR augmentation — use this as a template for new converters.
packages/markitdown-mcp/src/markitdown_mcp/main.py: Entry point for the MCP server, exposes MarkItDown as an LLM tool via the Model Context Protocol.
packages/markitdown-ocr/pyproject.toml: Defines plugin dependencies and entry points — critical for understanding how the OCR package hooks into markitdown core.
.github/workflows/tests.yml: CI test matrix — shows which Python versions and platforms are officially tested.
.pre-commit-config.yaml: Code quality gates enforced on every commit — check here before pushing to avoid CI failures.

How to make changes

To add a new file format converter: look at packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py as a template for converter class structure, register it in _plugin.py via the entry points pattern. To modify OCR behavior: edit _ocr_service.py which centralizes the OCR inference logic. To add MCP tools: edit packages/markitdown-mcp/src/markitdown_mcp/__main__.py. Tests for the OCR plugin live in packages/markitdown-ocr/tests/ with fixture files in ocr_test_data/.

Traps & gotchas

pip install markitdown installs only the minimal core; most format support requires pip install 'markitdown[all]' — the bracket syntax is mandatory and easy to forget. 2) The OCR plugin (markitdown-ocr) requires a separately configured vision/OCR model and likely environment variables or model paths not visible in the repo structure — check _ocr_service.py before assuming it works out of the box. 3) convert() (not shown but referenced in README) will follow arbitrary URLs, so never pass untrusted user input to it directly; use convert_local() or convert_stream() instead. 4) Python 3.10+ is strictly required; 3.9 will fail silently on some type annotation syntax.

Concepts to learn

Model Context Protocol (MCP) — The markitdown-mcp sub-package exposes MarkItDown as an MCP tool server, meaning understanding MCP is required to contribute to or use that package for LLM agent tool-use.
Python Entry Points / Plugin System — markitdown-ocr registers itself into the core MarkItDown converter registry via pyproject.toml entry points in _plugin.py — this is the mechanism for adding new format support without modifying core.
Retrieval-Augmented Generation (RAG) — MarkItDown's primary use case is preprocessing documents for RAG pipelines — understanding RAG explains why Markdown output (rather than raw text or HTML) is the target format.
Optical Character Recognition (OCR) in document pipelines — The markitdown-ocr package adds OCR to handle scanned PDFs and image-heavy DOCX/PPTX files where native text extraction fails — understanding OCR pipeline integration is needed to work on those converters.
EXIF Metadata — MarkItDown extracts EXIF metadata from images and audio files as part of conversion — relevant when contributing to image/audio converter code.
PEP 517 / pyproject.toml packaging — All three sub-packages use pyproject.toml-based builds; understanding this modern Python packaging standard is required to correctly add dependencies or entry points.

Related repos

deanmalmgren/textract — The closest alternative Python library for extracting text from arbitrary file formats — MarkItDown's README directly names it as a comparable tool.
microsoft/autogen — Built by the same AutoGen Team; MarkItDown is designed to feed documents into AutoGen-based LLM agent pipelines.
modelcontextprotocol/python-sdk — The MCP SDK used by packages/markitdown-mcp to expose MarkItDown as an LLM tool server.
pymupdf/pymupdf — Common alternative/complement for PDF parsing in Python pipelines where MarkItDown uses pdfminer.
jsvine/pdfplumber — Another PDF-to-structured-text library frequently used in the same LLM document ingestion use case.

PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add integration tests for markitdown-mcp package (packages/markitdown-mcp/tests/init.py is empty)

The markitdown-mcp package has a tests directory with only an empty init.py file, meaning there are zero tests for the MCP server functionality. This is a critical gap since the MCP server exposes MarkItDown to external LLM tool-calling clients. Tests should cover the tool registration, convert tool invocation, error handling, and the main.py entry point.

[ ] Inspect packages/markitdown-mcp/src/markitdown_mcp/init.py and main.py to understand the exposed MCP tool surface
[ ] Add packages/markitdown-mcp/tests/test_mcp_server.py with unit tests that mock the MCP server and verify tool registration and response format
[ ] Add a test for convert_local invocation via the MCP tool call, asserting Markdown output is returned correctly
[ ] Add a test for error cases (e.g., file not found, unsupported format) to verify the server returns proper MCP error responses
[ ] Wire the new tests into .github/workflows/tests.yml so they run in CI for the markitdown-mcp package

Add a dedicated CI workflow for the markitdown-ocr package that installs OCR system dependencies

The markitdown-ocr package has extensive test data (20+ test files across PDF, DOCX, PPTX, XLSX) and four test modules, but OCR packages like pytesseract or similar typically require system-level binaries (e.g., tesseract-ocr) that must be explicitly installed in CI. The existing .github/workflows/tests.yml likely does not install these, causing OCR tests to be silently skipped or fail. A dedicated workflow ensures OCR conversion quality is continuously verified.

[ ] Audit .github/workflows/tests.yml to confirm whether OCR system dependencies (e.g., tesseract-ocr, poppler-utils) are installed before running markitdown-ocr tests
[ ] Create .github/workflows/tests-ocr.yml that runs on ubuntu-latest, installs tesseract-ocr and poppler-utils via apt-get, installs the markitdown-ocr package with its extras, and runs pytest on packages/markitdown-ocr/tests/
[ ] Add a matrix across the scanned PDF test files in ocr_test_data (pdf_scanned_invoice.pdf, pdf_scanned_meeting_minutes.pdf, etc.) to ensure each is converted successfully
[ ] Add a badge or note in packages/markitdown-ocr/README.md referencing the new CI workflow status

Split packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py into format-specific OCR adapters to reduce coupling

There is a single _ocr_service.py file that is imported by four separate converter modules (_pdf_converter_with_ocr.py, _docx_converter_with_ocr.py, _pptx_converter_with_ocr.py, _xlsx_converter_with_ocr.py). Each format likely has different OCR needs (e.g., PDFs may use page-level rendering, DOCX/PPTX need image extraction from zip internals, XLSX needs cell-image handling). Centralizing all logic in one service file creates a large, hard-to-maintain module and makes it difficult for contributors to add new format support. Splitting it improves maintainability and testability.

[ ] Read _ocr_service.py in full to identify which methods are format-agnostic (e.g., image-to-text) vs. format-specific (e.g., PDF page rendering)
[ ] Extract a base _ocr_base.py with the shared image-to-text interface/class
[ ] Create _pdf_ocr_service.py, _docx_ocr_service.

Good first issues

Add unit tests for packages/markitdown-ocr/tests/ targeting the _ocr_service.py module — there are fixture files (e.g., pdf_scanned_invoice.pdf, docx_complex_layout.docx) but no visible test file covering OCR service internals. 2) Add a tests/__init__.py-adjacent test file for the markitdown-mcp package — packages/markitdown-mcp/tests/__init__.py exists but no actual test cases are visible in the file listing. 3) Document the environment variables and model configuration required by _ocr_service.py in packages/markitdown-ocr/README.md, which currently appears minimal based on the file listing.

Top contributors

@afourney — 54 commits
@lesyk — 5 commits
@richardye101 — 2 commits
@BetterAndBetterII — 2 commits
@jigangz — 1 commits

Recent commits

a51f725 — Clarify security posture in READMEs (#1807) (afourney)
604bba1 — fix: handle deeply nested HTML that triggers RecursionError (#1644) (jigangz)
63cbbd9 — Updated warning about binding to non-local interfaces. (#1653) (afourney)
a6c8ac4 — Fix O(n) memory growth in PDF conversion by calling page.close() afte… (#1612) (lesyk)
c6308dc — [MS] Add OCR layer service for embedded images and PDF scans (#1541) (lesyk)
4a5340f — Bump version for release. (#1564) (afourney)
6b0fd15 — Remove onnxruntime<=1.20.1 Windows pin (#1551) (basnijholt)
2b6ec9f — Add text/markdown to Accept header (#1554) (afourney)
c83de14 — [MS] Extend table support for wide tables (#1552) (lesyk)
7fdaefb — Fix: PDF parsing doesn't support partially numbered lists (#1525) (lesyk)

Security observations

High · Arbitrary File Read via Path Traversal in File Conversion — packages/markitdown/src/markitdown/__init__.py. MarkItDown performs I/O with the privileges of the current process and accepts file paths for conversion. If user-supplied input is passed directly to convert functions (e.g., convert_local()), a malicious actor could supply path traversal sequences (e.g., '../../etc/passwd') to read arbitrary files accessible to the process. The README itself warns about this: 'MarkItDown performs I/O with the privileges of the current process.' Fix: Validate and sanitize all input file paths before processing. Use os.path.realpath() and check that the resolved path is within an expected base directory. Prefer convert_stream() over convert_local() when handling untrusted input.
High · Server-Side Request Forgery (SSRF) via URL Conversion — packages/markitdown/src/markitdown/__init__.py. The tool likely supports converting URLs (given its broad file/document conversion scope and convert_* function variants). If user-controlled URLs are passed to a conversion function without validation, attackers could cause the server to make requests to internal network resources, cloud metadata endpoints (e.g., http://169.254.169.254/), or other internal services. Fix: Implement a URL allowlist or blocklist. Block requests to private IP ranges (RFC1918), loopback addresses, and cloud metadata endpoints. Use a dedicated HTTP client with timeouts and redirect limits. Consider using a library like 'ssrf-guard' or equivalent validation logic.
High · Malicious Document Processing / Code Execution via Untrusted File Formats — packages/markitdown/src/markitdown/, packages/markitdown-ocr/src/markitdown_ocr/. Processing untrusted Office documents (DOCX, XLSX, PPTX, PDF) using libraries like python-docx, openpyxl, pptx, or pdfminer can expose the application to vulnerabilities in those parsing libraries, including XML External Entity (XXE) injection via maliciously crafted XML-based Office formats (.docx, .xlsx, .pptx are ZIP+XML). Embedded macros, external references, or exploitable parser bugs in dependencies could lead to information disclosure or denial of service. Fix: Keep all document-parsing dependencies up to date. Disable external entity resolution in XML parsers (use defusedxml where applicable). Run file conversion in a sandboxed environment (e.g., a container with restricted network and filesystem access). Consider scanning uploaded files with antivirus/antimalware before processing.
High · Command Injection Risk via ExifTool and FFmpeg Integration — Dockerfile, packages/markitdown/src/markitdown/__init__.py. The Dockerfile sets EXIFTOOL_PATH and FFMPEG_PATH environment variables and installs these binaries. If the application constructs shell commands using user-supplied filenames or parameters and passes them to subprocess calls involving exiftool or ffmpeg without proper escaping, it could be vulnerable to command injection. Fix: Always use subprocess with a list of arguments (never shell=True with user input). Validate and sanitize filenames before passing them to external processes. Ensure file paths are passed as positional arguments to subprocess, not interpolated into shell strings.
Medium · Dockerfile: Build-time ARG for USER Identity Without Enforcement Validation — Dockerfile. The Dockerfile uses ARG USERID=nobody and ARG GROUPID=nogroup as defaults for the USER directive. While using a non-root user is good practice, build-time ARG values can be overridden at build time (e.g., --build-arg USERID=0), potentially running the container as root. There is no validation that USERID/GROUPID are non-root values. Fix: Add a RUN check to prevent root from being set: 'RUN [ "$USERID" != "0" ] || (echo "Root user not allowed" && exit 1)'. Consider using a fixed non-root user created during the build rather than relying on build-time arguments.
Medium · Dockerfile: No Image Digest Pinning for Base Image — undefined. The Dockerfile uses 'FROM python:3.13-slim-bullseye' without pinning to a specific image digest (e.g., python:3.13-slim-bullseye@sha256:...). This means the base image could change between builds, potentially Fix: undefined

LLM-derived; treat as a starting point, not a security audit.

Where to read next

Open issues — current backlog
Recent PRs — what's actively shipping
Source on GitHub

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

microsoft/markitdown

Embed this verdict

Onboarding doc