Complete Guide to Converting PDF to Markdown: Preserving Format and Images

Why is PDF to Markdown So Difficult?

Honestly, converting PDF to Markdown is much more troublesome than Word to Markdown. I've tried several tools before, and either the images were lost, the tables got deformed, or the format was completely messed up.

There are three main difficulties:

1. Format Recognition Issues

PDF is essentially a bunch of coordinates-positioned text and graphics, with no structured tags. Conversion tools need to "guess" which parts are titles, body text, or lists.

A friend of mine tried to convert an academic paper, and the footnotes and references got all mixed up - completely unusable. Later, using doc2markdown.com, the recognition rate reached over 90%.

2. Image Extraction Difficulties

Images in PDFs come in two types:

Embedded images: Relatively easy to extract
Vector graphics: Need to be converted to bitmaps, easily distorted

Especially those diagrams with annotations and arrows - details are often lost during conversion.

3. Complex Table Structures

Tables in PDFs aren't real tables, just drawn with lines. Conversion tools need to recognize cell boundaries, and slightly complex merged cells are prone to errors.

Using doc2markdown.com for Online Conversion

After trying several tools, I found doc2markdown.com to be the most reliable. The operation is simple:

Basic Conversion Process

Upload PDF: Open doc2markdown.com and drag the PDF file in
Wait for Processing: Usually 10-30 seconds, depending on file size
Preview Results: Preview the converted Markdown online
Download File: If satisfied, directly download the .md file

Real Testing Results

I tested a 20-page technical document:

Format Retention: Titles, lists, code blocks mostly complete
Image Processing: All 12 images extracted successfully, automatically converted to Base64 embedded
Table Conversion: 3 tables, 2 perfectly converted, 1 with minor issues (merged cells)
Conversion Time: 23 seconds

Better than Pandoc and some paid tools I've used before.

Tips for Handling Complex PDFs

Scanned PDFs

Scanned PDFs are essentially images with no selectable text. Two solutions:

Method 1: Do OCR First

Use Adobe Acrobat or online OCR tools (like ocr.space) to first convert the scanned version to a searchable PDF, then convert to Markdown.

I tried a scanned ancient document with about 85% OCR accuracy, then converted using doc2markdown.com - basically usable.

Method 2: Accept Image Format

If just for saving content, you can directly convert PDF pages to images embedded in Markdown. Although not editable, at least preserves the original.

Multi-Column Layout PDFs

Two or three-column layouts commonly used in academic papers and magazines are most error-prone during conversion. Text order often gets messed up.

Solutions:

Adjust Reading Order: Some PDF editors allow setting text flow order - adjust before converting
Convert in Segments: Split the PDF by column into single columns, convert separately then merge
Manual Correction: After conversion, check once and readjust paragraph order

Once I converted a double-column research report where the first 5 pages were completely out of order. Later I adjusted the reading order in Adobe Acrobat and reconverted - then it was normal.

PDFs with Watermarks and Headers/Footers

Watermarks, headers, and footers in PDFs will be recognized as body text during conversion - very annoying.

Handling Methods:

Clean Before Converting: Use a PDF editor to remove watermarks and headers/footers first
Delete After Converting: Use regex in the Markdown file to batch delete repeated content

For example, page numbers are usually in the format Page 1 of 10, which can be batch deleted with regex Page \d+ of \d+.

Real Case: Academic Paper Conversion

Last year I helped a friend convert his doctoral dissertation (150-page PDF) to Markdown for publishing on his personal blog.

Problems Encountered

Math Formulas: The paper had numerous LaTeX formulas that became gibberish after conversion
References: 200+ citations with messy formatting
Figures and Tables: 60+ images, some were vector graphics

Solutions

Formula Processing:
- Converted using doc2markdown.com, preserved 70% of formulas
- Manually rewrote the remaining 30% with MathJax syntax
- Final effect was good, formulas display normally on web pages
References:
- Format was messy after conversion, decided to reformat
- Used regex to extract author, year, title
- Uniformly changed to Markdown list format
Figure Processing:
- Vector graphics automatically converted to PNG during conversion, resolution sufficient
- Individually exported high-resolution versions to replace a few complex figures

Final Result

Took 3 days total (mainly manual adjustment of formulas and references). The converted Markdown file:

Size: From 15MB PDF to 2.5MB text + 8MB images
Format: Complete preservation of chapter structure, code blocks, tables
Readability: Much better than PDF, smooth reading even on mobile

Now his thesis has 300+ stars on GitHub, and several people said it's much more convenient than viewing the PDF.

Common Problems with Format Loss

Problem 1: Code Block Recognition Errors

Symptom: Code blocks in PDF are recognized as plain text, all indentation lost.

Solution:

Manually add Markdown code block markers (three backticks) after conversion
Use Prettier or similar tools to reformat code

Problem 2: Links Lost

Symptom: Hyperlinks in PDF become plain text after conversion.

Solution:

doc2markdown.com will try to preserve links, but not 100%
For important links, suggest checking once after conversion and manually adding them

Problem 3: Special Character Garbling

Symptom: Special characters like Chinese quotation marks and dashes become question marks or boxes.

Solution:

Usually an encoding issue, save Markdown file with UTF-8 encoding
If still problematic, use a text editor to batch replace

When Not to Convert to Markdown

PDF to Markdown isn't omnipotent - I don't recommend converting in these cases:

1. Complex Layout E-books

Those fancy, beautifully laid out e-books will lose a lot of design appeal when converted to Markdown. If just for reading, viewing the PDF directly is better.

2. Very Poor Quality Scanned Documents

Blurry, tilted, stained scans have too low OCR recognition rates - after conversion there are errors everywhere, might as well retype.

3. Image-Heavy PDFs

If the PDF is 90% images (like comics, picture albums), converting to Markdown is pointless - just save the images directly.

Summary

PDF to Markdown is indeed difficult, but using the right tool can save a lot of effort. doc2markdown.com does well in format preservation, image extraction, and table conversion - sufficient for most cases.

Suitable Conversion Scenarios:

Technical documentation, tutorials
Academic papers (need manual formula adjustment)
Work reports, manuals
PDF content needing online display

Remember to Check After Converting:

Whether title hierarchy is correct
Whether images are complete
Whether table format is aligned
Whether code blocks have syntax highlighting
Whether links are valid

For simple documents, basically no changes needed after conversion. Complex documents may need 10-30% manual adjustment, but still saves a lot more effort than writing from scratch.

Complete Guide to Converting PDF to Markdown: Preserving Format and Images

Why is PDF to Markdown So Difficult?

1. Format Recognition Issues

2. Image Extraction Difficulties

3. Complex Table Structures

Using doc2markdown.com for Online Conversion

Basic Conversion Process

Real Testing Results

Tips for Handling Complex PDFs

Scanned PDFs

Multi-Column Layout PDFs

PDFs with Watermarks and Headers/Footers

Real Case: Academic Paper Conversion

Problems Encountered

Solutions

Final Result

Common Problems with Format Loss

Problem 1: Code Block Recognition Errors

Problem 2: Links Lost

Problem 3: Special Character Garbling

When Not to Convert to Markdown

1. Complex Layout E-books

2. Very Poor Quality Scanned Documents

3. Image-Heavy PDFs

Summary

Links

Legal

More Tools