Insights

Complete Guide to Converting PDF to Markdown: Preserving Format and Images

Published January 14, 2025
Complete Guide to Converting PDF to Markdown: Preserving Format and Images
Tags:#PDF to Markdown#Document Conversion#Online Conversion Tool#Format Preservation

Why is PDF to Markdown So Difficult?

Honestly, converting PDF to Markdown is much more troublesome than Word to Markdown. I've tried several tools before, and either the images were lost, the tables got deformed, or the format was completely messed up.

There are three main difficulties:

1. Format Recognition Issues

PDF is essentially a bunch of coordinates-positioned text and graphics, with no structured tags. Conversion tools need to "guess" which parts are titles, body text, or lists.

A friend of mine tried to convert an academic paper, and the footnotes and references got all mixed up - completely unusable. Later, using doc2markdown.com, the recognition rate reached over 90%.

2. Image Extraction Difficulties

Images in PDFs come in two types:

  • Embedded images: Relatively easy to extract
  • Vector graphics: Need to be converted to bitmaps, easily distorted

Especially those diagrams with annotations and arrows - details are often lost during conversion.

3. Complex Table Structures

Tables in PDFs aren't real tables, just drawn with lines. Conversion tools need to recognize cell boundaries, and slightly complex merged cells are prone to errors.

Using doc2markdown.com for Online Conversion

After trying several tools, I found doc2markdown.com to be the most reliable. The operation is simple:

Basic Conversion Process

  1. Upload PDF: Open doc2markdown.com and drag the PDF file in
  2. Wait for Processing: Usually 10-30 seconds, depending on file size
  3. Preview Results: Preview the converted Markdown online
  4. Download File: If satisfied, directly download the .md file

Real Testing Results

I tested a 20-page technical document:

  • Format Retention: Titles, lists, code blocks mostly complete
  • Image Processing: All 12 images extracted successfully, automatically converted to Base64 embedded
  • Table Conversion: 3 tables, 2 perfectly converted, 1 with minor issues (merged cells)
  • Conversion Time: 23 seconds

Better than Pandoc and some paid tools I've used before.

Tips for Handling Complex PDFs

Scanned PDFs

Scanned PDFs are essentially images with no selectable text. Two solutions:

Method 1: Do OCR First

Use Adobe Acrobat or online OCR tools (like ocr.space) to first convert the scanned version to a searchable PDF, then convert to Markdown.

I tried a scanned ancient document with about 85% OCR accuracy, then converted using doc2markdown.com - basically usable.

Method 2: Accept Image Format

If just for saving content, you can directly convert PDF pages to images embedded in Markdown. Although not editable, at least preserves the original.

Multi-Column Layout PDFs

Two or three-column layouts commonly used in academic papers and magazines are most error-prone during conversion. Text order often gets messed up.

Solutions:

  1. Adjust Reading Order: Some PDF editors allow setting text flow order - adjust before converting
  2. Convert in Segments: Split the PDF by column into single columns, convert separately then merge
  3. Manual Correction: After conversion, check once and readjust paragraph order

Once I converted a double-column research report where the first 5 pages were completely out of order. Later I adjusted the reading order in Adobe Acrobat and reconverted - then it was normal.

PDFs with Watermarks and Headers/Footers

Watermarks, headers, and footers in PDFs will be recognized as body text during conversion - very annoying.

Handling Methods:

  • Clean Before Converting: Use a PDF editor to remove watermarks and headers/footers first
  • Delete After Converting: Use regex in the Markdown file to batch delete repeated content

For example, page numbers are usually in the format Page 1 of 10, which can be batch deleted with regex Page \d+ of \d+.

Real Case: Academic Paper Conversion

Last year I helped a friend convert his doctoral dissertation (150-page PDF) to Markdown for publishing on his personal blog.

Problems Encountered

  1. Math Formulas: The paper had numerous LaTeX formulas that became gibberish after conversion
  2. References: 200+ citations with messy formatting
  3. Figures and Tables: 60+ images, some were vector graphics

Solutions

  1. Formula Processing:

    • Converted using doc2markdown.com, preserved 70% of formulas
    • Manually rewrote the remaining 30% with MathJax syntax
    • Final effect was good, formulas display normally on web pages
  2. References:

    • Format was messy after conversion, decided to reformat
    • Used regex to extract author, year, title
    • Uniformly changed to Markdown list format
  3. Figure Processing:

    • Vector graphics automatically converted to PNG during conversion, resolution sufficient
    • Individually exported high-resolution versions to replace a few complex figures

Final Result

Took 3 days total (mainly manual adjustment of formulas and references). The converted Markdown file:

  • Size: From 15MB PDF to 2.5MB text + 8MB images
  • Format: Complete preservation of chapter structure, code blocks, tables
  • Readability: Much better than PDF, smooth reading even on mobile

Now his thesis has 300+ stars on GitHub, and several people said it's much more convenient than viewing the PDF.

Common Problems with Format Loss

Problem 1: Code Block Recognition Errors

Symptom: Code blocks in PDF are recognized as plain text, all indentation lost.

Solution:

  • Manually add Markdown code block markers (three backticks) after conversion
  • Use Prettier or similar tools to reformat code

Problem 2: Links Lost

Symptom: Hyperlinks in PDF become plain text after conversion.

Solution:

  • doc2markdown.com will try to preserve links, but not 100%
  • For important links, suggest checking once after conversion and manually adding them

Problem 3: Special Character Garbling

Symptom: Special characters like Chinese quotation marks and dashes become question marks or boxes.

Solution:

  • Usually an encoding issue, save Markdown file with UTF-8 encoding
  • If still problematic, use a text editor to batch replace

When Not to Convert to Markdown

PDF to Markdown isn't omnipotent - I don't recommend converting in these cases:

1. Complex Layout E-books

Those fancy, beautifully laid out e-books will lose a lot of design appeal when converted to Markdown. If just for reading, viewing the PDF directly is better.

2. Very Poor Quality Scanned Documents

Blurry, tilted, stained scans have too low OCR recognition rates - after conversion there are errors everywhere, might as well retype.

3. Image-Heavy PDFs

If the PDF is 90% images (like comics, picture albums), converting to Markdown is pointless - just save the images directly.

Summary

PDF to Markdown is indeed difficult, but using the right tool can save a lot of effort. doc2markdown.com does well in format preservation, image extraction, and table conversion - sufficient for most cases.

Suitable Conversion Scenarios:

  • Technical documentation, tutorials
  • Academic papers (need manual formula adjustment)
  • Work reports, manuals
  • PDF content needing online display

Remember to Check After Converting:

  • Whether title hierarchy is correct
  • Whether images are complete
  • Whether table format is aligned
  • Whether code blocks have syntax highlighting
  • Whether links are valid

For simple documents, basically no changes needed after conversion. Complex documents may need 10-30% manual adjustment, but still saves a lot more effort than writing from scratch.

Complete Guide to Converting PDF to Markdown: Preserving Format and Images