How to Quickly Convert Any PDF to Markdown
6/16/25...About 1 min
PDF to Markdown Converter
Overview
MyPDFConverter
is a tool for batch processing PDF files, extracting their content (including tables, images, etc.) and converting them to Markdown files. It supports uploading images to MinIO and automatically replacing image links in Markdown.
Dependencies
- Python 3.x
- docling_core
- docling
- commonutils
- MinIO (optional, for image upload)
Main Features
- Batch process all PDF files in a specified directory
- Automatically extract tables and save as images
- Supports OCR (Optical Character Recognition)
- Generates Markdown files, images can be uploaded to MinIO and links replaced
Code
// ... (code unchanged, see original for details)
Usage
1. Initialization
from pdfconverterv3 import MyPDFConverter
# If you need to upload images to MinIO, pass in a minio_util instance
converter = MyPDFConverter(minio_util=None)
2. Batch Process All PDFs in a Directory
input_dir = "path/to/pdf/files"
output_dir = "path/to/output/markdown"
converter.process_directory(input_dir, output_dir)
input_dir
: Directory containing PDF files to processoutput_dir
: Directory for generated Markdown files and images
3. Process a Single PDF File
from pathlib import Path
pdf_path = Path("example.pdf")
output_dir = Path("output")
converter.process_pdf(pdf_path, output_dir)
4. Parse and Process the Generated Markdown File
md_path = Path("output/example.md")
converter.parse_markdown(md_path)
Advanced Usage
- Supports custom MinIO utility class. Implement
upload_to_minio(image_path, doc_filename)
to automatically upload images and replace Markdown links with MinIO URLs.
Notes
- Ensure all dependencies are correctly installed.
- If an image does not exist or upload fails during processing, there will be log prompts.
- The generated Markdown file will automatically process image links; if MinIO is not configured, image paths will be local.
Example
if __name__ == "__main__":
converter = MyPDFConverter()
converter.process_directory(
"your_pdf_dir",
"your_output_dir"
)
Logging
- Log level is INFO, progress and errors will be output to the console.