How to Quickly Convert Word Documents to Markdown
6/17/25...About 1 min
Docx to Markdown Converter
Function Overview
This is a tool for converting Word documents (docx) to Markdown format, with the following features:
- Supports batch conversion of all docx files in a directory
- Automatically extracts images from documents
- Uploads images to MinIO storage
- Automatically replaces image links in Markdown files
Code
// ... (code unchanged, see original for details)
Usage
1. Initialization
from docxconverter import DocxConverter
from minio_util import MinioUtil # You need to implement the MinIO utility class first
minio_util = MinioUtil() # Initialize MinIO utility
converter = DocxConverter(minio_util)
2. Convert a Single File
# Convert a single docx file
output_file = converter.convert_doc_to_markdown(
doc_file="path/to/your/file.docx",
output_dir="path/to/output"
)
3. Batch Convert a Directory
# Convert all docx files in a directory
converter.process_docx_in_dir(
docx_dir="path/to/docx/files",
output_dir="path/to/output",
progress_callback=your_callback_function # Optional callback function
)
Output Description
- The converted Markdown files will be saved in the specified output directory
- Image files will be extracted to a _media subdirectory named after the original file
- Images will be automatically uploaded to MinIO, and the links in the Markdown file will be updated to MinIO URLs
Notes
- Make sure pandoc is installed on your system
- MinIO connection information must be correctly configured
- Ensure there is enough disk space for temporary file storage
- It is recommended to use a progress callback function to monitor conversion progress when processing large files
Error Handling
- If the input directory does not exist, the program will log an error
- If an error occurs during conversion, an exception will be raised and error information will be logged
- Temporary files will be automatically cleaned up in case of errors
Dependencies
- Python 3.x
- pypandoc
- Custom MinIO utility class
- Sufficient disk space
This converter is especially suitable for scenarios where you need to convert a large number of Word documents to Markdown format and handle images in the documents.