Preserving nested lists during document conversion can feel like walking a tightrope
Preserving nested lists during document conversion can feel like walking a tightrope. Most document formats and conversion tools struggle to keep list hierarchies intact—leading to flattened lists, messed up numbering, and lost structure. In real-world scenarios, this breaks the flow of technical papers, project documentation, and any content where list organization matters.
Why Do Nested Lists Break During Document Conversion?
Nested lists pose a unique challenge because they embed list items within other list items, creating a hierarchy both visually and semantically. Most document formats (Word, PDF, Markdown, HTML) represent these hierarchies differently at the data level.
During conversion, tools often:
- Ignore list level properties, treating all items as top-level.
- Fail to preserve different list types—ordered vs unordered.
- Drop numbering styles, reverting to generic bullets or numbers.
- Lose indentation info, resulting in flattened content.
For example, GitHub Issue #2375 documents how converting PDFs often flattens nested lists, turning complex outlines into one-level bullet dumps. That’s not just ugly—it’s misleading, especially in legal or academic contexts.
This problem is widespread because document conversion is not just about copying content—it's about translating hierarchical structure between diverse formatting engines, each with its quirks.
How Are Nested Lists Structured in Popular Formats?
Understanding the source data formats helps with the conversion strategy. Here’s a comparison table for common document formats:
| Format | Nested List Representation | Key Challenge |
|---|---|---|
| Word (.docx) | <w:numPr> and <w:ilvl> tags control list levels and numbering | Word numbering schemes vary; requires parsing XML carefully |
| Visual indentation and text styling | No native list tags; hierarchy inferred from layout | |
| Markdown | Indentation with spaces/tabs and - or * or numbered | Indentation varies; ambiguous levels if inconsistent |
| HTML | <ol> and <ul> with <li> elements nest naturally | Browser rendering is native but mapping to other formats can fail |
Python’s python-docx package reads Word’s list levels using XML tags and can recreate the hierarchy programmatically. But PDFs are much harder—it involves layout analysis rather than tags, making nested list preservation error-prone.
How To Preserve Nested Lists Using Programming Tools
Some libraries specifically help to maintain nested lists during conversions between formats like Word to JSON or vice versa. Here are three common scenarios:
1. Using python-docx to Preserve Nested Lists in Word Documents
python-docx can inspect and generate list structures by reading the paragraph style and numbering level (<w:numPr> and <w:ilvl>). You can traverse the document paragraph by paragraph, reconstruct nested list levels, and write back.
from docx import Document
def get_nested_lists(doc_path):
doc = Document(doc_path)
nested_list = []
for para in doc.paragraphs:
if para.style.name.startswith('List'):
indent_level = para._p.pPr.numPr.ilvl.val
nested_list.append((indent_level, para.text))
return nested_listThis snippet extracts each paragraph’s indentation level alongside its text, helping preserve the hierarchy when converting to other formats like JSON.
2. Using Google Docs API for Nested Lists (With Limitations)
The Google Docs API can create and modify nested lists but does not include detailed official instructions for nested list creation, as noted in Google Docs API documentation. Developers must manually build the listProperties object with nestingLevel fields, which complicates automation.
3. Converting Documents to JSON While Retaining Nested Lists
Transforming a document with nested lists into a JSON representation helps apps consume content and display it with correct hierarchy. Using python-docx you can parse nested lists and output JSON like this:
[
{
"level": 0,
"text": "Main topic",
"children": [
{
"level": 1,
"text": "Subtopic 1"
},
{
"level": 1,
"text": "Subtopic 2",
"children": [
{
"level": 2,
"text": "Detail 1"
}
]
}
]
}
]Such structure ensures nested relationships are explicit, which raw text or flattened lists can’t represent.
Common Pitfalls and How to Avoid Them
Many developers and users report troubles on forums and GitHub with nested lists during conversion:
- Flattened lists: Losing all indent levels converts nested into flat lists. Happens often converting PDFs.
- Numbering resets: Automatic numbering restarts unexpectedly if numbering IDs aren’t tracked.
- List type mismatches: Ordered lists become unordered, breaking semantic meaning.
- Different indentation standards: Tabs vs spaces disparity causes parsing errors in Markdown conversions.
To avoid these, focus on:
- Keeping track of indentation or numbering levels explicitly.
- Mapping list types correctly (ordered list elements map to
<ol>, unordered to<ul>). - Using tools that expose the underlying structure (e.g., XML for Word).
- Testing on representative documents, including deeply nested examples.
User-Centric Guide: Step-by-Step Walkthrough for Nested Lists Preservation with Python
Here is a practical user guide to preserving nested lists during Word to JSON conversion with python-docx.
Step 1: Load the Word Document
from docx import Document
doc = Document('sample.docx')Step 2: Identify List Paragraphs and Their Indentation Levels
def get_list_level(para):
if para._p.pPr is not None and para._p.pPr.numPr is not None:
return para._p.pPr.numPr.ilvl.val # 0-based nesting level
return NoneStep 3: Build a Nested Structure
Use a stack approach to maintain parent-child relationships:
def parse_nested_lists(paragraphs):
stack = []
root = []
for para in paragraphs:
level = get_list_level(para)
if level is None:
continue # Skip non-lists
item = {"text": para.text, "children": []}
while stack and stack[-1]["level"] >= level:
stack.pop()
if not stack:
root.append(item)
stack.append({"level": level, "node": item})
else:
parent = stack[-1]["node"]
parent["children"].append(item)
stack.append({"level": level, "node": item})
return rootStep 4: Export as JSON
import json
nested_list_structure = parse_nested_lists(doc.paragraphs)
print(json.dumps(nested_list_structure, indent=2))This method explicitly respects the nesting level from the source Word document, preserving list hierarchy during conversion — a common pain point.
Comparing Tools for Nested List Preservation
| Tool | Strengths | Limitations | Use Cases |
|---|---|---|---|
python-docx | Full access to Word XML; good for .docx to JSON or Markdown | No PDF support; complex XML parsing | Word doc automation, conversion scripts |
| Google Docs API | Cloud-based, allows editing of Google Docs lists | Poor documentation on nested lists creation | Cloud workflows, Google Workspace integration |
| PDF Parsers (e.g., PyMuPDF) | Extracts text and layout from PDFs | Cannot reliably detect nested lists | Extracting flat text from PDFs |
| Pandoc | Converts between many formats, including Markdown, HTML, Word | Sometimes flattens complex nested lists on conversion | Multi-format conversion needs |
Choosing the right tool depends on your source and target format and how critical it is to maintain the exact nested list structure.
Handling Conversion Errors: What to Watch Out For
Errors are common when converting documents with nested lists:
- Missing indentation tags: Some formats lack explicit indent properties, causing parsing failures.
- Inconsistent styling: Mixed bullet styles or manual spacing tricks confuse converters.
- Unsupported list types: Some formats support only ordered or unordered lists, not both nested.
- API limitations: E.g., Google Docs API doesn’t fully support nested list creation programmatically.
Build error handling into your scripts:
- Validate list levels before processing.
- Fallback to flattening only when hierarchy can’t be reliably detected.
- Log warnings about unsupported styles.
- Provide users with manual correction steps if automation fails.
“Converting nested lists accurately is one of the toughest challenges in
Frequently Asked Questions
Q: Why do nested lists often break during document conversion?
A: Nested lists break during document conversion because different document formats represent list hierarchies differently, leading to issues like flattened lists, lost numbering, and incorrect indentation.
Q: What are common pitfalls when converting documents with nested lists?
A: Common pitfalls include losing all indent levels, unexpected resets in automatic numbering, mismatches between ordered and unordered lists, and parsing errors due to different indentation standards.
Q: How can I preserve nested lists when converting Word documents to JSON?
A: You can preserve nested lists by using the python-docx library to read the document's paragraph styles and indentation levels, then reconstructing the hierarchy before exporting it as JSON.
Q: What tools are recommended for preserving nested lists during conversion?
A: Recommended tools include python-docx for Word documents, the Google Docs API for cloud-based editing, and PDF parsers like PyMuPDF for extracting text, although each has its limitations.
Q: What should I do if my nested lists are flattened after conversion?
A: To avoid flattened lists, ensure you track indentation and numbering levels explicitly, use tools that expose the underlying structure, and test your conversion on representative documents.
Q: Are there programming examples for handling nested lists in Python?
A: Yes, the article provides Python code snippets using python-docx to extract nested lists and convert them into a structured JSON format, maintaining the hierarchy.
Ready to convert your documents?
Try our free Markdown to Word converter →