Preserving nested lists during document conversion can feel like walking a tightrope

Preserving nested lists during document conversion can feel like walking a tightrope. Most document formats and conversion tools struggle to keep list hierarchies intact—leading to flattened lists, messed up numbering, and lost structure. In real-world scenarios, this breaks the flow of technical papers, project documentation, and any content where list organization matters.

Why Do Nested Lists Break During Document Conversion?

Nested lists pose a unique challenge because they embed list items within other list items, creating a hierarchy both visually and semantically. Most document formats (Word, PDF, Markdown, HTML) represent these hierarchies differently at the data level.

During conversion, tools often:

Ignore list level properties, treating all items as top-level.
Fail to preserve different list types—ordered vs unordered.
Drop numbering styles, reverting to generic bullets or numbers.
Lose indentation info, resulting in flattened content.

For example, GitHub Issue #2375 documents how converting PDFs often flattens nested lists, turning complex outlines into one-level bullet dumps. That’s not just ugly—it’s misleading, especially in legal or academic contexts.

This problem is widespread because document conversion is not just about copying content—it's about translating hierarchical structure between diverse formatting engines, each with its quirks.

How Are Nested Lists Structured in Popular Formats?

Understanding the source data formats helps with the conversion strategy. Here’s a comparison table for common document formats:

Format	Nested List Representation	Key Challenge
Word (.docx)	`<w:numPr>` and `<w:ilvl>` tags control list levels and numbering	Word numbering schemes vary; requires parsing XML carefully
PDF	Visual indentation and text styling	No native list tags; hierarchy inferred from layout
Markdown	Indentation with spaces/tabs and `-` or `*` or numbered	Indentation varies; ambiguous levels if inconsistent
HTML	`<ol>` and `<ul>` with `<li>` elements nest naturally	Browser rendering is native but mapping to other formats can fail

Python’s python-docx package reads Word’s list levels using XML tags and can recreate the hierarchy programmatically. But PDFs are much harder—it involves layout analysis rather than tags, making nested list preservation error-prone.

How To Preserve Nested Lists Using Programming Tools

Some libraries specifically help to maintain nested lists during conversions between formats like Word to JSON or vice versa. Here are three common scenarios:

1. Using `python-docx` to Preserve Nested Lists in Word Documents

python-docx can inspect and generate list structures by reading the paragraph style and numbering level (<w:numPr> and <w:ilvl>). You can traverse the document paragraph by paragraph, reconstruct nested list levels, and write back.

from docx import Document
 
def get_nested_lists(doc_path):
    doc = Document(doc_path)
    nested_list = []
    for para in doc.paragraphs:
        if para.style.name.startswith('List'):
            indent_level = para._p.pPr.numPr.ilvl.val
            nested_list.append((indent_level, para.text))
    return nested_list

This snippet extracts each paragraph’s indentation level alongside its text, helping preserve the hierarchy when converting to other formats like JSON.

2. Using Google Docs API for Nested Lists (With Limitations)

The Google Docs API can create and modify nested lists but does not include detailed official instructions for nested list creation, as noted in Google Docs API documentation. Developers must manually build the listProperties object with nestingLevel fields, which complicates automation.

3. Converting Documents to JSON While Retaining Nested Lists

Transforming a document with nested lists into a JSON representation helps apps consume content and display it with correct hierarchy. Using python-docx you can parse nested lists and output JSON like this:

[
  {
    "level": 0,
    "text": "Main topic",
    "children": [
      {
        "level": 1,
        "text": "Subtopic 1"
      },
      {
        "level": 1,
        "text": "Subtopic 2",
        "children": [
          {
            "level": 2,
            "text": "Detail 1"
          }
        ]
      }
    ]
  }
]

Such structure ensures nested relationships are explicit, which raw text or flattened lists can’t represent.

Common Pitfalls and How to Avoid Them

Many developers and users report troubles on forums and GitHub with nested lists during conversion:

Flattened lists: Losing all indent levels converts nested into flat lists. Happens often converting PDFs.
Numbering resets: Automatic numbering restarts unexpectedly if numbering IDs aren’t tracked.
List type mismatches: Ordered lists become unordered, breaking semantic meaning.
Different indentation standards: Tabs vs spaces disparity causes parsing errors in Markdown conversions.

To avoid these, focus on:

Keeping track of indentation or numbering levels explicitly.
Mapping list types correctly (ordered list elements map to <ol>, unordered to <ul>).
Using tools that expose the underlying structure (e.g., XML for Word).
Testing on representative documents, including deeply nested examples.

User-Centric Guide: Step-by-Step Walkthrough for Nested Lists Preservation with Python

Here is a practical user guide to preserving nested lists during Word to JSON conversion with python-docx.

Step 1: Load the Word Document

from docx import Document
doc = Document('sample.docx')

Step 2: Identify List Paragraphs and Their Indentation Levels

def get_list_level(para):
    if para._p.pPr is not None and para._p.pPr.numPr is not None:
        return para._p.pPr.numPr.ilvl.val  # 0-based nesting level
    return None

Step 3: Build a Nested Structure

Use a stack approach to maintain parent-child relationships:

def parse_nested_lists(paragraphs):
    stack = []
    root = []
    for para in paragraphs:
        level = get_list_level(para)
        if level is None:
            continue  # Skip non-lists
        
        item = {"text": para.text, "children": []}
        while stack and stack[-1]["level"] >= level:
            stack.pop()
        
        if not stack:
            root.append(item)
            stack.append({"level": level, "node": item})
        else:
            parent = stack[-1]["node"]
            parent["children"].append(item)
            stack.append({"level": level, "node": item})
    return root

Step 4: Export as JSON

import json
nested_list_structure = parse_nested_lists(doc.paragraphs)
print(json.dumps(nested_list_structure, indent=2))

This method explicitly respects the nesting level from the source Word document, preserving list hierarchy during conversion — a common pain point.

Comparing Tools for Nested List Preservation

Tool	Strengths	Limitations	Use Cases
`python-docx`	Full access to Word XML; good for .docx to JSON or Markdown	No PDF support; complex XML parsing	Word doc automation, conversion scripts
Google Docs API	Cloud-based, allows editing of Google Docs lists	Poor documentation on nested lists creation	Cloud workflows, Google Workspace integration
PDF Parsers (e.g., PyMuPDF)	Extracts text and layout from PDFs	Cannot reliably detect nested lists	Extracting flat text from PDFs
Pandoc	Converts between many formats, including Markdown, HTML, Word	Sometimes flattens complex nested lists on conversion	Multi-format conversion needs

Choosing the right tool depends on your source and target format and how critical it is to maintain the exact nested list structure.

Handling Conversion Errors: What to Watch Out For

Errors are common when converting documents with nested lists:

Missing indentation tags: Some formats lack explicit indent properties, causing parsing failures.
Inconsistent styling: Mixed bullet styles or manual spacing tricks confuse converters.
Unsupported list types: Some formats support only ordered or unordered lists, not both nested.
API limitations: E.g., Google Docs API doesn’t fully support nested list creation programmatically.

Build error handling into your scripts:

Validate list levels before processing.
Fallback to flattening only when hierarchy can’t be reliably detected.
Log warnings about unsupported styles.
Provide users with manual correction steps if automation fails.

“Converting nested lists accurately is one of the toughest challenges in

Frequently Asked Questions

Q: Why do nested lists often break during document conversion?

A: Nested lists break during document conversion because different document formats represent list hierarchies differently, leading to issues like flattened lists, lost numbering, and incorrect indentation.

Q: What are common pitfalls when converting documents with nested lists?

A: Common pitfalls include losing all indent levels, unexpected resets in automatic numbering, mismatches between ordered and unordered lists, and parsing errors due to different indentation standards.

Q: How can I preserve nested lists when converting Word documents to JSON?

A: You can preserve nested lists by using the python-docx library to read the document's paragraph styles and indentation levels, then reconstructing the hierarchy before exporting it as JSON.

Q: What tools are recommended for preserving nested lists during conversion?

A: Recommended tools include python-docx for Word documents, the Google Docs API for cloud-based editing, and PDF parsers like PyMuPDF for extracting text, although each has its limitations.

Q: What should I do if my nested lists are flattened after conversion?

A: To avoid flattened lists, ensure you track indentation and numbering levels explicitly, use tools that expose the underlying structure, and test your conversion on representative documents.

Q: Are there programming examples for handling nested lists in Python?

A: Yes, the article provides Python code snippets using python-docx to extract nested lists and convert them into a structured JSON format, maintaining the hierarchy.