Files
edgartools/venv/lib/python3.10/site-packages/edgar/documents/docs/fast-table-rendering.md
2025-12-09 12:13:01 +01:00

18 KiB

Fast Table Rendering

Status: Production Ready - Now the Default (as of 2025-10-08) Performance: ~8-10x faster than Rich rendering with correct colspan/rowspan handling


Overview

Fast table rendering provides a high-performance alternative to Rich library rendering for table text extraction. When parsing SEC filings with hundreds of tables, the cumulative rendering time can become a bottleneck. Fast rendering addresses this by using direct string building with TableMatrix for proper colspan/rowspan handling, achieving 8-10x speedup while maintaining correctness.

As of 2025-10-08, fast rendering is the default for all table text extraction. You no longer need to explicitly enable it.

Why It's Now the Default

  • Production-ready: Fixed all major issues (colspan, multi-row headers, multi-line cells)
  • 7-10x faster: Significant performance improvement with correct output
  • Maintains quality: Matches Rich's appearance with simple() style
  • Proven: Extensively tested with Apple, NVIDIA, Microsoft 10-K filings

When to Disable (Use Rich Instead)

You may want to disable fast rendering and use Rich for:

  • Terminal display for humans: Rich has more sophisticated text wrapping and layout
  • Visual reports: When presentation quality is more important than speed
  • Debugging: Rich output can be easier to visually inspect

Usage

Default Behavior (Fast Rendering Enabled)

from edgar.documents import parse_html

# Fast rendering is now the default - no configuration needed!
doc = parse_html(html)

# Tables automatically use fast renderer (7-10x faster)
table_text = doc.tables[0].text()

Disabling Fast Rendering (Use Rich Instead)

If you need Rich's sophisticated layout for visual display:

from edgar.documents import parse_html
from edgar.documents.config import ParserConfig

# Explicitly disable fast rendering to use Rich
config = ParserConfig(fast_table_rendering=False)
doc = parse_html(html, config=config)

# Tables use Rich renderer (slower but with advanced formatting)
table_text = doc.tables[0].text()

Custom Table Styles

New in this version: Fast rendering now uses the simple() style by default, which matches Rich's box.SIMPLE appearance (borderless, clean).

from edgar.documents import parse_html
from edgar.documents.config import ParserConfig
from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle

# Enable fast rendering (uses simple() style by default)
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)

# Default: simple() style - borderless, clean
table_text = doc.tables[0].text()

# To use pipe_table() style explicitly (markdown-compatible borders):
renderer = FastTableRenderer(TableStyle.pipe_table())
pipe_text = renderer.render_table_node(doc.tables[0])

# To use minimal() style (no separator):
renderer = FastTableRenderer(TableStyle.minimal())
minimal_text = renderer.render_table_node(doc.tables[0])

Performance Comparison

Benchmark Results

Test: Apple 10-K (63 tables) - Updated 2025-10-08

Renderer Average Per Table Improvement Notes
Rich 1.5-2.5ms Baseline Varies by table complexity
Fast (simple) 0.15-0.35ms 7-10x faster With proper colspan/rowspan handling

Real-world Examples (Apple 10-K):

  • Table 15 (complex colspan): Rich 2.51ms → Fast 0.35ms (7.1x faster)
  • Table 6 (multi-line cells): Rich 1.61ms → Fast 0.17ms (9.5x faster)
  • Table 5 (wide table): Rich 3.70ms → Fast 0.48ms (7.7x faster)

Impact on Full Parse:

  • Rich rendering: 30-40% of total parse time spent in table rendering
  • Fast rendering: 5-10% of total parse time
  • Overall speedup: Reduces total parsing time by ~25-30%

Memory Impact

Fast rendering also reduces memory overhead:

  • No Rich Console objects retained
  • Direct string building (no intermediate objects)
  • Helps prevent memory leaks identified in profiling

Output Examples

Rich Renderer Output (Default)

  (In millions)
  Year Ended June 30,                       2025    2024    2023
 ──────────────────────────────────────────────────────────

  Operating lease cost                    $5,524   3,555   2,875

  Finance lease cost:
  Amortization of right-of-use assets     $3,408   1,800   1,352
  Interest on lease liabilities            1,417     734     501

  Total finance lease cost                $4,825   2,534   1,853

Style: box.SIMPLE - No outer border, just horizontal separator under header Pros: Clean, uncluttered, perfect alignment, generous spacing Cons: Slow (6.5ms per table), creates Rich objects, memory overhead

Fast Renderer Output (NEW: simple() style - Default)

                            December 31, 2023    December 31, 2022    December 31, 2021
 ───────────────────────────────────────────────────────────────────────────────────────
  Revenue                               365,817              394,328              365,817
  Cost of revenue                       223,546              212,981              192,266
  Gross profit                          142,271              181,347              173,551

Style: simple() - Matches Rich's box.SIMPLE appearance Pros: Fast (0.2ms per table), clean appearance, no visual noise, professional look Cons: None - this is now the recommended default!

Fast Renderer Output (pipe_table() style - Optional)

|                          |  December 31, 2023  |  December 31, 2022  |  December 31, 2021  |
|--------------------------|---------------------|---------------------|---------------------|
| Revenue                  |             365,817 |             394,328 |             365,817 |
| Cost of revenue          |             223,546 |             212,981 |             192,266 |
| Gross profit             |             142,271 |             181,347 |             173,551 |

Style: pipe_table() - Markdown-compatible with borders Pros: Fast (0.2ms per table), markdown-compatible, explicit column boundaries Cons: Visual noise from pipe characters, busier appearance Use when: You need markdown-compatible output with explicit borders

Visual Comparison

Rich (box.SIMPLE):

  • No outer border - clean, uncluttered look
  • Horizontal line separator under header only
  • Generous internal spacing and padding
  • Perfect column alignment
  • Professional, minimalist presentation

Fast simple() (NEW DEFAULT):

  • No outer border - matches Rich's clean look
  • Horizontal line separator under header (using )
  • Space-separated columns with generous padding
  • Clean, professional appearance
  • Same performance as pipe_table (~0.2ms per table)

Fast pipe_table() (optional):

  • Full pipe table borders (| characters everywhere)
  • Horizontal dashes for header separator
  • Markdown-compatible format
  • Explicit column boundaries

Recent Improvements (2025-10-08)

1. Colspan/Rowspan Support

Fixed: Tables with colspan and rowspan attributes now render correctly.

Previous issue: Fast renderer was extracting cell text without accounting for colspan/rowspan, causing:

  • Missing columns (e.g., "2023" column disappeared in Apple 10-K table 15)
  • Misaligned data (currency symbols separated from values)
  • Data loss (em dashes and other values missing)

Solution: Integrated TableMatrix for proper cell expansion, same as Rich rendering uses.

Status: FIXED

2. Multi-Row Header Preservation

Fixed: Tables with multiple header rows now preserve each row separately.

Previous issue: Multi-row headers were collapsed into a single line, causing "Investment portfolio" row to disappear in Apple 10-K table 20.

Solution: Modified render_table_data() and _build_table() to preserve each header row as a separate line.

Status: FIXED

3. Multi-Line Cell Rendering

Fixed: Cells containing newline characters (\n) now render as multiple lines.

Previous issue: Multi-line cells like "Interest Rate\nSensitive Instrument" were truncated to first line only.

Solution: Added _format_multiline_row() to split cells by \n and render each line separately.

Status: FIXED

Performance Impact

All three fixes maintain excellent performance:

  • Speedup: 7-10x faster than Rich (down from initial 14x, but with correct output)
  • Correctness: Now matches Rich output exactly for colspan, multi-row headers, and multi-line cells
  • Production ready: Can confidently use as default renderer

Known Limitations

1. Column Alignment in Some Tables

Issue: Currency symbols and values may have extra spacing in some complex tables (e.g., Apple 10-K table 22)

Example:

  • Rich: $294,866
  • Fast: $ 294,866 (extra spacing)

Root cause: Column width calculation creates wider columns for some currency/value pairs after colspan expansion and column filtering.

Impact: Visual appearance differs slightly, but data is correct and readable.

Status: ⚠️ Minor visual difference - acceptable trade-off for 10x performance gain

3. Visual Polish

Issue: Some visual aspects don't exactly match Rich's sophisticated layout

Examples:

  • Multi-line cell wrapping may differ
  • Column alignment in edge cases

Status: ⚠️ Acceptable trade-off for 8-10x performance gain


Configuration Options

Table Styles

Fast renderer supports different visual styles:

from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle

# Pipe table style (default) - markdown compatible
renderer = FastTableRenderer(TableStyle.pipe_table())

# Minimal style - no borders, just spacing
renderer = FastTableRenderer(TableStyle.minimal())

Minimal Style Output

                           December 31, 2023   December 31, 2022   December 31, 2021
Revenue                              365,817             394,328             365,817
Cost of revenue                      223,546             212,981             192,266
Gross profit                         142,271             181,347             173,551

Note: Minimal style has cleaner appearance but loses column boundaries


Technical Details

How It Works

  1. Direct String Building: Bypasses Rich's layout engine
  2. Column Analysis: Detects numeric columns for right-alignment
  3. Smart Filtering: Removes empty spacing columns
  4. Currency Merging: Combines $ symbols with amounts
  5. Width Calculation: Measures content, applies min/max limits

Code Path

# When fast_table_rendering=True:
table.text()
   TableNode._fast_text_rendering()
   FastTableRenderer.render_table_node()
   Direct string building

Memory Benefits

Fast rendering avoids:

  • Rich Console object creation (~0.4MB per document)
  • Intermediate rich.Table objects
  • Style/theme processing overhead
  • ANSI escape code generation

Future Improvements

Planned Enhancements

  1. Match Rich's box.SIMPLE Style (Priority: HIGH)

    • Remove all pipe characters - no outer border, no column separators
    • Keep only horizontal separator under header (using character)
    • Increase internal padding to match Rich's generous spacing
    • Clean, minimalist appearance like Rich's SIMPLE box style
    • Goal: Match Rich visual quality, still 30x faster
  2. Improved Layout Engine

    • Better column width calculation (avoid too-wide/too-narrow columns)
    • Respect natural content breaks
    • Dynamic spacing based on content type
    • Handle wrapping for long content
  3. Dynamic Padding

    • Match Rich's generous spacing (currently too tight)
    • Adjust padding based on content type
    • Configurable padding rules
    • Maintain alignment with variable padding
  4. Header Handling

    • Better multi-row header collapse
    • Preserve important hierarchies
    • Smart column spanning
    • Honor header groupings
  5. Style Presets

    • TableStyle.simple() - Match Rich's box.SIMPLE (no borders, header separator only) PRIMARY GOAL
    • TableStyle.minimal() - no borders, just spacing (already implemented)
    • TableStyle.pipe_table() - current markdown style (default)
    • TableStyle.ascii_clean() - no Unicode, pure ASCII
    • TableStyle.compact() - minimal spacing for dense data

Timeline

These improvements are planned for Phase 2 of the HTML parser optimization work (after memory leak fixes).


Migration Guide

From Rich to Fast

Before (using Rich):

doc = parse_html(html)
table_text = doc.tables[0].text()  # Slow but pretty

After (using Fast):

config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)
table_text = doc.tables[0].text()  # Fast but current visual issues

Hybrid Approach

Use fast rendering during processing, Rich for final display:

# Fast processing
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)

# Extract data quickly
for table in doc.tables:
    data = table.text()  # Fast
    # Process data...

# Display one table nicely
special_table = doc.tables[5]
rich_output = special_table.render()  # Switch to Rich for display

Performance Recommendations

Batch Processing (optimize for speed):

config = ParserConfig.for_performance()
# Includes: fast_table_rendering=True, eager_section_extraction=False

Data Extraction (balance speed and accuracy):

config = ParserConfig(
    fast_table_rendering=True,
    extract_xbrl=True,
    detect_sections=True
)

Display/Reports (optimize for quality):

config = ParserConfig()  # Default settings use Rich
# Or explicitly:
config = ParserConfig.for_accuracy()

FAQ

Q: Can I mix Fast and Rich rendering? A: Not per-table. The setting is document-wide via ParserConfig. However, you can manually call table.render() to get Rich output.

Q: Does this affect section extraction? A: Indirectly, yes. Section detection calls text() on the entire document, which includes tables. Fast rendering speeds this up significantly.

Q: Will the output format change? A: Yes, as we improve the renderer. We'll maintain backward compatibility via style options.

Q: Can I customize the appearance? A: Currently limited to TableStyle.pipe_table() vs TableStyle.minimal(). More options coming.

Q: What about DataFrame export? A: Fast rendering only affects text output. table.to_dataframe() is unaffected.


Feedback

The fast renderer is actively being improved based on user feedback. Known issues:

  1. Pipe characters - visual noise
  2. Layout engine - inconsistent spacing
  3. Padding - needs tuning

If you have specific rendering issues or suggestions, please provide:

  • Sample table HTML
  • Expected vs actual output
  • Use case description

This helps prioritize improvements while maintaining the performance advantage.


Summary

Current State (As of 2025-10-08)

Performance: Excellent (8-10x faster than Rich) Correctness: Production ready (proper colspan/rowspan handling) Visual Quality: ⚠️ Good (simple() style matches Rich's box.SIMPLE appearance) Use Case: Production-ready for all use cases

Recent Milestones

Completed:

  • Core fast rendering implementation
  • TableStyle.simple() preset (borderless, clean)
  • Column filtering and merging
  • Numeric alignment detection
  • Colspan/rowspan support via TableMatrix
  • Performance benchmarking with real tables

🔧 Current Limitations:

  • Multi-row header collapsing differs from Rich
  • Some visual polish differences (acceptable for speed gain)
  • Layout engine not as sophisticated as Rich

Development Roadmap

Phase 1 ( COMPLETED):

  • Core fast rendering implementation
  • Simple() style matching Rich's box.SIMPLE
  • Proper colspan/rowspan handling via TableMatrix
  • Production-ready performance (8-10x faster)

Phase 2 (Future Enhancements):

  • 📋 Improve multi-row header handling
  • 📋 Better layout engine for perfect column widths
  • 📋 Additional style presets
  • 📋 Advanced header detection (data vs labels)

Bottom Line

Fast table rendering is production-ready and now the default for all table text extraction in EdgarTools.

Benefits:

  • 7-10x faster than Rich rendering
  • Correct data extraction with proper colspan/rowspan handling
  • Multi-row header preservation
  • Multi-line cell rendering
  • Clean, borderless appearance (simple() style)

Minor differences from Rich:

  • ⚠️ Some tables have extra spacing between currency symbols and values (e.g., table 22)
  • ⚠️ Column width calculation may differ slightly in complex tables
  • All data is preserved and correct - only visual presentation differs

The implementation achieves correct data extraction with significant performance gains and clean visual output, making it the ideal default for EdgarTools.