Files

kdusek 8e654ed209 Initial commit

2025-12-09 12:13:01 +01:00

18 KiB

Raw Permalink Blame History

Fast Table Rendering

Status: Production Ready - Now the Default (as of 2025-10-08) Performance: ~8-10x faster than Rich rendering with correct colspan/rowspan handling

Overview

Fast table rendering provides a high-performance alternative to Rich library rendering for table text extraction. When parsing SEC filings with hundreds of tables, the cumulative rendering time can become a bottleneck. Fast rendering addresses this by using direct string building with TableMatrix for proper colspan/rowspan handling, achieving 8-10x speedup while maintaining correctness.

As of 2025-10-08, fast rendering is the default for all table text extraction. You no longer need to explicitly enable it.

Why It's Now the Default

Production-ready: Fixed all major issues (colspan, multi-row headers, multi-line cells)
7-10x faster: Significant performance improvement with correct output
Maintains quality: Matches Rich's appearance with simple() style
Proven: Extensively tested with Apple, NVIDIA, Microsoft 10-K filings

When to Disable (Use Rich Instead)

You may want to disable fast rendering and use Rich for:

Terminal display for humans: Rich has more sophisticated text wrapping and layout
Visual reports: When presentation quality is more important than speed
Debugging: Rich output can be easier to visually inspect

Usage

Default Behavior (Fast Rendering Enabled)

from edgar.documents import parse_html

# Fast rendering is now the default - no configuration needed!
doc = parse_html(html)

# Tables automatically use fast renderer (7-10x faster)
table_text = doc.tables[0].text()

Disabling Fast Rendering (Use Rich Instead)

If you need Rich's sophisticated layout for visual display:

from edgar.documents import parse_html
from edgar.documents.config import ParserConfig

# Explicitly disable fast rendering to use Rich
config = ParserConfig(fast_table_rendering=False)
doc = parse_html(html, config=config)

# Tables use Rich renderer (slower but with advanced formatting)
table_text = doc.tables[0].text()

Custom Table Styles

New in this version: Fast rendering now uses the simple() style by default, which matches Rich's box.SIMPLE appearance (borderless, clean).

from edgar.documents import parse_html
from edgar.documents.config import ParserConfig
from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle

# Enable fast rendering (uses simple() style by default)
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)

# Default: simple() style - borderless, clean
table_text = doc.tables[0].text()

# To use pipe_table() style explicitly (markdown-compatible borders):
renderer = FastTableRenderer(TableStyle.pipe_table())
pipe_text = renderer.render_table_node(doc.tables[0])

# To use minimal() style (no separator):
renderer = FastTableRenderer(TableStyle.minimal())
minimal_text = renderer.render_table_node(doc.tables[0])

Performance Comparison

Benchmark Results

Test: Apple 10-K (63 tables) - Updated 2025-10-08

Renderer	Average Per Table	Improvement	Notes
Rich	1.5-2.5ms	Baseline	Varies by table complexity
Fast (simple)	0.15-0.35ms	7-10x faster	With proper colspan/rowspan handling

Real-world Examples (Apple 10-K):

Table 15 (complex colspan): Rich 2.51ms → Fast 0.35ms (7.1x faster)
Table 6 (multi-line cells): Rich 1.61ms → Fast 0.17ms (9.5x faster)
Table 5 (wide table): Rich 3.70ms → Fast 0.48ms (7.7x faster)

Impact on Full Parse:

Rich rendering: 30-40% of total parse time spent in table rendering
Fast rendering: 5-10% of total parse time
Overall speedup: Reduces total parsing time by ~25-30%

Memory Impact

Fast rendering also reduces memory overhead:

No Rich Console objects retained
Direct string building (no intermediate objects)
Helps prevent memory leaks identified in profiling

Output Examples

Rich Renderer Output (Default)

  (In millions)
  Year Ended June 30,                       2025    2024    2023
 ──────────────────────────────────────────────────────────

  Operating lease cost                    $5,524   3,555   2,875

  Finance lease cost:
  Amortization of right-of-use assets     $3,408   1,800   1,352
  Interest on lease liabilities            1,417     734     501

  Total finance lease cost                $4,825   2,534   1,853

Style: box.SIMPLE - No outer border, just horizontal separator under header Pros: Clean, uncluttered, perfect alignment, generous spacing Cons: Slow (6.5ms per table), creates Rich objects, memory overhead

Fast Renderer Output (NEW: simple() style - Default)

                            December 31, 2023    December 31, 2022    December 31, 2021
 ───────────────────────────────────────────────────────────────────────────────────────
  Revenue                               365,817              394,328              365,817
  Cost of revenue                       223,546              212,981              192,266
  Gross profit                          142,271              181,347              173,551

Style: simple() - Matches Rich's box.SIMPLE appearance Pros: Fast (0.2ms per table), clean appearance, no visual noise, professional look Cons: None - this is now the recommended default!

Fast Renderer Output (pipe_table() style - Optional)

|                          |  December 31, 2023  |  December 31, 2022  |  December 31, 2021  |
|--------------------------|---------------------|---------------------|---------------------|
| Revenue                  |             365,817 |             394,328 |             365,817 |
| Cost of revenue          |             223,546 |             212,981 |             192,266 |
| Gross profit             |             142,271 |             181,347 |             173,551 |

Style: pipe_table() - Markdown-compatible with borders Pros: Fast (0.2ms per table), markdown-compatible, explicit column boundaries Cons: Visual noise from pipe characters, busier appearance Use when: You need markdown-compatible output with explicit borders

Visual Comparison

Rich (box.SIMPLE):

No outer border - clean, uncluttered look
Horizontal line separator under header only
Generous internal spacing and padding
Perfect column alignment
Professional, minimalist presentation

Fast simple() (NEW DEFAULT):

No outer border - matches Rich's clean look
Horizontal line separator under header (using ─)
Space-separated columns with generous padding
Clean, professional appearance
Same performance as pipe_table (~0.2ms per table)

Fast pipe_table() (optional):

Full pipe table borders (| characters everywhere)
Horizontal dashes for header separator
Markdown-compatible format
Explicit column boundaries

Recent Improvements (2025-10-08)

1. Colspan/Rowspan Support

Fixed: Tables with colspan and rowspan attributes now render correctly.

Previous issue: Fast renderer was extracting cell text without accounting for colspan/rowspan, causing:

Missing columns (e.g., "2023" column disappeared in Apple 10-K table 15)
Misaligned data (currency symbols separated from values)
Data loss (em dashes and other values missing)

Solution: Integrated TableMatrix for proper cell expansion, same as Rich rendering uses.

Status: ✅ FIXED

2. Multi-Row Header Preservation

Fixed: Tables with multiple header rows now preserve each row separately.

Previous issue: Multi-row headers were collapsed into a single line, causing "Investment portfolio" row to disappear in Apple 10-K table 20.

Solution: Modified render_table_data() and _build_table() to preserve each header row as a separate line.

Status: ✅ FIXED

3. Multi-Line Cell Rendering

Fixed: Cells containing newline characters (\n) now render as multiple lines.

Previous issue: Multi-line cells like "Interest Rate\nSensitive Instrument" were truncated to first line only.

Solution: Added _format_multiline_row() to split cells by \n and render each line separately.

Status: ✅ FIXED

Performance Impact

All three fixes maintain excellent performance:

Speedup: 7-10x faster than Rich (down from initial 14x, but with correct output)
Correctness: Now matches Rich output exactly for colspan, multi-row headers, and multi-line cells
Production ready: Can confidently use as default renderer

Known Limitations

1. Column Alignment in Some Tables

Issue: Currency symbols and values may have extra spacing in some complex tables (e.g., Apple 10-K table 22)

Example:

Rich: $294,866
Fast: $ 294,866 (extra spacing)

Root cause: Column width calculation creates wider columns for some currency/value pairs after colspan expansion and column filtering.

Impact: Visual appearance differs slightly, but data is correct and readable.

Status: ⚠️ Minor visual difference - acceptable trade-off for 10x performance gain

3. Visual Polish

Issue: Some visual aspects don't exactly match Rich's sophisticated layout

Examples:

Multi-line cell wrapping may differ
Column alignment in edge cases

Status: ⚠️ Acceptable trade-off for 8-10x performance gain

Configuration Options

Table Styles

Fast renderer supports different visual styles:

from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle

# Pipe table style (default) - markdown compatible
renderer = FastTableRenderer(TableStyle.pipe_table())

# Minimal style - no borders, just spacing
renderer = FastTableRenderer(TableStyle.minimal())

Minimal Style Output

                           December 31, 2023   December 31, 2022   December 31, 2021
Revenue                              365,817             394,328             365,817
Cost of revenue                      223,546             212,981             192,266
Gross profit                         142,271             181,347             173,551

Note: Minimal style has cleaner appearance but loses column boundaries

Technical Details

How It Works

Direct String Building: Bypasses Rich's layout engine
Column Analysis: Detects numeric columns for right-alignment
Smart Filtering: Removes empty spacing columns
Currency Merging: Combines $ symbols with amounts
Width Calculation: Measures content, applies min/max limits

Code Path

# When fast_table_rendering=True:
table.text()
  → TableNode._fast_text_rendering()
  → FastTableRenderer.render_table_node()
  → Direct string building

Memory Benefits

Fast rendering avoids:

Rich Console object creation (~0.4MB per document)
Intermediate rich.Table objects
Style/theme processing overhead
ANSI escape code generation

Future Improvements

Planned Enhancements

Match Rich's box.SIMPLE Style (Priority: HIGH)
- Remove all pipe characters - no outer border, no column separators
- Keep only horizontal separator under header (using ─ character)
- Increase internal padding to match Rich's generous spacing
- Clean, minimalist appearance like Rich's SIMPLE box style
- Goal: Match Rich visual quality, still 30x faster
Improved Layout Engine
- Better column width calculation (avoid too-wide/too-narrow columns)
- Respect natural content breaks
- Dynamic spacing based on content type
- Handle wrapping for long content
Dynamic Padding
- Match Rich's generous spacing (currently too tight)
- Adjust padding based on content type
- Configurable padding rules
- Maintain alignment with variable padding
Header Handling
- Better multi-row header collapse
- Preserve important hierarchies
- Smart column spanning
- Honor header groupings
Style Presets
- TableStyle.simple() - Match Rich's box.SIMPLE (no borders, header separator only) ⭐ PRIMARY GOAL
- TableStyle.minimal() - no borders, just spacing (already implemented)
- TableStyle.pipe_table() - current markdown style (default)
- TableStyle.ascii_clean() - no Unicode, pure ASCII
- TableStyle.compact() - minimal spacing for dense data

Timeline

These improvements are planned for Phase 2 of the HTML parser optimization work (after memory leak fixes).

Migration Guide

From Rich to Fast

Before (using Rich):

doc = parse_html(html)
table_text = doc.tables[0].text()  # Slow but pretty

After (using Fast):

config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)
table_text = doc.tables[0].text()  # Fast but current visual issues

Hybrid Approach

Use fast rendering during processing, Rich for final display:

# Fast processing
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)

# Extract data quickly
for table in doc.tables:
    data = table.text()  # Fast
    # Process data...

# Display one table nicely
special_table = doc.tables[5]
rich_output = special_table.render()  # Switch to Rich for display

Performance Recommendations

Recommended Settings by Use Case

Batch Processing (optimize for speed):

config = ParserConfig.for_performance()
# Includes: fast_table_rendering=True, eager_section_extraction=False

Data Extraction (balance speed and accuracy):

config = ParserConfig(
    fast_table_rendering=True,
    extract_xbrl=True,
    detect_sections=True
)

Display/Reports (optimize for quality):

config = ParserConfig()  # Default settings use Rich
# Or explicitly:
config = ParserConfig.for_accuracy()

FAQ

Q: Can I mix Fast and Rich rendering? A: Not per-table. The setting is document-wide via ParserConfig. However, you can manually call table.render() to get Rich output.

Q: Does this affect section extraction? A: Indirectly, yes. Section detection calls text() on the entire document, which includes tables. Fast rendering speeds this up significantly.

Q: Will the output format change? A: Yes, as we improve the renderer. We'll maintain backward compatibility via style options.

Q: Can I customize the appearance? A: Currently limited to TableStyle.pipe_table() vs TableStyle.minimal(). More options coming.

Q: What about DataFrame export? A: Fast rendering only affects text output. table.to_dataframe() is unaffected.

Feedback

The fast renderer is actively being improved based on user feedback. Known issues:

❌ Pipe characters - visual noise
❌ Layout engine - inconsistent spacing
❌ Padding - needs tuning

If you have specific rendering issues or suggestions, please provide:

Sample table HTML
Expected vs actual output
Use case description

This helps prioritize improvements while maintaining the performance advantage.

Summary

Current State (As of 2025-10-08)

Performance: ✅ Excellent (8-10x faster than Rich) Correctness: ✅ Production ready (proper colspan/rowspan handling) Visual Quality: ⚠️ Good (simple() style matches Rich's box.SIMPLE appearance) Use Case: Production-ready for all use cases

Recent Milestones

✅ Completed:

Core fast rendering implementation
TableStyle.simple() preset (borderless, clean)
Column filtering and merging
Numeric alignment detection
Colspan/rowspan support via TableMatrix
Performance benchmarking with real tables

🔧 Current Limitations:

Multi-row header collapsing differs from Rich
Some visual polish differences (acceptable for speed gain)
Layout engine not as sophisticated as Rich

Development Roadmap

Phase 1 (✅ COMPLETED):

✅ Core fast rendering implementation
✅ Simple() style matching Rich's box.SIMPLE
✅ Proper colspan/rowspan handling via TableMatrix
✅ Production-ready performance (8-10x faster)

Phase 2 (Future Enhancements):

📋 Improve multi-row header handling
📋 Better layout engine for perfect column widths
📋 Additional style presets
📋 Advanced header detection (data vs labels)

Bottom Line

Fast table rendering is production-ready and now the default for all table text extraction in EdgarTools.

Benefits:

✅ 7-10x faster than Rich rendering
✅ Correct data extraction with proper colspan/rowspan handling
✅ Multi-row header preservation
✅ Multi-line cell rendering
✅ Clean, borderless appearance (simple() style)

Minor differences from Rich:

⚠️ Some tables have extra spacing between currency symbols and values (e.g., table 22)
⚠️ Column width calculation may differ slightly in complex tables
✅ All data is preserved and correct - only visual presentation differs

The implementation achieves correct data extraction with significant performance gains and clean visual output, making it the ideal default for EdgarTools.

HTML Parser Status - Overall parser progress
Performance Analysis - Profiling results showing Rich rendering bottleneck
Memory Analysis - Memory leak issues with Rich objects

18 KiB Raw Permalink Blame History

Fast Table Rendering

Overview

Why It's Now the Default

When to Disable (Use Rich Instead)

Usage

Default Behavior (Fast Rendering Enabled)

Disabling Fast Rendering (Use Rich Instead)

Custom Table Styles

Performance Comparison

Benchmark Results

Memory Impact

Output Examples

Rich Renderer Output (Default)

Fast Renderer Output (NEW: simple() style - Default)

Fast Renderer Output (pipe_table() style - Optional)

Visual Comparison

Recent Improvements (2025-10-08)

1. Colspan/Rowspan Support

2. Multi-Row Header Preservation

3. Multi-Line Cell Rendering

Performance Impact

Known Limitations

1. Column Alignment in Some Tables

3. Visual Polish

Configuration Options

Table Styles

Minimal Style Output

Technical Details

How It Works

Code Path

Memory Benefits

Future Improvements

Planned Enhancements

Timeline

Migration Guide

From Rich to Fast

Hybrid Approach

Performance Recommendations

Recommended Settings by Use Case

FAQ

Feedback

Summary

Current State (As of 2025-10-08)

Recent Milestones

Development Roadmap

Bottom Line

Related Documentation

18 KiB

Raw Permalink Blame History