Initial commit

This commit is contained in:
kdusek
2025-12-09 12:13:01 +01:00
commit 8e654ed209
13332 changed files with 2695056 additions and 0 deletions

View File

@@ -0,0 +1,529 @@
# Fast Table Rendering
**Status**: Production Ready - **Now the Default** (as of 2025-10-08)
**Performance**: ~8-10x faster than Rich rendering with correct colspan/rowspan handling
---
## Overview
Fast table rendering provides a high-performance alternative to Rich library rendering for table text extraction. When parsing SEC filings with hundreds of tables, the cumulative rendering time can become a bottleneck. Fast rendering addresses this by using direct string building with TableMatrix for proper colspan/rowspan handling, achieving 8-10x speedup while maintaining correctness.
**As of 2025-10-08, fast rendering is the default** for all table text extraction. You no longer need to explicitly enable it.
### Why It's Now the Default
- **Production-ready**: Fixed all major issues (colspan, multi-row headers, multi-line cells)
- **7-10x faster**: Significant performance improvement with correct output
- **Maintains quality**: Matches Rich's appearance with simple() style
- **Proven**: Extensively tested with Apple, NVIDIA, Microsoft 10-K filings
### When to Disable (Use Rich Instead)
You may want to disable fast rendering and use Rich for:
- **Terminal display for humans**: Rich has more sophisticated text wrapping and layout
- **Visual reports**: When presentation quality is more important than speed
- **Debugging**: Rich output can be easier to visually inspect
---
## Usage
### Default Behavior (Fast Rendering Enabled)
```python
from edgar.documents import parse_html
# Fast rendering is now the default - no configuration needed!
doc = parse_html(html)
# Tables automatically use fast renderer (7-10x faster)
table_text = doc.tables[0].text()
```
### Disabling Fast Rendering (Use Rich Instead)
If you need Rich's sophisticated layout for visual display:
```python
from edgar.documents import parse_html
from edgar.documents.config import ParserConfig
# Explicitly disable fast rendering to use Rich
config = ParserConfig(fast_table_rendering=False)
doc = parse_html(html, config=config)
# Tables use Rich renderer (slower but with advanced formatting)
table_text = doc.tables[0].text()
```
### Custom Table Styles
**New in this version**: Fast rendering now uses the `simple()` style by default, which matches Rich's `box.SIMPLE` appearance (borderless, clean).
```python
from edgar.documents import parse_html
from edgar.documents.config import ParserConfig
from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle
# Enable fast rendering (uses simple() style by default)
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)
# Default: simple() style - borderless, clean
table_text = doc.tables[0].text()
# To use pipe_table() style explicitly (markdown-compatible borders):
renderer = FastTableRenderer(TableStyle.pipe_table())
pipe_text = renderer.render_table_node(doc.tables[0])
# To use minimal() style (no separator):
renderer = FastTableRenderer(TableStyle.minimal())
minimal_text = renderer.render_table_node(doc.tables[0])
```
---
## Performance Comparison
### Benchmark Results
**Test**: Apple 10-K (63 tables) - Updated 2025-10-08
| Renderer | Average Per Table | Improvement | Notes |
|----------|-------------------|-------------|-------|
| Rich | 1.5-2.5ms | Baseline | Varies by table complexity |
| Fast (simple) | 0.15-0.35ms | **7-10x faster** | With proper colspan/rowspan handling |
**Real-world Examples** (Apple 10-K):
- Table 15 (complex colspan): Rich 2.51ms → Fast 0.35ms (**7.1x faster**)
- Table 6 (multi-line cells): Rich 1.61ms → Fast 0.17ms (**9.5x faster**)
- Table 5 (wide table): Rich 3.70ms → Fast 0.48ms (**7.7x faster**)
**Impact on Full Parse**:
- Rich rendering: 30-40% of total parse time spent in table rendering
- Fast rendering: 5-10% of total parse time
- **Overall speedup**: Reduces total parsing time by ~25-30%
### Memory Impact
Fast rendering also reduces memory overhead:
- No Rich Console objects retained
- Direct string building (no intermediate objects)
- Helps prevent memory leaks identified in profiling
---
## Output Examples
### Rich Renderer Output (Default)
```
(In millions)
Year Ended June 30, 2025 2024 2023
──────────────────────────────────────────────────────────
Operating lease cost $5,524 3,555 2,875
Finance lease cost:
Amortization of right-of-use assets $3,408 1,800 1,352
Interest on lease liabilities 1,417 734 501
Total finance lease cost $4,825 2,534 1,853
```
**Style**: `box.SIMPLE` - No outer border, just horizontal separator under header
**Pros**: Clean, uncluttered, perfect alignment, generous spacing
**Cons**: Slow (6.5ms per table), creates Rich objects, memory overhead
### Fast Renderer Output (NEW: simple() style - Default)
```
December 31, 2023 December 31, 2022 December 31, 2021
───────────────────────────────────────────────────────────────────────────────────────
Revenue 365,817 394,328 365,817
Cost of revenue 223,546 212,981 192,266
Gross profit 142,271 181,347 173,551
```
**Style**: `simple()` - Matches Rich's `box.SIMPLE` appearance
**Pros**: Fast (0.2ms per table), clean appearance, no visual noise, professional look
**Cons**: None - this is now the recommended default!
### Fast Renderer Output (pipe_table() style - Optional)
```
| | December 31, 2023 | December 31, 2022 | December 31, 2021 |
|--------------------------|---------------------|---------------------|---------------------|
| Revenue | 365,817 | 394,328 | 365,817 |
| Cost of revenue | 223,546 | 212,981 | 192,266 |
| Gross profit | 142,271 | 181,347 | 173,551 |
```
**Style**: `pipe_table()` - Markdown-compatible with borders
**Pros**: Fast (0.2ms per table), markdown-compatible, explicit column boundaries
**Cons**: Visual noise from pipe characters, busier appearance
**Use when**: You need markdown-compatible output with explicit borders
### Visual Comparison
**Rich** (`box.SIMPLE`):
- No outer border - clean, uncluttered look
- Horizontal line separator under header only
- Generous internal spacing and padding
- Perfect column alignment
- Professional, minimalist presentation
**Fast simple()** (NEW DEFAULT):
- No outer border - matches Rich's clean look
- Horizontal line separator under header (using `─`)
- Space-separated columns with generous padding
- Clean, professional appearance
- Same performance as pipe_table (~0.2ms per table)
**Fast pipe_table()** (optional):
- Full pipe table borders (`|` characters everywhere)
- Horizontal dashes for header separator
- Markdown-compatible format
- Explicit column boundaries
---
## Recent Improvements (2025-10-08)
### 1. Colspan/Rowspan Support
**Fixed**: Tables with `colspan` and `rowspan` attributes now render correctly.
**Previous issue**: Fast renderer was extracting cell text without accounting for colspan/rowspan, causing:
- Missing columns (e.g., "2023" column disappeared in Apple 10-K table 15)
- Misaligned data (currency symbols separated from values)
- Data loss (em dashes and other values missing)
**Solution**: Integrated `TableMatrix` for proper cell expansion, same as Rich rendering uses.
**Status**: ✅ FIXED
### 2. Multi-Row Header Preservation
**Fixed**: Tables with multiple header rows now preserve each row separately.
**Previous issue**: Multi-row headers were collapsed into a single line, causing "Investment portfolio" row to disappear in Apple 10-K table 20.
**Solution**: Modified `render_table_data()` and `_build_table()` to preserve each header row as a separate line.
**Status**: ✅ FIXED
### 3. Multi-Line Cell Rendering
**Fixed**: Cells containing newline characters (`\n`) now render as multiple lines.
**Previous issue**: Multi-line cells like "Interest Rate\nSensitive Instrument" were truncated to first line only.
**Solution**: Added `_format_multiline_row()` to split cells by `\n` and render each line separately.
**Status**: ✅ FIXED
### Performance Impact
All three fixes maintain excellent performance:
- **Speedup**: 7-10x faster than Rich (down from initial 14x, but with correct output)
- **Correctness**: Now matches Rich output exactly for colspan, multi-row headers, and multi-line cells
- **Production ready**: Can confidently use as default renderer
---
## Known Limitations
### 1. Column Alignment in Some Tables
**Issue**: Currency symbols and values may have extra spacing in some complex tables (e.g., Apple 10-K table 22)
**Example**:
- Rich: `$294,866`
- Fast: `$ 294,866` (extra spacing)
**Root cause**: Column width calculation creates wider columns for some currency/value pairs after colspan expansion and column filtering.
**Impact**: Visual appearance differs slightly, but data is correct and readable.
**Status**: ⚠️ Minor visual difference - acceptable trade-off for 10x performance gain
### 3. Visual Polish
**Issue**: Some visual aspects don't exactly match Rich's sophisticated layout
**Examples**:
- Multi-line cell wrapping may differ
- Column alignment in edge cases
**Status**: ⚠️ Acceptable trade-off for 8-10x performance gain
---
## Configuration Options
### Table Styles
Fast renderer supports different visual styles:
```python
from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle
# Pipe table style (default) - markdown compatible
renderer = FastTableRenderer(TableStyle.pipe_table())
# Minimal style - no borders, just spacing
renderer = FastTableRenderer(TableStyle.minimal())
```
### Minimal Style Output
```
December 31, 2023 December 31, 2022 December 31, 2021
Revenue 365,817 394,328 365,817
Cost of revenue 223,546 212,981 192,266
Gross profit 142,271 181,347 173,551
```
**Note**: Minimal style has cleaner appearance but loses column boundaries
---
## Technical Details
### How It Works
1. **Direct String Building**: Bypasses Rich's layout engine
2. **Column Analysis**: Detects numeric columns for right-alignment
3. **Smart Filtering**: Removes empty spacing columns
4. **Currency Merging**: Combines `$` symbols with amounts
5. **Width Calculation**: Measures content, applies min/max limits
### Code Path
```python
# When fast_table_rendering=True:
table.text()
TableNode._fast_text_rendering()
FastTableRenderer.render_table_node()
Direct string building
```
### Memory Benefits
Fast rendering avoids:
- Rich Console object creation (~0.4MB per document)
- Intermediate rich.Table objects
- Style/theme processing overhead
- ANSI escape code generation
---
## Future Improvements
### Planned Enhancements
1. **Match Rich's `box.SIMPLE` Style** (Priority: HIGH)
- **Remove all pipe characters** - no outer border, no column separators
- **Keep only horizontal separator** under header (using `─` character)
- **Increase internal padding** to match Rich's generous spacing
- **Clean, minimalist appearance** like Rich's SIMPLE box style
- **Goal**: Match Rich visual quality, still 30x faster
2. **Improved Layout Engine**
- Better column width calculation (avoid too-wide/too-narrow columns)
- Respect natural content breaks
- Dynamic spacing based on content type
- Handle wrapping for long content
3. **Dynamic Padding**
- Match Rich's generous spacing (currently too tight)
- Adjust padding based on content type
- Configurable padding rules
- Maintain alignment with variable padding
4. **Header Handling**
- Better multi-row header collapse
- Preserve important hierarchies
- Smart column spanning
- Honor header groupings
5. **Style Presets**
- `TableStyle.simple()` - Match Rich's `box.SIMPLE` (no borders, header separator only) ⭐ **PRIMARY GOAL**
- `TableStyle.minimal()` - no borders, just spacing (already implemented)
- `TableStyle.pipe_table()` - current markdown style (default)
- `TableStyle.ascii_clean()` - no Unicode, pure ASCII
- `TableStyle.compact()` - minimal spacing for dense data
### Timeline
These improvements are **planned for Phase 2** of the HTML parser optimization work (after memory leak fixes).
---
## Migration Guide
### From Rich to Fast
**Before** (using Rich):
```python
doc = parse_html(html)
table_text = doc.tables[0].text() # Slow but pretty
```
**After** (using Fast):
```python
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)
table_text = doc.tables[0].text() # Fast but current visual issues
```
### Hybrid Approach
Use fast rendering during processing, Rich for final display:
```python
# Fast processing
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)
# Extract data quickly
for table in doc.tables:
data = table.text() # Fast
# Process data...
# Display one table nicely
special_table = doc.tables[5]
rich_output = special_table.render() # Switch to Rich for display
```
---
## Performance Recommendations
### Recommended Settings by Use Case
**Batch Processing** (optimize for speed):
```python
config = ParserConfig.for_performance()
# Includes: fast_table_rendering=True, eager_section_extraction=False
```
**Data Extraction** (balance speed and accuracy):
```python
config = ParserConfig(
fast_table_rendering=True,
extract_xbrl=True,
detect_sections=True
)
```
**Display/Reports** (optimize for quality):
```python
config = ParserConfig() # Default settings use Rich
# Or explicitly:
config = ParserConfig.for_accuracy()
```
---
## FAQ
**Q: Can I mix Fast and Rich rendering?**
A: Not per-table. The setting is document-wide via ParserConfig. However, you can manually call `table.render()` to get Rich output.
**Q: Does this affect section extraction?**
A: Indirectly, yes. Section detection calls `text()` on the entire document, which includes tables. Fast rendering speeds this up significantly.
**Q: Will the output format change?**
A: Yes, as we improve the renderer. We'll maintain backward compatibility via style options.
**Q: Can I customize the appearance?**
A: Currently limited to `TableStyle.pipe_table()` vs `TableStyle.minimal()`. More options coming.
**Q: What about DataFrame export?**
A: Fast rendering only affects text output. `table.to_dataframe()` is unaffected.
---
## Feedback
The fast renderer is actively being improved based on user feedback. Known issues:
1.**Pipe characters** - visual noise
2.**Layout engine** - inconsistent spacing
3.**Padding** - needs tuning
If you have specific rendering issues or suggestions, please provide:
- Sample table HTML
- Expected vs actual output
- Use case description
This helps prioritize improvements while maintaining the performance advantage.
---
## Summary
### Current State (As of 2025-10-08)
**Performance**: ✅ Excellent (8-10x faster than Rich)
**Correctness**: ✅ Production ready (proper colspan/rowspan handling)
**Visual Quality**: ⚠️ Good (simple() style matches Rich's box.SIMPLE appearance)
**Use Case**: Production-ready for all use cases
### Recent Milestones
**✅ Completed**:
- Core fast rendering implementation
- TableStyle.simple() preset (borderless, clean)
- Column filtering and merging
- Numeric alignment detection
- **Colspan/rowspan support via TableMatrix**
- **Performance benchmarking with real tables**
**🔧 Current Limitations**:
- Multi-row header collapsing differs from Rich
- Some visual polish differences (acceptable for speed gain)
- Layout engine not as sophisticated as Rich
### Development Roadmap
**Phase 1** (✅ COMPLETED):
- ✅ Core fast rendering implementation
- ✅ Simple() style matching Rich's box.SIMPLE
- ✅ Proper colspan/rowspan handling via TableMatrix
- ✅ Production-ready performance (8-10x faster)
**Phase 2** (Future Enhancements):
- 📋 Improve multi-row header handling
- 📋 Better layout engine for perfect column widths
- 📋 Additional style presets
- 📋 Advanced header detection (data vs labels)
### Bottom Line
Fast table rendering is **production-ready and now the default** for all table text extraction in EdgarTools.
**Benefits**:
- ✅ 7-10x faster than Rich rendering
- ✅ Correct data extraction with proper colspan/rowspan handling
- ✅ Multi-row header preservation
- ✅ Multi-line cell rendering
- ✅ Clean, borderless appearance (simple() style)
**Minor differences from Rich**:
- ⚠️ Some tables have extra spacing between currency symbols and values (e.g., table 22)
- ⚠️ Column width calculation may differ slightly in complex tables
- ✅ All data is preserved and correct - only visual presentation differs
The implementation achieves **correct data extraction** with **significant performance gains** and **clean visual output**, making it the ideal default for EdgarTools.
---
## Related Documentation
- [HTML Parser Status](HTML_PARSER_STATUS.md) - Overall parser progress
- [Performance Analysis](../perf/hotpath_analysis.md) - Profiling results showing Rich rendering bottleneck
- [Memory Analysis](../perf/memory_analysis.md) - Memory leak issues with Rich objects