Initial commit
This commit is contained in:
@@ -0,0 +1,314 @@
|
||||
# HTML Parser Rewrite - Status Report
|
||||
|
||||
**Generated**: 2025-10-08
|
||||
**Branch**: `html_rewrite`
|
||||
**Target**: Merge to `main`
|
||||
|
||||
---
|
||||
|
||||
## Overall Progress: ~95% Complete ✅
|
||||
|
||||
### Completed Phases
|
||||
|
||||
#### ✅ Phase 1: Core Implementation (100%)
|
||||
- [x] Streaming parser for large documents
|
||||
- [x] TableMatrix system for accurate table rendering
|
||||
- [x] Section extraction with Part I/II detection
|
||||
- [x] XBRL integration
|
||||
- [x] Rich-based table rendering
|
||||
- [x] Configuration system (ParserConfig)
|
||||
- [x] Error handling and validation
|
||||
|
||||
#### ✅ Phase 2: Functional Testing (100%)
|
||||
- [x] **Corpus Validation** - 40 diverse filings, 100% success rate
|
||||
- [x] **Edge Cases** - 31 tests covering invalid inputs, malformed HTML, edge conditions
|
||||
- [x] **Integration Tests** - 25 tests for Filing/Company integration, backward compatibility
|
||||
- [x] **Regression Tests** - 15 tests preventing known bugs from returning
|
||||
|
||||
**Total Test Count**: 79 functional tests, all passing
|
||||
|
||||
#### ✅ Phase 3: Performance Profiling (100%)
|
||||
- [x] **Benchmarking Infrastructure** - Comprehensive benchmark suite
|
||||
- [x] **Hot Path Analysis** - Identified 3 critical bottlenecks (63% section extraction, 40% Rich rendering, 15% regex)
|
||||
- [x] **Memory Profiling** - Found 255MB memory leak in MSFT 10-K, documented root causes
|
||||
- [x] **Performance Regression Tests** - 15 tests locking in baseline thresholds
|
||||
|
||||
**Performance Baseline Established**:
|
||||
- Average: 3.8MB/s throughput, 4.1MB memory per doc
|
||||
- Small docs: 2.6MB/s (optimization opportunity)
|
||||
- Large docs: 20.7MB/s (excellent streaming)
|
||||
- Memory leak: 19-25x ratio on medium docs (needs fixing)
|
||||
|
||||
#### ✅ Phase 4: Test Data Augmentation (100%)
|
||||
- [x] **HTML Fixtures** - Downloaded 32 files (155MB) from 16 companies across 6 industries
|
||||
- [x] **Download Automation** - Created `download_html_fixtures.py` script
|
||||
- [x] **Documentation** - Comprehensive fixture documentation
|
||||
|
||||
---
|
||||
|
||||
## Current Status: Ready for Optimization Phase
|
||||
|
||||
### What's Working Well ✅
|
||||
|
||||
1. **Parsing Accuracy**: 100% success rate across 40+ diverse filings
|
||||
2. **Large Document Handling**: Excellent streaming performance (20.7MB/s on JPM 10-K)
|
||||
3. **Table Extraction**: TableMatrix accurately handles colspan/rowspan
|
||||
4. **Test Coverage**: 79 comprehensive tests covering edge cases, integration, regression
|
||||
5. **Backward Compatibility**: Old TenK API still works for existing code
|
||||
|
||||
### Known Issues to Address 🔧
|
||||
|
||||
#### Critical (Must Fix Before Merge)
|
||||
|
||||
1. **Memory Leaks** (Priority: CRITICAL)
|
||||
- MSFT 10-K: 255MB leak (19x document size)
|
||||
- Apple 10-K: 41MB leak (23x document size)
|
||||
- **Root Causes**:
|
||||
- Rich Console objects retained (0.4MB per doc)
|
||||
- Global caches not cleared on document deletion
|
||||
- Circular references in node graph
|
||||
- **Location**: `tests/perf/memory_analysis.md:90-130`
|
||||
- **Impact**: Server crashes after 10-20 requests in production
|
||||
|
||||
2. **Performance Bottlenecks** (Priority: HIGH)
|
||||
- Section extraction: 3.7s (63% of parse time)
|
||||
- Rich rendering for text: 2.4s (40% of parse time)
|
||||
- Regex normalization: 0.8s (15% of parse time)
|
||||
- **Location**: `tests/perf/hotpath_analysis.md:9-66`
|
||||
- **Impact**: 4x slower than necessary on medium documents
|
||||
|
||||
#### Non-Critical (Can Fix After Merge)
|
||||
|
||||
3. **Small Document Performance** (Priority: MEDIUM)
|
||||
- 2.6MB/s vs desired 5MB/s
|
||||
- Overhead dominates on <5MB documents
|
||||
- **Optimization**: Lazy loading, reduce upfront processing
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (In Order)
|
||||
|
||||
### Phase 5: Critical Fixes (2-3 days) 🔧
|
||||
|
||||
#### 5.1 Memory Leak Fixes (1-2 days)
|
||||
**Goal**: Reduce memory leak from 255MB to <5MB
|
||||
|
||||
Tasks:
|
||||
- [ ] Implement `Document.__del__()` to clear caches
|
||||
- [ ] Replace Rich rendering in `text()` with direct string building
|
||||
- [ ] Break circular references in node graph
|
||||
- [ ] Use weak references for parent links
|
||||
- [ ] Add `__slots__` to frequently created objects (Cell, TableNode)
|
||||
|
||||
**Expected Result**: MSFT 10-K leak: 255MB → <5MB (95% improvement)
|
||||
|
||||
**Validation**:
|
||||
```bash
|
||||
pytest tests/perf/test_performance_regression.py::TestMemoryRegression -v
|
||||
```
|
||||
|
||||
#### 5.2 Performance Optimizations (1-2 days)
|
||||
**Goal**: Improve parse speed from 1.2s → 0.3s on Apple 10-K (77% faster)
|
||||
|
||||
Tasks:
|
||||
- [ ] Fix section detection - use headings instead of rendering entire document
|
||||
- [ ] Implement fast text extraction without Rich overhead
|
||||
- [ ] Optimize regex normalization - combine patterns, use compilation
|
||||
|
||||
**Expected Results**:
|
||||
- Section extraction: 3.7s → 1.2s (60% faster)
|
||||
- Text extraction: 2.4s → 1.2s (50% faster)
|
||||
- Regex: 0.8s → 0.5s (40% faster)
|
||||
|
||||
**Validation**:
|
||||
```bash
|
||||
pytest tests/perf/test_performance_regression.py::TestParseSpeedRegression -v
|
||||
```
|
||||
|
||||
### Phase 6: Final Validation (1 day) ✅
|
||||
|
||||
Tasks:
|
||||
- [ ] Re-run all 79 functional tests
|
||||
- [ ] Re-run performance regression tests (verify improvements)
|
||||
- [ ] Run full corpus validation
|
||||
- [ ] Memory profiling validation (confirm leaks fixed)
|
||||
- [ ] Update CHANGELOG.md
|
||||
- [ ] Create merge summary document
|
||||
|
||||
### Phase 7: Merge to Main (1 day) 🚀
|
||||
|
||||
Tasks:
|
||||
- [ ] Final code review
|
||||
- [ ] Squash commits or create clean merge
|
||||
- [ ] Update version number
|
||||
- [ ] Merge to main
|
||||
- [ ] Tag release
|
||||
- [ ] Monitor for issues
|
||||
|
||||
---
|
||||
|
||||
## Test Summary
|
||||
|
||||
### Current Test Status: 79/79 Passing (100%)
|
||||
|
||||
```
|
||||
tests/corpus/test_corpus_validation.py 8 tests ✓
|
||||
tests/test_html_parser_edge_cases.py 31 tests ✓
|
||||
tests/test_html_parser_integration.py 25 tests ✓
|
||||
tests/test_html_parser_regressions.py 15 tests ✓
|
||||
tests/perf/test_performance_regression.py 15 tests ✓ (baseline established)
|
||||
```
|
||||
|
||||
### Test Execution
|
||||
|
||||
```bash
|
||||
# Functional tests (79 tests, ~30s)
|
||||
pytest tests/corpus tests/test_html_parser_*.py -v
|
||||
|
||||
# Performance tests (15 tests, ~20s)
|
||||
pytest tests/perf/test_performance_regression.py -m performance -v
|
||||
|
||||
# All tests
|
||||
pytest tests/ -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Metrics
|
||||
|
||||
### Current Baseline (Before Optimization)
|
||||
|
||||
| Document | Size | Parse Time | Throughput | Memory | Tables | Sections |
|
||||
|----------|------|------------|------------|--------|--------|----------|
|
||||
| Apple 10-Q | 1.1MB | 0.307s | 3.6MB/s | 27.9MB (25.6x) | 40 | 9 |
|
||||
| Apple 10-K | 1.8MB | 0.500s | 3.6MB/s | 21.6MB (11.9x) | 63 | 8 |
|
||||
| MSFT 10-K | 7.8MB | 1.501s | 5.2MB/s | 147.0MB (18.9x) | 85 | 0 |
|
||||
| JPM 10-K | 52.4MB | 2.537s | 20.7MB/s | 0.6MB (0.01x) | 681 | 0 |
|
||||
|
||||
### Target Metrics (After Optimization)
|
||||
|
||||
| Metric | Current | Target | Improvement |
|
||||
|--------|---------|--------|-------------|
|
||||
| **Memory leak** | 41-255MB | <5MB | 95% reduction |
|
||||
| **Memory ratio** | 19-25x | <3x | 87% reduction |
|
||||
| **Parse time (Apple 10-K)** | 0.500s | 0.150s | 70% faster |
|
||||
| **Throughput (small docs)** | 2.6MB/s | 5.0MB/s | 92% faster |
|
||||
|
||||
---
|
||||
|
||||
## File Organization
|
||||
|
||||
### Core Parser Files
|
||||
```
|
||||
edgar/documents/
|
||||
├── __init__.py # Public API (parse_html)
|
||||
├── parser.py # Main parser with streaming
|
||||
├── config.py # ParserConfig
|
||||
├── document_builder.py # Document tree construction
|
||||
├── nodes/ # Node types (TableNode, SectionNode)
|
||||
├── utils/
|
||||
│ ├── streaming.py # Streaming parser (fixed JPM bug)
|
||||
│ └── table_processing.py # TableMatrix system
|
||||
└── exceptions.py # Custom exceptions
|
||||
```
|
||||
|
||||
### Test Files
|
||||
```
|
||||
tests/
|
||||
├── corpus/ # Corpus validation
|
||||
│ ├── quick_corpus.py # Corpus builder
|
||||
│ └── test_corpus_validation.py # 8 validation tests
|
||||
├── fixtures/
|
||||
│ ├── html/ # 32 HTML fixtures (155MB)
|
||||
│ │ ├── {ticker}/10k/ # By company and form
|
||||
│ │ └── README.md
|
||||
│ └── download_html_fixtures.py # Download automation
|
||||
├── perf/ # Performance testing
|
||||
│ ├── benchmark_html_parser.py # Benchmarking
|
||||
│ ├── profile_hotpaths.py # Hot path profiling
|
||||
│ ├── profile_memory.py # Memory profiling
|
||||
│ ├── test_performance_regression.py # Regression tests
|
||||
│ ├── performance_report.md # Benchmark results
|
||||
│ ├── hotpath_analysis.md # Bottleneck analysis
|
||||
│ └── memory_analysis.md # Memory leak analysis
|
||||
├── test_html_parser_edge_cases.py # 31 edge case tests
|
||||
├── test_html_parser_integration.py # 25 integration tests
|
||||
└── test_html_parser_regressions.py # 15 regression tests
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Risks and Mitigation
|
||||
|
||||
### Risk 1: Memory Leaks in Production
|
||||
**Severity**: HIGH
|
||||
**Probability**: HIGH (confirmed in testing)
|
||||
**Mitigation**: Must fix before merge (Phase 5.1)
|
||||
|
||||
### Risk 2: Performance Regression
|
||||
**Severity**: MEDIUM
|
||||
**Probability**: LOW (baseline established, regression tests in place)
|
||||
**Mitigation**: Performance regression tests will catch any degradation
|
||||
|
||||
### Risk 3: Backward Compatibility
|
||||
**Severity**: LOW
|
||||
**Probability**: LOW (integration tests passing)
|
||||
**Mitigation**: 25 integration tests verify old API still works
|
||||
|
||||
---
|
||||
|
||||
## Estimated Timeline to Merge
|
||||
|
||||
```
|
||||
Phase 5.1: Memory leak fixes 1-2 days
|
||||
Phase 5.2: Performance optimization 1-2 days
|
||||
Phase 6: Final validation 1 day
|
||||
Phase 7: Merge to main 1 day
|
||||
----------------------------------------
|
||||
Total: 4-6 days
|
||||
```
|
||||
|
||||
**Target Merge Date**: October 12-14, 2025
|
||||
|
||||
---
|
||||
|
||||
## Decision Points
|
||||
|
||||
### Should We Merge Now or After Optimization?
|
||||
|
||||
**Option A: Merge Now (Not Recommended)**
|
||||
- ✅ Functional tests passing
|
||||
- ✅ Backward compatible
|
||||
- ❌ Memory leaks (production risk)
|
||||
- ❌ Performance issues
|
||||
- ❌ Will require hotfix soon
|
||||
|
||||
**Option B: Fix Critical Issues First (Recommended)**
|
||||
- ✅ Production-ready
|
||||
- ✅ Performance validated
|
||||
- ✅ Memory efficient
|
||||
- ❌ 4-6 days delay
|
||||
- ✅ Clean, professional release
|
||||
|
||||
**Recommendation**: **Option B** - Fix critical memory leaks and performance issues before merge. The 4-6 day investment prevents production incidents and ensures a polished release.
|
||||
|
||||
---
|
||||
|
||||
## Questions for Review
|
||||
|
||||
1. **Scope**: Should we fix only critical issues (memory + performance) or also tackle small-doc optimization?
|
||||
2. **Timeline**: Is 4-6 days acceptable, or do we need to merge sooner?
|
||||
3. **Testing**: Are 79 functional tests + 15 performance tests sufficient coverage?
|
||||
4. **Documentation**: Do we need user-facing documentation updates?
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The HTML parser rewrite is **95% complete** with excellent functional testing but critical memory and performance issues identified. The smart path forward is:
|
||||
|
||||
1. ✅ Complete critical fixes (4-6 days)
|
||||
2. ✅ Validate improvements
|
||||
3. ✅ Merge to main with confidence
|
||||
|
||||
This approach ensures a production-ready, performant parser rather than merging now and hotfixing later.
|
||||
@@ -0,0 +1,437 @@
|
||||
# HTML Parser Rewrite - Progress Assessment
|
||||
|
||||
**Date**: 2025-10-07
|
||||
**Status**: Active Development (html_rewrite branch)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The HTML parser rewrite is **substantially complete** for core functionality with **excellent progress** on Item/section detection. Recent bug fixes (2025-10-07) have addressed critical table rendering issues and 10-Q Part I/II distinction, bringing the parser close to production-ready quality.
|
||||
|
||||
### Overall Progress: **~90% Complete**
|
||||
|
||||
- ✅ Core parsing infrastructure: **100% Complete**
|
||||
- ✅ Table processing: **95% Complete** (recent fixes)
|
||||
- ✅ Section/Item detection: **95% Complete** (Part I/II fixed, needs validation)
|
||||
- ⚠️ Performance optimization: **70% Complete**
|
||||
- ⚠️ Comprehensive testing: **65% Complete** (added 10-Q Part tests)
|
||||
- ⚠️ Documentation: **75% Complete**
|
||||
|
||||
---
|
||||
|
||||
## Goal Achievement Analysis
|
||||
|
||||
### Primary Goals (from goals.md)
|
||||
|
||||
#### 1. **Semantic Meaning Preservation** ✅ **ACHIEVED**
|
||||
> "Read text, tables and ixbrl data preserving greatest semantic meaning"
|
||||
|
||||
**Status**: ✅ Fully implemented
|
||||
- Text extraction with structure preservation
|
||||
- Advanced table matrix system for accurate table rendering
|
||||
- XBRL fact extraction before preprocessing
|
||||
- Hierarchical node model maintains document structure
|
||||
|
||||
**Recent Improvements**:
|
||||
- Header detection fixes (Oracle Table 6, Tesla Table 16)
|
||||
- Spacing column filter now preserves header columns (MSFT Table 39)
|
||||
- Multi-row header normalization
|
||||
|
||||
#### 2. **AI Channel (Primary) + Human Channel (Secondary)** ✅ **ACHIEVED**
|
||||
> "AI context is the primary goal, with human context being secondary"
|
||||
|
||||
**Status**: ✅ Both channels working
|
||||
- **AI Channel**:
|
||||
- Clean text output optimized for LLMs
|
||||
- Structured table rendering for context windows
|
||||
- Section-level extraction for chunking
|
||||
- Semantic divisibility supported
|
||||
|
||||
- **Human Channel**:
|
||||
- Rich console rendering with proper formatting
|
||||
- Markdown export
|
||||
- Visual table alignment (recently fixed)
|
||||
|
||||
#### 3. **Section-Level Processing** ✅ **ACHIEVED**
|
||||
> "Work at full document level and section level - breaking into independently processable sections"
|
||||
|
||||
**Status**: ✅ Implemented with good coverage
|
||||
- `SectionExtractor` class fully functional
|
||||
- TOC-based section detection
|
||||
- Pattern-based section identification
|
||||
- Lazy loading support for large documents
|
||||
|
||||
**What Works**:
|
||||
```python
|
||||
# Section detection is operational
|
||||
doc = parse_html(html)
|
||||
sections = doc.sections # Dict of section names -> SectionNode
|
||||
|
||||
# Access specific sections
|
||||
business = sections.get('Item 1 - Business')
|
||||
mda = sections.get('Item 7 - MD&A')
|
||||
financials = sections.get('Item 8 - Financial Statements')
|
||||
```
|
||||
|
||||
#### 4. **Standard Section Names (10-K, 10-Q, 8-K)** ✅ **ACHIEVED**
|
||||
> "For some filing types (10-K, 10-Q, 8-K) identify sections by standard names"
|
||||
|
||||
**Status**: ✅ 95% Complete - Implemented with Part I/II distinction for 10-Q
|
||||
|
||||
**What's Implemented**:
|
||||
- Pattern matching for standard Items:
|
||||
- Item 1 - Business
|
||||
- Item 1A - Risk Factors
|
||||
- Item 7 - MD&A
|
||||
- Item 7A - Market Risk
|
||||
- Item 8 - Financial Statements
|
||||
- And more...
|
||||
- **10-Q Part I/Part II distinction** (newly fixed 2025-10-07):
|
||||
- Part I - Item 1 (Financial Statements)
|
||||
- Part II - Item 1 (Legal Proceedings)
|
||||
- Proper boundary detection and context propagation
|
||||
- Prevents Item number conflicts
|
||||
|
||||
**What's Remaining** (5%):
|
||||
- Validation against large corpus of 10-K/10-Q filings
|
||||
- Edge case handling (non-standard formatting)
|
||||
- 8-K specific section patterns expansion
|
||||
|
||||
**Evidence from Code**:
|
||||
```python
|
||||
# edgar/documents/extractors/section_extractor.py
|
||||
(r'^(Item|ITEM)\s+1\.?\s*Business', 'Item 1 - Business'),
|
||||
(r'^(Item|ITEM)\s+1A\.?\s*Risk\s+Factors', 'Item 1A - Risk Factors'),
|
||||
(r'^(Item|ITEM)\s+7\.?\s*Management.*Discussion', 'Item 7 - MD&A'),
|
||||
(r'^(Item|ITEM)\s+8\.?\s*Financial\s+Statements', 'Item 8 - Financial Statements'),
|
||||
|
||||
# NEW: Part I/II detection (edgar/documents/extractors/section_extractor.py:294-324)
|
||||
def _detect_10q_parts(self, headers) -> Dict[int, str]:
|
||||
"""Detect Part I and Part II boundaries in 10-Q filings."""
|
||||
```
|
||||
|
||||
#### 5. **Table Processing for AI Context** ✅ **ACHIEVED**
|
||||
> "Getting tables in the right structure for rendering to text for AI context is more important than dataframes"
|
||||
|
||||
**Status**: ✅ Excellent progress with recent fixes
|
||||
- Advanced TableMatrix system handles complex tables
|
||||
- Multi-row header detection and normalization
|
||||
- Spacing column filtering (preserves semantic columns)
|
||||
- Currency symbol merging
|
||||
- Clean text rendering for LLM consumption
|
||||
|
||||
**Recent Fixes (Today)**:
|
||||
- ✅ Fixed spacing column filter removing legitimate headers (MSFT Table 39)
|
||||
- ✅ Fixed header detection for date ranges (Oracle Table 6)
|
||||
- ✅ Fixed long narrative text misclassification (Tesla Table 16)
|
||||
- ✅ Header row normalization for alignment
|
||||
|
||||
#### 6. **Better Than Old Parser in Every Way** 🟡 **MOSTLY ACHIEVED**
|
||||
> "Speed, accuracy, features, usability"
|
||||
|
||||
**Comparison**:
|
||||
|
||||
| Aspect | Old Parser | New Parser | Status |
|
||||
|--------|-----------|------------|--------|
|
||||
| **Speed** | Baseline | 1.4x faster (typical) | ✅ Better |
|
||||
| **Accuracy** | Good | Excellent (with recent fixes) | ✅ Better |
|
||||
| **Features** | Basic | Rich (XBRL, sections, multiple outputs) | ✅ Better |
|
||||
| **Usability** | Simple | Powerful + Simple API | ✅ Better |
|
||||
| **Table Rendering** | Basic alignment | Advanced matrix system | ✅ Better |
|
||||
| **Section Detection** | Limited | Comprehensive | ✅ Better |
|
||||
|
||||
**Areas Needing Validation**:
|
||||
- Performance on very large documents (>50MB)
|
||||
- Memory usage under sustained load
|
||||
- Edge case handling across diverse filings
|
||||
|
||||
---
|
||||
|
||||
## Item/Section Detection Deep Dive
|
||||
|
||||
### Current Capabilities
|
||||
|
||||
**10-K Sections Detected**:
|
||||
- ✅ Item 1 - Business
|
||||
- ✅ Item 1A - Risk Factors
|
||||
- ✅ Item 1B - Unresolved Staff Comments
|
||||
- ✅ Item 2 - Properties
|
||||
- ✅ Item 3 - Legal Proceedings
|
||||
- ✅ Item 4 - Mine Safety Disclosures
|
||||
- ✅ Item 5 - Market for Stock
|
||||
- ✅ Item 6 - Selected Financial Data
|
||||
- ✅ Item 7 - MD&A
|
||||
- ✅ Item 7A - Market Risk
|
||||
- ✅ Item 8 - Financial Statements
|
||||
- ✅ Item 9 - Changes in Accounting
|
||||
- ✅ Item 9A - Controls and Procedures
|
||||
- ✅ Item 9B - Other Information
|
||||
- ✅ Item 10 - Directors and Officers
|
||||
- ✅ Item 11 - Executive Compensation
|
||||
- ✅ Item 12 - Security Ownership
|
||||
- ✅ Item 13 - Related Transactions
|
||||
- ✅ Item 14 - Principal Accountant
|
||||
- ✅ Item 15 - Exhibits
|
||||
|
||||
**10-Q Sections Detected**:
|
||||
- ✅ Part I Items (Financial Information):
|
||||
- Part I - Item 1 - Financial Statements
|
||||
- Part I - Item 2 - MD&A
|
||||
- Part I - Item 3 - Market Risk
|
||||
- Part I - Item 4 - Controls and Procedures
|
||||
- ✅ Part II Items (Other Information):
|
||||
- Part II - Item 1 - Legal Proceedings
|
||||
- Part II - Item 1A - Risk Factors
|
||||
- Part II - Item 2 - Unregistered Sales
|
||||
- Part II - Item 6 - Exhibits
|
||||
|
||||
**✅ FIXED** (2025-10-07): Part I/Part II distinction now implemented!
|
||||
- Part I Item 1 and Part II Item 1 are properly distinguished
|
||||
- Section keys include Part context: "Part I - Item 1 - Financial Statements" vs "Part II - Item 1 - Legal Proceedings"
|
||||
- Comprehensive test coverage added (5 tests in test_10q_part_detection.py)
|
||||
|
||||
**8-K Sections**:
|
||||
- ⚠️ Limited - needs expansion
|
||||
|
||||
### Detection Methods
|
||||
|
||||
1. **TOC-based Detection** ✅
|
||||
- Analyzes Table of Contents
|
||||
- Extracts anchor links
|
||||
- Maps sections to content
|
||||
|
||||
2. **Pattern-based Detection** ✅
|
||||
- Regex matching for Item headers
|
||||
- Heading analysis (h1-h6 tags)
|
||||
- Text pattern recognition
|
||||
|
||||
3. **Hybrid Approach** ✅
|
||||
- Combines TOC + patterns
|
||||
- Fallback mechanisms
|
||||
- Cross-validation
|
||||
|
||||
### What's Working
|
||||
|
||||
```python
|
||||
# This works today:
|
||||
from edgar.documents import parse_html
|
||||
|
||||
html = filing.html()
|
||||
doc = parse_html(html)
|
||||
|
||||
# Get all sections
|
||||
sections = doc.sections # Returns dict
|
||||
|
||||
# Access specific Items
|
||||
if 'Item 7 - MD&A' in sections:
|
||||
mda = sections['Item 7 - MD&A']
|
||||
mda_text = mda.text()
|
||||
mda_tables = mda.tables()
|
||||
```
|
||||
|
||||
### What Needs Work
|
||||
|
||||
1. **Validation Coverage** (20% remaining)
|
||||
- Test against 100+ diverse 10-K filings
|
||||
- Test against 10-Q filings
|
||||
- Test against 8-K filings
|
||||
- Capture edge cases and variations
|
||||
|
||||
2. **Edge Cases** (20% remaining)
|
||||
- Non-standard Item formatting
|
||||
- Missing TOC
|
||||
- Nested sections
|
||||
- Combined Items (e.g., "Items 10, 13, 14")
|
||||
|
||||
3. **8-K Support** (50% remaining)
|
||||
- 8-K specific Item patterns
|
||||
- Event-based section detection
|
||||
- Exhibit handling
|
||||
|
||||
---
|
||||
|
||||
## Recent Achievements (Past 24 Hours)
|
||||
|
||||
### Critical Bug Fixes ✅
|
||||
|
||||
1. **Spacing Column Filter Fix** (MSFT Table 39)
|
||||
- Problem: Legitimate headers removed as "spacing"
|
||||
- Solution: Header content protection + colspan preservation
|
||||
- Impact: Tables now render correctly with all headers
|
||||
- Commits: `4e43276`, `d19ddd1`
|
||||
|
||||
2. **Header Detection Improvements**
|
||||
- Oracle Table 6: Date ranges no longer misclassified
|
||||
- Tesla Table 16: Long narrative text properly handled
|
||||
- Multi-row header normalization
|
||||
- Comprehensive test coverage (16 new tests)
|
||||
|
||||
3. **Documentation Updates**
|
||||
- TESTING.md clarified output limits
|
||||
- CHANGELOG updated with fixes
|
||||
- Bug reports and research docs completed
|
||||
|
||||
### Quality Metrics
|
||||
|
||||
**Test Coverage**:
|
||||
- 16 new tests added (all passing)
|
||||
- 0 regressions in existing tests
|
||||
- Comprehensive edge case coverage
|
||||
|
||||
**Code Quality**:
|
||||
- Clean implementation following plan
|
||||
- Well-documented changes
|
||||
- Proper commit messages with Claude Code attribution
|
||||
|
||||
---
|
||||
|
||||
## Path to 100% Completion
|
||||
|
||||
### High Priority (Next Steps)
|
||||
|
||||
**📋 Detailed plans available**:
|
||||
- **Performance**: See `docs-internal/planning/active-tasks/2025-10-07-performance-optimization-plan.md`
|
||||
- **Testing**: See `docs-internal/planning/active-tasks/2025-10-07-comprehensive-testing-plan.md`
|
||||
|
||||
1. **Performance Optimization** (1-2 weeks)
|
||||
- [ ] Phase 1: Benchmarking & profiling (2-3 days)
|
||||
- [ ] Phase 2: Algorithm optimizations (3-4 days)
|
||||
- [ ] Phase 3: Validation & regression tests (2-3 days)
|
||||
- [ ] Phase 4: Documentation & monitoring (1 day)
|
||||
- **Goal**: Maintain 1.3x+ speed advantage, <2x memory usage
|
||||
|
||||
2. **Comprehensive Testing** (2-3 weeks)
|
||||
- [ ] Phase 1: Corpus validation - 100+ filings (3-4 days)
|
||||
- [ ] Phase 2: Edge cases & error handling (2-3 days)
|
||||
- [ ] Phase 3: Integration testing (2-3 days)
|
||||
- [ ] Phase 4: Regression prevention (1-2 days)
|
||||
- [ ] Phase 5: Documentation & sign-off (1 day)
|
||||
- **Goal**: >95% success rate, >80% test coverage
|
||||
|
||||
3. **Item Detection Validation** (included in testing plan)
|
||||
- [ ] Test against 50+ diverse 10-K filings
|
||||
- [ ] Test against 20+ 10-Q filings
|
||||
- [ ] Document any pattern variations found
|
||||
- [ ] Add regression tests for edge cases
|
||||
|
||||
### Medium Priority
|
||||
|
||||
4. **8-K Support** (1-2 days)
|
||||
- [ ] Research 8-K Item patterns
|
||||
- [ ] Implement detection patterns
|
||||
- [ ] Test against sample 8-K filings
|
||||
|
||||
5. **Documentation** (1 day)
|
||||
- [ ] User guide for section access
|
||||
- [ ] API documentation
|
||||
- [ ] Migration guide from old parser
|
||||
- [ ] Examples and recipes
|
||||
|
||||
### Low Priority (Polish)
|
||||
|
||||
6. **Final Polish**
|
||||
- [ ] Error message improvements
|
||||
- [ ] Logging enhancements
|
||||
- [ ] Configuration documentation
|
||||
- [ ] Performance tuning
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### Low Risk ✅
|
||||
- Core parsing functionality (stable)
|
||||
- Table processing (recently fixed, well-tested)
|
||||
- Text extraction (working well)
|
||||
- XBRL extraction (functional)
|
||||
|
||||
### Medium Risk ⚠️
|
||||
- Section detection edge cases (needs validation)
|
||||
- Performance on very large docs (needs testing)
|
||||
- Memory usage (needs profiling)
|
||||
|
||||
### Mitigation Strategy
|
||||
1. Comprehensive validation testing (in progress)
|
||||
2. Real-world filing corpus testing
|
||||
3. Performance benchmarking suite
|
||||
4. Gradual rollout with monitoring
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### Immediate Actions (This Week)
|
||||
|
||||
1. **Validate Item Detection** 🎯 **TOP PRIORITY**
|
||||
```bash
|
||||
# Run on diverse corpus
|
||||
python tests/manual/compare_parsers.py --all
|
||||
|
||||
# Test specific sections
|
||||
python -c "
|
||||
from edgar.documents import parse_html
|
||||
from pathlib import Path
|
||||
|
||||
for filing in ['Apple', 'Oracle', 'Tesla', 'Microsoft']:
|
||||
html = Path(f'data/html/{filing}.10-K.html').read_text()
|
||||
doc = parse_html(html)
|
||||
print(f'{filing}: {list(doc.sections.keys())[:5]}...')
|
||||
"
|
||||
```
|
||||
|
||||
2. **Create Section Access Tests**
|
||||
- Write tests that verify each Item can be accessed
|
||||
- Validate text and table extraction from sections
|
||||
- Test edge cases (missing Items, combined Items)
|
||||
|
||||
3. **User Acceptance Testing**
|
||||
- Have maintainer review section detection output
|
||||
- Validate against known-good filings
|
||||
- Document any issues found
|
||||
|
||||
### Timeline to Production
|
||||
|
||||
**Optimistic**: 1 week
|
||||
- If validation shows good Item detection
|
||||
- If performance is acceptable
|
||||
- If no major issues found
|
||||
|
||||
**Realistic**: 2-3 weeks
|
||||
- Account for edge case fixes
|
||||
- Additional testing needed
|
||||
- Documentation completion
|
||||
|
||||
**Conservative**: 4 weeks
|
||||
- Account for 8-K support
|
||||
- Comprehensive testing across all filing types
|
||||
- Full documentation
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The HTML parser rewrite is **very close to completion** with excellent progress on all goals:
|
||||
|
||||
**✅ Fully Achieved**:
|
||||
- Semantic meaning preservation
|
||||
- AI/Human channel support
|
||||
- Section-level processing
|
||||
- Table processing for AI context
|
||||
- Superior to old parser (in most respects)
|
||||
- **Standard Item detection for 10-K/10-Q** (with Part I/II distinction)
|
||||
|
||||
**⚠️ Remaining Work (10%)**:
|
||||
- Validation against diverse corpus
|
||||
- Edge case handling
|
||||
- 8-K specific support expansion
|
||||
- Final testing and documentation
|
||||
|
||||
**Bottom Line**: The parser is **production-ready for 10-K/10-Q** with Item detection functional but requiring validation. The recent bug fixes have resolved critical table rendering issues. With 1-2 weeks of focused validation and testing, this can be shipped with confidence.
|
||||
|
||||
### Next Steps
|
||||
1. Run comprehensive Item detection validation
|
||||
2. Create section access test suite
|
||||
3. Performance benchmark
|
||||
4. Maintainer review and sign-off
|
||||
5. Merge to main branch
|
||||
@@ -0,0 +1,233 @@
|
||||
# HTML Parser Testing Quick Start
|
||||
|
||||
Quick reference for testing the HTML parser rewrite during quality improvement.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Use shortcuts (easy!)
|
||||
python tests/manual/compare_parsers.py aapl # Apple 10-K
|
||||
python tests/manual/compare_parsers.py nvda --tables # Nvidia tables
|
||||
python tests/manual/compare_parsers.py 'aapl 10-q' # Apple 10-Q
|
||||
python tests/manual/compare_parsers.py orcl --table 5 # Oracle table #5
|
||||
|
||||
# Or use full paths
|
||||
python tests/manual/compare_parsers.py data/html/Apple.10-K.html
|
||||
|
||||
# Run all test files
|
||||
python tests/manual/compare_parsers.py --all
|
||||
```
|
||||
|
||||
**Available shortcuts:**
|
||||
- **Companies**: `aapl`, `msft`, `tsla`, `nvda`, `orcl` (or full names like `apple`)
|
||||
- **Filing types**: `10-k` (default), `10-q`, `8-k`
|
||||
- **Combine**: `'aapl 10-q'`, `'orcl 8-k'`
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. First Look at a Filing
|
||||
|
||||
```bash
|
||||
# Get overview: speed, table count, sections
|
||||
python tests/manual/compare_parsers.py orcl
|
||||
```
|
||||
|
||||
**Shows**:
|
||||
- Parse time comparison (OLD vs NEW)
|
||||
- Tables found
|
||||
- Text length
|
||||
- Sections detected
|
||||
- New features (headings, XBRL)
|
||||
|
||||
### 2. Check Table Rendering
|
||||
|
||||
```bash
|
||||
# List all tables with dimensions (shows first 20 tables)
|
||||
python tests/manual/compare_parsers.py aapl --tables
|
||||
|
||||
# Compare specific table side-by-side (FULL table, no truncation)
|
||||
python tests/manual/compare_parsers.py aapl --table 7
|
||||
|
||||
# Compare a range of tables
|
||||
python tests/manual/compare_parsers.py aapl --range 5:10
|
||||
```
|
||||
|
||||
**Look for**:
|
||||
- Currency symbols merged: `$1,234` not `$ | 1,234`
|
||||
- Proper column alignment
|
||||
- Correct row/column counts
|
||||
- Clean rendering without extra spacing columns
|
||||
|
||||
**Note**: `--table N` shows the **complete table** with all rows - no truncation!
|
||||
|
||||
### 3. Verify Text Extraction
|
||||
|
||||
```bash
|
||||
# See first 50 lines side-by-side (default limit)
|
||||
python tests/manual/compare_parsers.py msft --text
|
||||
|
||||
# Show more lines (configurable)
|
||||
python tests/manual/compare_parsers.py msft --text --lines 100
|
||||
|
||||
# Show first 200 lines
|
||||
python tests/manual/compare_parsers.py msft --text --lines 200
|
||||
```
|
||||
|
||||
**Check**:
|
||||
- Semantic meaning preserved
|
||||
- No missing content
|
||||
- Clean formatting for LLM consumption
|
||||
|
||||
**Note**: Text mode shows first N lines only (default: 50). Use `--lines N` to adjust.
|
||||
|
||||
### 4. Check Section Detection
|
||||
|
||||
```bash
|
||||
python tests/manual/compare_parsers.py aapl --sections
|
||||
```
|
||||
|
||||
**Verify**:
|
||||
- Standard sections identified (10-K/10-Q)
|
||||
- Section boundaries correct
|
||||
- Text length reasonable per section
|
||||
|
||||
### 5. Run Full Test Suite
|
||||
|
||||
```bash
|
||||
# Test all files in corpus
|
||||
python tests/manual/compare_parsers.py --all
|
||||
```
|
||||
|
||||
**Results**:
|
||||
- Summary table across all files
|
||||
- Performance comparison
|
||||
- Table detection comparison
|
||||
|
||||
## Test Files
|
||||
|
||||
Available in `data/html/`:
|
||||
|
||||
- `Apple.10-K.html` - 1.8MB, complex financials
|
||||
- `Oracle.10-K.html` - Large filing
|
||||
- `Nvidia.10-K.html` - Tech company
|
||||
- `Apple.10-Q.html` - Quarterly format
|
||||
- More files as needed...
|
||||
|
||||
## Command Reference
|
||||
|
||||
```
|
||||
python tests/manual/compare_parsers.py [FILE] [OPTIONS]
|
||||
|
||||
Options:
|
||||
--all Run on all test files
|
||||
--tables Show tables summary (first 20 tables)
|
||||
--table N Show specific table N side-by-side (FULL table)
|
||||
--range START:END Show range of tables (e.g., 5:10)
|
||||
--text Show text comparison (first 50 lines by default)
|
||||
--sections Show sections comparison
|
||||
--lines N Number of text lines to show (default: 50, only for --text)
|
||||
--help Show full help
|
||||
```
|
||||
|
||||
### Output Limits Summary
|
||||
|
||||
| Mode | Limit | Configurable | Notes |
|
||||
|---------------|------------|-------------------|---------------------------------|
|
||||
| `--table N` | None | N/A | Shows **complete table** |
|
||||
| `--range N:M` | None | N/A | Shows **complete tables** in range |
|
||||
| `--tables` | 20 tables | No | Lists first 20 tables only |
|
||||
| `--text` | 50 lines | Yes (`--lines N`) | Preview only |
|
||||
| `--sections` | None | N/A | Shows all sections |
|
||||
|
||||
## Output Interpretation
|
||||
|
||||
### Overview Table
|
||||
|
||||
```
|
||||
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
|
||||
┃ Metric ┃ Old Parser ┃ New Parser ┃ Notes ┃
|
||||
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
|
||||
│ Parse Time │ 454ms │ 334ms │ 1.4x faster│
|
||||
│ Tables Found │ 63 │ 63 │ +0 │
|
||||
│ Text Length │ 0 │ 159,388 │ NEW! │
|
||||
└───────────────┴────────────┴────────────┴────────────┘
|
||||
```
|
||||
|
||||
**Good signs**:
|
||||
- ✅ New parser faster or similar speed
|
||||
- ✅ Same or more tables found
|
||||
- ✅ Text extracted (old parser shows 0)
|
||||
- ✅ Sections detected
|
||||
|
||||
**Red flags**:
|
||||
- ❌ Significantly slower
|
||||
- ❌ Fewer tables (unless removing layout tables)
|
||||
- ❌ Much shorter text (content missing)
|
||||
|
||||
### Table Comparison
|
||||
|
||||
```
|
||||
Old Parser:
|
||||
┌─────────┬──────────┬──────────┐
|
||||
│ Year │ Revenue │ Profit │
|
||||
├─────────┼──────────┼──────────┤
|
||||
│ 2023 │ $ 100M │ $ 20M │ <- Currency separated
|
||||
└─────────┴──────────┴──────────┘
|
||||
|
||||
New Parser:
|
||||
┌─────────┬──────────┬──────────┐
|
||||
│ Year │ Revenue │ Profit │
|
||||
├─────────┼──────────┼──────────┤
|
||||
│ 2023 │ $100M │ $20M │ <- Currency merged ✅
|
||||
└─────────┴──────────┴──────────┘
|
||||
```
|
||||
|
||||
**Look for**:
|
||||
- Currency symbols merged with values
|
||||
- No extra empty columns
|
||||
- Proper alignment
|
||||
- Clean numeric formatting
|
||||
|
||||
## Tips
|
||||
|
||||
1. **Start with overview** - Get the big picture first
|
||||
2. **Check tables visually** - Automated metrics miss formatting issues
|
||||
3. **Use specific table inspection** - Don't scroll through 60 tables manually
|
||||
4. **Compare text for semantics** - Does it make sense for an LLM?
|
||||
5. **Run --all periodically** - Catch regressions across files
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Script fails with import error
|
||||
|
||||
```bash
|
||||
# Clear cached modules
|
||||
find . -type d -name __pycache__ -exec rm -rf {} +
|
||||
python tests/manual/compare_parsers.py data/html/Apple.10-K.html
|
||||
```
|
||||
|
||||
### File not found
|
||||
|
||||
```bash
|
||||
# Check available files
|
||||
ls -lh data/html/*.html
|
||||
|
||||
# Use full path
|
||||
python tests/manual/compare_parsers.py /full/path/to/file.html
|
||||
```
|
||||
|
||||
### Old parser shows 0 text
|
||||
|
||||
This is expected - old parser has different text extraction. Focus on:
|
||||
- Table comparison
|
||||
- Parse time
|
||||
- Visual quality of output
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Run comparison on all test files
|
||||
2. Document bugs in `quality-improvement-strategy.md`
|
||||
3. Fix issues
|
||||
4. Repeat until satisfied
|
||||
|
||||
See `edgar/documents/docs/quality-improvement-strategy.md` for full process.
|
||||
@@ -0,0 +1,529 @@
|
||||
# Fast Table Rendering
|
||||
|
||||
**Status**: Production Ready - **Now the Default** (as of 2025-10-08)
|
||||
**Performance**: ~8-10x faster than Rich rendering with correct colspan/rowspan handling
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Fast table rendering provides a high-performance alternative to Rich library rendering for table text extraction. When parsing SEC filings with hundreds of tables, the cumulative rendering time can become a bottleneck. Fast rendering addresses this by using direct string building with TableMatrix for proper colspan/rowspan handling, achieving 8-10x speedup while maintaining correctness.
|
||||
|
||||
**As of 2025-10-08, fast rendering is the default** for all table text extraction. You no longer need to explicitly enable it.
|
||||
|
||||
### Why It's Now the Default
|
||||
|
||||
- **Production-ready**: Fixed all major issues (colspan, multi-row headers, multi-line cells)
|
||||
- **7-10x faster**: Significant performance improvement with correct output
|
||||
- **Maintains quality**: Matches Rich's appearance with simple() style
|
||||
- **Proven**: Extensively tested with Apple, NVIDIA, Microsoft 10-K filings
|
||||
|
||||
### When to Disable (Use Rich Instead)
|
||||
|
||||
You may want to disable fast rendering and use Rich for:
|
||||
- **Terminal display for humans**: Rich has more sophisticated text wrapping and layout
|
||||
- **Visual reports**: When presentation quality is more important than speed
|
||||
- **Debugging**: Rich output can be easier to visually inspect
|
||||
|
||||
---
|
||||
|
||||
## Usage
|
||||
|
||||
### Default Behavior (Fast Rendering Enabled)
|
||||
|
||||
```python
|
||||
from edgar.documents import parse_html
|
||||
|
||||
# Fast rendering is now the default - no configuration needed!
|
||||
doc = parse_html(html)
|
||||
|
||||
# Tables automatically use fast renderer (7-10x faster)
|
||||
table_text = doc.tables[0].text()
|
||||
```
|
||||
|
||||
### Disabling Fast Rendering (Use Rich Instead)
|
||||
|
||||
If you need Rich's sophisticated layout for visual display:
|
||||
|
||||
```python
|
||||
from edgar.documents import parse_html
|
||||
from edgar.documents.config import ParserConfig
|
||||
|
||||
# Explicitly disable fast rendering to use Rich
|
||||
config = ParserConfig(fast_table_rendering=False)
|
||||
doc = parse_html(html, config=config)
|
||||
|
||||
# Tables use Rich renderer (slower but with advanced formatting)
|
||||
table_text = doc.tables[0].text()
|
||||
```
|
||||
|
||||
### Custom Table Styles
|
||||
|
||||
**New in this version**: Fast rendering now uses the `simple()` style by default, which matches Rich's `box.SIMPLE` appearance (borderless, clean).
|
||||
|
||||
```python
|
||||
from edgar.documents import parse_html
|
||||
from edgar.documents.config import ParserConfig
|
||||
from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle
|
||||
|
||||
# Enable fast rendering (uses simple() style by default)
|
||||
config = ParserConfig(fast_table_rendering=True)
|
||||
doc = parse_html(html, config=config)
|
||||
|
||||
# Default: simple() style - borderless, clean
|
||||
table_text = doc.tables[0].text()
|
||||
|
||||
# To use pipe_table() style explicitly (markdown-compatible borders):
|
||||
renderer = FastTableRenderer(TableStyle.pipe_table())
|
||||
pipe_text = renderer.render_table_node(doc.tables[0])
|
||||
|
||||
# To use minimal() style (no separator):
|
||||
renderer = FastTableRenderer(TableStyle.minimal())
|
||||
minimal_text = renderer.render_table_node(doc.tables[0])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Comparison
|
||||
|
||||
### Benchmark Results
|
||||
|
||||
**Test**: Apple 10-K (63 tables) - Updated 2025-10-08
|
||||
|
||||
| Renderer | Average Per Table | Improvement | Notes |
|
||||
|----------|-------------------|-------------|-------|
|
||||
| Rich | 1.5-2.5ms | Baseline | Varies by table complexity |
|
||||
| Fast (simple) | 0.15-0.35ms | **7-10x faster** | With proper colspan/rowspan handling |
|
||||
|
||||
**Real-world Examples** (Apple 10-K):
|
||||
- Table 15 (complex colspan): Rich 2.51ms → Fast 0.35ms (**7.1x faster**)
|
||||
- Table 6 (multi-line cells): Rich 1.61ms → Fast 0.17ms (**9.5x faster**)
|
||||
- Table 5 (wide table): Rich 3.70ms → Fast 0.48ms (**7.7x faster**)
|
||||
|
||||
**Impact on Full Parse**:
|
||||
- Rich rendering: 30-40% of total parse time spent in table rendering
|
||||
- Fast rendering: 5-10% of total parse time
|
||||
- **Overall speedup**: Reduces total parsing time by ~25-30%
|
||||
|
||||
### Memory Impact
|
||||
|
||||
Fast rendering also reduces memory overhead:
|
||||
- No Rich Console objects retained
|
||||
- Direct string building (no intermediate objects)
|
||||
- Helps prevent memory leaks identified in profiling
|
||||
|
||||
---
|
||||
|
||||
## Output Examples
|
||||
|
||||
### Rich Renderer Output (Default)
|
||||
|
||||
```
|
||||
(In millions)
|
||||
Year Ended June 30, 2025 2024 2023
|
||||
──────────────────────────────────────────────────────────
|
||||
|
||||
Operating lease cost $5,524 3,555 2,875
|
||||
|
||||
Finance lease cost:
|
||||
Amortization of right-of-use assets $3,408 1,800 1,352
|
||||
Interest on lease liabilities 1,417 734 501
|
||||
|
||||
Total finance lease cost $4,825 2,534 1,853
|
||||
```
|
||||
|
||||
**Style**: `box.SIMPLE` - No outer border, just horizontal separator under header
|
||||
**Pros**: Clean, uncluttered, perfect alignment, generous spacing
|
||||
**Cons**: Slow (6.5ms per table), creates Rich objects, memory overhead
|
||||
|
||||
### Fast Renderer Output (NEW: simple() style - Default)
|
||||
|
||||
```
|
||||
December 31, 2023 December 31, 2022 December 31, 2021
|
||||
───────────────────────────────────────────────────────────────────────────────────────
|
||||
Revenue 365,817 394,328 365,817
|
||||
Cost of revenue 223,546 212,981 192,266
|
||||
Gross profit 142,271 181,347 173,551
|
||||
```
|
||||
|
||||
**Style**: `simple()` - Matches Rich's `box.SIMPLE` appearance
|
||||
**Pros**: Fast (0.2ms per table), clean appearance, no visual noise, professional look
|
||||
**Cons**: None - this is now the recommended default!
|
||||
|
||||
### Fast Renderer Output (pipe_table() style - Optional)
|
||||
|
||||
```
|
||||
| | December 31, 2023 | December 31, 2022 | December 31, 2021 |
|
||||
|--------------------------|---------------------|---------------------|---------------------|
|
||||
| Revenue | 365,817 | 394,328 | 365,817 |
|
||||
| Cost of revenue | 223,546 | 212,981 | 192,266 |
|
||||
| Gross profit | 142,271 | 181,347 | 173,551 |
|
||||
```
|
||||
|
||||
**Style**: `pipe_table()` - Markdown-compatible with borders
|
||||
**Pros**: Fast (0.2ms per table), markdown-compatible, explicit column boundaries
|
||||
**Cons**: Visual noise from pipe characters, busier appearance
|
||||
**Use when**: You need markdown-compatible output with explicit borders
|
||||
|
||||
### Visual Comparison
|
||||
|
||||
**Rich** (`box.SIMPLE`):
|
||||
- No outer border - clean, uncluttered look
|
||||
- Horizontal line separator under header only
|
||||
- Generous internal spacing and padding
|
||||
- Perfect column alignment
|
||||
- Professional, minimalist presentation
|
||||
|
||||
**Fast simple()** (NEW DEFAULT):
|
||||
- No outer border - matches Rich's clean look
|
||||
- Horizontal line separator under header (using `─`)
|
||||
- Space-separated columns with generous padding
|
||||
- Clean, professional appearance
|
||||
- Same performance as pipe_table (~0.2ms per table)
|
||||
|
||||
**Fast pipe_table()** (optional):
|
||||
- Full pipe table borders (`|` characters everywhere)
|
||||
- Horizontal dashes for header separator
|
||||
- Markdown-compatible format
|
||||
- Explicit column boundaries
|
||||
|
||||
---
|
||||
|
||||
## Recent Improvements (2025-10-08)
|
||||
|
||||
### 1. Colspan/Rowspan Support
|
||||
|
||||
**Fixed**: Tables with `colspan` and `rowspan` attributes now render correctly.
|
||||
|
||||
**Previous issue**: Fast renderer was extracting cell text without accounting for colspan/rowspan, causing:
|
||||
- Missing columns (e.g., "2023" column disappeared in Apple 10-K table 15)
|
||||
- Misaligned data (currency symbols separated from values)
|
||||
- Data loss (em dashes and other values missing)
|
||||
|
||||
**Solution**: Integrated `TableMatrix` for proper cell expansion, same as Rich rendering uses.
|
||||
|
||||
**Status**: ✅ FIXED
|
||||
|
||||
### 2. Multi-Row Header Preservation
|
||||
|
||||
**Fixed**: Tables with multiple header rows now preserve each row separately.
|
||||
|
||||
**Previous issue**: Multi-row headers were collapsed into a single line, causing "Investment portfolio" row to disappear in Apple 10-K table 20.
|
||||
|
||||
**Solution**: Modified `render_table_data()` and `_build_table()` to preserve each header row as a separate line.
|
||||
|
||||
**Status**: ✅ FIXED
|
||||
|
||||
### 3. Multi-Line Cell Rendering
|
||||
|
||||
**Fixed**: Cells containing newline characters (`\n`) now render as multiple lines.
|
||||
|
||||
**Previous issue**: Multi-line cells like "Interest Rate\nSensitive Instrument" were truncated to first line only.
|
||||
|
||||
**Solution**: Added `_format_multiline_row()` to split cells by `\n` and render each line separately.
|
||||
|
||||
**Status**: ✅ FIXED
|
||||
|
||||
### Performance Impact
|
||||
|
||||
All three fixes maintain excellent performance:
|
||||
- **Speedup**: 7-10x faster than Rich (down from initial 14x, but with correct output)
|
||||
- **Correctness**: Now matches Rich output exactly for colspan, multi-row headers, and multi-line cells
|
||||
- **Production ready**: Can confidently use as default renderer
|
||||
|
||||
---
|
||||
|
||||
## Known Limitations
|
||||
|
||||
### 1. Column Alignment in Some Tables
|
||||
|
||||
**Issue**: Currency symbols and values may have extra spacing in some complex tables (e.g., Apple 10-K table 22)
|
||||
|
||||
**Example**:
|
||||
- Rich: `$294,866`
|
||||
- Fast: `$ 294,866` (extra spacing)
|
||||
|
||||
**Root cause**: Column width calculation creates wider columns for some currency/value pairs after colspan expansion and column filtering.
|
||||
|
||||
**Impact**: Visual appearance differs slightly, but data is correct and readable.
|
||||
|
||||
**Status**: ⚠️ Minor visual difference - acceptable trade-off for 10x performance gain
|
||||
|
||||
### 3. Visual Polish
|
||||
|
||||
**Issue**: Some visual aspects don't exactly match Rich's sophisticated layout
|
||||
|
||||
**Examples**:
|
||||
- Multi-line cell wrapping may differ
|
||||
- Column alignment in edge cases
|
||||
|
||||
**Status**: ⚠️ Acceptable trade-off for 8-10x performance gain
|
||||
|
||||
---
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### Table Styles
|
||||
|
||||
Fast renderer supports different visual styles:
|
||||
|
||||
```python
|
||||
from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle
|
||||
|
||||
# Pipe table style (default) - markdown compatible
|
||||
renderer = FastTableRenderer(TableStyle.pipe_table())
|
||||
|
||||
# Minimal style - no borders, just spacing
|
||||
renderer = FastTableRenderer(TableStyle.minimal())
|
||||
```
|
||||
|
||||
### Minimal Style Output
|
||||
|
||||
```
|
||||
December 31, 2023 December 31, 2022 December 31, 2021
|
||||
Revenue 365,817 394,328 365,817
|
||||
Cost of revenue 223,546 212,981 192,266
|
||||
Gross profit 142,271 181,347 173,551
|
||||
```
|
||||
|
||||
**Note**: Minimal style has cleaner appearance but loses column boundaries
|
||||
|
||||
---
|
||||
|
||||
## Technical Details
|
||||
|
||||
### How It Works
|
||||
|
||||
1. **Direct String Building**: Bypasses Rich's layout engine
|
||||
2. **Column Analysis**: Detects numeric columns for right-alignment
|
||||
3. **Smart Filtering**: Removes empty spacing columns
|
||||
4. **Currency Merging**: Combines `$` symbols with amounts
|
||||
5. **Width Calculation**: Measures content, applies min/max limits
|
||||
|
||||
### Code Path
|
||||
|
||||
```python
|
||||
# When fast_table_rendering=True:
|
||||
table.text()
|
||||
→ TableNode._fast_text_rendering()
|
||||
→ FastTableRenderer.render_table_node()
|
||||
→ Direct string building
|
||||
```
|
||||
|
||||
### Memory Benefits
|
||||
|
||||
Fast rendering avoids:
|
||||
- Rich Console object creation (~0.4MB per document)
|
||||
- Intermediate rich.Table objects
|
||||
- Style/theme processing overhead
|
||||
- ANSI escape code generation
|
||||
|
||||
---
|
||||
|
||||
## Future Improvements
|
||||
|
||||
### Planned Enhancements
|
||||
|
||||
1. **Match Rich's `box.SIMPLE` Style** (Priority: HIGH)
|
||||
- **Remove all pipe characters** - no outer border, no column separators
|
||||
- **Keep only horizontal separator** under header (using `─` character)
|
||||
- **Increase internal padding** to match Rich's generous spacing
|
||||
- **Clean, minimalist appearance** like Rich's SIMPLE box style
|
||||
- **Goal**: Match Rich visual quality, still 30x faster
|
||||
|
||||
2. **Improved Layout Engine**
|
||||
- Better column width calculation (avoid too-wide/too-narrow columns)
|
||||
- Respect natural content breaks
|
||||
- Dynamic spacing based on content type
|
||||
- Handle wrapping for long content
|
||||
|
||||
3. **Dynamic Padding**
|
||||
- Match Rich's generous spacing (currently too tight)
|
||||
- Adjust padding based on content type
|
||||
- Configurable padding rules
|
||||
- Maintain alignment with variable padding
|
||||
|
||||
4. **Header Handling**
|
||||
- Better multi-row header collapse
|
||||
- Preserve important hierarchies
|
||||
- Smart column spanning
|
||||
- Honor header groupings
|
||||
|
||||
5. **Style Presets**
|
||||
- `TableStyle.simple()` - Match Rich's `box.SIMPLE` (no borders, header separator only) ⭐ **PRIMARY GOAL**
|
||||
- `TableStyle.minimal()` - no borders, just spacing (already implemented)
|
||||
- `TableStyle.pipe_table()` - current markdown style (default)
|
||||
- `TableStyle.ascii_clean()` - no Unicode, pure ASCII
|
||||
- `TableStyle.compact()` - minimal spacing for dense data
|
||||
|
||||
### Timeline
|
||||
|
||||
These improvements are **planned for Phase 2** of the HTML parser optimization work (after memory leak fixes).
|
||||
|
||||
---
|
||||
|
||||
## Migration Guide
|
||||
|
||||
### From Rich to Fast
|
||||
|
||||
**Before** (using Rich):
|
||||
```python
|
||||
doc = parse_html(html)
|
||||
table_text = doc.tables[0].text() # Slow but pretty
|
||||
```
|
||||
|
||||
**After** (using Fast):
|
||||
```python
|
||||
config = ParserConfig(fast_table_rendering=True)
|
||||
doc = parse_html(html, config=config)
|
||||
table_text = doc.tables[0].text() # Fast but current visual issues
|
||||
```
|
||||
|
||||
### Hybrid Approach
|
||||
|
||||
Use fast rendering during processing, Rich for final display:
|
||||
|
||||
```python
|
||||
# Fast processing
|
||||
config = ParserConfig(fast_table_rendering=True)
|
||||
doc = parse_html(html, config=config)
|
||||
|
||||
# Extract data quickly
|
||||
for table in doc.tables:
|
||||
data = table.text() # Fast
|
||||
# Process data...
|
||||
|
||||
# Display one table nicely
|
||||
special_table = doc.tables[5]
|
||||
rich_output = special_table.render() # Switch to Rich for display
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Recommendations
|
||||
|
||||
### Recommended Settings by Use Case
|
||||
|
||||
**Batch Processing** (optimize for speed):
|
||||
```python
|
||||
config = ParserConfig.for_performance()
|
||||
# Includes: fast_table_rendering=True, eager_section_extraction=False
|
||||
```
|
||||
|
||||
**Data Extraction** (balance speed and accuracy):
|
||||
```python
|
||||
config = ParserConfig(
|
||||
fast_table_rendering=True,
|
||||
extract_xbrl=True,
|
||||
detect_sections=True
|
||||
)
|
||||
```
|
||||
|
||||
**Display/Reports** (optimize for quality):
|
||||
```python
|
||||
config = ParserConfig() # Default settings use Rich
|
||||
# Or explicitly:
|
||||
config = ParserConfig.for_accuracy()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: Can I mix Fast and Rich rendering?**
|
||||
A: Not per-table. The setting is document-wide via ParserConfig. However, you can manually call `table.render()` to get Rich output.
|
||||
|
||||
**Q: Does this affect section extraction?**
|
||||
A: Indirectly, yes. Section detection calls `text()` on the entire document, which includes tables. Fast rendering speeds this up significantly.
|
||||
|
||||
**Q: Will the output format change?**
|
||||
A: Yes, as we improve the renderer. We'll maintain backward compatibility via style options.
|
||||
|
||||
**Q: Can I customize the appearance?**
|
||||
A: Currently limited to `TableStyle.pipe_table()` vs `TableStyle.minimal()`. More options coming.
|
||||
|
||||
**Q: What about DataFrame export?**
|
||||
A: Fast rendering only affects text output. `table.to_dataframe()` is unaffected.
|
||||
|
||||
---
|
||||
|
||||
## Feedback
|
||||
|
||||
The fast renderer is actively being improved based on user feedback. Known issues:
|
||||
|
||||
1. ❌ **Pipe characters** - visual noise
|
||||
2. ❌ **Layout engine** - inconsistent spacing
|
||||
3. ❌ **Padding** - needs tuning
|
||||
|
||||
If you have specific rendering issues or suggestions, please provide:
|
||||
- Sample table HTML
|
||||
- Expected vs actual output
|
||||
- Use case description
|
||||
|
||||
This helps prioritize improvements while maintaining the performance advantage.
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
### Current State (As of 2025-10-08)
|
||||
|
||||
**Performance**: ✅ Excellent (8-10x faster than Rich)
|
||||
**Correctness**: ✅ Production ready (proper colspan/rowspan handling)
|
||||
**Visual Quality**: ⚠️ Good (simple() style matches Rich's box.SIMPLE appearance)
|
||||
**Use Case**: Production-ready for all use cases
|
||||
|
||||
### Recent Milestones
|
||||
|
||||
**✅ Completed**:
|
||||
- Core fast rendering implementation
|
||||
- TableStyle.simple() preset (borderless, clean)
|
||||
- Column filtering and merging
|
||||
- Numeric alignment detection
|
||||
- **Colspan/rowspan support via TableMatrix**
|
||||
- **Performance benchmarking with real tables**
|
||||
|
||||
**🔧 Current Limitations**:
|
||||
- Multi-row header collapsing differs from Rich
|
||||
- Some visual polish differences (acceptable for speed gain)
|
||||
- Layout engine not as sophisticated as Rich
|
||||
|
||||
### Development Roadmap
|
||||
|
||||
**Phase 1** (✅ COMPLETED):
|
||||
- ✅ Core fast rendering implementation
|
||||
- ✅ Simple() style matching Rich's box.SIMPLE
|
||||
- ✅ Proper colspan/rowspan handling via TableMatrix
|
||||
- ✅ Production-ready performance (8-10x faster)
|
||||
|
||||
**Phase 2** (Future Enhancements):
|
||||
- 📋 Improve multi-row header handling
|
||||
- 📋 Better layout engine for perfect column widths
|
||||
- 📋 Additional style presets
|
||||
- 📋 Advanced header detection (data vs labels)
|
||||
|
||||
### Bottom Line
|
||||
|
||||
Fast table rendering is **production-ready and now the default** for all table text extraction in EdgarTools.
|
||||
|
||||
**Benefits**:
|
||||
- ✅ 7-10x faster than Rich rendering
|
||||
- ✅ Correct data extraction with proper colspan/rowspan handling
|
||||
- ✅ Multi-row header preservation
|
||||
- ✅ Multi-line cell rendering
|
||||
- ✅ Clean, borderless appearance (simple() style)
|
||||
|
||||
**Minor differences from Rich**:
|
||||
- ⚠️ Some tables have extra spacing between currency symbols and values (e.g., table 22)
|
||||
- ⚠️ Column width calculation may differ slightly in complex tables
|
||||
- ✅ All data is preserved and correct - only visual presentation differs
|
||||
|
||||
The implementation achieves **correct data extraction** with **significant performance gains** and **clean visual output**, making it the ideal default for EdgarTools.
|
||||
|
||||
---
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [HTML Parser Status](HTML_PARSER_STATUS.md) - Overall parser progress
|
||||
- [Performance Analysis](../perf/hotpath_analysis.md) - Profiling results showing Rich rendering bottleneck
|
||||
- [Memory Analysis](../perf/memory_analysis.md) - Memory leak issues with Rich objects
|
||||
164
venv/lib/python3.10/site-packages/edgar/documents/docs/goals.md
Normal file
164
venv/lib/python3.10/site-packages/edgar/documents/docs/goals.md
Normal file
@@ -0,0 +1,164 @@
|
||||
# Goals
|
||||
|
||||
## Mission
|
||||
Replace `edgar.files` with a parser that is better in **every way** - utility, accuracy, and user experience. The maintainer is the final judge: output must look correct when printed.
|
||||
|
||||
## Core Principles
|
||||
|
||||
### Primary Goal: AI Context Optimization
|
||||
- **Token efficiency**: 30-50% reduction vs raw HTML while preserving semantic meaning
|
||||
- **Chunking support**: Enable independent processing of sections/tables for LLM context windows
|
||||
- **Clean text output**: Tables rendered in LLM-friendly formats (clean text, markdown)
|
||||
- **Semantic preservation**: Extract meaning, not just formatting
|
||||
|
||||
### Secondary Goal: Human Readability
|
||||
- **Rich console output**: Beautiful rendering with proper table alignment
|
||||
- **Markdown export**: Professional-looking document conversion
|
||||
- **Section navigation**: Easy access to specific Items/sections
|
||||
|
||||
## User-Focused Feature Goals
|
||||
|
||||
### 1. Text Extraction
|
||||
- Extract full document text without dropping meaningful content
|
||||
- Preserve paragraph structure and semantic whitespace
|
||||
- Handle inline XBRL facts gracefully (show values, not raw tags)
|
||||
- Clean HTML artifacts automatically (scripts, styles, page numbers)
|
||||
- **Target**: 99%+ accuracy vs manual reading
|
||||
|
||||
### 2. Section Extraction (10-K, 10-Q, 8-K)
|
||||
- Detect >90% of standard sections for >90% of test tickers
|
||||
- Support flexible access: `doc.sections['Item 1A']`, `doc['1A']`, `doc.risk_factors`
|
||||
- Return Section objects with `.text()`, `.tables`, `.search()` methods
|
||||
- Include confidence scores and detection method metadata
|
||||
- **Target**: Better recall than old parser (quantify with test suite)
|
||||
|
||||
### 3. Table Extraction
|
||||
- Extract all meaningful data tables (ignore pure layout tables)
|
||||
- Accurate rendering with aligned columns and proper formatting
|
||||
- Handle complex tables (rowspan, colspan, nested headers)
|
||||
- Preserve table captions and surrounding context
|
||||
- Support DataFrame conversion for data analysis
|
||||
- **Target**: 95%+ accuracy on test corpus
|
||||
|
||||
### 4. Search Capabilities
|
||||
- Text search within documents
|
||||
- Regex pattern matching
|
||||
- Semantic search preparation (structure for embedding-based search)
|
||||
- Search within sections for focused queries
|
||||
|
||||
### 5. Multiple Output Formats
|
||||
- Plain text (optimized for LLM context)
|
||||
- Markdown (for documentation/sharing)
|
||||
- Rich console (beautiful terminal display)
|
||||
- JSON (structured data export)
|
||||
|
||||
### 6. Developer Experience
|
||||
- Intuitive API: `doc.text()`, `doc.tables`, `doc.sections`
|
||||
- Rich objects with useful methods (not just strings)
|
||||
- Simple tasks simple, complex tasks possible
|
||||
- Helpful error messages with recovery suggestions
|
||||
- **Target**: New users productive in <10 minutes
|
||||
|
||||
|
||||
|
||||
## Performance Targets
|
||||
|
||||
### Speed Benchmarks (Based on Current Performance)
|
||||
- **Small docs (<5MB)**: <500ms ✅ *Currently 96ms - excellent*
|
||||
- **Medium docs (5-20MB)**: <2s ✅ *Currently 1.19s - excellent*
|
||||
- **Large docs (>50MB)**: <10s ✅ *Currently 0.59s - excellent*
|
||||
- **Throughput**: >3MB/s sustained ✅ *Currently 3.8MB/s*
|
||||
- **Target**: Maintain or improve on all benchmarks
|
||||
|
||||
### Memory Efficiency
|
||||
- **Small docs (<5MB)**: <3x document size *(currently 9x - needs optimization)*
|
||||
- **Large docs (>10MB)**: <2x document size *(currently 1.9x - good)*
|
||||
- **No memory spikes**: Never exceed 5x document size *(MSFT currently 5.4x)*
|
||||
- **Target**: Consistent 2-3x overhead across all document sizes
|
||||
|
||||
### Accuracy Benchmarks
|
||||
- **Section detection recall**: >90% on 20-ticker test set
|
||||
- **Table extraction accuracy**: >95% on manual validation set
|
||||
- **Text fidelity**: >99% semantic equivalence to source HTML
|
||||
- **XBRL fact extraction**: 100% of inline facts captured correctly
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### HTML Parsing
|
||||
- Read the entire HTML document without dropping semantically meaningful content
|
||||
- Drop non-meaningful content (scripts, styles, pure formatting tags)
|
||||
- Preserve semantic structure (headings, paragraphs, lists)
|
||||
- Handle both old (pre-2015) and modern (inline XBRL) formats
|
||||
- Graceful degradation for malformed HTML
|
||||
|
||||
### Table Parsing
|
||||
- Extract tables containing meaningful data
|
||||
- Ignore layout tables (unless they aid document understanding)
|
||||
- Accurate rendering with proper column alignment
|
||||
- Handle complex structures: rowspan, colspan, nested headers, multi-level headers
|
||||
- Preserve table captions and contextual information
|
||||
- Support conversion to pandas DataFrame
|
||||
|
||||
### Section Extraction
|
||||
- Detect standard sections (Item 1, 1A, 7, etc.) for 10-K, 10-Q, 8-K filings
|
||||
- Support multiple detection strategies: TOC-based, heading-based, pattern-based
|
||||
- Return Section objects with full API: `.text()`, `.text_without_tables()`, `.tables`, `.search()`
|
||||
- Include metadata: confidence scores, detection method, position
|
||||
- Better recall than old parser (establish baseline with test suite)
|
||||
|
||||
## Quality Gates Before Replacing edgar.files
|
||||
|
||||
### Automated Tests
|
||||
- [ ] All existing tests pass with new parser (1000+ tests)
|
||||
- [ ] Performance regression tests (<5% slower on any document)
|
||||
- [ ] Memory regression tests (no >10% increases)
|
||||
- [ ] Section detection accuracy >90% on test corpus
|
||||
- [ ] Table extraction accuracy >95% on validation set
|
||||
|
||||
### Manual Validation (Maintainer Review)
|
||||
- [ ] Print full document text for 10 sample filings → verify quality
|
||||
- [ ] Compare table rendering old vs new → verify improvement
|
||||
- [ ] Test section extraction on edge cases → verify robustness
|
||||
- [ ] Review markdown output → verify professional appearance
|
||||
- [ ] Check memory usage → verify no concerning spikes
|
||||
|
||||
### Documentation Requirements
|
||||
- [ ] Migration guide (old API → new API with examples)
|
||||
- [ ] Updated user guide showing new features
|
||||
- [ ] Performance comparison report (old vs new)
|
||||
- [ ] Known limitations documented clearly
|
||||
- [ ] API reference complete for all public methods
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Launch Criteria
|
||||
1. **Speed**: Equal or faster on 95% of test corpus
|
||||
2. **Accuracy**: Maintainer approves output quality on sample set
|
||||
3. **API**: Clean, intuitive interface (no confusion)
|
||||
4. **Tests**: Zero regressions, 95%+ coverage on new code
|
||||
5. **Docs**: Complete with examples for all major use cases
|
||||
|
||||
### Post-Launch Monitoring
|
||||
- Issue reports: <5% related to parser quality/accuracy
|
||||
- User feedback: Positive sentiment on ease of use
|
||||
- Performance: No degradation over time (regression tests)
|
||||
- Adoption: Smooth migration from old parser (deprecation path)
|
||||
|
||||
## Feature Parity with Old Parser
|
||||
|
||||
### Must-Have (Required for Migration)
|
||||
- ✅ Get document text (with/without tables)
|
||||
- ✅ Extract specific sections by name/number
|
||||
- ✅ List all tables in document
|
||||
- ✅ Search document content
|
||||
- ✅ Convert to markdown
|
||||
- ✅ Handle both old and new SEC filing formats
|
||||
- ✅ Graceful error handling
|
||||
|
||||
### Nice-to-Have (Improvements Over Old Parser)
|
||||
- 🎯 Semantic search capabilities
|
||||
- 🎯 Better subsection extraction within Items
|
||||
- 🎯 Table-of-contents navigation
|
||||
- 🎯 Export to multiple formats (JSON, clean HTML)
|
||||
- 🎯 Batch processing optimizations
|
||||
- 🎯 Section confidence scores and metadata
|
||||
@@ -0,0 +1,240 @@
|
||||
# HTML Parser Rewrite Technical Overview
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The `edgar/documents` module represents a comprehensive rewrite of the HTML parsing capabilities originally implemented in `edgar/files`. This new parser is designed to provide superior parsing accuracy, structured data extraction, and rendering quality for SEC filing documents. The rewrite introduces a modern, extensible architecture with specialized components for handling the complex structure of financial documents.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Core Components
|
||||
|
||||
#### 1. Document Object Model
|
||||
The new parser introduces a sophisticated node-based document model:
|
||||
|
||||
- **Document**: Top-level container with metadata and sections
|
||||
- **Node Hierarchy**: Abstract base classes for all document elements
|
||||
- `DocumentNode`: Root document container
|
||||
- `TextNode`: Plain text content
|
||||
- `ParagraphNode`: Paragraph elements with styling
|
||||
- `HeadingNode`: Headers with levels 1-6
|
||||
- `ContainerNode`: Generic containers (div, section)
|
||||
- `SectionNode`: Document sections with semantic meaning
|
||||
- `ListNode`/`ListItemNode`: Ordered and unordered lists
|
||||
- `LinkNode`: Hyperlinks with metadata
|
||||
- `ImageNode`: Images with attributes
|
||||
|
||||
#### 2. Table Processing System
|
||||
Advanced table handling represents a major improvement over the old parser:
|
||||
|
||||
- **TableNode**: Sophisticated table representation with multi-level headers
|
||||
- **Cell**: Individual cell with colspan/rowspan support and type detection
|
||||
- **Row**: Table row with header detection and semantic classification
|
||||
- **TableMatrix**: Handles complex cell spanning and alignment
|
||||
- **CurrencyColumnMerger**: Intelligently merges currency symbols with values
|
||||
- **ColumnAnalyzer**: Detects spacing columns and optimizes layout
|
||||
|
||||
#### 3. Parser Pipeline
|
||||
The parsing process follows a well-defined pipeline:
|
||||
|
||||
1. **HTMLParser**: Main orchestration class
|
||||
2. **HTMLPreprocessor**: Cleans and normalizes HTML
|
||||
3. **DocumentBuilder**: Converts HTML tree to document nodes
|
||||
4. **Strategy Pattern**: Pluggable parsing strategies
|
||||
5. **DocumentPostprocessor**: Final cleanup and optimization
|
||||
|
||||
### Key Improvements Over Old Parser
|
||||
|
||||
#### Table Processing Enhancements
|
||||
|
||||
**Old Parser (`edgar/files`)**:
|
||||
- Basic table extraction using BeautifulSoup
|
||||
- Limited colspan/rowspan handling
|
||||
- Simple text-based rendering
|
||||
- Manual column alignment
|
||||
- Currency symbols often misaligned
|
||||
|
||||
**New Parser (`edgar/documents`)**:
|
||||
- Advanced table matrix system for perfect cell alignment
|
||||
- Intelligent header detection (multi-row headers, year detection)
|
||||
- Automatic currency column merging ($1,234 instead of $ | 1,234)
|
||||
- Semantic table type detection (FINANCIAL, METRICS, TOC, etc.)
|
||||
- Rich table rendering with proper formatting
|
||||
- Smart column width calculation
|
||||
- Enhanced numeric formatting with comma separators
|
||||
|
||||
#### Document Structure
|
||||
|
||||
**Old Parser**:
|
||||
- Flat block-based structure
|
||||
- Limited semantic understanding
|
||||
- Basic text extraction
|
||||
|
||||
**New Parser**:
|
||||
- Hierarchical node-based model
|
||||
- Semantic section detection
|
||||
- Rich metadata preservation
|
||||
- XBRL fact extraction
|
||||
- Search capabilities
|
||||
- Multiple output formats (text, markdown, JSON, pandas)
|
||||
|
||||
#### Rendering Quality
|
||||
|
||||
**Old Parser**:
|
||||
- Basic text output
|
||||
- Limited table formatting
|
||||
- No styling preservation
|
||||
|
||||
**New Parser**:
|
||||
- Multiple renderers (text, markdown, Rich console)
|
||||
- Preserves document structure and styling
|
||||
- Configurable output options
|
||||
- LLM-optimized formatting
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Configuration System
|
||||
|
||||
The new parser uses a comprehensive configuration system:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ParserConfig:
|
||||
# Size limits
|
||||
max_document_size: int = 50 * 1024 * 1024 # 50MB
|
||||
streaming_threshold: int = 10 * 1024 * 1024 # 10MB
|
||||
|
||||
# Processing options
|
||||
preserve_whitespace: bool = False
|
||||
detect_sections: bool = True
|
||||
extract_xbrl: bool = True
|
||||
table_extraction: bool = True
|
||||
detect_table_types: bool = True
|
||||
```
|
||||
|
||||
### Strategy Pattern Implementation
|
||||
|
||||
The parser uses pluggable strategies for different aspects:
|
||||
|
||||
- **HeaderDetectionStrategy**: Identifies document sections
|
||||
- **TableProcessor**: Handles table extraction and classification
|
||||
- **XBRLExtractor**: Extracts XBRL facts and metadata
|
||||
- **StyleParser**: Processes CSS styling information
|
||||
|
||||
### Table Processing Deep Dive
|
||||
|
||||
The table processing system represents the most significant improvement:
|
||||
|
||||
#### Header Detection Algorithm
|
||||
- Analyzes cell content patterns (th vs td elements)
|
||||
- Detects year patterns in financial tables
|
||||
- Identifies period indicators (quarters, fiscal years)
|
||||
- Handles multi-row headers with units and descriptions
|
||||
- Prevents misclassification of data rows as headers
|
||||
|
||||
#### Cell Type Detection
|
||||
- Numeric vs text classification
|
||||
- Currency value recognition
|
||||
- Percentage handling
|
||||
- Em dash and null value detection
|
||||
- Proper number formatting with thousand separators
|
||||
|
||||
#### Matrix Building
|
||||
- Handles colspan and rowspan expansion
|
||||
- Maintains cell relationships
|
||||
- Optimizes column layout
|
||||
- Removes spacing columns automatically
|
||||
|
||||
### XBRL Integration
|
||||
|
||||
The new parser includes sophisticated XBRL processing:
|
||||
- Extracts facts before preprocessing to preserve ix:hidden content
|
||||
- Maintains metadata relationships
|
||||
- Supports inline XBRL transformations
|
||||
- Preserves semantic context
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Memory Efficiency
|
||||
- Streaming support for large documents (>10MB)
|
||||
- Lazy loading of document sections
|
||||
- Caching for repeated operations
|
||||
- Memory-efficient node representation
|
||||
|
||||
### Processing Speed
|
||||
- Optimized HTML parsing with lxml
|
||||
- Configurable processing strategies
|
||||
- Parallel extraction capabilities
|
||||
- Smart caching of expensive operations
|
||||
|
||||
## Migration and Compatibility
|
||||
|
||||
### API Compatibility
|
||||
The new parser maintains high-level compatibility with the old parser while offering enhanced functionality:
|
||||
|
||||
```python
|
||||
# Old way
|
||||
from edgar.files import FilingDocument
|
||||
doc = FilingDocument(html)
|
||||
text = doc.text()
|
||||
|
||||
# New way
|
||||
from edgar.documents import HTMLParser
|
||||
parser = HTMLParser()
|
||||
doc = parser.parse(html)
|
||||
text = doc.text()
|
||||
```
|
||||
|
||||
### Feature Parity
|
||||
All major features from the old parser are preserved:
|
||||
- Text extraction
|
||||
- Table conversion to DataFrame
|
||||
- Section detection
|
||||
- Metadata extraction
|
||||
|
||||
### Enhanced Features
|
||||
New capabilities not available in the old parser:
|
||||
- Rich console rendering
|
||||
- Markdown export
|
||||
- Advanced table semantics
|
||||
- XBRL fact extraction
|
||||
- Document search
|
||||
- LLM optimization
|
||||
- Multiple output formats
|
||||
|
||||
## Current Status and Next Steps
|
||||
|
||||
### Completed Components
|
||||
- ✅ Core document model
|
||||
- ✅ HTML parsing pipeline
|
||||
- ✅ Advanced table processing
|
||||
- ✅ Multiple renderers (text, markdown, Rich)
|
||||
- ✅ XBRL extraction
|
||||
- ✅ Configuration system
|
||||
- ✅ Streaming support
|
||||
|
||||
### Remaining Work
|
||||
- 🔄 Performance optimization and benchmarking
|
||||
- 🔄 Comprehensive test coverage migration
|
||||
- 🔄 Error handling improvements
|
||||
- 🔄 Documentation and examples
|
||||
- 🔄 Validation against large corpus of filings
|
||||
|
||||
### Testing Strategy
|
||||
The rewrite requires extensive validation:
|
||||
- Comparison testing against old parser output
|
||||
- Financial table accuracy verification
|
||||
- Performance benchmarking
|
||||
- Edge case handling
|
||||
- Integration testing with existing workflows
|
||||
|
||||
## Conclusion
|
||||
|
||||
The `edgar/documents` rewrite represents a significant advancement in SEC filing processing capabilities. The new architecture provides:
|
||||
|
||||
1. **Better Accuracy**: Advanced table processing and semantic understanding
|
||||
2. **Enhanced Functionality**: Multiple output formats and rich rendering
|
||||
3. **Improved Maintainability**: Clean, modular architecture with clear separation of concerns
|
||||
4. **Future Extensibility**: Plugin architecture for new parsing strategies
|
||||
5. **Performance**: Streaming support and optimized processing for large documents
|
||||
|
||||
The modular design ensures that improvements can be made incrementally while maintaining backward compatibility. The sophisticated table processing system alone represents a major advancement in handling complex financial documents accurately.
|
||||
@@ -0,0 +1,208 @@
|
||||
# HTML Parser Quality Improvement Strategy
|
||||
|
||||
## Overview
|
||||
|
||||
Simple, iterative testing strategy for the HTML parser rewrite. The goal is rapid feedback loops where we compare OLD vs NEW parser output, identify visual/functional issues, fix them, and repeat until satisfied.
|
||||
|
||||
## Test Corpus
|
||||
|
||||
### 10 Representative Documents
|
||||
|
||||
Selected to cover different filing types, companies, and edge cases:
|
||||
|
||||
| # | Company | Filing Type | File Path | Rationale |
|
||||
|---|---------|-------------|-----------|-----------|
|
||||
| 1 | Apple | 10-K | `data/html/Apple.10-K.html` | Large complex filing, existing test file |
|
||||
| 2 | Oracle | 10-K | `data/html/Oracle.10-K.html` | Complex financials, existing test file |
|
||||
| 3 | Nvidia | 10-K | `data/html/Nvidia.10-K.html` | Tech company, existing test file |
|
||||
| 4 | Microsoft | 10-K | `data/html/Microsoft.10-K.html` | Popular company, complex tables |
|
||||
| 5 | Tesla | 10-K | `data/html/Tesla.10-K.html` | Manufacturing sector, different formatting |
|
||||
| 6 | [TBD] | 10-Q | TBD | Quarterly report format |
|
||||
| 7 | [TBD] | 10-Q | TBD | Another quarterly for variety |
|
||||
| 8 | [TBD] | 8-K | `data/html/BuckleInc.8-K.html` | Event-driven filing |
|
||||
| 9 | [TBD] | Proxy (DEF 14A) | TBD | Proxy statement with compensation tables |
|
||||
| 10 | [TBD] | Edge case | TBD | Unusual formatting or very large file |
|
||||
|
||||
**Note**: Fill in TBD entries as we identify good test candidates.
|
||||
|
||||
## The 4-Step Loop
|
||||
|
||||
### Step 1: Run Comparison
|
||||
|
||||
Use existing test scripts to compare OLD vs NEW parsers:
|
||||
|
||||
```bash
|
||||
# Full comparison with metrics
|
||||
python tests/manual/check_parser_comparison.py
|
||||
|
||||
# Table-focused comparison with rendering
|
||||
python tests/manual/check_tables.py
|
||||
|
||||
# Or run on specific file
|
||||
python tests/manual/check_html_rewrite.py
|
||||
```
|
||||
|
||||
**Outputs to review**:
|
||||
- Console output with side-by-side Rich panels
|
||||
- Metrics (parse time, table count, section detection)
|
||||
- Rendered tables (old vs new)
|
||||
|
||||
### Step 2: Human Review
|
||||
|
||||
**Visual Inspection Process**:
|
||||
1. Look at console output directly (Rich rendering)
|
||||
2. For detailed text comparison, optionally dump to files:
|
||||
- OLD parser: `doc.text()` → `output/old_apple.txt`
|
||||
- NEW parser: `doc.text()` → `output/new_apple.txt`
|
||||
- Use `diff` or visual diff tool
|
||||
3. Take screenshots for complex table issues
|
||||
4. Focus on:
|
||||
- Table alignment and formatting
|
||||
- Currency symbol placement (should be merged: `$1,234` not `$ | 1,234`)
|
||||
- Column count (fewer is better after removing spacing columns)
|
||||
- Section detection accuracy
|
||||
- Text readability for LLM context
|
||||
|
||||
**Quality Criteria** (from goals.md):
|
||||
- Semantic meaning preserved
|
||||
- Tables render correctly when printed
|
||||
- Better than old parser in speed, accuracy, features
|
||||
- **You are the final judge**: "Does this look right?"
|
||||
|
||||
### Step 3: Document Bugs
|
||||
|
||||
Record issues in the tracker below as you find them:
|
||||
|
||||
| Bug # | Status | Priority | Description | File/Location | Notes |
|
||||
|-------|--------|----------|-------------|---------------|-------|
|
||||
| Example | Fixed | High | Currency symbols not merging in balance sheet | Apple 10-K, Table 5 | Issue in CurrencyColumnMerger |
|
||||
| | | | | | |
|
||||
| | | | | | |
|
||||
| | | | | | |
|
||||
|
||||
**Status values**: Open, In Progress, Fixed, Won't Fix, Deferred
|
||||
**Priority values**: Critical, High, Medium, Low
|
||||
|
||||
**Bug Description Template**:
|
||||
- What's wrong: Clear description of the issue
|
||||
- Where: Which file/table/section
|
||||
- Expected: What it should look like
|
||||
- Actual: What it currently looks like
|
||||
- Impact: How it affects usability/readability
|
||||
|
||||
### Step 4: Fix & Repeat
|
||||
|
||||
1. Pick highest priority bug
|
||||
2. Fix the code
|
||||
3. Re-run comparison on affected file(s)
|
||||
4. Verify fix doesn't break other files
|
||||
5. Mark bug as Fixed
|
||||
6. Repeat until exit criteria met
|
||||
|
||||
**Quick verification**:
|
||||
```bash
|
||||
# Re-run just the problematic file
|
||||
python -c "
|
||||
from edgar.documents import parse_html
|
||||
from pathlib import Path
|
||||
html = Path('data/html/Apple.10-K.html').read_text()
|
||||
doc = parse_html(html)
|
||||
# Quick inspection
|
||||
print(f'Tables: {len(doc.tables)}')
|
||||
print(doc.tables[5].render(width=200)) # Check specific table
|
||||
"
|
||||
```
|
||||
|
||||
## Exit Criteria
|
||||
|
||||
We're done when:
|
||||
1. ✅ All 10 test documents parse successfully
|
||||
2. ✅ Visual output looks correct (maintainer approval)
|
||||
3. ✅ Tables render cleanly with proper alignment
|
||||
4. ✅ No critical or high priority bugs remain
|
||||
5. ✅ Performance is equal or better than old parser
|
||||
6. ✅ Text extraction is complete and clean for AI context
|
||||
|
||||
**Final approval**: Maintainer says "This is good enough to ship."
|
||||
|
||||
## Testing Infrastructure
|
||||
|
||||
### Primary Tool: compare_parsers.py
|
||||
|
||||
Simple command-line tool for the quality improvement loop:
|
||||
|
||||
```bash
|
||||
# Quick overview comparison (using shortcuts!)
|
||||
python tests/manual/compare_parsers.py aapl
|
||||
|
||||
# See all tables in a document
|
||||
python tests/manual/compare_parsers.py aapl --tables
|
||||
|
||||
# Compare specific table (OLD vs NEW side-by-side)
|
||||
python tests/manual/compare_parsers.py aapl --table 5
|
||||
|
||||
# Compare text extraction
|
||||
python tests/manual/compare_parsers.py msft --text
|
||||
|
||||
# See section detection
|
||||
python tests/manual/compare_parsers.py orcl --sections
|
||||
|
||||
# Test with 10-Q filings
|
||||
python tests/manual/compare_parsers.py 'aapl 10-q'
|
||||
|
||||
# Run all test files at once
|
||||
python tests/manual/compare_parsers.py --all
|
||||
```
|
||||
|
||||
**Shortcuts available**:
|
||||
- Companies: `aapl`, `msft`, `tsla`, `nvda`, `orcl`
|
||||
- Filing types: `10-k` (default), `10-q`, `8-k`
|
||||
- Or use full file paths
|
||||
|
||||
**Features**:
|
||||
- Clean command-line interface
|
||||
- Side-by-side OLD vs NEW comparison
|
||||
- Rich console output with colors and tables
|
||||
- Performance metrics
|
||||
- Individual table inspection
|
||||
|
||||
### Other Available Scripts
|
||||
|
||||
Additional tools for specific testing:
|
||||
|
||||
- `tests/manual/check_parser_comparison.py` - Full comparison with metrics
|
||||
- `tests/manual/check_tables.py` - Table-specific comparison with rendering
|
||||
- `tests/manual/check_html_rewrite.py` - General HTML parsing checks
|
||||
- `tests/manual/check_html_parser_real_files.py` - Real filing tests
|
||||
|
||||
## Quick Reference
|
||||
|
||||
For day-to-day testing commands and usage examples, see [TESTING.md](TESTING.md).
|
||||
|
||||
## Notes
|
||||
|
||||
- **Keep it simple**: This is about rapid iteration, not comprehensive automation
|
||||
- **Visual inspection is key**: Automated metrics don't catch layout/formatting issues
|
||||
- **Use screenshots**: When describing bugs, screenshots speak louder than words
|
||||
- **Iterative approach**: Don't try to fix everything at once, prioritize
|
||||
- **Trust your judgment**: If it looks wrong, it probably is wrong
|
||||
|
||||
## Bug Tracker
|
||||
|
||||
### Active Issues
|
||||
|
||||
(Add bugs here as they're discovered)
|
||||
|
||||
### Fixed Issues
|
||||
|
||||
(Move completed bugs here for history)
|
||||
|
||||
### Deferred Issues
|
||||
|
||||
(Issues that aren't blocking release but could be improved later)
|
||||
|
||||
---
|
||||
|
||||
**Status**: Initial draft
|
||||
**Last Updated**: 2025-10-07
|
||||
**Maintainer**: Dwight Gunning
|
||||
Reference in New Issue
Block a user