Initial commit

This commit is contained in:
kdusek
2025-12-09 12:13:01 +01:00
commit 8e654ed209
13332 changed files with 2695056 additions and 0 deletions

View File

@@ -0,0 +1,314 @@
# HTML Parser Rewrite - Status Report
**Generated**: 2025-10-08
**Branch**: `html_rewrite`
**Target**: Merge to `main`
---
## Overall Progress: ~95% Complete ✅
### Completed Phases
#### ✅ Phase 1: Core Implementation (100%)
- [x] Streaming parser for large documents
- [x] TableMatrix system for accurate table rendering
- [x] Section extraction with Part I/II detection
- [x] XBRL integration
- [x] Rich-based table rendering
- [x] Configuration system (ParserConfig)
- [x] Error handling and validation
#### ✅ Phase 2: Functional Testing (100%)
- [x] **Corpus Validation** - 40 diverse filings, 100% success rate
- [x] **Edge Cases** - 31 tests covering invalid inputs, malformed HTML, edge conditions
- [x] **Integration Tests** - 25 tests for Filing/Company integration, backward compatibility
- [x] **Regression Tests** - 15 tests preventing known bugs from returning
**Total Test Count**: 79 functional tests, all passing
#### ✅ Phase 3: Performance Profiling (100%)
- [x] **Benchmarking Infrastructure** - Comprehensive benchmark suite
- [x] **Hot Path Analysis** - Identified 3 critical bottlenecks (63% section extraction, 40% Rich rendering, 15% regex)
- [x] **Memory Profiling** - Found 255MB memory leak in MSFT 10-K, documented root causes
- [x] **Performance Regression Tests** - 15 tests locking in baseline thresholds
**Performance Baseline Established**:
- Average: 3.8MB/s throughput, 4.1MB memory per doc
- Small docs: 2.6MB/s (optimization opportunity)
- Large docs: 20.7MB/s (excellent streaming)
- Memory leak: 19-25x ratio on medium docs (needs fixing)
#### ✅ Phase 4: Test Data Augmentation (100%)
- [x] **HTML Fixtures** - Downloaded 32 files (155MB) from 16 companies across 6 industries
- [x] **Download Automation** - Created `download_html_fixtures.py` script
- [x] **Documentation** - Comprehensive fixture documentation
---
## Current Status: Ready for Optimization Phase
### What's Working Well ✅
1. **Parsing Accuracy**: 100% success rate across 40+ diverse filings
2. **Large Document Handling**: Excellent streaming performance (20.7MB/s on JPM 10-K)
3. **Table Extraction**: TableMatrix accurately handles colspan/rowspan
4. **Test Coverage**: 79 comprehensive tests covering edge cases, integration, regression
5. **Backward Compatibility**: Old TenK API still works for existing code
### Known Issues to Address 🔧
#### Critical (Must Fix Before Merge)
1. **Memory Leaks** (Priority: CRITICAL)
- MSFT 10-K: 255MB leak (19x document size)
- Apple 10-K: 41MB leak (23x document size)
- **Root Causes**:
- Rich Console objects retained (0.4MB per doc)
- Global caches not cleared on document deletion
- Circular references in node graph
- **Location**: `tests/perf/memory_analysis.md:90-130`
- **Impact**: Server crashes after 10-20 requests in production
2. **Performance Bottlenecks** (Priority: HIGH)
- Section extraction: 3.7s (63% of parse time)
- Rich rendering for text: 2.4s (40% of parse time)
- Regex normalization: 0.8s (15% of parse time)
- **Location**: `tests/perf/hotpath_analysis.md:9-66`
- **Impact**: 4x slower than necessary on medium documents
#### Non-Critical (Can Fix After Merge)
3. **Small Document Performance** (Priority: MEDIUM)
- 2.6MB/s vs desired 5MB/s
- Overhead dominates on <5MB documents
- **Optimization**: Lazy loading, reduce upfront processing
---
## Next Steps (In Order)
### Phase 5: Critical Fixes (2-3 days) 🔧
#### 5.1 Memory Leak Fixes (1-2 days)
**Goal**: Reduce memory leak from 255MB to <5MB
Tasks:
- [ ] Implement `Document.__del__()` to clear caches
- [ ] Replace Rich rendering in `text()` with direct string building
- [ ] Break circular references in node graph
- [ ] Use weak references for parent links
- [ ] Add `__slots__` to frequently created objects (Cell, TableNode)
**Expected Result**: MSFT 10-K leak: 255MB → <5MB (95% improvement)
**Validation**:
```bash
pytest tests/perf/test_performance_regression.py::TestMemoryRegression -v
```
#### 5.2 Performance Optimizations (1-2 days)
**Goal**: Improve parse speed from 1.2s → 0.3s on Apple 10-K (77% faster)
Tasks:
- [ ] Fix section detection - use headings instead of rendering entire document
- [ ] Implement fast text extraction without Rich overhead
- [ ] Optimize regex normalization - combine patterns, use compilation
**Expected Results**:
- Section extraction: 3.7s → 1.2s (60% faster)
- Text extraction: 2.4s → 1.2s (50% faster)
- Regex: 0.8s → 0.5s (40% faster)
**Validation**:
```bash
pytest tests/perf/test_performance_regression.py::TestParseSpeedRegression -v
```
### Phase 6: Final Validation (1 day) ✅
Tasks:
- [ ] Re-run all 79 functional tests
- [ ] Re-run performance regression tests (verify improvements)
- [ ] Run full corpus validation
- [ ] Memory profiling validation (confirm leaks fixed)
- [ ] Update CHANGELOG.md
- [ ] Create merge summary document
### Phase 7: Merge to Main (1 day) 🚀
Tasks:
- [ ] Final code review
- [ ] Squash commits or create clean merge
- [ ] Update version number
- [ ] Merge to main
- [ ] Tag release
- [ ] Monitor for issues
---
## Test Summary
### Current Test Status: 79/79 Passing (100%)
```
tests/corpus/test_corpus_validation.py 8 tests ✓
tests/test_html_parser_edge_cases.py 31 tests ✓
tests/test_html_parser_integration.py 25 tests ✓
tests/test_html_parser_regressions.py 15 tests ✓
tests/perf/test_performance_regression.py 15 tests ✓ (baseline established)
```
### Test Execution
```bash
# Functional tests (79 tests, ~30s)
pytest tests/corpus tests/test_html_parser_*.py -v
# Performance tests (15 tests, ~20s)
pytest tests/perf/test_performance_regression.py -m performance -v
# All tests
pytest tests/ -v
```
---
## Performance Metrics
### Current Baseline (Before Optimization)
| Document | Size | Parse Time | Throughput | Memory | Tables | Sections |
|----------|------|------------|------------|--------|--------|----------|
| Apple 10-Q | 1.1MB | 0.307s | 3.6MB/s | 27.9MB (25.6x) | 40 | 9 |
| Apple 10-K | 1.8MB | 0.500s | 3.6MB/s | 21.6MB (11.9x) | 63 | 8 |
| MSFT 10-K | 7.8MB | 1.501s | 5.2MB/s | 147.0MB (18.9x) | 85 | 0 |
| JPM 10-K | 52.4MB | 2.537s | 20.7MB/s | 0.6MB (0.01x) | 681 | 0 |
### Target Metrics (After Optimization)
| Metric | Current | Target | Improvement |
|--------|---------|--------|-------------|
| **Memory leak** | 41-255MB | <5MB | 95% reduction |
| **Memory ratio** | 19-25x | <3x | 87% reduction |
| **Parse time (Apple 10-K)** | 0.500s | 0.150s | 70% faster |
| **Throughput (small docs)** | 2.6MB/s | 5.0MB/s | 92% faster |
---
## File Organization
### Core Parser Files
```
edgar/documents/
├── __init__.py # Public API (parse_html)
├── parser.py # Main parser with streaming
├── config.py # ParserConfig
├── document_builder.py # Document tree construction
├── nodes/ # Node types (TableNode, SectionNode)
├── utils/
│ ├── streaming.py # Streaming parser (fixed JPM bug)
│ └── table_processing.py # TableMatrix system
└── exceptions.py # Custom exceptions
```
### Test Files
```
tests/
├── corpus/ # Corpus validation
│ ├── quick_corpus.py # Corpus builder
│ └── test_corpus_validation.py # 8 validation tests
├── fixtures/
│ ├── html/ # 32 HTML fixtures (155MB)
│ │ ├── {ticker}/10k/ # By company and form
│ │ └── README.md
│ └── download_html_fixtures.py # Download automation
├── perf/ # Performance testing
│ ├── benchmark_html_parser.py # Benchmarking
│ ├── profile_hotpaths.py # Hot path profiling
│ ├── profile_memory.py # Memory profiling
│ ├── test_performance_regression.py # Regression tests
│ ├── performance_report.md # Benchmark results
│ ├── hotpath_analysis.md # Bottleneck analysis
│ └── memory_analysis.md # Memory leak analysis
├── test_html_parser_edge_cases.py # 31 edge case tests
├── test_html_parser_integration.py # 25 integration tests
└── test_html_parser_regressions.py # 15 regression tests
```
---
## Risks and Mitigation
### Risk 1: Memory Leaks in Production
**Severity**: HIGH
**Probability**: HIGH (confirmed in testing)
**Mitigation**: Must fix before merge (Phase 5.1)
### Risk 2: Performance Regression
**Severity**: MEDIUM
**Probability**: LOW (baseline established, regression tests in place)
**Mitigation**: Performance regression tests will catch any degradation
### Risk 3: Backward Compatibility
**Severity**: LOW
**Probability**: LOW (integration tests passing)
**Mitigation**: 25 integration tests verify old API still works
---
## Estimated Timeline to Merge
```
Phase 5.1: Memory leak fixes 1-2 days
Phase 5.2: Performance optimization 1-2 days
Phase 6: Final validation 1 day
Phase 7: Merge to main 1 day
----------------------------------------
Total: 4-6 days
```
**Target Merge Date**: October 12-14, 2025
---
## Decision Points
### Should We Merge Now or After Optimization?
**Option A: Merge Now (Not Recommended)**
- ✅ Functional tests passing
- ✅ Backward compatible
- ❌ Memory leaks (production risk)
- ❌ Performance issues
- ❌ Will require hotfix soon
**Option B: Fix Critical Issues First (Recommended)**
- ✅ Production-ready
- ✅ Performance validated
- ✅ Memory efficient
- ❌ 4-6 days delay
- ✅ Clean, professional release
**Recommendation**: **Option B** - Fix critical memory leaks and performance issues before merge. The 4-6 day investment prevents production incidents and ensures a polished release.
---
## Questions for Review
1. **Scope**: Should we fix only critical issues (memory + performance) or also tackle small-doc optimization?
2. **Timeline**: Is 4-6 days acceptable, or do we need to merge sooner?
3. **Testing**: Are 79 functional tests + 15 performance tests sufficient coverage?
4. **Documentation**: Do we need user-facing documentation updates?
---
## Conclusion
The HTML parser rewrite is **95% complete** with excellent functional testing but critical memory and performance issues identified. The smart path forward is:
1. ✅ Complete critical fixes (4-6 days)
2. ✅ Validate improvements
3. ✅ Merge to main with confidence
This approach ensures a production-ready, performant parser rather than merging now and hotfixing later.

View File

@@ -0,0 +1,437 @@
# HTML Parser Rewrite - Progress Assessment
**Date**: 2025-10-07
**Status**: Active Development (html_rewrite branch)
---
## Executive Summary
The HTML parser rewrite is **substantially complete** for core functionality with **excellent progress** on Item/section detection. Recent bug fixes (2025-10-07) have addressed critical table rendering issues and 10-Q Part I/II distinction, bringing the parser close to production-ready quality.
### Overall Progress: **~90% Complete**
- ✅ Core parsing infrastructure: **100% Complete**
- ✅ Table processing: **95% Complete** (recent fixes)
- ✅ Section/Item detection: **95% Complete** (Part I/II fixed, needs validation)
- ⚠️ Performance optimization: **70% Complete**
- ⚠️ Comprehensive testing: **65% Complete** (added 10-Q Part tests)
- ⚠️ Documentation: **75% Complete**
---
## Goal Achievement Analysis
### Primary Goals (from goals.md)
#### 1. **Semantic Meaning Preservation** ✅ **ACHIEVED**
> "Read text, tables and ixbrl data preserving greatest semantic meaning"
**Status**: ✅ Fully implemented
- Text extraction with structure preservation
- Advanced table matrix system for accurate table rendering
- XBRL fact extraction before preprocessing
- Hierarchical node model maintains document structure
**Recent Improvements**:
- Header detection fixes (Oracle Table 6, Tesla Table 16)
- Spacing column filter now preserves header columns (MSFT Table 39)
- Multi-row header normalization
#### 2. **AI Channel (Primary) + Human Channel (Secondary)** ✅ **ACHIEVED**
> "AI context is the primary goal, with human context being secondary"
**Status**: ✅ Both channels working
- **AI Channel**:
- Clean text output optimized for LLMs
- Structured table rendering for context windows
- Section-level extraction for chunking
- Semantic divisibility supported
- **Human Channel**:
- Rich console rendering with proper formatting
- Markdown export
- Visual table alignment (recently fixed)
#### 3. **Section-Level Processing** ✅ **ACHIEVED**
> "Work at full document level and section level - breaking into independently processable sections"
**Status**: ✅ Implemented with good coverage
- `SectionExtractor` class fully functional
- TOC-based section detection
- Pattern-based section identification
- Lazy loading support for large documents
**What Works**:
```python
# Section detection is operational
doc = parse_html(html)
sections = doc.sections # Dict of section names -> SectionNode
# Access specific sections
business = sections.get('Item 1 - Business')
mda = sections.get('Item 7 - MD&A')
financials = sections.get('Item 8 - Financial Statements')
```
#### 4. **Standard Section Names (10-K, 10-Q, 8-K)** ✅ **ACHIEVED**
> "For some filing types (10-K, 10-Q, 8-K) identify sections by standard names"
**Status**: ✅ 95% Complete - Implemented with Part I/II distinction for 10-Q
**What's Implemented**:
- Pattern matching for standard Items:
- Item 1 - Business
- Item 1A - Risk Factors
- Item 7 - MD&A
- Item 7A - Market Risk
- Item 8 - Financial Statements
- And more...
- **10-Q Part I/Part II distinction** (newly fixed 2025-10-07):
- Part I - Item 1 (Financial Statements)
- Part II - Item 1 (Legal Proceedings)
- Proper boundary detection and context propagation
- Prevents Item number conflicts
**What's Remaining** (5%):
- Validation against large corpus of 10-K/10-Q filings
- Edge case handling (non-standard formatting)
- 8-K specific section patterns expansion
**Evidence from Code**:
```python
# edgar/documents/extractors/section_extractor.py
(r'^(Item|ITEM)\s+1\.?\s*Business', 'Item 1 - Business'),
(r'^(Item|ITEM)\s+1A\.?\s*Risk\s+Factors', 'Item 1A - Risk Factors'),
(r'^(Item|ITEM)\s+7\.?\s*Management.*Discussion', 'Item 7 - MD&A'),
(r'^(Item|ITEM)\s+8\.?\s*Financial\s+Statements', 'Item 8 - Financial Statements'),
# NEW: Part I/II detection (edgar/documents/extractors/section_extractor.py:294-324)
def _detect_10q_parts(self, headers) -> Dict[int, str]:
"""Detect Part I and Part II boundaries in 10-Q filings."""
```
#### 5. **Table Processing for AI Context** ✅ **ACHIEVED**
> "Getting tables in the right structure for rendering to text for AI context is more important than dataframes"
**Status**: ✅ Excellent progress with recent fixes
- Advanced TableMatrix system handles complex tables
- Multi-row header detection and normalization
- Spacing column filtering (preserves semantic columns)
- Currency symbol merging
- Clean text rendering for LLM consumption
**Recent Fixes (Today)**:
- ✅ Fixed spacing column filter removing legitimate headers (MSFT Table 39)
- ✅ Fixed header detection for date ranges (Oracle Table 6)
- ✅ Fixed long narrative text misclassification (Tesla Table 16)
- ✅ Header row normalization for alignment
#### 6. **Better Than Old Parser in Every Way** 🟡 **MOSTLY ACHIEVED**
> "Speed, accuracy, features, usability"
**Comparison**:
| Aspect | Old Parser | New Parser | Status |
|--------|-----------|------------|--------|
| **Speed** | Baseline | 1.4x faster (typical) | ✅ Better |
| **Accuracy** | Good | Excellent (with recent fixes) | ✅ Better |
| **Features** | Basic | Rich (XBRL, sections, multiple outputs) | ✅ Better |
| **Usability** | Simple | Powerful + Simple API | ✅ Better |
| **Table Rendering** | Basic alignment | Advanced matrix system | ✅ Better |
| **Section Detection** | Limited | Comprehensive | ✅ Better |
**Areas Needing Validation**:
- Performance on very large documents (>50MB)
- Memory usage under sustained load
- Edge case handling across diverse filings
---
## Item/Section Detection Deep Dive
### Current Capabilities
**10-K Sections Detected**:
- ✅ Item 1 - Business
- ✅ Item 1A - Risk Factors
- ✅ Item 1B - Unresolved Staff Comments
- ✅ Item 2 - Properties
- ✅ Item 3 - Legal Proceedings
- ✅ Item 4 - Mine Safety Disclosures
- ✅ Item 5 - Market for Stock
- ✅ Item 6 - Selected Financial Data
- ✅ Item 7 - MD&A
- ✅ Item 7A - Market Risk
- ✅ Item 8 - Financial Statements
- ✅ Item 9 - Changes in Accounting
- ✅ Item 9A - Controls and Procedures
- ✅ Item 9B - Other Information
- ✅ Item 10 - Directors and Officers
- ✅ Item 11 - Executive Compensation
- ✅ Item 12 - Security Ownership
- ✅ Item 13 - Related Transactions
- ✅ Item 14 - Principal Accountant
- ✅ Item 15 - Exhibits
**10-Q Sections Detected**:
- ✅ Part I Items (Financial Information):
- Part I - Item 1 - Financial Statements
- Part I - Item 2 - MD&A
- Part I - Item 3 - Market Risk
- Part I - Item 4 - Controls and Procedures
- ✅ Part II Items (Other Information):
- Part II - Item 1 - Legal Proceedings
- Part II - Item 1A - Risk Factors
- Part II - Item 2 - Unregistered Sales
- Part II - Item 6 - Exhibits
**✅ FIXED** (2025-10-07): Part I/Part II distinction now implemented!
- Part I Item 1 and Part II Item 1 are properly distinguished
- Section keys include Part context: "Part I - Item 1 - Financial Statements" vs "Part II - Item 1 - Legal Proceedings"
- Comprehensive test coverage added (5 tests in test_10q_part_detection.py)
**8-K Sections**:
- ⚠️ Limited - needs expansion
### Detection Methods
1. **TOC-based Detection**
- Analyzes Table of Contents
- Extracts anchor links
- Maps sections to content
2. **Pattern-based Detection**
- Regex matching for Item headers
- Heading analysis (h1-h6 tags)
- Text pattern recognition
3. **Hybrid Approach**
- Combines TOC + patterns
- Fallback mechanisms
- Cross-validation
### What's Working
```python
# This works today:
from edgar.documents import parse_html
html = filing.html()
doc = parse_html(html)
# Get all sections
sections = doc.sections # Returns dict
# Access specific Items
if 'Item 7 - MD&A' in sections:
mda = sections['Item 7 - MD&A']
mda_text = mda.text()
mda_tables = mda.tables()
```
### What Needs Work
1. **Validation Coverage** (20% remaining)
- Test against 100+ diverse 10-K filings
- Test against 10-Q filings
- Test against 8-K filings
- Capture edge cases and variations
2. **Edge Cases** (20% remaining)
- Non-standard Item formatting
- Missing TOC
- Nested sections
- Combined Items (e.g., "Items 10, 13, 14")
3. **8-K Support** (50% remaining)
- 8-K specific Item patterns
- Event-based section detection
- Exhibit handling
---
## Recent Achievements (Past 24 Hours)
### Critical Bug Fixes ✅
1. **Spacing Column Filter Fix** (MSFT Table 39)
- Problem: Legitimate headers removed as "spacing"
- Solution: Header content protection + colspan preservation
- Impact: Tables now render correctly with all headers
- Commits: `4e43276`, `d19ddd1`
2. **Header Detection Improvements**
- Oracle Table 6: Date ranges no longer misclassified
- Tesla Table 16: Long narrative text properly handled
- Multi-row header normalization
- Comprehensive test coverage (16 new tests)
3. **Documentation Updates**
- TESTING.md clarified output limits
- CHANGELOG updated with fixes
- Bug reports and research docs completed
### Quality Metrics
**Test Coverage**:
- 16 new tests added (all passing)
- 0 regressions in existing tests
- Comprehensive edge case coverage
**Code Quality**:
- Clean implementation following plan
- Well-documented changes
- Proper commit messages with Claude Code attribution
---
## Path to 100% Completion
### High Priority (Next Steps)
**📋 Detailed plans available**:
- **Performance**: See `docs-internal/planning/active-tasks/2025-10-07-performance-optimization-plan.md`
- **Testing**: See `docs-internal/planning/active-tasks/2025-10-07-comprehensive-testing-plan.md`
1. **Performance Optimization** (1-2 weeks)
- [ ] Phase 1: Benchmarking & profiling (2-3 days)
- [ ] Phase 2: Algorithm optimizations (3-4 days)
- [ ] Phase 3: Validation & regression tests (2-3 days)
- [ ] Phase 4: Documentation & monitoring (1 day)
- **Goal**: Maintain 1.3x+ speed advantage, <2x memory usage
2. **Comprehensive Testing** (2-3 weeks)
- [ ] Phase 1: Corpus validation - 100+ filings (3-4 days)
- [ ] Phase 2: Edge cases & error handling (2-3 days)
- [ ] Phase 3: Integration testing (2-3 days)
- [ ] Phase 4: Regression prevention (1-2 days)
- [ ] Phase 5: Documentation & sign-off (1 day)
- **Goal**: >95% success rate, >80% test coverage
3. **Item Detection Validation** (included in testing plan)
- [ ] Test against 50+ diverse 10-K filings
- [ ] Test against 20+ 10-Q filings
- [ ] Document any pattern variations found
- [ ] Add regression tests for edge cases
### Medium Priority
4. **8-K Support** (1-2 days)
- [ ] Research 8-K Item patterns
- [ ] Implement detection patterns
- [ ] Test against sample 8-K filings
5. **Documentation** (1 day)
- [ ] User guide for section access
- [ ] API documentation
- [ ] Migration guide from old parser
- [ ] Examples and recipes
### Low Priority (Polish)
6. **Final Polish**
- [ ] Error message improvements
- [ ] Logging enhancements
- [ ] Configuration documentation
- [ ] Performance tuning
---
## Risk Assessment
### Low Risk ✅
- Core parsing functionality (stable)
- Table processing (recently fixed, well-tested)
- Text extraction (working well)
- XBRL extraction (functional)
### Medium Risk ⚠️
- Section detection edge cases (needs validation)
- Performance on very large docs (needs testing)
- Memory usage (needs profiling)
### Mitigation Strategy
1. Comprehensive validation testing (in progress)
2. Real-world filing corpus testing
3. Performance benchmarking suite
4. Gradual rollout with monitoring
---
## Recommendations
### Immediate Actions (This Week)
1. **Validate Item Detection** 🎯 **TOP PRIORITY**
```bash
# Run on diverse corpus
python tests/manual/compare_parsers.py --all
# Test specific sections
python -c "
from edgar.documents import parse_html
from pathlib import Path
for filing in ['Apple', 'Oracle', 'Tesla', 'Microsoft']:
html = Path(f'data/html/{filing}.10-K.html').read_text()
doc = parse_html(html)
print(f'{filing}: {list(doc.sections.keys())[:5]}...')
"
```
2. **Create Section Access Tests**
- Write tests that verify each Item can be accessed
- Validate text and table extraction from sections
- Test edge cases (missing Items, combined Items)
3. **User Acceptance Testing**
- Have maintainer review section detection output
- Validate against known-good filings
- Document any issues found
### Timeline to Production
**Optimistic**: 1 week
- If validation shows good Item detection
- If performance is acceptable
- If no major issues found
**Realistic**: 2-3 weeks
- Account for edge case fixes
- Additional testing needed
- Documentation completion
**Conservative**: 4 weeks
- Account for 8-K support
- Comprehensive testing across all filing types
- Full documentation
---
## Conclusion
The HTML parser rewrite is **very close to completion** with excellent progress on all goals:
**✅ Fully Achieved**:
- Semantic meaning preservation
- AI/Human channel support
- Section-level processing
- Table processing for AI context
- Superior to old parser (in most respects)
- **Standard Item detection for 10-K/10-Q** (with Part I/II distinction)
**⚠️ Remaining Work (10%)**:
- Validation against diverse corpus
- Edge case handling
- 8-K specific support expansion
- Final testing and documentation
**Bottom Line**: The parser is **production-ready for 10-K/10-Q** with Item detection functional but requiring validation. The recent bug fixes have resolved critical table rendering issues. With 1-2 weeks of focused validation and testing, this can be shipped with confidence.
### Next Steps
1. Run comprehensive Item detection validation
2. Create section access test suite
3. Performance benchmark
4. Maintainer review and sign-off
5. Merge to main branch

View File

@@ -0,0 +1,233 @@
# HTML Parser Testing Quick Start
Quick reference for testing the HTML parser rewrite during quality improvement.
## Quick Start
```bash
# Use shortcuts (easy!)
python tests/manual/compare_parsers.py aapl # Apple 10-K
python tests/manual/compare_parsers.py nvda --tables # Nvidia tables
python tests/manual/compare_parsers.py 'aapl 10-q' # Apple 10-Q
python tests/manual/compare_parsers.py orcl --table 5 # Oracle table #5
# Or use full paths
python tests/manual/compare_parsers.py data/html/Apple.10-K.html
# Run all test files
python tests/manual/compare_parsers.py --all
```
**Available shortcuts:**
- **Companies**: `aapl`, `msft`, `tsla`, `nvda`, `orcl` (or full names like `apple`)
- **Filing types**: `10-k` (default), `10-q`, `8-k`
- **Combine**: `'aapl 10-q'`, `'orcl 8-k'`
## Common Use Cases
### 1. First Look at a Filing
```bash
# Get overview: speed, table count, sections
python tests/manual/compare_parsers.py orcl
```
**Shows**:
- Parse time comparison (OLD vs NEW)
- Tables found
- Text length
- Sections detected
- New features (headings, XBRL)
### 2. Check Table Rendering
```bash
# List all tables with dimensions (shows first 20 tables)
python tests/manual/compare_parsers.py aapl --tables
# Compare specific table side-by-side (FULL table, no truncation)
python tests/manual/compare_parsers.py aapl --table 7
# Compare a range of tables
python tests/manual/compare_parsers.py aapl --range 5:10
```
**Look for**:
- Currency symbols merged: `$1,234` not `$ | 1,234`
- Proper column alignment
- Correct row/column counts
- Clean rendering without extra spacing columns
**Note**: `--table N` shows the **complete table** with all rows - no truncation!
### 3. Verify Text Extraction
```bash
# See first 50 lines side-by-side (default limit)
python tests/manual/compare_parsers.py msft --text
# Show more lines (configurable)
python tests/manual/compare_parsers.py msft --text --lines 100
# Show first 200 lines
python tests/manual/compare_parsers.py msft --text --lines 200
```
**Check**:
- Semantic meaning preserved
- No missing content
- Clean formatting for LLM consumption
**Note**: Text mode shows first N lines only (default: 50). Use `--lines N` to adjust.
### 4. Check Section Detection
```bash
python tests/manual/compare_parsers.py aapl --sections
```
**Verify**:
- Standard sections identified (10-K/10-Q)
- Section boundaries correct
- Text length reasonable per section
### 5. Run Full Test Suite
```bash
# Test all files in corpus
python tests/manual/compare_parsers.py --all
```
**Results**:
- Summary table across all files
- Performance comparison
- Table detection comparison
## Test Files
Available in `data/html/`:
- `Apple.10-K.html` - 1.8MB, complex financials
- `Oracle.10-K.html` - Large filing
- `Nvidia.10-K.html` - Tech company
- `Apple.10-Q.html` - Quarterly format
- More files as needed...
## Command Reference
```
python tests/manual/compare_parsers.py [FILE] [OPTIONS]
Options:
--all Run on all test files
--tables Show tables summary (first 20 tables)
--table N Show specific table N side-by-side (FULL table)
--range START:END Show range of tables (e.g., 5:10)
--text Show text comparison (first 50 lines by default)
--sections Show sections comparison
--lines N Number of text lines to show (default: 50, only for --text)
--help Show full help
```
### Output Limits Summary
| Mode | Limit | Configurable | Notes |
|---------------|------------|-------------------|---------------------------------|
| `--table N` | None | N/A | Shows **complete table** |
| `--range N:M` | None | N/A | Shows **complete tables** in range |
| `--tables` | 20 tables | No | Lists first 20 tables only |
| `--text` | 50 lines | Yes (`--lines N`) | Preview only |
| `--sections` | None | N/A | Shows all sections |
## Output Interpretation
### Overview Table
```
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric ┃ Old Parser ┃ New Parser ┃ Notes ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parse Time │ 454ms │ 334ms │ 1.4x faster│
│ Tables Found │ 63 │ 63 │ +0 │
│ Text Length │ 0 │ 159,388 │ NEW! │
└───────────────┴────────────┴────────────┴────────────┘
```
**Good signs**:
- ✅ New parser faster or similar speed
- ✅ Same or more tables found
- ✅ Text extracted (old parser shows 0)
- ✅ Sections detected
**Red flags**:
- ❌ Significantly slower
- ❌ Fewer tables (unless removing layout tables)
- ❌ Much shorter text (content missing)
### Table Comparison
```
Old Parser:
┌─────────┬──────────┬──────────┐
│ Year │ Revenue │ Profit │
├─────────┼──────────┼──────────┤
│ 2023 │ $ 100M │ $ 20M │ <- Currency separated
└─────────┴──────────┴──────────┘
New Parser:
┌─────────┬──────────┬──────────┐
│ Year │ Revenue │ Profit │
├─────────┼──────────┼──────────┤
│ 2023 │ $100M │ $20M │ <- Currency merged ✅
└─────────┴──────────┴──────────┘
```
**Look for**:
- Currency symbols merged with values
- No extra empty columns
- Proper alignment
- Clean numeric formatting
## Tips
1. **Start with overview** - Get the big picture first
2. **Check tables visually** - Automated metrics miss formatting issues
3. **Use specific table inspection** - Don't scroll through 60 tables manually
4. **Compare text for semantics** - Does it make sense for an LLM?
5. **Run --all periodically** - Catch regressions across files
## Troubleshooting
### Script fails with import error
```bash
# Clear cached modules
find . -type d -name __pycache__ -exec rm -rf {} +
python tests/manual/compare_parsers.py data/html/Apple.10-K.html
```
### File not found
```bash
# Check available files
ls -lh data/html/*.html
# Use full path
python tests/manual/compare_parsers.py /full/path/to/file.html
```
### Old parser shows 0 text
This is expected - old parser has different text extraction. Focus on:
- Table comparison
- Parse time
- Visual quality of output
## Next Steps
1. Run comparison on all test files
2. Document bugs in `quality-improvement-strategy.md`
3. Fix issues
4. Repeat until satisfied
See `edgar/documents/docs/quality-improvement-strategy.md` for full process.

View File

@@ -0,0 +1,529 @@
# Fast Table Rendering
**Status**: Production Ready - **Now the Default** (as of 2025-10-08)
**Performance**: ~8-10x faster than Rich rendering with correct colspan/rowspan handling
---
## Overview
Fast table rendering provides a high-performance alternative to Rich library rendering for table text extraction. When parsing SEC filings with hundreds of tables, the cumulative rendering time can become a bottleneck. Fast rendering addresses this by using direct string building with TableMatrix for proper colspan/rowspan handling, achieving 8-10x speedup while maintaining correctness.
**As of 2025-10-08, fast rendering is the default** for all table text extraction. You no longer need to explicitly enable it.
### Why It's Now the Default
- **Production-ready**: Fixed all major issues (colspan, multi-row headers, multi-line cells)
- **7-10x faster**: Significant performance improvement with correct output
- **Maintains quality**: Matches Rich's appearance with simple() style
- **Proven**: Extensively tested with Apple, NVIDIA, Microsoft 10-K filings
### When to Disable (Use Rich Instead)
You may want to disable fast rendering and use Rich for:
- **Terminal display for humans**: Rich has more sophisticated text wrapping and layout
- **Visual reports**: When presentation quality is more important than speed
- **Debugging**: Rich output can be easier to visually inspect
---
## Usage
### Default Behavior (Fast Rendering Enabled)
```python
from edgar.documents import parse_html
# Fast rendering is now the default - no configuration needed!
doc = parse_html(html)
# Tables automatically use fast renderer (7-10x faster)
table_text = doc.tables[0].text()
```
### Disabling Fast Rendering (Use Rich Instead)
If you need Rich's sophisticated layout for visual display:
```python
from edgar.documents import parse_html
from edgar.documents.config import ParserConfig
# Explicitly disable fast rendering to use Rich
config = ParserConfig(fast_table_rendering=False)
doc = parse_html(html, config=config)
# Tables use Rich renderer (slower but with advanced formatting)
table_text = doc.tables[0].text()
```
### Custom Table Styles
**New in this version**: Fast rendering now uses the `simple()` style by default, which matches Rich's `box.SIMPLE` appearance (borderless, clean).
```python
from edgar.documents import parse_html
from edgar.documents.config import ParserConfig
from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle
# Enable fast rendering (uses simple() style by default)
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)
# Default: simple() style - borderless, clean
table_text = doc.tables[0].text()
# To use pipe_table() style explicitly (markdown-compatible borders):
renderer = FastTableRenderer(TableStyle.pipe_table())
pipe_text = renderer.render_table_node(doc.tables[0])
# To use minimal() style (no separator):
renderer = FastTableRenderer(TableStyle.minimal())
minimal_text = renderer.render_table_node(doc.tables[0])
```
---
## Performance Comparison
### Benchmark Results
**Test**: Apple 10-K (63 tables) - Updated 2025-10-08
| Renderer | Average Per Table | Improvement | Notes |
|----------|-------------------|-------------|-------|
| Rich | 1.5-2.5ms | Baseline | Varies by table complexity |
| Fast (simple) | 0.15-0.35ms | **7-10x faster** | With proper colspan/rowspan handling |
**Real-world Examples** (Apple 10-K):
- Table 15 (complex colspan): Rich 2.51ms → Fast 0.35ms (**7.1x faster**)
- Table 6 (multi-line cells): Rich 1.61ms → Fast 0.17ms (**9.5x faster**)
- Table 5 (wide table): Rich 3.70ms → Fast 0.48ms (**7.7x faster**)
**Impact on Full Parse**:
- Rich rendering: 30-40% of total parse time spent in table rendering
- Fast rendering: 5-10% of total parse time
- **Overall speedup**: Reduces total parsing time by ~25-30%
### Memory Impact
Fast rendering also reduces memory overhead:
- No Rich Console objects retained
- Direct string building (no intermediate objects)
- Helps prevent memory leaks identified in profiling
---
## Output Examples
### Rich Renderer Output (Default)
```
(In millions)
Year Ended June 30, 2025 2024 2023
──────────────────────────────────────────────────────────
Operating lease cost $5,524 3,555 2,875
Finance lease cost:
Amortization of right-of-use assets $3,408 1,800 1,352
Interest on lease liabilities 1,417 734 501
Total finance lease cost $4,825 2,534 1,853
```
**Style**: `box.SIMPLE` - No outer border, just horizontal separator under header
**Pros**: Clean, uncluttered, perfect alignment, generous spacing
**Cons**: Slow (6.5ms per table), creates Rich objects, memory overhead
### Fast Renderer Output (NEW: simple() style - Default)
```
December 31, 2023 December 31, 2022 December 31, 2021
───────────────────────────────────────────────────────────────────────────────────────
Revenue 365,817 394,328 365,817
Cost of revenue 223,546 212,981 192,266
Gross profit 142,271 181,347 173,551
```
**Style**: `simple()` - Matches Rich's `box.SIMPLE` appearance
**Pros**: Fast (0.2ms per table), clean appearance, no visual noise, professional look
**Cons**: None - this is now the recommended default!
### Fast Renderer Output (pipe_table() style - Optional)
```
| | December 31, 2023 | December 31, 2022 | December 31, 2021 |
|--------------------------|---------------------|---------------------|---------------------|
| Revenue | 365,817 | 394,328 | 365,817 |
| Cost of revenue | 223,546 | 212,981 | 192,266 |
| Gross profit | 142,271 | 181,347 | 173,551 |
```
**Style**: `pipe_table()` - Markdown-compatible with borders
**Pros**: Fast (0.2ms per table), markdown-compatible, explicit column boundaries
**Cons**: Visual noise from pipe characters, busier appearance
**Use when**: You need markdown-compatible output with explicit borders
### Visual Comparison
**Rich** (`box.SIMPLE`):
- No outer border - clean, uncluttered look
- Horizontal line separator under header only
- Generous internal spacing and padding
- Perfect column alignment
- Professional, minimalist presentation
**Fast simple()** (NEW DEFAULT):
- No outer border - matches Rich's clean look
- Horizontal line separator under header (using `─`)
- Space-separated columns with generous padding
- Clean, professional appearance
- Same performance as pipe_table (~0.2ms per table)
**Fast pipe_table()** (optional):
- Full pipe table borders (`|` characters everywhere)
- Horizontal dashes for header separator
- Markdown-compatible format
- Explicit column boundaries
---
## Recent Improvements (2025-10-08)
### 1. Colspan/Rowspan Support
**Fixed**: Tables with `colspan` and `rowspan` attributes now render correctly.
**Previous issue**: Fast renderer was extracting cell text without accounting for colspan/rowspan, causing:
- Missing columns (e.g., "2023" column disappeared in Apple 10-K table 15)
- Misaligned data (currency symbols separated from values)
- Data loss (em dashes and other values missing)
**Solution**: Integrated `TableMatrix` for proper cell expansion, same as Rich rendering uses.
**Status**: ✅ FIXED
### 2. Multi-Row Header Preservation
**Fixed**: Tables with multiple header rows now preserve each row separately.
**Previous issue**: Multi-row headers were collapsed into a single line, causing "Investment portfolio" row to disappear in Apple 10-K table 20.
**Solution**: Modified `render_table_data()` and `_build_table()` to preserve each header row as a separate line.
**Status**: ✅ FIXED
### 3. Multi-Line Cell Rendering
**Fixed**: Cells containing newline characters (`\n`) now render as multiple lines.
**Previous issue**: Multi-line cells like "Interest Rate\nSensitive Instrument" were truncated to first line only.
**Solution**: Added `_format_multiline_row()` to split cells by `\n` and render each line separately.
**Status**: ✅ FIXED
### Performance Impact
All three fixes maintain excellent performance:
- **Speedup**: 7-10x faster than Rich (down from initial 14x, but with correct output)
- **Correctness**: Now matches Rich output exactly for colspan, multi-row headers, and multi-line cells
- **Production ready**: Can confidently use as default renderer
---
## Known Limitations
### 1. Column Alignment in Some Tables
**Issue**: Currency symbols and values may have extra spacing in some complex tables (e.g., Apple 10-K table 22)
**Example**:
- Rich: `$294,866`
- Fast: `$ 294,866` (extra spacing)
**Root cause**: Column width calculation creates wider columns for some currency/value pairs after colspan expansion and column filtering.
**Impact**: Visual appearance differs slightly, but data is correct and readable.
**Status**: ⚠️ Minor visual difference - acceptable trade-off for 10x performance gain
### 3. Visual Polish
**Issue**: Some visual aspects don't exactly match Rich's sophisticated layout
**Examples**:
- Multi-line cell wrapping may differ
- Column alignment in edge cases
**Status**: ⚠️ Acceptable trade-off for 8-10x performance gain
---
## Configuration Options
### Table Styles
Fast renderer supports different visual styles:
```python
from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle
# Pipe table style (default) - markdown compatible
renderer = FastTableRenderer(TableStyle.pipe_table())
# Minimal style - no borders, just spacing
renderer = FastTableRenderer(TableStyle.minimal())
```
### Minimal Style Output
```
December 31, 2023 December 31, 2022 December 31, 2021
Revenue 365,817 394,328 365,817
Cost of revenue 223,546 212,981 192,266
Gross profit 142,271 181,347 173,551
```
**Note**: Minimal style has cleaner appearance but loses column boundaries
---
## Technical Details
### How It Works
1. **Direct String Building**: Bypasses Rich's layout engine
2. **Column Analysis**: Detects numeric columns for right-alignment
3. **Smart Filtering**: Removes empty spacing columns
4. **Currency Merging**: Combines `$` symbols with amounts
5. **Width Calculation**: Measures content, applies min/max limits
### Code Path
```python
# When fast_table_rendering=True:
table.text()
TableNode._fast_text_rendering()
FastTableRenderer.render_table_node()
Direct string building
```
### Memory Benefits
Fast rendering avoids:
- Rich Console object creation (~0.4MB per document)
- Intermediate rich.Table objects
- Style/theme processing overhead
- ANSI escape code generation
---
## Future Improvements
### Planned Enhancements
1. **Match Rich's `box.SIMPLE` Style** (Priority: HIGH)
- **Remove all pipe characters** - no outer border, no column separators
- **Keep only horizontal separator** under header (using `─` character)
- **Increase internal padding** to match Rich's generous spacing
- **Clean, minimalist appearance** like Rich's SIMPLE box style
- **Goal**: Match Rich visual quality, still 30x faster
2. **Improved Layout Engine**
- Better column width calculation (avoid too-wide/too-narrow columns)
- Respect natural content breaks
- Dynamic spacing based on content type
- Handle wrapping for long content
3. **Dynamic Padding**
- Match Rich's generous spacing (currently too tight)
- Adjust padding based on content type
- Configurable padding rules
- Maintain alignment with variable padding
4. **Header Handling**
- Better multi-row header collapse
- Preserve important hierarchies
- Smart column spanning
- Honor header groupings
5. **Style Presets**
- `TableStyle.simple()` - Match Rich's `box.SIMPLE` (no borders, header separator only) ⭐ **PRIMARY GOAL**
- `TableStyle.minimal()` - no borders, just spacing (already implemented)
- `TableStyle.pipe_table()` - current markdown style (default)
- `TableStyle.ascii_clean()` - no Unicode, pure ASCII
- `TableStyle.compact()` - minimal spacing for dense data
### Timeline
These improvements are **planned for Phase 2** of the HTML parser optimization work (after memory leak fixes).
---
## Migration Guide
### From Rich to Fast
**Before** (using Rich):
```python
doc = parse_html(html)
table_text = doc.tables[0].text() # Slow but pretty
```
**After** (using Fast):
```python
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)
table_text = doc.tables[0].text() # Fast but current visual issues
```
### Hybrid Approach
Use fast rendering during processing, Rich for final display:
```python
# Fast processing
config = ParserConfig(fast_table_rendering=True)
doc = parse_html(html, config=config)
# Extract data quickly
for table in doc.tables:
data = table.text() # Fast
# Process data...
# Display one table nicely
special_table = doc.tables[5]
rich_output = special_table.render() # Switch to Rich for display
```
---
## Performance Recommendations
### Recommended Settings by Use Case
**Batch Processing** (optimize for speed):
```python
config = ParserConfig.for_performance()
# Includes: fast_table_rendering=True, eager_section_extraction=False
```
**Data Extraction** (balance speed and accuracy):
```python
config = ParserConfig(
fast_table_rendering=True,
extract_xbrl=True,
detect_sections=True
)
```
**Display/Reports** (optimize for quality):
```python
config = ParserConfig() # Default settings use Rich
# Or explicitly:
config = ParserConfig.for_accuracy()
```
---
## FAQ
**Q: Can I mix Fast and Rich rendering?**
A: Not per-table. The setting is document-wide via ParserConfig. However, you can manually call `table.render()` to get Rich output.
**Q: Does this affect section extraction?**
A: Indirectly, yes. Section detection calls `text()` on the entire document, which includes tables. Fast rendering speeds this up significantly.
**Q: Will the output format change?**
A: Yes, as we improve the renderer. We'll maintain backward compatibility via style options.
**Q: Can I customize the appearance?**
A: Currently limited to `TableStyle.pipe_table()` vs `TableStyle.minimal()`. More options coming.
**Q: What about DataFrame export?**
A: Fast rendering only affects text output. `table.to_dataframe()` is unaffected.
---
## Feedback
The fast renderer is actively being improved based on user feedback. Known issues:
1.**Pipe characters** - visual noise
2.**Layout engine** - inconsistent spacing
3.**Padding** - needs tuning
If you have specific rendering issues or suggestions, please provide:
- Sample table HTML
- Expected vs actual output
- Use case description
This helps prioritize improvements while maintaining the performance advantage.
---
## Summary
### Current State (As of 2025-10-08)
**Performance**: ✅ Excellent (8-10x faster than Rich)
**Correctness**: ✅ Production ready (proper colspan/rowspan handling)
**Visual Quality**: ⚠️ Good (simple() style matches Rich's box.SIMPLE appearance)
**Use Case**: Production-ready for all use cases
### Recent Milestones
**✅ Completed**:
- Core fast rendering implementation
- TableStyle.simple() preset (borderless, clean)
- Column filtering and merging
- Numeric alignment detection
- **Colspan/rowspan support via TableMatrix**
- **Performance benchmarking with real tables**
**🔧 Current Limitations**:
- Multi-row header collapsing differs from Rich
- Some visual polish differences (acceptable for speed gain)
- Layout engine not as sophisticated as Rich
### Development Roadmap
**Phase 1** (✅ COMPLETED):
- ✅ Core fast rendering implementation
- ✅ Simple() style matching Rich's box.SIMPLE
- ✅ Proper colspan/rowspan handling via TableMatrix
- ✅ Production-ready performance (8-10x faster)
**Phase 2** (Future Enhancements):
- 📋 Improve multi-row header handling
- 📋 Better layout engine for perfect column widths
- 📋 Additional style presets
- 📋 Advanced header detection (data vs labels)
### Bottom Line
Fast table rendering is **production-ready and now the default** for all table text extraction in EdgarTools.
**Benefits**:
- ✅ 7-10x faster than Rich rendering
- ✅ Correct data extraction with proper colspan/rowspan handling
- ✅ Multi-row header preservation
- ✅ Multi-line cell rendering
- ✅ Clean, borderless appearance (simple() style)
**Minor differences from Rich**:
- ⚠️ Some tables have extra spacing between currency symbols and values (e.g., table 22)
- ⚠️ Column width calculation may differ slightly in complex tables
- ✅ All data is preserved and correct - only visual presentation differs
The implementation achieves **correct data extraction** with **significant performance gains** and **clean visual output**, making it the ideal default for EdgarTools.
---
## Related Documentation
- [HTML Parser Status](HTML_PARSER_STATUS.md) - Overall parser progress
- [Performance Analysis](../perf/hotpath_analysis.md) - Profiling results showing Rich rendering bottleneck
- [Memory Analysis](../perf/memory_analysis.md) - Memory leak issues with Rich objects

View File

@@ -0,0 +1,164 @@
# Goals
## Mission
Replace `edgar.files` with a parser that is better in **every way** - utility, accuracy, and user experience. The maintainer is the final judge: output must look correct when printed.
## Core Principles
### Primary Goal: AI Context Optimization
- **Token efficiency**: 30-50% reduction vs raw HTML while preserving semantic meaning
- **Chunking support**: Enable independent processing of sections/tables for LLM context windows
- **Clean text output**: Tables rendered in LLM-friendly formats (clean text, markdown)
- **Semantic preservation**: Extract meaning, not just formatting
### Secondary Goal: Human Readability
- **Rich console output**: Beautiful rendering with proper table alignment
- **Markdown export**: Professional-looking document conversion
- **Section navigation**: Easy access to specific Items/sections
## User-Focused Feature Goals
### 1. Text Extraction
- Extract full document text without dropping meaningful content
- Preserve paragraph structure and semantic whitespace
- Handle inline XBRL facts gracefully (show values, not raw tags)
- Clean HTML artifacts automatically (scripts, styles, page numbers)
- **Target**: 99%+ accuracy vs manual reading
### 2. Section Extraction (10-K, 10-Q, 8-K)
- Detect >90% of standard sections for >90% of test tickers
- Support flexible access: `doc.sections['Item 1A']`, `doc['1A']`, `doc.risk_factors`
- Return Section objects with `.text()`, `.tables`, `.search()` methods
- Include confidence scores and detection method metadata
- **Target**: Better recall than old parser (quantify with test suite)
### 3. Table Extraction
- Extract all meaningful data tables (ignore pure layout tables)
- Accurate rendering with aligned columns and proper formatting
- Handle complex tables (rowspan, colspan, nested headers)
- Preserve table captions and surrounding context
- Support DataFrame conversion for data analysis
- **Target**: 95%+ accuracy on test corpus
### 4. Search Capabilities
- Text search within documents
- Regex pattern matching
- Semantic search preparation (structure for embedding-based search)
- Search within sections for focused queries
### 5. Multiple Output Formats
- Plain text (optimized for LLM context)
- Markdown (for documentation/sharing)
- Rich console (beautiful terminal display)
- JSON (structured data export)
### 6. Developer Experience
- Intuitive API: `doc.text()`, `doc.tables`, `doc.sections`
- Rich objects with useful methods (not just strings)
- Simple tasks simple, complex tasks possible
- Helpful error messages with recovery suggestions
- **Target**: New users productive in <10 minutes
## Performance Targets
### Speed Benchmarks (Based on Current Performance)
- **Small docs (<5MB)**: <500ms ✅ *Currently 96ms - excellent*
- **Medium docs (5-20MB)**: <2s ✅ *Currently 1.19s - excellent*
- **Large docs (>50MB)**: <10s ✅ *Currently 0.59s - excellent*
- **Throughput**: >3MB/s sustained ✅ *Currently 3.8MB/s*
- **Target**: Maintain or improve on all benchmarks
### Memory Efficiency
- **Small docs (<5MB)**: <3x document size *(currently 9x - needs optimization)*
- **Large docs (>10MB)**: <2x document size *(currently 1.9x - good)*
- **No memory spikes**: Never exceed 5x document size *(MSFT currently 5.4x)*
- **Target**: Consistent 2-3x overhead across all document sizes
### Accuracy Benchmarks
- **Section detection recall**: >90% on 20-ticker test set
- **Table extraction accuracy**: >95% on manual validation set
- **Text fidelity**: >99% semantic equivalence to source HTML
- **XBRL fact extraction**: 100% of inline facts captured correctly
## Implementation Details
### HTML Parsing
- Read the entire HTML document without dropping semantically meaningful content
- Drop non-meaningful content (scripts, styles, pure formatting tags)
- Preserve semantic structure (headings, paragraphs, lists)
- Handle both old (pre-2015) and modern (inline XBRL) formats
- Graceful degradation for malformed HTML
### Table Parsing
- Extract tables containing meaningful data
- Ignore layout tables (unless they aid document understanding)
- Accurate rendering with proper column alignment
- Handle complex structures: rowspan, colspan, nested headers, multi-level headers
- Preserve table captions and contextual information
- Support conversion to pandas DataFrame
### Section Extraction
- Detect standard sections (Item 1, 1A, 7, etc.) for 10-K, 10-Q, 8-K filings
- Support multiple detection strategies: TOC-based, heading-based, pattern-based
- Return Section objects with full API: `.text()`, `.text_without_tables()`, `.tables`, `.search()`
- Include metadata: confidence scores, detection method, position
- Better recall than old parser (establish baseline with test suite)
## Quality Gates Before Replacing edgar.files
### Automated Tests
- [ ] All existing tests pass with new parser (1000+ tests)
- [ ] Performance regression tests (<5% slower on any document)
- [ ] Memory regression tests (no >10% increases)
- [ ] Section detection accuracy >90% on test corpus
- [ ] Table extraction accuracy >95% on validation set
### Manual Validation (Maintainer Review)
- [ ] Print full document text for 10 sample filings → verify quality
- [ ] Compare table rendering old vs new → verify improvement
- [ ] Test section extraction on edge cases → verify robustness
- [ ] Review markdown output → verify professional appearance
- [ ] Check memory usage → verify no concerning spikes
### Documentation Requirements
- [ ] Migration guide (old API → new API with examples)
- [ ] Updated user guide showing new features
- [ ] Performance comparison report (old vs new)
- [ ] Known limitations documented clearly
- [ ] API reference complete for all public methods
## Success Metrics
### Launch Criteria
1. **Speed**: Equal or faster on 95% of test corpus
2. **Accuracy**: Maintainer approves output quality on sample set
3. **API**: Clean, intuitive interface (no confusion)
4. **Tests**: Zero regressions, 95%+ coverage on new code
5. **Docs**: Complete with examples for all major use cases
### Post-Launch Monitoring
- Issue reports: <5% related to parser quality/accuracy
- User feedback: Positive sentiment on ease of use
- Performance: No degradation over time (regression tests)
- Adoption: Smooth migration from old parser (deprecation path)
## Feature Parity with Old Parser
### Must-Have (Required for Migration)
- ✅ Get document text (with/without tables)
- ✅ Extract specific sections by name/number
- ✅ List all tables in document
- ✅ Search document content
- ✅ Convert to markdown
- ✅ Handle both old and new SEC filing formats
- ✅ Graceful error handling
### Nice-to-Have (Improvements Over Old Parser)
- 🎯 Semantic search capabilities
- 🎯 Better subsection extraction within Items
- 🎯 Table-of-contents navigation
- 🎯 Export to multiple formats (JSON, clean HTML)
- 🎯 Batch processing optimizations
- 🎯 Section confidence scores and metadata

View File

@@ -0,0 +1,240 @@
# HTML Parser Rewrite Technical Overview
## Executive Summary
The `edgar/documents` module represents a comprehensive rewrite of the HTML parsing capabilities originally implemented in `edgar/files`. This new parser is designed to provide superior parsing accuracy, structured data extraction, and rendering quality for SEC filing documents. The rewrite introduces a modern, extensible architecture with specialized components for handling the complex structure of financial documents.
## Architecture Overview
### Core Components
#### 1. Document Object Model
The new parser introduces a sophisticated node-based document model:
- **Document**: Top-level container with metadata and sections
- **Node Hierarchy**: Abstract base classes for all document elements
- `DocumentNode`: Root document container
- `TextNode`: Plain text content
- `ParagraphNode`: Paragraph elements with styling
- `HeadingNode`: Headers with levels 1-6
- `ContainerNode`: Generic containers (div, section)
- `SectionNode`: Document sections with semantic meaning
- `ListNode`/`ListItemNode`: Ordered and unordered lists
- `LinkNode`: Hyperlinks with metadata
- `ImageNode`: Images with attributes
#### 2. Table Processing System
Advanced table handling represents a major improvement over the old parser:
- **TableNode**: Sophisticated table representation with multi-level headers
- **Cell**: Individual cell with colspan/rowspan support and type detection
- **Row**: Table row with header detection and semantic classification
- **TableMatrix**: Handles complex cell spanning and alignment
- **CurrencyColumnMerger**: Intelligently merges currency symbols with values
- **ColumnAnalyzer**: Detects spacing columns and optimizes layout
#### 3. Parser Pipeline
The parsing process follows a well-defined pipeline:
1. **HTMLParser**: Main orchestration class
2. **HTMLPreprocessor**: Cleans and normalizes HTML
3. **DocumentBuilder**: Converts HTML tree to document nodes
4. **Strategy Pattern**: Pluggable parsing strategies
5. **DocumentPostprocessor**: Final cleanup and optimization
### Key Improvements Over Old Parser
#### Table Processing Enhancements
**Old Parser (`edgar/files`)**:
- Basic table extraction using BeautifulSoup
- Limited colspan/rowspan handling
- Simple text-based rendering
- Manual column alignment
- Currency symbols often misaligned
**New Parser (`edgar/documents`)**:
- Advanced table matrix system for perfect cell alignment
- Intelligent header detection (multi-row headers, year detection)
- Automatic currency column merging ($1,234 instead of $ | 1,234)
- Semantic table type detection (FINANCIAL, METRICS, TOC, etc.)
- Rich table rendering with proper formatting
- Smart column width calculation
- Enhanced numeric formatting with comma separators
#### Document Structure
**Old Parser**:
- Flat block-based structure
- Limited semantic understanding
- Basic text extraction
**New Parser**:
- Hierarchical node-based model
- Semantic section detection
- Rich metadata preservation
- XBRL fact extraction
- Search capabilities
- Multiple output formats (text, markdown, JSON, pandas)
#### Rendering Quality
**Old Parser**:
- Basic text output
- Limited table formatting
- No styling preservation
**New Parser**:
- Multiple renderers (text, markdown, Rich console)
- Preserves document structure and styling
- Configurable output options
- LLM-optimized formatting
## Implementation Details
### Configuration System
The new parser uses a comprehensive configuration system:
```python
@dataclass
class ParserConfig:
# Size limits
max_document_size: int = 50 * 1024 * 1024 # 50MB
streaming_threshold: int = 10 * 1024 * 1024 # 10MB
# Processing options
preserve_whitespace: bool = False
detect_sections: bool = True
extract_xbrl: bool = True
table_extraction: bool = True
detect_table_types: bool = True
```
### Strategy Pattern Implementation
The parser uses pluggable strategies for different aspects:
- **HeaderDetectionStrategy**: Identifies document sections
- **TableProcessor**: Handles table extraction and classification
- **XBRLExtractor**: Extracts XBRL facts and metadata
- **StyleParser**: Processes CSS styling information
### Table Processing Deep Dive
The table processing system represents the most significant improvement:
#### Header Detection Algorithm
- Analyzes cell content patterns (th vs td elements)
- Detects year patterns in financial tables
- Identifies period indicators (quarters, fiscal years)
- Handles multi-row headers with units and descriptions
- Prevents misclassification of data rows as headers
#### Cell Type Detection
- Numeric vs text classification
- Currency value recognition
- Percentage handling
- Em dash and null value detection
- Proper number formatting with thousand separators
#### Matrix Building
- Handles colspan and rowspan expansion
- Maintains cell relationships
- Optimizes column layout
- Removes spacing columns automatically
### XBRL Integration
The new parser includes sophisticated XBRL processing:
- Extracts facts before preprocessing to preserve ix:hidden content
- Maintains metadata relationships
- Supports inline XBRL transformations
- Preserves semantic context
## Performance Characteristics
### Memory Efficiency
- Streaming support for large documents (>10MB)
- Lazy loading of document sections
- Caching for repeated operations
- Memory-efficient node representation
### Processing Speed
- Optimized HTML parsing with lxml
- Configurable processing strategies
- Parallel extraction capabilities
- Smart caching of expensive operations
## Migration and Compatibility
### API Compatibility
The new parser maintains high-level compatibility with the old parser while offering enhanced functionality:
```python
# Old way
from edgar.files import FilingDocument
doc = FilingDocument(html)
text = doc.text()
# New way
from edgar.documents import HTMLParser
parser = HTMLParser()
doc = parser.parse(html)
text = doc.text()
```
### Feature Parity
All major features from the old parser are preserved:
- Text extraction
- Table conversion to DataFrame
- Section detection
- Metadata extraction
### Enhanced Features
New capabilities not available in the old parser:
- Rich console rendering
- Markdown export
- Advanced table semantics
- XBRL fact extraction
- Document search
- LLM optimization
- Multiple output formats
## Current Status and Next Steps
### Completed Components
- ✅ Core document model
- ✅ HTML parsing pipeline
- ✅ Advanced table processing
- ✅ Multiple renderers (text, markdown, Rich)
- ✅ XBRL extraction
- ✅ Configuration system
- ✅ Streaming support
### Remaining Work
- 🔄 Performance optimization and benchmarking
- 🔄 Comprehensive test coverage migration
- 🔄 Error handling improvements
- 🔄 Documentation and examples
- 🔄 Validation against large corpus of filings
### Testing Strategy
The rewrite requires extensive validation:
- Comparison testing against old parser output
- Financial table accuracy verification
- Performance benchmarking
- Edge case handling
- Integration testing with existing workflows
## Conclusion
The `edgar/documents` rewrite represents a significant advancement in SEC filing processing capabilities. The new architecture provides:
1. **Better Accuracy**: Advanced table processing and semantic understanding
2. **Enhanced Functionality**: Multiple output formats and rich rendering
3. **Improved Maintainability**: Clean, modular architecture with clear separation of concerns
4. **Future Extensibility**: Plugin architecture for new parsing strategies
5. **Performance**: Streaming support and optimized processing for large documents
The modular design ensures that improvements can be made incrementally while maintaining backward compatibility. The sophisticated table processing system alone represents a major advancement in handling complex financial documents accurately.

View File

@@ -0,0 +1,208 @@
# HTML Parser Quality Improvement Strategy
## Overview
Simple, iterative testing strategy for the HTML parser rewrite. The goal is rapid feedback loops where we compare OLD vs NEW parser output, identify visual/functional issues, fix them, and repeat until satisfied.
## Test Corpus
### 10 Representative Documents
Selected to cover different filing types, companies, and edge cases:
| # | Company | Filing Type | File Path | Rationale |
|---|---------|-------------|-----------|-----------|
| 1 | Apple | 10-K | `data/html/Apple.10-K.html` | Large complex filing, existing test file |
| 2 | Oracle | 10-K | `data/html/Oracle.10-K.html` | Complex financials, existing test file |
| 3 | Nvidia | 10-K | `data/html/Nvidia.10-K.html` | Tech company, existing test file |
| 4 | Microsoft | 10-K | `data/html/Microsoft.10-K.html` | Popular company, complex tables |
| 5 | Tesla | 10-K | `data/html/Tesla.10-K.html` | Manufacturing sector, different formatting |
| 6 | [TBD] | 10-Q | TBD | Quarterly report format |
| 7 | [TBD] | 10-Q | TBD | Another quarterly for variety |
| 8 | [TBD] | 8-K | `data/html/BuckleInc.8-K.html` | Event-driven filing |
| 9 | [TBD] | Proxy (DEF 14A) | TBD | Proxy statement with compensation tables |
| 10 | [TBD] | Edge case | TBD | Unusual formatting or very large file |
**Note**: Fill in TBD entries as we identify good test candidates.
## The 4-Step Loop
### Step 1: Run Comparison
Use existing test scripts to compare OLD vs NEW parsers:
```bash
# Full comparison with metrics
python tests/manual/check_parser_comparison.py
# Table-focused comparison with rendering
python tests/manual/check_tables.py
# Or run on specific file
python tests/manual/check_html_rewrite.py
```
**Outputs to review**:
- Console output with side-by-side Rich panels
- Metrics (parse time, table count, section detection)
- Rendered tables (old vs new)
### Step 2: Human Review
**Visual Inspection Process**:
1. Look at console output directly (Rich rendering)
2. For detailed text comparison, optionally dump to files:
- OLD parser: `doc.text()``output/old_apple.txt`
- NEW parser: `doc.text()``output/new_apple.txt`
- Use `diff` or visual diff tool
3. Take screenshots for complex table issues
4. Focus on:
- Table alignment and formatting
- Currency symbol placement (should be merged: `$1,234` not `$ | 1,234`)
- Column count (fewer is better after removing spacing columns)
- Section detection accuracy
- Text readability for LLM context
**Quality Criteria** (from goals.md):
- Semantic meaning preserved
- Tables render correctly when printed
- Better than old parser in speed, accuracy, features
- **You are the final judge**: "Does this look right?"
### Step 3: Document Bugs
Record issues in the tracker below as you find them:
| Bug # | Status | Priority | Description | File/Location | Notes |
|-------|--------|----------|-------------|---------------|-------|
| Example | Fixed | High | Currency symbols not merging in balance sheet | Apple 10-K, Table 5 | Issue in CurrencyColumnMerger |
| | | | | | |
| | | | | | |
| | | | | | |
**Status values**: Open, In Progress, Fixed, Won't Fix, Deferred
**Priority values**: Critical, High, Medium, Low
**Bug Description Template**:
- What's wrong: Clear description of the issue
- Where: Which file/table/section
- Expected: What it should look like
- Actual: What it currently looks like
- Impact: How it affects usability/readability
### Step 4: Fix & Repeat
1. Pick highest priority bug
2. Fix the code
3. Re-run comparison on affected file(s)
4. Verify fix doesn't break other files
5. Mark bug as Fixed
6. Repeat until exit criteria met
**Quick verification**:
```bash
# Re-run just the problematic file
python -c "
from edgar.documents import parse_html
from pathlib import Path
html = Path('data/html/Apple.10-K.html').read_text()
doc = parse_html(html)
# Quick inspection
print(f'Tables: {len(doc.tables)}')
print(doc.tables[5].render(width=200)) # Check specific table
"
```
## Exit Criteria
We're done when:
1. ✅ All 10 test documents parse successfully
2. ✅ Visual output looks correct (maintainer approval)
3. ✅ Tables render cleanly with proper alignment
4. ✅ No critical or high priority bugs remain
5. ✅ Performance is equal or better than old parser
6. ✅ Text extraction is complete and clean for AI context
**Final approval**: Maintainer says "This is good enough to ship."
## Testing Infrastructure
### Primary Tool: compare_parsers.py
Simple command-line tool for the quality improvement loop:
```bash
# Quick overview comparison (using shortcuts!)
python tests/manual/compare_parsers.py aapl
# See all tables in a document
python tests/manual/compare_parsers.py aapl --tables
# Compare specific table (OLD vs NEW side-by-side)
python tests/manual/compare_parsers.py aapl --table 5
# Compare text extraction
python tests/manual/compare_parsers.py msft --text
# See section detection
python tests/manual/compare_parsers.py orcl --sections
# Test with 10-Q filings
python tests/manual/compare_parsers.py 'aapl 10-q'
# Run all test files at once
python tests/manual/compare_parsers.py --all
```
**Shortcuts available**:
- Companies: `aapl`, `msft`, `tsla`, `nvda`, `orcl`
- Filing types: `10-k` (default), `10-q`, `8-k`
- Or use full file paths
**Features**:
- Clean command-line interface
- Side-by-side OLD vs NEW comparison
- Rich console output with colors and tables
- Performance metrics
- Individual table inspection
### Other Available Scripts
Additional tools for specific testing:
- `tests/manual/check_parser_comparison.py` - Full comparison with metrics
- `tests/manual/check_tables.py` - Table-specific comparison with rendering
- `tests/manual/check_html_rewrite.py` - General HTML parsing checks
- `tests/manual/check_html_parser_real_files.py` - Real filing tests
## Quick Reference
For day-to-day testing commands and usage examples, see [TESTING.md](TESTING.md).
## Notes
- **Keep it simple**: This is about rapid iteration, not comprehensive automation
- **Visual inspection is key**: Automated metrics don't catch layout/formatting issues
- **Use screenshots**: When describing bugs, screenshots speak louder than words
- **Iterative approach**: Don't try to fix everything at once, prioritize
- **Trust your judgment**: If it looks wrong, it probably is wrong
## Bug Tracker
### Active Issues
(Add bugs here as they're discovered)
### Fixed Issues
(Move completed bugs here for history)
### Deferred Issues
(Issues that aren't blocking release but could be improved later)
---
**Status**: Initial draft
**Last Updated**: 2025-10-07
**Maintainer**: Dwight Gunning