Initial commit

2025-12-09 12:13:01 +01:00
commit 8e654ed209
13332 changed files with 2695056 additions and 0 deletions
--- a/venv/lib/python3.10/site-packages/edgar/documents/docs/HTML_PARSER_STATUS.md
+++ b/venv/lib/python3.10/site-packages/edgar/documents/docs/HTML_PARSER_STATUS.md
@@ -0,0 +1,314 @@
+# HTML Parser Rewrite - Status Report
+
+**Generated**: 2025-10-08
+**Branch**: `html_rewrite`
+**Target**: Merge to `main`
+
+---
+
+## Overall Progress: ~95% Complete ✅
+
+### Completed Phases
+
+#### ✅ Phase 1: Core Implementation (100%)
+- [x] Streaming parser for large documents
+- [x] TableMatrix system for accurate table rendering
+- [x] Section extraction with Part I/II detection
+- [x] XBRL integration
+- [x] Rich-based table rendering
+- [x] Configuration system (ParserConfig)
+- [x] Error handling and validation
+
+#### ✅ Phase 2: Functional Testing (100%)
+- [x] **Corpus Validation** - 40 diverse filings, 100% success rate
+- [x] **Edge Cases** - 31 tests covering invalid inputs, malformed HTML, edge conditions
+- [x] **Integration Tests** - 25 tests for Filing/Company integration, backward compatibility
+- [x] **Regression Tests** - 15 tests preventing known bugs from returning
+
+**Total Test Count**: 79 functional tests, all passing
+
+#### ✅ Phase 3: Performance Profiling (100%)
+- [x] **Benchmarking Infrastructure** - Comprehensive benchmark suite
+- [x] **Hot Path Analysis** - Identified 3 critical bottlenecks (63% section extraction, 40% Rich rendering, 15% regex)
+- [x] **Memory Profiling** - Found 255MB memory leak in MSFT 10-K, documented root causes
+- [x] **Performance Regression Tests** - 15 tests locking in baseline thresholds
+
+**Performance Baseline Established**:
+- Average: 3.8MB/s throughput, 4.1MB memory per doc
+- Small docs: 2.6MB/s (optimization opportunity)
+- Large docs: 20.7MB/s (excellent streaming)
+- Memory leak: 19-25x ratio on medium docs (needs fixing)
+
+#### ✅ Phase 4: Test Data Augmentation (100%)
+- [x] **HTML Fixtures** - Downloaded 32 files (155MB) from 16 companies across 6 industries
+- [x] **Download Automation** - Created `download_html_fixtures.py` script
+- [x] **Documentation** - Comprehensive fixture documentation
+
+---
+
+## Current Status: Ready for Optimization Phase
+
+### What's Working Well ✅
+
+1. **Parsing Accuracy**: 100% success rate across 40+ diverse filings
+2. **Large Document Handling**: Excellent streaming performance (20.7MB/s on JPM 10-K)
+3. **Table Extraction**: TableMatrix accurately handles colspan/rowspan
+4. **Test Coverage**: 79 comprehensive tests covering edge cases, integration, regression
+5. **Backward Compatibility**: Old TenK API still works for existing code
+
+### Known Issues to Address 🔧
+
+#### Critical (Must Fix Before Merge)
+
+1. **Memory Leaks** (Priority: CRITICAL)
+   - MSFT 10-K: 255MB leak (19x document size)
+   - Apple 10-K: 41MB leak (23x document size)
+   - **Root Causes**:
+     - Rich Console objects retained (0.4MB per doc)
+     - Global caches not cleared on document deletion
+     - Circular references in node graph
+   - **Location**: `tests/perf/memory_analysis.md:90-130`
+   - **Impact**: Server crashes after 10-20 requests in production
+
+2. **Performance Bottlenecks** (Priority: HIGH)
+   - Section extraction: 3.7s (63% of parse time)
+   - Rich rendering for text: 2.4s (40% of parse time)
+   - Regex normalization: 0.8s (15% of parse time)
+   - **Location**: `tests/perf/hotpath_analysis.md:9-66`
+   - **Impact**: 4x slower than necessary on medium documents
+
+#### Non-Critical (Can Fix After Merge)
+
+3. **Small Document Performance** (Priority: MEDIUM)
+   - 2.6MB/s vs desired 5MB/s
+   - Overhead dominates on <5MB documents
+   - **Optimization**: Lazy loading, reduce upfront processing
+
+---
+
+## Next Steps (In Order)
+
+### Phase 5: Critical Fixes (2-3 days) 🔧
+
+#### 5.1 Memory Leak Fixes (1-2 days)
+**Goal**: Reduce memory leak from 255MB to <5MB
+
+Tasks:
+- [ ] Implement `Document.__del__()` to clear caches
+- [ ] Replace Rich rendering in `text()` with direct string building
+- [ ] Break circular references in node graph
+- [ ] Use weak references for parent links
+- [ ] Add `__slots__` to frequently created objects (Cell, TableNode)
+
+**Expected Result**: MSFT 10-K leak: 255MB → <5MB (95% improvement)
+
+**Validation**:
+```bash
+pytest tests/perf/test_performance_regression.py::TestMemoryRegression -v
+```
+
+#### 5.2 Performance Optimizations (1-2 days)
+**Goal**: Improve parse speed from 1.2s → 0.3s on Apple 10-K (77% faster)
+
+Tasks:
+- [ ] Fix section detection - use headings instead of rendering entire document
+- [ ] Implement fast text extraction without Rich overhead
+- [ ] Optimize regex normalization - combine patterns, use compilation
+
+**Expected Results**:
+- Section extraction: 3.7s → 1.2s (60% faster)
+- Text extraction: 2.4s → 1.2s (50% faster)
+- Regex: 0.8s → 0.5s (40% faster)
+
+**Validation**:
+```bash
+pytest tests/perf/test_performance_regression.py::TestParseSpeedRegression -v
+```
+
+### Phase 6: Final Validation (1 day) ✅
+
+Tasks:
+- [ ] Re-run all 79 functional tests
+- [ ] Re-run performance regression tests (verify improvements)
+- [ ] Run full corpus validation
+- [ ] Memory profiling validation (confirm leaks fixed)
+- [ ] Update CHANGELOG.md
+- [ ] Create merge summary document
+
+### Phase 7: Merge to Main (1 day) 🚀
+
+Tasks:
+- [ ] Final code review
+- [ ] Squash commits or create clean merge
+- [ ] Update version number
+- [ ] Merge to main
+- [ ] Tag release
+- [ ] Monitor for issues
+
+---
+
+## Test Summary
+
+### Current Test Status: 79/79 Passing (100%)
+
+```
+tests/corpus/test_corpus_validation.py     8 tests  ✓
+tests/test_html_parser_edge_cases.py      31 tests  ✓
+tests/test_html_parser_integration.py     25 tests  ✓
+tests/test_html_parser_regressions.py     15 tests  ✓
+tests/perf/test_performance_regression.py 15 tests  ✓ (baseline established)
+```
+
+### Test Execution
+
+```bash
+# Functional tests (79 tests, ~30s)
+pytest tests/corpus tests/test_html_parser_*.py -v
+
+# Performance tests (15 tests, ~20s)
+pytest tests/perf/test_performance_regression.py -m performance -v
+
+# All tests
+pytest tests/ -v
+```
+
+---
+
+## Performance Metrics
+
+### Current Baseline (Before Optimization)
+
+| Document | Size | Parse Time | Throughput | Memory | Tables | Sections |
+|----------|------|------------|------------|--------|--------|----------|
+| Apple 10-Q | 1.1MB | 0.307s | 3.6MB/s | 27.9MB (25.6x) | 40 | 9 |
+| Apple 10-K | 1.8MB | 0.500s | 3.6MB/s | 21.6MB (11.9x) | 63 | 8 |
+| MSFT 10-K | 7.8MB | 1.501s | 5.2MB/s | 147.0MB (18.9x) | 85 | 0 |
+| JPM 10-K | 52.4MB | 2.537s | 20.7MB/s | 0.6MB (0.01x) | 681 | 0 |
+
+### Target Metrics (After Optimization)
+
+| Metric | Current | Target | Improvement |
+|--------|---------|--------|-------------|
+| **Memory leak** | 41-255MB | <5MB | 95% reduction |
+| **Memory ratio** | 19-25x | <3x | 87% reduction |
+| **Parse time (Apple 10-K)** | 0.500s | 0.150s | 70% faster |
+| **Throughput (small docs)** | 2.6MB/s | 5.0MB/s | 92% faster |
+
+---
+
+## File Organization
+
+### Core Parser Files
+```
+edgar/documents/
+├── __init__.py              # Public API (parse_html)
+├── parser.py                # Main parser with streaming
+├── config.py                # ParserConfig
+├── document_builder.py      # Document tree construction
+├── nodes/                   # Node types (TableNode, SectionNode)
+├── utils/
+│   ├── streaming.py         # Streaming parser (fixed JPM bug)
+│   └── table_processing.py  # TableMatrix system
+└── exceptions.py            # Custom exceptions
+```
+
+### Test Files
+```
+tests/
+├── corpus/                           # Corpus validation
+│   ├── quick_corpus.py              # Corpus builder
+│   └── test_corpus_validation.py    # 8 validation tests
+├── fixtures/
+│   ├── html/                         # 32 HTML fixtures (155MB)
+│   │   ├── {ticker}/10k/            # By company and form
+│   │   └── README.md
+│   └── download_html_fixtures.py    # Download automation
+├── perf/                             # Performance testing
+│   ├── benchmark_html_parser.py     # Benchmarking
+│   ├── profile_hotpaths.py          # Hot path profiling
+│   ├── profile_memory.py            # Memory profiling
+│   ├── test_performance_regression.py # Regression tests
+│   ├── performance_report.md        # Benchmark results
+│   ├── hotpath_analysis.md          # Bottleneck analysis
+│   └── memory_analysis.md           # Memory leak analysis
+├── test_html_parser_edge_cases.py   # 31 edge case tests
+├── test_html_parser_integration.py  # 25 integration tests
+└── test_html_parser_regressions.py  # 15 regression tests
+```
+
+---
+
+## Risks and Mitigation
+
+### Risk 1: Memory Leaks in Production
+**Severity**: HIGH
+**Probability**: HIGH (confirmed in testing)
+**Mitigation**: Must fix before merge (Phase 5.1)
+
+### Risk 2: Performance Regression
+**Severity**: MEDIUM
+**Probability**: LOW (baseline established, regression tests in place)
+**Mitigation**: Performance regression tests will catch any degradation
+
+### Risk 3: Backward Compatibility
+**Severity**: LOW
+**Probability**: LOW (integration tests passing)
+**Mitigation**: 25 integration tests verify old API still works
+
+---
+
+## Estimated Timeline to Merge
+
+```
+Phase 5.1: Memory leak fixes        1-2 days
+Phase 5.2: Performance optimization 1-2 days
+Phase 6: Final validation           1 day
+Phase 7: Merge to main              1 day
+----------------------------------------
+Total:                              4-6 days
+```
+
+**Target Merge Date**: October 12-14, 2025
+
+---
+
+## Decision Points
+
+### Should We Merge Now or After Optimization?
+
+**Option A: Merge Now (Not Recommended)**
+- ✅ Functional tests passing
+- ✅ Backward compatible
+- ❌ Memory leaks (production risk)
+- ❌ Performance issues
+- ❌ Will require hotfix soon
+
+**Option B: Fix Critical Issues First (Recommended)**
+- ✅ Production-ready
+- ✅ Performance validated
+- ✅ Memory efficient
+- ❌ 4-6 days delay
+- ✅ Clean, professional release
+
+**Recommendation**: **Option B** - Fix critical memory leaks and performance issues before merge. The 4-6 day investment prevents production incidents and ensures a polished release.
+
+---
+
+## Questions for Review
+
+1. **Scope**: Should we fix only critical issues (memory + performance) or also tackle small-doc optimization?
+2. **Timeline**: Is 4-6 days acceptable, or do we need to merge sooner?
+3. **Testing**: Are 79 functional tests + 15 performance tests sufficient coverage?
+4. **Documentation**: Do we need user-facing documentation updates?
+
+---
+
+## Conclusion
+
+The HTML parser rewrite is **95% complete** with excellent functional testing but critical memory and performance issues identified. The smart path forward is:
+
+1. ✅ Complete critical fixes (4-6 days)
+2. ✅ Validate improvements
+3. ✅ Merge to main with confidence
+
+This approach ensures a production-ready, performant parser rather than merging now and hotfixing later.
--- a/venv/lib/python3.10/site-packages/edgar/documents/docs/PROGRESS_ASSESSMENT.md
+++ b/venv/lib/python3.10/site-packages/edgar/documents/docs/PROGRESS_ASSESSMENT.md
@@ -0,0 +1,437 @@
+# HTML Parser Rewrite - Progress Assessment
+
+**Date**: 2025-10-07
+**Status**: Active Development (html_rewrite branch)
+
+---
+
+## Executive Summary
+
+The HTML parser rewrite is **substantially complete** for core functionality with **excellent progress** on Item/section detection. Recent bug fixes (2025-10-07) have addressed critical table rendering issues and 10-Q Part I/II distinction, bringing the parser close to production-ready quality.
+
+### Overall Progress: **~90% Complete**
+
+- ✅ Core parsing infrastructure: **100% Complete**
+- ✅ Table processing: **95% Complete** (recent fixes)
+- ✅ Section/Item detection: **95% Complete** (Part I/II fixed, needs validation)
+- ⚠️ Performance optimization: **70% Complete**
+- ⚠️ Comprehensive testing: **65% Complete** (added 10-Q Part tests)
+- ⚠️ Documentation: **75% Complete**
+
+---
+
+## Goal Achievement Analysis
+
+### Primary Goals (from goals.md)
+
+#### 1. **Semantic Meaning Preservation** ✅ **ACHIEVED**
+> "Read text, tables and ixbrl data preserving greatest semantic meaning"
+
+**Status**: ✅ Fully implemented
+- Text extraction with structure preservation
+- Advanced table matrix system for accurate table rendering
+- XBRL fact extraction before preprocessing
+- Hierarchical node model maintains document structure
+
+**Recent Improvements**:
+- Header detection fixes (Oracle Table 6, Tesla Table 16)
+- Spacing column filter now preserves header columns (MSFT Table 39)
+- Multi-row header normalization
+
+#### 2. **AI Channel (Primary) + Human Channel (Secondary)** ✅ **ACHIEVED**
+> "AI context is the primary goal, with human context being secondary"
+
+**Status**: ✅ Both channels working
+- **AI Channel**:
+  - Clean text output optimized for LLMs
+  - Structured table rendering for context windows
+  - Section-level extraction for chunking
+  - Semantic divisibility supported
+
+- **Human Channel**:
+  - Rich console rendering with proper formatting
+  - Markdown export
+  - Visual table alignment (recently fixed)
+
+#### 3. **Section-Level Processing** ✅ **ACHIEVED**
+> "Work at full document level and section level - breaking into independently processable sections"
+
+**Status**: ✅ Implemented with good coverage
+- `SectionExtractor` class fully functional
+- TOC-based section detection
+- Pattern-based section identification
+- Lazy loading support for large documents
+
+**What Works**:
+```python
+# Section detection is operational
+doc = parse_html(html)
+sections = doc.sections  # Dict of section names -> SectionNode
+
+# Access specific sections
+business = sections.get('Item 1 - Business')
+mda = sections.get('Item 7 - MD&A')
+financials = sections.get('Item 8 - Financial Statements')
+```
+
+#### 4. **Standard Section Names (10-K, 10-Q, 8-K)** ✅ **ACHIEVED**
+> "For some filing types (10-K, 10-Q, 8-K) identify sections by standard names"
+
+**Status**: ✅ 95% Complete - Implemented with Part I/II distinction for 10-Q
+
+**What's Implemented**:
+- Pattern matching for standard Items:
+  - Item 1 - Business
+  - Item 1A - Risk Factors
+  - Item 7 - MD&A
+  - Item 7A - Market Risk
+  - Item 8 - Financial Statements
+  - And more...
+- **10-Q Part I/Part II distinction** (newly fixed 2025-10-07):
+  - Part I - Item 1 (Financial Statements)
+  - Part II - Item 1 (Legal Proceedings)
+  - Proper boundary detection and context propagation
+  - Prevents Item number conflicts
+
+**What's Remaining** (5%):
+- Validation against large corpus of 10-K/10-Q filings
+- Edge case handling (non-standard formatting)
+- 8-K specific section patterns expansion
+
+**Evidence from Code**:
+```python
+# edgar/documents/extractors/section_extractor.py
+(r'^(Item|ITEM)\s+1\.?\s*Business', 'Item 1 - Business'),
+(r'^(Item|ITEM)\s+1A\.?\s*Risk\s+Factors', 'Item 1A - Risk Factors'),
+(r'^(Item|ITEM)\s+7\.?\s*Management.*Discussion', 'Item 7 - MD&A'),
+(r'^(Item|ITEM)\s+8\.?\s*Financial\s+Statements', 'Item 8 - Financial Statements'),
+
+# NEW: Part I/II detection (edgar/documents/extractors/section_extractor.py:294-324)
+def _detect_10q_parts(self, headers) -> Dict[int, str]:
+    """Detect Part I and Part II boundaries in 10-Q filings."""
+```
+
+#### 5. **Table Processing for AI Context** ✅ **ACHIEVED**
+> "Getting tables in the right structure for rendering to text for AI context is more important than dataframes"
+
+**Status**: ✅ Excellent progress with recent fixes
+- Advanced TableMatrix system handles complex tables
+- Multi-row header detection and normalization
+- Spacing column filtering (preserves semantic columns)
+- Currency symbol merging
+- Clean text rendering for LLM consumption
+
+**Recent Fixes (Today)**:
+- ✅ Fixed spacing column filter removing legitimate headers (MSFT Table 39)
+- ✅ Fixed header detection for date ranges (Oracle Table 6)
+- ✅ Fixed long narrative text misclassification (Tesla Table 16)
+- ✅ Header row normalization for alignment
+
+#### 6. **Better Than Old Parser in Every Way** 🟡 **MOSTLY ACHIEVED**
+> "Speed, accuracy, features, usability"
+
+**Comparison**:
+
+| Aspect | Old Parser | New Parser | Status |
+|--------|-----------|------------|--------|
+| **Speed** | Baseline | 1.4x faster (typical) | ✅ Better |
+| **Accuracy** | Good | Excellent (with recent fixes) | ✅ Better |
+| **Features** | Basic | Rich (XBRL, sections, multiple outputs) | ✅ Better |
+| **Usability** | Simple | Powerful + Simple API | ✅ Better |
+| **Table Rendering** | Basic alignment | Advanced matrix system | ✅ Better |
+| **Section Detection** | Limited | Comprehensive | ✅ Better |
+
+**Areas Needing Validation**:
+- Performance on very large documents (>50MB)
+- Memory usage under sustained load
+- Edge case handling across diverse filings
+
+---
+
+## Item/Section Detection Deep Dive
+
+### Current Capabilities
+
+**10-K Sections Detected**:
+- ✅ Item 1 - Business
+- ✅ Item 1A - Risk Factors
+- ✅ Item 1B - Unresolved Staff Comments
+- ✅ Item 2 - Properties
+- ✅ Item 3 - Legal Proceedings
+- ✅ Item 4 - Mine Safety Disclosures
+- ✅ Item 5 - Market for Stock
+- ✅ Item 6 - Selected Financial Data
+- ✅ Item 7 - MD&A
+- ✅ Item 7A - Market Risk
+- ✅ Item 8 - Financial Statements
+- ✅ Item 9 - Changes in Accounting
+- ✅ Item 9A - Controls and Procedures
+- ✅ Item 9B - Other Information
+- ✅ Item 10 - Directors and Officers
+- ✅ Item 11 - Executive Compensation
+- ✅ Item 12 - Security Ownership
+- ✅ Item 13 - Related Transactions
+- ✅ Item 14 - Principal Accountant
+- ✅ Item 15 - Exhibits
+
+**10-Q Sections Detected**:
+- ✅ Part I Items (Financial Information):
+  - Part I - Item 1 - Financial Statements
+  - Part I - Item 2 - MD&A
+  - Part I - Item 3 - Market Risk
+  - Part I - Item 4 - Controls and Procedures
+- ✅ Part II Items (Other Information):
+  - Part II - Item 1 - Legal Proceedings
+  - Part II - Item 1A - Risk Factors
+  - Part II - Item 2 - Unregistered Sales
+  - Part II - Item 6 - Exhibits
+
+**✅ FIXED** (2025-10-07): Part I/Part II distinction now implemented!
+- Part I Item 1 and Part II Item 1 are properly distinguished
+- Section keys include Part context: "Part I - Item 1 - Financial Statements" vs "Part II - Item 1 - Legal Proceedings"
+- Comprehensive test coverage added (5 tests in test_10q_part_detection.py)
+
+**8-K Sections**:
+- ⚠️ Limited - needs expansion
+
+### Detection Methods
+
+1. **TOC-based Detection** ✅
+   - Analyzes Table of Contents
+   - Extracts anchor links
+   - Maps sections to content
+
+2. **Pattern-based Detection** ✅
+   - Regex matching for Item headers
+   - Heading analysis (h1-h6 tags)
+   - Text pattern recognition
+
+3. **Hybrid Approach** ✅
+   - Combines TOC + patterns
+   - Fallback mechanisms
+   - Cross-validation
+
+### What's Working
+
+```python
+# This works today:
+from edgar.documents import parse_html
+
+html = filing.html()
+doc = parse_html(html)
+
+# Get all sections
+sections = doc.sections  # Returns dict
+
+# Access specific Items
+if 'Item 7 - MD&A' in sections:
+    mda = sections['Item 7 - MD&A']
+    mda_text = mda.text()
+    mda_tables = mda.tables()
+```
+
+### What Needs Work
+
+1. **Validation Coverage** (20% remaining)
+   - Test against 100+ diverse 10-K filings
+   - Test against 10-Q filings
+   - Test against 8-K filings
+   - Capture edge cases and variations
+
+2. **Edge Cases** (20% remaining)
+   - Non-standard Item formatting
+   - Missing TOC
+   - Nested sections
+   - Combined Items (e.g., "Items 10, 13, 14")
+
+3. **8-K Support** (50% remaining)
+   - 8-K specific Item patterns
+   - Event-based section detection
+   - Exhibit handling
+
+---
+
+## Recent Achievements (Past 24 Hours)
+
+### Critical Bug Fixes ✅
+
+1. **Spacing Column Filter Fix** (MSFT Table 39)
+   - Problem: Legitimate headers removed as "spacing"
+   - Solution: Header content protection + colspan preservation
+   - Impact: Tables now render correctly with all headers
+   - Commits: `4e43276`, `d19ddd1`
+
+2. **Header Detection Improvements**
+   - Oracle Table 6: Date ranges no longer misclassified
+   - Tesla Table 16: Long narrative text properly handled
+   - Multi-row header normalization
+   - Comprehensive test coverage (16 new tests)
+
+3. **Documentation Updates**
+   - TESTING.md clarified output limits
+   - CHANGELOG updated with fixes
+   - Bug reports and research docs completed
+
+### Quality Metrics
+
+**Test Coverage**:
+- 16 new tests added (all passing)
+- 0 regressions in existing tests
+- Comprehensive edge case coverage
+
+**Code Quality**:
+- Clean implementation following plan
+- Well-documented changes
+- Proper commit messages with Claude Code attribution
+
+---
+
+## Path to 100% Completion
+
+### High Priority (Next Steps)
+
+**📋 Detailed plans available**:
+- **Performance**: See `docs-internal/planning/active-tasks/2025-10-07-performance-optimization-plan.md`
+- **Testing**: See `docs-internal/planning/active-tasks/2025-10-07-comprehensive-testing-plan.md`
+
+1. **Performance Optimization** (1-2 weeks)
+   - [ ] Phase 1: Benchmarking & profiling (2-3 days)
+   - [ ] Phase 2: Algorithm optimizations (3-4 days)
+   - [ ] Phase 3: Validation & regression tests (2-3 days)
+   - [ ] Phase 4: Documentation & monitoring (1 day)
+   - **Goal**: Maintain 1.3x+ speed advantage, <2x memory usage
+
+2. **Comprehensive Testing** (2-3 weeks)
+   - [ ] Phase 1: Corpus validation - 100+ filings (3-4 days)
+   - [ ] Phase 2: Edge cases & error handling (2-3 days)
+   - [ ] Phase 3: Integration testing (2-3 days)
+   - [ ] Phase 4: Regression prevention (1-2 days)
+   - [ ] Phase 5: Documentation & sign-off (1 day)
+   - **Goal**: >95% success rate, >80% test coverage
+
+3. **Item Detection Validation** (included in testing plan)
+   - [ ] Test against 50+ diverse 10-K filings
+   - [ ] Test against 20+ 10-Q filings
+   - [ ] Document any pattern variations found
+   - [ ] Add regression tests for edge cases
+
+### Medium Priority
+
+4. **8-K Support** (1-2 days)
+   - [ ] Research 8-K Item patterns
+   - [ ] Implement detection patterns
+   - [ ] Test against sample 8-K filings
+
+5. **Documentation** (1 day)
+   - [ ] User guide for section access
+   - [ ] API documentation
+   - [ ] Migration guide from old parser
+   - [ ] Examples and recipes
+
+### Low Priority (Polish)
+
+6. **Final Polish**
+   - [ ] Error message improvements
+   - [ ] Logging enhancements
+   - [ ] Configuration documentation
+   - [ ] Performance tuning
+
+---
+
+## Risk Assessment
+
+### Low Risk ✅
+- Core parsing functionality (stable)
+- Table processing (recently fixed, well-tested)
+- Text extraction (working well)
+- XBRL extraction (functional)
+
+### Medium Risk ⚠️
+- Section detection edge cases (needs validation)
+- Performance on very large docs (needs testing)
+- Memory usage (needs profiling)
+
+### Mitigation Strategy
+1. Comprehensive validation testing (in progress)
+2. Real-world filing corpus testing
+3. Performance benchmarking suite
+4. Gradual rollout with monitoring
+
+---
+
+## Recommendations
+
+### Immediate Actions (This Week)
+
+1. **Validate Item Detection** 🎯 **TOP PRIORITY**
+   ```bash
+   # Run on diverse corpus
+   python tests/manual/compare_parsers.py --all
+
+   # Test specific sections
+   python -c "
+   from edgar.documents import parse_html
+   from pathlib import Path
+
+   for filing in ['Apple', 'Oracle', 'Tesla', 'Microsoft']:
+       html = Path(f'data/html/{filing}.10-K.html').read_text()
+       doc = parse_html(html)
+       print(f'{filing}: {list(doc.sections.keys())[:5]}...')
+   "
+   ```
+
+2. **Create Section Access Tests**
+   - Write tests that verify each Item can be accessed
+   - Validate text and table extraction from sections
+   - Test edge cases (missing Items, combined Items)
+
+3. **User Acceptance Testing**
+   - Have maintainer review section detection output
+   - Validate against known-good filings
+   - Document any issues found
+
+### Timeline to Production
+
+**Optimistic**: 1 week
+- If validation shows good Item detection
+- If performance is acceptable
+- If no major issues found
+
+**Realistic**: 2-3 weeks
+- Account for edge case fixes
+- Additional testing needed
+- Documentation completion
+
+**Conservative**: 4 weeks
+- Account for 8-K support
+- Comprehensive testing across all filing types
+- Full documentation
+
+---
+
+## Conclusion
+
+The HTML parser rewrite is **very close to completion** with excellent progress on all goals:
+
+**✅ Fully Achieved**:
+- Semantic meaning preservation
+- AI/Human channel support
+- Section-level processing
+- Table processing for AI context
+- Superior to old parser (in most respects)
+- **Standard Item detection for 10-K/10-Q** (with Part I/II distinction)
+
+**⚠️ Remaining Work (10%)**:
+- Validation against diverse corpus
+- Edge case handling
+- 8-K specific support expansion
+- Final testing and documentation
+
+**Bottom Line**: The parser is **production-ready for 10-K/10-Q** with Item detection functional but requiring validation. The recent bug fixes have resolved critical table rendering issues. With 1-2 weeks of focused validation and testing, this can be shipped with confidence.
+
+### Next Steps
+1. Run comprehensive Item detection validation
+2. Create section access test suite
+3. Performance benchmark
+4. Maintainer review and sign-off
+5. Merge to main branch
--- a/venv/lib/python3.10/site-packages/edgar/documents/docs/TESTING.md
+++ b/venv/lib/python3.10/site-packages/edgar/documents/docs/TESTING.md
@@ -0,0 +1,233 @@
+# HTML Parser Testing Quick Start
+
+Quick reference for testing the HTML parser rewrite during quality improvement.
+
+## Quick Start
+
+```bash
+# Use shortcuts (easy!)
+python tests/manual/compare_parsers.py aapl              # Apple 10-K
+python tests/manual/compare_parsers.py nvda --tables     # Nvidia tables
+python tests/manual/compare_parsers.py 'aapl 10-q'       # Apple 10-Q
+python tests/manual/compare_parsers.py orcl --table 5    # Oracle table #5
+
+# Or use full paths
+python tests/manual/compare_parsers.py data/html/Apple.10-K.html
+
+# Run all test files
+python tests/manual/compare_parsers.py --all
+```
+
+**Available shortcuts:**
+- **Companies**: `aapl`, `msft`, `tsla`, `nvda`, `orcl` (or full names like `apple`)
+- **Filing types**: `10-k` (default), `10-q`, `8-k`
+- **Combine**: `'aapl 10-q'`, `'orcl 8-k'`
+
+## Common Use Cases
+
+### 1. First Look at a Filing
+
+```bash
+# Get overview: speed, table count, sections
+python tests/manual/compare_parsers.py orcl
+```
+
+**Shows**:
+- Parse time comparison (OLD vs NEW)
+- Tables found
+- Text length
+- Sections detected
+- New features (headings, XBRL)
+
+### 2. Check Table Rendering
+
+```bash
+# List all tables with dimensions (shows first 20 tables)
+python tests/manual/compare_parsers.py aapl --tables
+
+# Compare specific table side-by-side (FULL table, no truncation)
+python tests/manual/compare_parsers.py aapl --table 7
+
+# Compare a range of tables
+python tests/manual/compare_parsers.py aapl --range 5:10
+```
+
+**Look for**:
+- Currency symbols merged: `$1,234` not `$ | 1,234`
+- Proper column alignment
+- Correct row/column counts
+- Clean rendering without extra spacing columns
+
+**Note**: `--table N` shows the **complete table** with all rows - no truncation!
+
+### 3. Verify Text Extraction
+
+```bash
+# See first 50 lines side-by-side (default limit)
+python tests/manual/compare_parsers.py msft --text
+
+# Show more lines (configurable)
+python tests/manual/compare_parsers.py msft --text --lines 100
+
+# Show first 200 lines
+python tests/manual/compare_parsers.py msft --text --lines 200
+```
+
+**Check**:
+- Semantic meaning preserved
+- No missing content
+- Clean formatting for LLM consumption
+
+**Note**: Text mode shows first N lines only (default: 50). Use `--lines N` to adjust.
+
+### 4. Check Section Detection
+
+```bash
+python tests/manual/compare_parsers.py aapl --sections
+```
+
+**Verify**:
+- Standard sections identified (10-K/10-Q)
+- Section boundaries correct
+- Text length reasonable per section
+
+### 5. Run Full Test Suite
+
+```bash
+# Test all files in corpus
+python tests/manual/compare_parsers.py --all
+```
+
+**Results**:
+- Summary table across all files
+- Performance comparison
+- Table detection comparison
+
+## Test Files
+
+Available in `data/html/`:
+
+- `Apple.10-K.html` - 1.8MB, complex financials
+- `Oracle.10-K.html` - Large filing
+- `Nvidia.10-K.html` - Tech company
+- `Apple.10-Q.html` - Quarterly format
+- More files as needed...
+
+## Command Reference
+
+```
+python tests/manual/compare_parsers.py [FILE] [OPTIONS]
+
+Options:
+  --all           Run on all test files
+  --tables        Show tables summary (first 20 tables)
+  --table N       Show specific table N side-by-side (FULL table)
+  --range START:END  Show range of tables (e.g., 5:10)
+  --text          Show text comparison (first 50 lines by default)
+  --sections      Show sections comparison
+  --lines N       Number of text lines to show (default: 50, only for --text)
+  --help          Show full help
+```
+
+### Output Limits Summary
+
+| Mode          | Limit      | Configurable      | Notes                           |
+|---------------|------------|-------------------|---------------------------------|
+| `--table N`   | None       | N/A               | Shows **complete table**        |
+| `--range N:M` | None       | N/A               | Shows **complete tables** in range |
+| `--tables`    | 20 tables  | No                | Lists first 20 tables only      |
+| `--text`      | 50 lines   | Yes (`--lines N`) | Preview only                    |
+| `--sections`  | None       | N/A               | Shows all sections              |
+
+## Output Interpretation
+
+### Overview Table
+
+```
+┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
+┃ Metric        ┃ Old Parser ┃ New Parser ┃ Notes      ┃
+┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
+│ Parse Time    │ 454ms      │ 334ms      │ 1.4x faster│
+│ Tables Found  │ 63         │ 63         │ +0         │
+│ Text Length   │ 0          │ 159,388    │ NEW!       │
+└───────────────┴────────────┴────────────┴────────────┘
+```
+
+**Good signs**:
+- ✅ New parser faster or similar speed
+- ✅ Same or more tables found
+- ✅ Text extracted (old parser shows 0)
+- ✅ Sections detected
+
+**Red flags**:
+- ❌ Significantly slower
+- ❌ Fewer tables (unless removing layout tables)
+- ❌ Much shorter text (content missing)
+
+### Table Comparison
+
+```
+Old Parser:
+┌─────────┬──────────┬──────────┐
+│ Year    │ Revenue  │ Profit   │
+├─────────┼──────────┼──────────┤
+│ 2023    │ $ 100M   │ $ 20M    │  <- Currency separated
+└─────────┴──────────┴──────────┘
+
+New Parser:
+┌─────────┬──────────┬──────────┐
+│ Year    │ Revenue  │ Profit   │
+├─────────┼──────────┼──────────┤
+│ 2023    │ $100M    │ $20M     │  <- Currency merged ✅
+└─────────┴──────────┴──────────┘
+```
+
+**Look for**:
+- Currency symbols merged with values
+- No extra empty columns
+- Proper alignment
+- Clean numeric formatting
+
+## Tips
+
+1. **Start with overview** - Get the big picture first
+2. **Check tables visually** - Automated metrics miss formatting issues
+3. **Use specific table inspection** - Don't scroll through 60 tables manually
+4. **Compare text for semantics** - Does it make sense for an LLM?
+5. **Run --all periodically** - Catch regressions across files
+
+## Troubleshooting
+
+### Script fails with import error
+
+```bash
+# Clear cached modules
+find . -type d -name __pycache__ -exec rm -rf {} +
+python tests/manual/compare_parsers.py data/html/Apple.10-K.html
+```
+
+### File not found
+
+```bash
+# Check available files
+ls -lh data/html/*.html
+
+# Use full path
+python tests/manual/compare_parsers.py /full/path/to/file.html
+```
+
+### Old parser shows 0 text
+
+This is expected - old parser has different text extraction. Focus on:
+- Table comparison
+- Parse time
+- Visual quality of output
+
+## Next Steps
+
+1. Run comparison on all test files
+2. Document bugs in `quality-improvement-strategy.md`
+3. Fix issues
+4. Repeat until satisfied
+
+See `edgar/documents/docs/quality-improvement-strategy.md` for full process.
--- a/venv/lib/python3.10/site-packages/edgar/documents/docs/fast-table-rendering.md
+++ b/venv/lib/python3.10/site-packages/edgar/documents/docs/fast-table-rendering.md
@@ -0,0 +1,529 @@
+# Fast Table Rendering
+
+**Status**: Production Ready - **Now the Default** (as of 2025-10-08)
+**Performance**: ~8-10x faster than Rich rendering with correct colspan/rowspan handling
+
+---
+
+## Overview
+
+Fast table rendering provides a high-performance alternative to Rich library rendering for table text extraction. When parsing SEC filings with hundreds of tables, the cumulative rendering time can become a bottleneck. Fast rendering addresses this by using direct string building with TableMatrix for proper colspan/rowspan handling, achieving 8-10x speedup while maintaining correctness.
+
+**As of 2025-10-08, fast rendering is the default** for all table text extraction. You no longer need to explicitly enable it.
+
+### Why It's Now the Default
+
+- **Production-ready**: Fixed all major issues (colspan, multi-row headers, multi-line cells)
+- **7-10x faster**: Significant performance improvement with correct output
+- **Maintains quality**: Matches Rich's appearance with simple() style
+- **Proven**: Extensively tested with Apple, NVIDIA, Microsoft 10-K filings
+
+### When to Disable (Use Rich Instead)
+
+You may want to disable fast rendering and use Rich for:
+- **Terminal display for humans**: Rich has more sophisticated text wrapping and layout
+- **Visual reports**: When presentation quality is more important than speed
+- **Debugging**: Rich output can be easier to visually inspect
+
+---
+
+## Usage
+
+### Default Behavior (Fast Rendering Enabled)
+
+```python
+from edgar.documents import parse_html
+
+# Fast rendering is now the default - no configuration needed!
+doc = parse_html(html)
+
+# Tables automatically use fast renderer (7-10x faster)
+table_text = doc.tables[0].text()
+```
+
+### Disabling Fast Rendering (Use Rich Instead)
+
+If you need Rich's sophisticated layout for visual display:
+
+```python
+from edgar.documents import parse_html
+from edgar.documents.config import ParserConfig
+
+# Explicitly disable fast rendering to use Rich
+config = ParserConfig(fast_table_rendering=False)
+doc = parse_html(html, config=config)
+
+# Tables use Rich renderer (slower but with advanced formatting)
+table_text = doc.tables[0].text()
+```
+
+### Custom Table Styles
+
+**New in this version**: Fast rendering now uses the `simple()` style by default, which matches Rich's `box.SIMPLE` appearance (borderless, clean).
+
+```python
+from edgar.documents import parse_html
+from edgar.documents.config import ParserConfig
+from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle
+
+# Enable fast rendering (uses simple() style by default)
+config = ParserConfig(fast_table_rendering=True)
+doc = parse_html(html, config=config)
+
+# Default: simple() style - borderless, clean
+table_text = doc.tables[0].text()
+
+# To use pipe_table() style explicitly (markdown-compatible borders):
+renderer = FastTableRenderer(TableStyle.pipe_table())
+pipe_text = renderer.render_table_node(doc.tables[0])
+
+# To use minimal() style (no separator):
+renderer = FastTableRenderer(TableStyle.minimal())
+minimal_text = renderer.render_table_node(doc.tables[0])
+```
+
+---
+
+## Performance Comparison
+
+### Benchmark Results
+
+**Test**: Apple 10-K (63 tables) - Updated 2025-10-08
+
+| Renderer | Average Per Table | Improvement | Notes |
+|----------|-------------------|-------------|-------|
+| Rich     | 1.5-2.5ms        | Baseline    | Varies by table complexity |
+| Fast (simple) | 0.15-0.35ms | **7-10x faster** | With proper colspan/rowspan handling |
+
+**Real-world Examples** (Apple 10-K):
+- Table 15 (complex colspan): Rich 2.51ms → Fast 0.35ms (**7.1x faster**)
+- Table 6 (multi-line cells): Rich 1.61ms → Fast 0.17ms (**9.5x faster**)
+- Table 5 (wide table): Rich 3.70ms → Fast 0.48ms (**7.7x faster**)
+
+**Impact on Full Parse**:
+- Rich rendering: 30-40% of total parse time spent in table rendering
+- Fast rendering: 5-10% of total parse time
+- **Overall speedup**: Reduces total parsing time by ~25-30%
+
+### Memory Impact
+
+Fast rendering also reduces memory overhead:
+- No Rich Console objects retained
+- Direct string building (no intermediate objects)
+- Helps prevent memory leaks identified in profiling
+
+---
+
+## Output Examples
+
+### Rich Renderer Output (Default)
+
+```
+  (In millions)
+  Year Ended June 30,                       2025    2024    2023
+ ──────────────────────────────────────────────────────────
+
+  Operating lease cost                    $5,524   3,555   2,875
+
+  Finance lease cost:
+  Amortization of right-of-use assets     $3,408   1,800   1,352
+  Interest on lease liabilities            1,417     734     501
+
+  Total finance lease cost                $4,825   2,534   1,853
+```
+
+**Style**: `box.SIMPLE` - No outer border, just horizontal separator under header
+**Pros**: Clean, uncluttered, perfect alignment, generous spacing
+**Cons**: Slow (6.5ms per table), creates Rich objects, memory overhead
+
+### Fast Renderer Output (NEW: simple() style - Default)
+
+```
+                            December 31, 2023    December 31, 2022    December 31, 2021
+ ───────────────────────────────────────────────────────────────────────────────────────
+  Revenue                               365,817              394,328              365,817
+  Cost of revenue                       223,546              212,981              192,266
+  Gross profit                          142,271              181,347              173,551
+```
+
+**Style**: `simple()` - Matches Rich's `box.SIMPLE` appearance
+**Pros**: Fast (0.2ms per table), clean appearance, no visual noise, professional look
+**Cons**: None - this is now the recommended default!
+
+### Fast Renderer Output (pipe_table() style - Optional)
+
+```
+|                          |  December 31, 2023  |  December 31, 2022  |  December 31, 2021  |
+|--------------------------|---------------------|---------------------|---------------------|
+| Revenue                  |             365,817 |             394,328 |             365,817 |
+| Cost of revenue          |             223,546 |             212,981 |             192,266 |
+| Gross profit             |             142,271 |             181,347 |             173,551 |
+```
+
+**Style**: `pipe_table()` - Markdown-compatible with borders
+**Pros**: Fast (0.2ms per table), markdown-compatible, explicit column boundaries
+**Cons**: Visual noise from pipe characters, busier appearance
+**Use when**: You need markdown-compatible output with explicit borders
+
+### Visual Comparison
+
+**Rich** (`box.SIMPLE`):
+- No outer border - clean, uncluttered look
+- Horizontal line separator under header only
+- Generous internal spacing and padding
+- Perfect column alignment
+- Professional, minimalist presentation
+
+**Fast simple()** (NEW DEFAULT):
+- No outer border - matches Rich's clean look
+- Horizontal line separator under header (using `─`)
+- Space-separated columns with generous padding
+- Clean, professional appearance
+- Same performance as pipe_table (~0.2ms per table)
+
+**Fast pipe_table()** (optional):
+- Full pipe table borders (`|` characters everywhere)
+- Horizontal dashes for header separator
+- Markdown-compatible format
+- Explicit column boundaries
+
+---
+
+## Recent Improvements (2025-10-08)
+
+### 1. Colspan/Rowspan Support
+
+**Fixed**: Tables with `colspan` and `rowspan` attributes now render correctly.
+
+**Previous issue**: Fast renderer was extracting cell text without accounting for colspan/rowspan, causing:
+- Missing columns (e.g., "2023" column disappeared in Apple 10-K table 15)
+- Misaligned data (currency symbols separated from values)
+- Data loss (em dashes and other values missing)
+
+**Solution**: Integrated `TableMatrix` for proper cell expansion, same as Rich rendering uses.
+
+**Status**: ✅ FIXED
+
+### 2. Multi-Row Header Preservation
+
+**Fixed**: Tables with multiple header rows now preserve each row separately.
+
+**Previous issue**: Multi-row headers were collapsed into a single line, causing "Investment portfolio" row to disappear in Apple 10-K table 20.
+
+**Solution**: Modified `render_table_data()` and `_build_table()` to preserve each header row as a separate line.
+
+**Status**: ✅ FIXED
+
+### 3. Multi-Line Cell Rendering
+
+**Fixed**: Cells containing newline characters (`\n`) now render as multiple lines.
+
+**Previous issue**: Multi-line cells like "Interest Rate\nSensitive Instrument" were truncated to first line only.
+
+**Solution**: Added `_format_multiline_row()` to split cells by `\n` and render each line separately.
+
+**Status**: ✅ FIXED
+
+### Performance Impact
+
+All three fixes maintain excellent performance:
+- **Speedup**: 7-10x faster than Rich (down from initial 14x, but with correct output)
+- **Correctness**: Now matches Rich output exactly for colspan, multi-row headers, and multi-line cells
+- **Production ready**: Can confidently use as default renderer
+
+---
+
+## Known Limitations
+
+### 1. Column Alignment in Some Tables
+
+**Issue**: Currency symbols and values may have extra spacing in some complex tables (e.g., Apple 10-K table 22)
+
+**Example**:
+- Rich: `$294,866`
+- Fast: `$                     294,866` (extra spacing)
+
+**Root cause**: Column width calculation creates wider columns for some currency/value pairs after colspan expansion and column filtering.
+
+**Impact**: Visual appearance differs slightly, but data is correct and readable.
+
+**Status**: ⚠️ Minor visual difference - acceptable trade-off for 10x performance gain
+
+### 3. Visual Polish
+
+**Issue**: Some visual aspects don't exactly match Rich's sophisticated layout
+
+**Examples**:
+- Multi-line cell wrapping may differ
+- Column alignment in edge cases
+
+**Status**: ⚠️ Acceptable trade-off for 8-10x performance gain
+
+---
+
+## Configuration Options
+
+### Table Styles
+
+Fast renderer supports different visual styles:
+
+```python
+from edgar.documents.renderers.fast_table import FastTableRenderer, TableStyle
+
+# Pipe table style (default) - markdown compatible
+renderer = FastTableRenderer(TableStyle.pipe_table())
+
+# Minimal style - no borders, just spacing
+renderer = FastTableRenderer(TableStyle.minimal())
+```
+
+### Minimal Style Output
+
+```
+                           December 31, 2023   December 31, 2022   December 31, 2021
+Revenue                              365,817             394,328             365,817
+Cost of revenue                      223,546             212,981             192,266
+Gross profit                         142,271             181,347             173,551
+```
+
+**Note**: Minimal style has cleaner appearance but loses column boundaries
+
+---
+
+## Technical Details
+
+### How It Works
+
+1. **Direct String Building**: Bypasses Rich's layout engine
+2. **Column Analysis**: Detects numeric columns for right-alignment
+3. **Smart Filtering**: Removes empty spacing columns
+4. **Currency Merging**: Combines `$` symbols with amounts
+5. **Width Calculation**: Measures content, applies min/max limits
+
+### Code Path
+
+```python
+# When fast_table_rendering=True:
+table.text()
+  → TableNode._fast_text_rendering()
+  → FastTableRenderer.render_table_node()
+  → Direct string building
+```
+
+### Memory Benefits
+
+Fast rendering avoids:
+- Rich Console object creation (~0.4MB per document)
+- Intermediate rich.Table objects
+- Style/theme processing overhead
+- ANSI escape code generation
+
+---
+
+## Future Improvements
+
+### Planned Enhancements
+
+1. **Match Rich's `box.SIMPLE` Style** (Priority: HIGH)
+   - **Remove all pipe characters** - no outer border, no column separators
+   - **Keep only horizontal separator** under header (using `─` character)
+   - **Increase internal padding** to match Rich's generous spacing
+   - **Clean, minimalist appearance** like Rich's SIMPLE box style
+   - **Goal**: Match Rich visual quality, still 30x faster
+
+2. **Improved Layout Engine**
+   - Better column width calculation (avoid too-wide/too-narrow columns)
+   - Respect natural content breaks
+   - Dynamic spacing based on content type
+   - Handle wrapping for long content
+
+3. **Dynamic Padding**
+   - Match Rich's generous spacing (currently too tight)
+   - Adjust padding based on content type
+   - Configurable padding rules
+   - Maintain alignment with variable padding
+
+4. **Header Handling**
+   - Better multi-row header collapse
+   - Preserve important hierarchies
+   - Smart column spanning
+   - Honor header groupings
+
+5. **Style Presets**
+   - `TableStyle.simple()` - Match Rich's `box.SIMPLE` (no borders, header separator only) ⭐ **PRIMARY GOAL**
+   - `TableStyle.minimal()` - no borders, just spacing (already implemented)
+   - `TableStyle.pipe_table()` - current markdown style (default)
+   - `TableStyle.ascii_clean()` - no Unicode, pure ASCII
+   - `TableStyle.compact()` - minimal spacing for dense data
+
+### Timeline
+
+These improvements are **planned for Phase 2** of the HTML parser optimization work (after memory leak fixes).
+
+---
+
+## Migration Guide
+
+### From Rich to Fast
+
+**Before** (using Rich):
+```python
+doc = parse_html(html)
+table_text = doc.tables[0].text()  # Slow but pretty
+```
+
+**After** (using Fast):
+```python
+config = ParserConfig(fast_table_rendering=True)
+doc = parse_html(html, config=config)
+table_text = doc.tables[0].text()  # Fast but current visual issues
+```
+
+### Hybrid Approach
+
+Use fast rendering during processing, Rich for final display:
+
+```python
+# Fast processing
+config = ParserConfig(fast_table_rendering=True)
+doc = parse_html(html, config=config)
+
+# Extract data quickly
+for table in doc.tables:
+    data = table.text()  # Fast
+    # Process data...
+
+# Display one table nicely
+special_table = doc.tables[5]
+rich_output = special_table.render()  # Switch to Rich for display
+```
+
+---
+
+## Performance Recommendations
+
+### Recommended Settings by Use Case
+
+**Batch Processing** (optimize for speed):
+```python
+config = ParserConfig.for_performance()
+# Includes: fast_table_rendering=True, eager_section_extraction=False
+```
+
+**Data Extraction** (balance speed and accuracy):
+```python
+config = ParserConfig(
+    fast_table_rendering=True,
+    extract_xbrl=True,
+    detect_sections=True
+)
+```
+
+**Display/Reports** (optimize for quality):
+```python
+config = ParserConfig()  # Default settings use Rich
+# Or explicitly:
+config = ParserConfig.for_accuracy()
+```
+
+---
+
+## FAQ
+
+**Q: Can I mix Fast and Rich rendering?**
+A: Not per-table. The setting is document-wide via ParserConfig. However, you can manually call `table.render()` to get Rich output.
+
+**Q: Does this affect section extraction?**
+A: Indirectly, yes. Section detection calls `text()` on the entire document, which includes tables. Fast rendering speeds this up significantly.
+
+**Q: Will the output format change?**
+A: Yes, as we improve the renderer. We'll maintain backward compatibility via style options.
+
+**Q: Can I customize the appearance?**
+A: Currently limited to `TableStyle.pipe_table()` vs `TableStyle.minimal()`. More options coming.
+
+**Q: What about DataFrame export?**
+A: Fast rendering only affects text output. `table.to_dataframe()` is unaffected.
+
+---
+
+## Feedback
+
+The fast renderer is actively being improved based on user feedback. Known issues:
+
+1. ❌ **Pipe characters** - visual noise
+2. ❌ **Layout engine** - inconsistent spacing
+3. ❌ **Padding** - needs tuning
+
+If you have specific rendering issues or suggestions, please provide:
+- Sample table HTML
+- Expected vs actual output
+- Use case description
+
+This helps prioritize improvements while maintaining the performance advantage.
+
+---
+
+## Summary
+
+### Current State (As of 2025-10-08)
+
+**Performance**: ✅ Excellent (8-10x faster than Rich)
+**Correctness**: ✅ Production ready (proper colspan/rowspan handling)
+**Visual Quality**: ⚠️ Good (simple() style matches Rich's box.SIMPLE appearance)
+**Use Case**: Production-ready for all use cases
+
+### Recent Milestones
+
+**✅ Completed**:
+- Core fast rendering implementation
+- TableStyle.simple() preset (borderless, clean)
+- Column filtering and merging
+- Numeric alignment detection
+- **Colspan/rowspan support via TableMatrix**
+- **Performance benchmarking with real tables**
+
+**🔧 Current Limitations**:
+- Multi-row header collapsing differs from Rich
+- Some visual polish differences (acceptable for speed gain)
+- Layout engine not as sophisticated as Rich
+
+### Development Roadmap
+
+**Phase 1** (✅ COMPLETED):
+- ✅ Core fast rendering implementation
+- ✅ Simple() style matching Rich's box.SIMPLE
+- ✅ Proper colspan/rowspan handling via TableMatrix
+- ✅ Production-ready performance (8-10x faster)
+
+**Phase 2** (Future Enhancements):
+- 📋 Improve multi-row header handling
+- 📋 Better layout engine for perfect column widths
+- 📋 Additional style presets
+- 📋 Advanced header detection (data vs labels)
+
+### Bottom Line
+
+Fast table rendering is **production-ready and now the default** for all table text extraction in EdgarTools.
+
+**Benefits**:
+- ✅ 7-10x faster than Rich rendering
+- ✅ Correct data extraction with proper colspan/rowspan handling
+- ✅ Multi-row header preservation
+- ✅ Multi-line cell rendering
+- ✅ Clean, borderless appearance (simple() style)
+
+**Minor differences from Rich**:
+- ⚠️ Some tables have extra spacing between currency symbols and values (e.g., table 22)
+- ⚠️ Column width calculation may differ slightly in complex tables
+- ✅ All data is preserved and correct - only visual presentation differs
+
+The implementation achieves **correct data extraction** with **significant performance gains** and **clean visual output**, making it the ideal default for EdgarTools.
+
+---
+
+## Related Documentation
+
+- [HTML Parser Status](HTML_PARSER_STATUS.md) - Overall parser progress
+- [Performance Analysis](../perf/hotpath_analysis.md) - Profiling results showing Rich rendering bottleneck
+- [Memory Analysis](../perf/memory_analysis.md) - Memory leak issues with Rich objects
--- a/venv/lib/python3.10/site-packages/edgar/documents/docs/goals.md
+++ b/venv/lib/python3.10/site-packages/edgar/documents/docs/goals.md
@@ -0,0 +1,164 @@
+# Goals
+
+## Mission
+Replace `edgar.files` with a parser that is better in **every way** - utility, accuracy, and user experience. The maintainer is the final judge: output must look correct when printed.
+
+## Core Principles
+
+### Primary Goal: AI Context Optimization
+- **Token efficiency**: 30-50% reduction vs raw HTML while preserving semantic meaning
+- **Chunking support**: Enable independent processing of sections/tables for LLM context windows
+- **Clean text output**: Tables rendered in LLM-friendly formats (clean text, markdown)
+- **Semantic preservation**: Extract meaning, not just formatting
+
+### Secondary Goal: Human Readability
+- **Rich console output**: Beautiful rendering with proper table alignment
+- **Markdown export**: Professional-looking document conversion
+- **Section navigation**: Easy access to specific Items/sections
+
+## User-Focused Feature Goals
+
+### 1. Text Extraction
+- Extract full document text without dropping meaningful content
+- Preserve paragraph structure and semantic whitespace
+- Handle inline XBRL facts gracefully (show values, not raw tags)
+- Clean HTML artifacts automatically (scripts, styles, page numbers)
+- **Target**: 99%+ accuracy vs manual reading
+
+### 2. Section Extraction (10-K, 10-Q, 8-K)
+- Detect >90% of standard sections for >90% of test tickers
+- Support flexible access: `doc.sections['Item 1A']`, `doc['1A']`, `doc.risk_factors`
+- Return Section objects with `.text()`, `.tables`, `.search()` methods
+- Include confidence scores and detection method metadata
+- **Target**: Better recall than old parser (quantify with test suite)
+
+### 3. Table Extraction
+- Extract all meaningful data tables (ignore pure layout tables)
+- Accurate rendering with aligned columns and proper formatting
+- Handle complex tables (rowspan, colspan, nested headers)
+- Preserve table captions and surrounding context
+- Support DataFrame conversion for data analysis
+- **Target**: 95%+ accuracy on test corpus
+
+### 4. Search Capabilities
+- Text search within documents
+- Regex pattern matching
+- Semantic search preparation (structure for embedding-based search)
+- Search within sections for focused queries
+
+### 5. Multiple Output Formats
+- Plain text (optimized for LLM context)
+- Markdown (for documentation/sharing)
+- Rich console (beautiful terminal display)
+- JSON (structured data export)
+
+### 6. Developer Experience
+- Intuitive API: `doc.text()`, `doc.tables`, `doc.sections`
+- Rich objects with useful methods (not just strings)
+- Simple tasks simple, complex tasks possible
+- Helpful error messages with recovery suggestions
+- **Target**: New users productive in <10 minutes
+
+
+
+## Performance Targets
+
+### Speed Benchmarks (Based on Current Performance)
+- **Small docs (<5MB)**: <500ms ✅ *Currently 96ms - excellent*
+- **Medium docs (5-20MB)**: <2s ✅ *Currently 1.19s - excellent*
+- **Large docs (>50MB)**: <10s ✅ *Currently 0.59s - excellent*
+- **Throughput**: >3MB/s sustained ✅ *Currently 3.8MB/s*
+- **Target**: Maintain or improve on all benchmarks
+
+### Memory Efficiency
+- **Small docs (<5MB)**: <3x document size *(currently 9x - needs optimization)*
+- **Large docs (>10MB)**: <2x document size *(currently 1.9x - good)*
+- **No memory spikes**: Never exceed 5x document size *(MSFT currently 5.4x)*
+- **Target**: Consistent 2-3x overhead across all document sizes
+
+### Accuracy Benchmarks
+- **Section detection recall**: >90% on 20-ticker test set
+- **Table extraction accuracy**: >95% on manual validation set
+- **Text fidelity**: >99% semantic equivalence to source HTML
+- **XBRL fact extraction**: 100% of inline facts captured correctly
+
+## Implementation Details
+
+### HTML Parsing
+- Read the entire HTML document without dropping semantically meaningful content
+- Drop non-meaningful content (scripts, styles, pure formatting tags)
+- Preserve semantic structure (headings, paragraphs, lists)
+- Handle both old (pre-2015) and modern (inline XBRL) formats
+- Graceful degradation for malformed HTML
+
+### Table Parsing
+- Extract tables containing meaningful data
+- Ignore layout tables (unless they aid document understanding)
+- Accurate rendering with proper column alignment
+- Handle complex structures: rowspan, colspan, nested headers, multi-level headers
+- Preserve table captions and contextual information
+- Support conversion to pandas DataFrame
+
+### Section Extraction
+- Detect standard sections (Item 1, 1A, 7, etc.) for 10-K, 10-Q, 8-K filings
+- Support multiple detection strategies: TOC-based, heading-based, pattern-based
+- Return Section objects with full API: `.text()`, `.text_without_tables()`, `.tables`, `.search()`
+- Include metadata: confidence scores, detection method, position
+- Better recall than old parser (establish baseline with test suite)
+
+## Quality Gates Before Replacing edgar.files
+
+### Automated Tests
+- [ ] All existing tests pass with new parser (1000+ tests)
+- [ ] Performance regression tests (<5% slower on any document)
+- [ ] Memory regression tests (no >10% increases)
+- [ ] Section detection accuracy >90% on test corpus
+- [ ] Table extraction accuracy >95% on validation set
+
+### Manual Validation (Maintainer Review)
+- [ ] Print full document text for 10 sample filings → verify quality
+- [ ] Compare table rendering old vs new → verify improvement
+- [ ] Test section extraction on edge cases → verify robustness
+- [ ] Review markdown output → verify professional appearance
+- [ ] Check memory usage → verify no concerning spikes
+
+### Documentation Requirements
+- [ ] Migration guide (old API → new API with examples)
+- [ ] Updated user guide showing new features
+- [ ] Performance comparison report (old vs new)
+- [ ] Known limitations documented clearly
+- [ ] API reference complete for all public methods
+
+## Success Metrics
+
+### Launch Criteria
+1. **Speed**: Equal or faster on 95% of test corpus
+2. **Accuracy**: Maintainer approves output quality on sample set
+3. **API**: Clean, intuitive interface (no confusion)
+4. **Tests**: Zero regressions, 95%+ coverage on new code
+5. **Docs**: Complete with examples for all major use cases
+
+### Post-Launch Monitoring
+- Issue reports: <5% related to parser quality/accuracy
+- User feedback: Positive sentiment on ease of use
+- Performance: No degradation over time (regression tests)
+- Adoption: Smooth migration from old parser (deprecation path)
+
+## Feature Parity with Old Parser
+
+### Must-Have (Required for Migration)
+- ✅ Get document text (with/without tables)
+- ✅ Extract specific sections by name/number
+- ✅ List all tables in document
+- ✅ Search document content
+- ✅ Convert to markdown
+- ✅ Handle both old and new SEC filing formats
+- ✅ Graceful error handling
+
+### Nice-to-Have (Improvements Over Old Parser)
+- 🎯 Semantic search capabilities
+- 🎯 Better subsection extraction within Items
+- 🎯 Table-of-contents navigation
+- 🎯 Export to multiple formats (JSON, clean HTML)
+- 🎯 Batch processing optimizations
+- 🎯 Section confidence scores and metadata
--- a/venv/lib/python3.10/site-packages/edgar/documents/docs/html-parser-rewrite-overview.md
+++ b/venv/lib/python3.10/site-packages/edgar/documents/docs/html-parser-rewrite-overview.md
@@ -0,0 +1,240 @@
+# HTML Parser Rewrite Technical Overview
+
+## Executive Summary
+
+The `edgar/documents` module represents a comprehensive rewrite of the HTML parsing capabilities originally implemented in `edgar/files`. This new parser is designed to provide superior parsing accuracy, structured data extraction, and rendering quality for SEC filing documents. The rewrite introduces a modern, extensible architecture with specialized components for handling the complex structure of financial documents.
+
+## Architecture Overview
+
+### Core Components
+
+#### 1. Document Object Model
+The new parser introduces a sophisticated node-based document model:
+
+- **Document**: Top-level container with metadata and sections
+- **Node Hierarchy**: Abstract base classes for all document elements
+  - `DocumentNode`: Root document container
+  - `TextNode`: Plain text content
+  - `ParagraphNode`: Paragraph elements with styling
+  - `HeadingNode`: Headers with levels 1-6
+  - `ContainerNode`: Generic containers (div, section)
+  - `SectionNode`: Document sections with semantic meaning
+  - `ListNode`/`ListItemNode`: Ordered and unordered lists
+  - `LinkNode`: Hyperlinks with metadata
+  - `ImageNode`: Images with attributes
+
+#### 2. Table Processing System
+Advanced table handling represents a major improvement over the old parser:
+
+- **TableNode**: Sophisticated table representation with multi-level headers
+- **Cell**: Individual cell with colspan/rowspan support and type detection
+- **Row**: Table row with header detection and semantic classification
+- **TableMatrix**: Handles complex cell spanning and alignment
+- **CurrencyColumnMerger**: Intelligently merges currency symbols with values
+- **ColumnAnalyzer**: Detects spacing columns and optimizes layout
+
+#### 3. Parser Pipeline
+The parsing process follows a well-defined pipeline:
+
+1. **HTMLParser**: Main orchestration class
+2. **HTMLPreprocessor**: Cleans and normalizes HTML
+3. **DocumentBuilder**: Converts HTML tree to document nodes
+4. **Strategy Pattern**: Pluggable parsing strategies
+5. **DocumentPostprocessor**: Final cleanup and optimization
+
+### Key Improvements Over Old Parser
+
+#### Table Processing Enhancements
+
+**Old Parser (`edgar/files`)**:
+- Basic table extraction using BeautifulSoup
+- Limited colspan/rowspan handling
+- Simple text-based rendering
+- Manual column alignment
+- Currency symbols often misaligned
+
+**New Parser (`edgar/documents`)**:
+- Advanced table matrix system for perfect cell alignment
+- Intelligent header detection (multi-row headers, year detection)
+- Automatic currency column merging ($1,234 instead of $ | 1,234)
+- Semantic table type detection (FINANCIAL, METRICS, TOC, etc.)
+- Rich table rendering with proper formatting
+- Smart column width calculation
+- Enhanced numeric formatting with comma separators
+
+#### Document Structure
+
+**Old Parser**:
+- Flat block-based structure
+- Limited semantic understanding
+- Basic text extraction
+
+**New Parser**:
+- Hierarchical node-based model
+- Semantic section detection
+- Rich metadata preservation
+- XBRL fact extraction
+- Search capabilities
+- Multiple output formats (text, markdown, JSON, pandas)
+
+#### Rendering Quality
+
+**Old Parser**:
+- Basic text output
+- Limited table formatting
+- No styling preservation
+
+**New Parser**:
+- Multiple renderers (text, markdown, Rich console)
+- Preserves document structure and styling
+- Configurable output options
+- LLM-optimized formatting
+
+## Implementation Details
+
+### Configuration System
+
+The new parser uses a comprehensive configuration system:
+
+```python
+@dataclass
+class ParserConfig:
+    # Size limits
+    max_document_size: int = 50 * 1024 * 1024  # 50MB
+    streaming_threshold: int = 10 * 1024 * 1024  # 10MB
+    
+    # Processing options
+    preserve_whitespace: bool = False
+    detect_sections: bool = True
+    extract_xbrl: bool = True
+    table_extraction: bool = True
+    detect_table_types: bool = True
+```
+
+### Strategy Pattern Implementation
+
+The parser uses pluggable strategies for different aspects:
+
+- **HeaderDetectionStrategy**: Identifies document sections
+- **TableProcessor**: Handles table extraction and classification
+- **XBRLExtractor**: Extracts XBRL facts and metadata
+- **StyleParser**: Processes CSS styling information
+
+### Table Processing Deep Dive
+
+The table processing system represents the most significant improvement:
+
+#### Header Detection Algorithm
+- Analyzes cell content patterns (th vs td elements)
+- Detects year patterns in financial tables
+- Identifies period indicators (quarters, fiscal years)
+- Handles multi-row headers with units and descriptions
+- Prevents misclassification of data rows as headers
+
+#### Cell Type Detection
+- Numeric vs text classification
+- Currency value recognition
+- Percentage handling
+- Em dash and null value detection
+- Proper number formatting with thousand separators
+
+#### Matrix Building
+- Handles colspan and rowspan expansion
+- Maintains cell relationships
+- Optimizes column layout
+- Removes spacing columns automatically
+
+### XBRL Integration
+
+The new parser includes sophisticated XBRL processing:
+- Extracts facts before preprocessing to preserve ix:hidden content
+- Maintains metadata relationships
+- Supports inline XBRL transformations
+- Preserves semantic context
+
+## Performance Characteristics
+
+### Memory Efficiency
+- Streaming support for large documents (>10MB)
+- Lazy loading of document sections
+- Caching for repeated operations
+- Memory-efficient node representation
+
+### Processing Speed
+- Optimized HTML parsing with lxml
+- Configurable processing strategies
+- Parallel extraction capabilities
+- Smart caching of expensive operations
+
+## Migration and Compatibility
+
+### API Compatibility
+The new parser maintains high-level compatibility with the old parser while offering enhanced functionality:
+
+```python
+# Old way
+from edgar.files import FilingDocument
+doc = FilingDocument(html)
+text = doc.text()
+
+# New way  
+from edgar.documents import HTMLParser
+parser = HTMLParser()
+doc = parser.parse(html)
+text = doc.text()
+```
+
+### Feature Parity
+All major features from the old parser are preserved:
+- Text extraction
+- Table conversion to DataFrame
+- Section detection
+- Metadata extraction
+
+### Enhanced Features
+New capabilities not available in the old parser:
+- Rich console rendering
+- Markdown export
+- Advanced table semantics
+- XBRL fact extraction
+- Document search
+- LLM optimization
+- Multiple output formats
+
+## Current Status and Next Steps
+
+### Completed Components
+- ✅ Core document model
+- ✅ HTML parsing pipeline
+- ✅ Advanced table processing
+- ✅ Multiple renderers (text, markdown, Rich)
+- ✅ XBRL extraction
+- ✅ Configuration system
+- ✅ Streaming support
+
+### Remaining Work
+- 🔄 Performance optimization and benchmarking
+- 🔄 Comprehensive test coverage migration
+- 🔄 Error handling improvements
+- 🔄 Documentation and examples
+- 🔄 Validation against large corpus of filings
+
+### Testing Strategy
+The rewrite requires extensive validation:
+- Comparison testing against old parser output
+- Financial table accuracy verification
+- Performance benchmarking
+- Edge case handling
+- Integration testing with existing workflows
+
+## Conclusion
+
+The `edgar/documents` rewrite represents a significant advancement in SEC filing processing capabilities. The new architecture provides:
+
+1. **Better Accuracy**: Advanced table processing and semantic understanding
+2. **Enhanced Functionality**: Multiple output formats and rich rendering
+3. **Improved Maintainability**: Clean, modular architecture with clear separation of concerns
+4. **Future Extensibility**: Plugin architecture for new parsing strategies
+5. **Performance**: Streaming support and optimized processing for large documents
+
+The modular design ensures that improvements can be made incrementally while maintaining backward compatibility. The sophisticated table processing system alone represents a major advancement in handling complex financial documents accurately.
--- a/venv/lib/python3.10/site-packages/edgar/documents/docs/quality-improvement-strategy.md
+++ b/venv/lib/python3.10/site-packages/edgar/documents/docs/quality-improvement-strategy.md
@@ -0,0 +1,208 @@
+# HTML Parser Quality Improvement Strategy
+
+## Overview
+
+Simple, iterative testing strategy for the HTML parser rewrite. The goal is rapid feedback loops where we compare OLD vs NEW parser output, identify visual/functional issues, fix them, and repeat until satisfied.
+
+## Test Corpus
+
+### 10 Representative Documents
+
+Selected to cover different filing types, companies, and edge cases:
+
+| # | Company | Filing Type | File Path | Rationale |
+|---|---------|-------------|-----------|-----------|
+| 1 | Apple | 10-K | `data/html/Apple.10-K.html` | Large complex filing, existing test file |
+| 2 | Oracle | 10-K | `data/html/Oracle.10-K.html` | Complex financials, existing test file |
+| 3 | Nvidia | 10-K | `data/html/Nvidia.10-K.html` | Tech company, existing test file |
+| 4 | Microsoft | 10-K | `data/html/Microsoft.10-K.html` | Popular company, complex tables |
+| 5 | Tesla | 10-K | `data/html/Tesla.10-K.html` | Manufacturing sector, different formatting |
+| 6 | [TBD] | 10-Q | TBD | Quarterly report format |
+| 7 | [TBD] | 10-Q | TBD | Another quarterly for variety |
+| 8 | [TBD] | 8-K | `data/html/BuckleInc.8-K.html` | Event-driven filing |
+| 9 | [TBD] | Proxy (DEF 14A) | TBD | Proxy statement with compensation tables |
+| 10 | [TBD] | Edge case | TBD | Unusual formatting or very large file |
+
+**Note**: Fill in TBD entries as we identify good test candidates.
+
+## The 4-Step Loop
+
+### Step 1: Run Comparison
+
+Use existing test scripts to compare OLD vs NEW parsers:
+
+```bash
+# Full comparison with metrics
+python tests/manual/check_parser_comparison.py
+
+# Table-focused comparison with rendering
+python tests/manual/check_tables.py
+
+# Or run on specific file
+python tests/manual/check_html_rewrite.py
+```
+
+**Outputs to review**:
+- Console output with side-by-side Rich panels
+- Metrics (parse time, table count, section detection)
+- Rendered tables (old vs new)
+
+### Step 2: Human Review
+
+**Visual Inspection Process**:
+1. Look at console output directly (Rich rendering)
+2. For detailed text comparison, optionally dump to files:
+   - OLD parser: `doc.text()` → `output/old_apple.txt`
+   - NEW parser: `doc.text()` → `output/new_apple.txt`
+   - Use `diff` or visual diff tool
+3. Take screenshots for complex table issues
+4. Focus on:
+   - Table alignment and formatting
+   - Currency symbol placement (should be merged: `$1,234` not `$ | 1,234`)
+   - Column count (fewer is better after removing spacing columns)
+   - Section detection accuracy
+   - Text readability for LLM context
+
+**Quality Criteria** (from goals.md):
+- Semantic meaning preserved
+- Tables render correctly when printed
+- Better than old parser in speed, accuracy, features
+- **You are the final judge**: "Does this look right?"
+
+### Step 3: Document Bugs
+
+Record issues in the tracker below as you find them:
+
+| Bug # | Status | Priority | Description | File/Location | Notes |
+|-------|--------|----------|-------------|---------------|-------|
+| Example | Fixed | High | Currency symbols not merging in balance sheet | Apple 10-K, Table 5 | Issue in CurrencyColumnMerger |
+| | | | | | |
+| | | | | | |
+| | | | | | |
+
+**Status values**: Open, In Progress, Fixed, Won't Fix, Deferred
+**Priority values**: Critical, High, Medium, Low
+
+**Bug Description Template**:
+- What's wrong: Clear description of the issue
+- Where: Which file/table/section
+- Expected: What it should look like
+- Actual: What it currently looks like
+- Impact: How it affects usability/readability
+
+### Step 4: Fix & Repeat
+
+1. Pick highest priority bug
+2. Fix the code
+3. Re-run comparison on affected file(s)
+4. Verify fix doesn't break other files
+5. Mark bug as Fixed
+6. Repeat until exit criteria met
+
+**Quick verification**:
+```bash
+# Re-run just the problematic file
+python -c "
+from edgar.documents import parse_html
+from pathlib import Path
+html = Path('data/html/Apple.10-K.html').read_text()
+doc = parse_html(html)
+# Quick inspection
+print(f'Tables: {len(doc.tables)}')
+print(doc.tables[5].render(width=200))  # Check specific table
+"
+```
+
+## Exit Criteria
+
+We're done when:
+1. ✅ All 10 test documents parse successfully
+2. ✅ Visual output looks correct (maintainer approval)
+3. ✅ Tables render cleanly with proper alignment
+4. ✅ No critical or high priority bugs remain
+5. ✅ Performance is equal or better than old parser
+6. ✅ Text extraction is complete and clean for AI context
+
+**Final approval**: Maintainer says "This is good enough to ship."
+
+## Testing Infrastructure
+
+### Primary Tool: compare_parsers.py
+
+Simple command-line tool for the quality improvement loop:
+
+```bash
+# Quick overview comparison (using shortcuts!)
+python tests/manual/compare_parsers.py aapl
+
+# See all tables in a document
+python tests/manual/compare_parsers.py aapl --tables
+
+# Compare specific table (OLD vs NEW side-by-side)
+python tests/manual/compare_parsers.py aapl --table 5
+
+# Compare text extraction
+python tests/manual/compare_parsers.py msft --text
+
+# See section detection
+python tests/manual/compare_parsers.py orcl --sections
+
+# Test with 10-Q filings
+python tests/manual/compare_parsers.py 'aapl 10-q'
+
+# Run all test files at once
+python tests/manual/compare_parsers.py --all
+```
+
+**Shortcuts available**:
+- Companies: `aapl`, `msft`, `tsla`, `nvda`, `orcl`
+- Filing types: `10-k` (default), `10-q`, `8-k`
+- Or use full file paths
+
+**Features**:
+- Clean command-line interface
+- Side-by-side OLD vs NEW comparison
+- Rich console output with colors and tables
+- Performance metrics
+- Individual table inspection
+
+### Other Available Scripts
+
+Additional tools for specific testing:
+
+- `tests/manual/check_parser_comparison.py` - Full comparison with metrics
+- `tests/manual/check_tables.py` - Table-specific comparison with rendering
+- `tests/manual/check_html_rewrite.py` - General HTML parsing checks
+- `tests/manual/check_html_parser_real_files.py` - Real filing tests
+
+## Quick Reference
+
+For day-to-day testing commands and usage examples, see [TESTING.md](TESTING.md).
+
+## Notes
+
+- **Keep it simple**: This is about rapid iteration, not comprehensive automation
+- **Visual inspection is key**: Automated metrics don't catch layout/formatting issues
+- **Use screenshots**: When describing bugs, screenshots speak louder than words
+- **Iterative approach**: Don't try to fix everything at once, prioritize
+- **Trust your judgment**: If it looks wrong, it probably is wrong
+
+## Bug Tracker
+
+### Active Issues
+
+(Add bugs here as they're discovered)
+
+### Fixed Issues
+
+(Move completed bugs here for history)
+
+### Deferred Issues
+
+(Issues that aren't blocking release but could be improved later)
+
+---
+
+**Status**: Initial draft
+**Last Updated**: 2025-10-07
+**Maintainer**: Dwight Gunning