# HTML Parser Quality Improvement Strategy ## Overview Simple, iterative testing strategy for the HTML parser rewrite. The goal is rapid feedback loops where we compare OLD vs NEW parser output, identify visual/functional issues, fix them, and repeat until satisfied. ## Test Corpus ### 10 Representative Documents Selected to cover different filing types, companies, and edge cases: | # | Company | Filing Type | File Path | Rationale | |---|---------|-------------|-----------|-----------| | 1 | Apple | 10-K | `data/html/Apple.10-K.html` | Large complex filing, existing test file | | 2 | Oracle | 10-K | `data/html/Oracle.10-K.html` | Complex financials, existing test file | | 3 | Nvidia | 10-K | `data/html/Nvidia.10-K.html` | Tech company, existing test file | | 4 | Microsoft | 10-K | `data/html/Microsoft.10-K.html` | Popular company, complex tables | | 5 | Tesla | 10-K | `data/html/Tesla.10-K.html` | Manufacturing sector, different formatting | | 6 | [TBD] | 10-Q | TBD | Quarterly report format | | 7 | [TBD] | 10-Q | TBD | Another quarterly for variety | | 8 | [TBD] | 8-K | `data/html/BuckleInc.8-K.html` | Event-driven filing | | 9 | [TBD] | Proxy (DEF 14A) | TBD | Proxy statement with compensation tables | | 10 | [TBD] | Edge case | TBD | Unusual formatting or very large file | **Note**: Fill in TBD entries as we identify good test candidates. ## The 4-Step Loop ### Step 1: Run Comparison Use existing test scripts to compare OLD vs NEW parsers: ```bash # Full comparison with metrics python tests/manual/check_parser_comparison.py # Table-focused comparison with rendering python tests/manual/check_tables.py # Or run on specific file python tests/manual/check_html_rewrite.py ``` **Outputs to review**: - Console output with side-by-side Rich panels - Metrics (parse time, table count, section detection) - Rendered tables (old vs new) ### Step 2: Human Review **Visual Inspection Process**: 1. Look at console output directly (Rich rendering) 2. For detailed text comparison, optionally dump to files: - OLD parser: `doc.text()` → `output/old_apple.txt` - NEW parser: `doc.text()` → `output/new_apple.txt` - Use `diff` or visual diff tool 3. Take screenshots for complex table issues 4. Focus on: - Table alignment and formatting - Currency symbol placement (should be merged: `$1,234` not `$ | 1,234`) - Column count (fewer is better after removing spacing columns) - Section detection accuracy - Text readability for LLM context **Quality Criteria** (from goals.md): - Semantic meaning preserved - Tables render correctly when printed - Better than old parser in speed, accuracy, features - **You are the final judge**: "Does this look right?" ### Step 3: Document Bugs Record issues in the tracker below as you find them: | Bug # | Status | Priority | Description | File/Location | Notes | |-------|--------|----------|-------------|---------------|-------| | Example | Fixed | High | Currency symbols not merging in balance sheet | Apple 10-K, Table 5 | Issue in CurrencyColumnMerger | | | | | | | | | | | | | | | | | | | | | | **Status values**: Open, In Progress, Fixed, Won't Fix, Deferred **Priority values**: Critical, High, Medium, Low **Bug Description Template**: - What's wrong: Clear description of the issue - Where: Which file/table/section - Expected: What it should look like - Actual: What it currently looks like - Impact: How it affects usability/readability ### Step 4: Fix & Repeat 1. Pick highest priority bug 2. Fix the code 3. Re-run comparison on affected file(s) 4. Verify fix doesn't break other files 5. Mark bug as Fixed 6. Repeat until exit criteria met **Quick verification**: ```bash # Re-run just the problematic file python -c " from edgar.documents import parse_html from pathlib import Path html = Path('data/html/Apple.10-K.html').read_text() doc = parse_html(html) # Quick inspection print(f'Tables: {len(doc.tables)}') print(doc.tables[5].render(width=200)) # Check specific table " ``` ## Exit Criteria We're done when: 1. ✅ All 10 test documents parse successfully 2. ✅ Visual output looks correct (maintainer approval) 3. ✅ Tables render cleanly with proper alignment 4. ✅ No critical or high priority bugs remain 5. ✅ Performance is equal or better than old parser 6. ✅ Text extraction is complete and clean for AI context **Final approval**: Maintainer says "This is good enough to ship." ## Testing Infrastructure ### Primary Tool: compare_parsers.py Simple command-line tool for the quality improvement loop: ```bash # Quick overview comparison (using shortcuts!) python tests/manual/compare_parsers.py aapl # See all tables in a document python tests/manual/compare_parsers.py aapl --tables # Compare specific table (OLD vs NEW side-by-side) python tests/manual/compare_parsers.py aapl --table 5 # Compare text extraction python tests/manual/compare_parsers.py msft --text # See section detection python tests/manual/compare_parsers.py orcl --sections # Test with 10-Q filings python tests/manual/compare_parsers.py 'aapl 10-q' # Run all test files at once python tests/manual/compare_parsers.py --all ``` **Shortcuts available**: - Companies: `aapl`, `msft`, `tsla`, `nvda`, `orcl` - Filing types: `10-k` (default), `10-q`, `8-k` - Or use full file paths **Features**: - Clean command-line interface - Side-by-side OLD vs NEW comparison - Rich console output with colors and tables - Performance metrics - Individual table inspection ### Other Available Scripts Additional tools for specific testing: - `tests/manual/check_parser_comparison.py` - Full comparison with metrics - `tests/manual/check_tables.py` - Table-specific comparison with rendering - `tests/manual/check_html_rewrite.py` - General HTML parsing checks - `tests/manual/check_html_parser_real_files.py` - Real filing tests ## Quick Reference For day-to-day testing commands and usage examples, see [TESTING.md](TESTING.md). ## Notes - **Keep it simple**: This is about rapid iteration, not comprehensive automation - **Visual inspection is key**: Automated metrics don't catch layout/formatting issues - **Use screenshots**: When describing bugs, screenshots speak louder than words - **Iterative approach**: Don't try to fix everything at once, prioritize - **Trust your judgment**: If it looks wrong, it probably is wrong ## Bug Tracker ### Active Issues (Add bugs here as they're discovered) ### Fixed Issues (Move completed bugs here for history) ### Deferred Issues (Issues that aren't blocking release but could be improved later) --- **Status**: Initial draft **Last Updated**: 2025-10-07 **Maintainer**: Dwight Gunning