6.5 KiB
HTML Parser Quality Improvement Strategy
Overview
Simple, iterative testing strategy for the HTML parser rewrite. The goal is rapid feedback loops where we compare OLD vs NEW parser output, identify visual/functional issues, fix them, and repeat until satisfied.
Test Corpus
10 Representative Documents
Selected to cover different filing types, companies, and edge cases:
| # | Company | Filing Type | File Path | Rationale |
|---|---|---|---|---|
| 1 | Apple | 10-K | data/html/Apple.10-K.html |
Large complex filing, existing test file |
| 2 | Oracle | 10-K | data/html/Oracle.10-K.html |
Complex financials, existing test file |
| 3 | Nvidia | 10-K | data/html/Nvidia.10-K.html |
Tech company, existing test file |
| 4 | Microsoft | 10-K | data/html/Microsoft.10-K.html |
Popular company, complex tables |
| 5 | Tesla | 10-K | data/html/Tesla.10-K.html |
Manufacturing sector, different formatting |
| 6 | [TBD] | 10-Q | TBD | Quarterly report format |
| 7 | [TBD] | 10-Q | TBD | Another quarterly for variety |
| 8 | [TBD] | 8-K | data/html/BuckleInc.8-K.html |
Event-driven filing |
| 9 | [TBD] | Proxy (DEF 14A) | TBD | Proxy statement with compensation tables |
| 10 | [TBD] | Edge case | TBD | Unusual formatting or very large file |
Note: Fill in TBD entries as we identify good test candidates.
The 4-Step Loop
Step 1: Run Comparison
Use existing test scripts to compare OLD vs NEW parsers:
# Full comparison with metrics
python tests/manual/check_parser_comparison.py
# Table-focused comparison with rendering
python tests/manual/check_tables.py
# Or run on specific file
python tests/manual/check_html_rewrite.py
Outputs to review:
- Console output with side-by-side Rich panels
- Metrics (parse time, table count, section detection)
- Rendered tables (old vs new)
Step 2: Human Review
Visual Inspection Process:
- Look at console output directly (Rich rendering)
- For detailed text comparison, optionally dump to files:
- OLD parser:
doc.text()→output/old_apple.txt - NEW parser:
doc.text()→output/new_apple.txt - Use
diffor visual diff tool
- OLD parser:
- Take screenshots for complex table issues
- Focus on:
- Table alignment and formatting
- Currency symbol placement (should be merged:
$1,234not$ | 1,234) - Column count (fewer is better after removing spacing columns)
- Section detection accuracy
- Text readability for LLM context
Quality Criteria (from goals.md):
- Semantic meaning preserved
- Tables render correctly when printed
- Better than old parser in speed, accuracy, features
- You are the final judge: "Does this look right?"
Step 3: Document Bugs
Record issues in the tracker below as you find them:
| Bug # | Status | Priority | Description | File/Location | Notes |
|---|---|---|---|---|---|
| Example | Fixed | High | Currency symbols not merging in balance sheet | Apple 10-K, Table 5 | Issue in CurrencyColumnMerger |
Status values: Open, In Progress, Fixed, Won't Fix, Deferred Priority values: Critical, High, Medium, Low
Bug Description Template:
- What's wrong: Clear description of the issue
- Where: Which file/table/section
- Expected: What it should look like
- Actual: What it currently looks like
- Impact: How it affects usability/readability
Step 4: Fix & Repeat
- Pick highest priority bug
- Fix the code
- Re-run comparison on affected file(s)
- Verify fix doesn't break other files
- Mark bug as Fixed
- Repeat until exit criteria met
Quick verification:
# Re-run just the problematic file
python -c "
from edgar.documents import parse_html
from pathlib import Path
html = Path('data/html/Apple.10-K.html').read_text()
doc = parse_html(html)
# Quick inspection
print(f'Tables: {len(doc.tables)}')
print(doc.tables[5].render(width=200)) # Check specific table
"
Exit Criteria
We're done when:
- ✅ All 10 test documents parse successfully
- ✅ Visual output looks correct (maintainer approval)
- ✅ Tables render cleanly with proper alignment
- ✅ No critical or high priority bugs remain
- ✅ Performance is equal or better than old parser
- ✅ Text extraction is complete and clean for AI context
Final approval: Maintainer says "This is good enough to ship."
Testing Infrastructure
Primary Tool: compare_parsers.py
Simple command-line tool for the quality improvement loop:
# Quick overview comparison (using shortcuts!)
python tests/manual/compare_parsers.py aapl
# See all tables in a document
python tests/manual/compare_parsers.py aapl --tables
# Compare specific table (OLD vs NEW side-by-side)
python tests/manual/compare_parsers.py aapl --table 5
# Compare text extraction
python tests/manual/compare_parsers.py msft --text
# See section detection
python tests/manual/compare_parsers.py orcl --sections
# Test with 10-Q filings
python tests/manual/compare_parsers.py 'aapl 10-q'
# Run all test files at once
python tests/manual/compare_parsers.py --all
Shortcuts available:
- Companies:
aapl,msft,tsla,nvda,orcl - Filing types:
10-k(default),10-q,8-k - Or use full file paths
Features:
- Clean command-line interface
- Side-by-side OLD vs NEW comparison
- Rich console output with colors and tables
- Performance metrics
- Individual table inspection
Other Available Scripts
Additional tools for specific testing:
tests/manual/check_parser_comparison.py- Full comparison with metricstests/manual/check_tables.py- Table-specific comparison with renderingtests/manual/check_html_rewrite.py- General HTML parsing checkstests/manual/check_html_parser_real_files.py- Real filing tests
Quick Reference
For day-to-day testing commands and usage examples, see TESTING.md.
Notes
- Keep it simple: This is about rapid iteration, not comprehensive automation
- Visual inspection is key: Automated metrics don't catch layout/formatting issues
- Use screenshots: When describing bugs, screenshots speak louder than words
- Iterative approach: Don't try to fix everything at once, prioritize
- Trust your judgment: If it looks wrong, it probably is wrong
Bug Tracker
Active Issues
(Add bugs here as they're discovered)
Fixed Issues
(Move completed bugs here for history)
Deferred Issues
(Issues that aren't blocking release but could be improved later)
Status: Initial draft Last Updated: 2025-10-07 Maintainer: Dwight Gunning