Files

kdusek 8e654ed209 Initial commit

2025-12-09 12:13:01 +01:00

6.5 KiB

Raw Blame History

HTML Parser Quality Improvement Strategy

Overview

Simple, iterative testing strategy for the HTML parser rewrite. The goal is rapid feedback loops where we compare OLD vs NEW parser output, identify visual/functional issues, fix them, and repeat until satisfied.

Test Corpus

10 Representative Documents

Selected to cover different filing types, companies, and edge cases:

#	Company	Filing Type	File Path	Rationale
1	Apple	10-K	`data/html/Apple.10-K.html`	Large complex filing, existing test file
2	Oracle	10-K	`data/html/Oracle.10-K.html`	Complex financials, existing test file
3	Nvidia	10-K	`data/html/Nvidia.10-K.html`	Tech company, existing test file
4	Microsoft	10-K	`data/html/Microsoft.10-K.html`	Popular company, complex tables
5	Tesla	10-K	`data/html/Tesla.10-K.html`	Manufacturing sector, different formatting
6	[TBD]	10-Q	TBD	Quarterly report format
7	[TBD]	10-Q	TBD	Another quarterly for variety
8	[TBD]	8-K	`data/html/BuckleInc.8-K.html`	Event-driven filing
9	[TBD]	Proxy (DEF 14A)	TBD	Proxy statement with compensation tables
10	[TBD]	Edge case	TBD	Unusual formatting or very large file

Note: Fill in TBD entries as we identify good test candidates.

The 4-Step Loop

Step 1: Run Comparison

Use existing test scripts to compare OLD vs NEW parsers:

# Full comparison with metrics
python tests/manual/check_parser_comparison.py

# Table-focused comparison with rendering
python tests/manual/check_tables.py

# Or run on specific file
python tests/manual/check_html_rewrite.py

Outputs to review:

Console output with side-by-side Rich panels
Metrics (parse time, table count, section detection)
Rendered tables (old vs new)

Step 2: Human Review

Visual Inspection Process:

Look at console output directly (Rich rendering)
For detailed text comparison, optionally dump to files:
- OLD parser: doc.text() → output/old_apple.txt
- NEW parser: doc.text() → output/new_apple.txt
- Use diff or visual diff tool
Take screenshots for complex table issues
Focus on:
- Table alignment and formatting
- Currency symbol placement (should be merged: $1,234 not $ | 1,234)
- Column count (fewer is better after removing spacing columns)
- Section detection accuracy
- Text readability for LLM context

Quality Criteria (from goals.md):

Semantic meaning preserved
Tables render correctly when printed
Better than old parser in speed, accuracy, features
You are the final judge: "Does this look right?"

Step 3: Document Bugs

Record issues in the tracker below as you find them:

Bug #	Status	Priority	Description	File/Location	Notes
Example	Fixed	High	Currency symbols not merging in balance sheet	Apple 10-K, Table 5	Issue in CurrencyColumnMerger

Status values: Open, In Progress, Fixed, Won't Fix, Deferred Priority values: Critical, High, Medium, Low

Bug Description Template:

What's wrong: Clear description of the issue
Where: Which file/table/section
Expected: What it should look like
Actual: What it currently looks like
Impact: How it affects usability/readability

Step 4: Fix & Repeat

Pick highest priority bug
Fix the code
Re-run comparison on affected file(s)
Verify fix doesn't break other files
Mark bug as Fixed
Repeat until exit criteria met

Quick verification:

# Re-run just the problematic file
python -c "
from edgar.documents import parse_html
from pathlib import Path
html = Path('data/html/Apple.10-K.html').read_text()
doc = parse_html(html)
# Quick inspection
print(f'Tables: {len(doc.tables)}')
print(doc.tables[5].render(width=200))  # Check specific table
"

Exit Criteria

We're done when:

✅ All 10 test documents parse successfully
✅ Visual output looks correct (maintainer approval)
✅ Tables render cleanly with proper alignment
✅ No critical or high priority bugs remain
✅ Performance is equal or better than old parser
✅ Text extraction is complete and clean for AI context

Final approval: Maintainer says "This is good enough to ship."

Testing Infrastructure

Primary Tool: compare_parsers.py

Simple command-line tool for the quality improvement loop:

# Quick overview comparison (using shortcuts!)
python tests/manual/compare_parsers.py aapl

# See all tables in a document
python tests/manual/compare_parsers.py aapl --tables

# Compare specific table (OLD vs NEW side-by-side)
python tests/manual/compare_parsers.py aapl --table 5

# Compare text extraction
python tests/manual/compare_parsers.py msft --text

# See section detection
python tests/manual/compare_parsers.py orcl --sections

# Test with 10-Q filings
python tests/manual/compare_parsers.py 'aapl 10-q'

# Run all test files at once
python tests/manual/compare_parsers.py --all

Shortcuts available:

Companies: aapl, msft, tsla, nvda, orcl
Filing types: 10-k (default), 10-q, 8-k
Or use full file paths

Features:

Clean command-line interface
Side-by-side OLD vs NEW comparison
Rich console output with colors and tables
Performance metrics
Individual table inspection

Other Available Scripts

Additional tools for specific testing:

tests/manual/check_parser_comparison.py - Full comparison with metrics
tests/manual/check_tables.py - Table-specific comparison with rendering
tests/manual/check_html_rewrite.py - General HTML parsing checks
tests/manual/check_html_parser_real_files.py - Real filing tests

Quick Reference

For day-to-day testing commands and usage examples, see TESTING.md.

Notes

Keep it simple: This is about rapid iteration, not comprehensive automation
Visual inspection is key: Automated metrics don't catch layout/formatting issues
Use screenshots: When describing bugs, screenshots speak louder than words
Iterative approach: Don't try to fix everything at once, prioritize
Trust your judgment: If it looks wrong, it probably is wrong

Bug Tracker

Active Issues

(Add bugs here as they're discovered)

Fixed Issues

(Move completed bugs here for history)

Deferred Issues

(Issues that aren't blocking release but could be improved later)

Status: Initial draft Last Updated: 2025-10-07 Maintainer: Dwight Gunning

6.5 KiB Raw Blame History