# HTML Parser Quality Improvement Strategy

## Overview

Simple, iterative testing strategy for the HTML parser rewrite. The goal is rapid feedback loops where we compare OLD vs NEW parser output, identify visual/functional issues, fix them, and repeat until satisfied.

## Test Corpus

### 10 Representative Documents

Selected to cover different filing types, companies, and edge cases:

| # | Company | Filing Type | File Path | Rationale |
|---|---------|-------------|-----------|-----------|
| 1 | Apple | 10-K | `data/html/Apple.10-K.html` | Large complex filing, existing test file |
| 2 | Oracle | 10-K | `data/html/Oracle.10-K.html` | Complex financials, existing test file |
| 3 | Nvidia | 10-K | `data/html/Nvidia.10-K.html` | Tech company, existing test file |
| 4 | Microsoft | 10-K | `data/html/Microsoft.10-K.html` | Popular company, complex tables |
| 5 | Tesla | 10-K | `data/html/Tesla.10-K.html` | Manufacturing sector, different formatting |
| 6 | [TBD] | 10-Q | TBD | Quarterly report format |
| 7 | [TBD] | 10-Q | TBD | Another quarterly for variety |
| 8 | [TBD] | 8-K | `data/html/BuckleInc.8-K.html` | Event-driven filing |
| 9 | [TBD] | Proxy (DEF 14A) | TBD | Proxy statement with compensation tables |
| 10 | [TBD] | Edge case | TBD | Unusual formatting or very large file |

**Note**: Fill in TBD entries as we identify good test candidates.

## The 4-Step Loop

### Step 1: Run Comparison

Use existing test scripts to compare OLD vs NEW parsers:

```bash
# Full comparison with metrics
python tests/manual/check_parser_comparison.py

# Table-focused comparison with rendering
python tests/manual/check_tables.py

# Or run on specific file
python tests/manual/check_html_rewrite.py
```

**Outputs to review**:
- Console output with side-by-side Rich panels
- Metrics (parse time, table count, section detection)
- Rendered tables (old vs new)

### Step 2: Human Review

**Visual Inspection Process**:
1. Look at console output directly (Rich rendering)
2. For detailed text comparison, optionally dump to files:
   - OLD parser: `doc.text()` → `output/old_apple.txt`
   - NEW parser: `doc.text()` → `output/new_apple.txt`
   - Use `diff` or visual diff tool
3. Take screenshots for complex table issues
4. Focus on:
   - Table alignment and formatting
   - Currency symbol placement (should be merged: `$1,234` not `$ | 1,234`)
   - Column count (fewer is better after removing spacing columns)
   - Section detection accuracy
   - Text readability for LLM context

**Quality Criteria** (from goals.md):
- Semantic meaning preserved
- Tables render correctly when printed
- Better than old parser in speed, accuracy, features
- **You are the final judge**: "Does this look right?"

### Step 3: Document Bugs

Record issues in the tracker below as you find them:

| Bug # | Status | Priority | Description | File/Location | Notes |
|-------|--------|----------|-------------|---------------|-------|
| Example | Fixed | High | Currency symbols not merging in balance sheet | Apple 10-K, Table 5 | Issue in CurrencyColumnMerger |
| | | | | | |
| | | | | | |
| | | | | | |

**Status values**: Open, In Progress, Fixed, Won't Fix, Deferred
**Priority values**: Critical, High, Medium, Low

**Bug Description Template**:
- What's wrong: Clear description of the issue
- Where: Which file/table/section
- Expected: What it should look like
- Actual: What it currently looks like
- Impact: How it affects usability/readability

### Step 4: Fix & Repeat

1. Pick highest priority bug
2. Fix the code
3. Re-run comparison on affected file(s)
4. Verify fix doesn't break other files
5. Mark bug as Fixed
6. Repeat until exit criteria met

**Quick verification**:
```bash
# Re-run just the problematic file
python -c "
from edgar.documents import parse_html
from pathlib import Path
html = Path('data/html/Apple.10-K.html').read_text()
doc = parse_html(html)
# Quick inspection
print(f'Tables: {len(doc.tables)}')
print(doc.tables[5].render(width=200))  # Check specific table
"
```

## Exit Criteria

We're done when:
1. ✅ All 10 test documents parse successfully
2. ✅ Visual output looks correct (maintainer approval)
3. ✅ Tables render cleanly with proper alignment
4. ✅ No critical or high priority bugs remain
5. ✅ Performance is equal or better than old parser
6. ✅ Text extraction is complete and clean for AI context

**Final approval**: Maintainer says "This is good enough to ship."

## Testing Infrastructure

### Primary Tool: compare_parsers.py

Simple command-line tool for the quality improvement loop:

```bash
# Quick overview comparison (using shortcuts!)
python tests/manual/compare_parsers.py aapl

# See all tables in a document
python tests/manual/compare_parsers.py aapl --tables

# Compare specific table (OLD vs NEW side-by-side)
python tests/manual/compare_parsers.py aapl --table 5

# Compare text extraction
python tests/manual/compare_parsers.py msft --text

# See section detection
python tests/manual/compare_parsers.py orcl --sections

# Test with 10-Q filings
python tests/manual/compare_parsers.py 'aapl 10-q'

# Run all test files at once
python tests/manual/compare_parsers.py --all
```

**Shortcuts available**:
- Companies: `aapl`, `msft`, `tsla`, `nvda`, `orcl`
- Filing types: `10-k` (default), `10-q`, `8-k`
- Or use full file paths

**Features**:
- Clean command-line interface
- Side-by-side OLD vs NEW comparison
- Rich console output with colors and tables
- Performance metrics
- Individual table inspection

### Other Available Scripts

Additional tools for specific testing:

- `tests/manual/check_parser_comparison.py` - Full comparison with metrics
- `tests/manual/check_tables.py` - Table-specific comparison with rendering
- `tests/manual/check_html_rewrite.py` - General HTML parsing checks
- `tests/manual/check_html_parser_real_files.py` - Real filing tests

## Quick Reference

For day-to-day testing commands and usage examples, see [TESTING.md](TESTING.md).

## Notes

- **Keep it simple**: This is about rapid iteration, not comprehensive automation
- **Visual inspection is key**: Automated metrics don't catch layout/formatting issues
- **Use screenshots**: When describing bugs, screenshots speak louder than words
- **Iterative approach**: Don't try to fix everything at once, prioritize
- **Trust your judgment**: If it looks wrong, it probably is wrong

## Bug Tracker

### Active Issues

(Add bugs here as they're discovered)

### Fixed Issues

(Move completed bugs here for history)

### Deferred Issues

(Issues that aren't blocking release but could be improved later)

---

**Status**: Initial draft
**Last Updated**: 2025-10-07
**Maintainer**: Dwight Gunning