# HTML Parser Testing Quick Start

Quick reference for testing the HTML parser rewrite during quality improvement.

## Quick Start

```bash
# Use shortcuts (easy!)
python tests/manual/compare_parsers.py aapl              # Apple 10-K
python tests/manual/compare_parsers.py nvda --tables     # Nvidia tables
python tests/manual/compare_parsers.py 'aapl 10-q'       # Apple 10-Q
python tests/manual/compare_parsers.py orcl --table 5    # Oracle table #5

# Or use full paths
python tests/manual/compare_parsers.py data/html/Apple.10-K.html

# Run all test files
python tests/manual/compare_parsers.py --all
```

**Available shortcuts:**
- **Companies**: `aapl`, `msft`, `tsla`, `nvda`, `orcl` (or full names like `apple`)
- **Filing types**: `10-k` (default), `10-q`, `8-k`
- **Combine**: `'aapl 10-q'`, `'orcl 8-k'`

## Common Use Cases

### 1. First Look at a Filing

```bash
# Get overview: speed, table count, sections
python tests/manual/compare_parsers.py orcl
```

**Shows**:
- Parse time comparison (OLD vs NEW)
- Tables found
- Text length
- Sections detected
- New features (headings, XBRL)

### 2. Check Table Rendering

```bash
# List all tables with dimensions (shows first 20 tables)
python tests/manual/compare_parsers.py aapl --tables

# Compare specific table side-by-side (FULL table, no truncation)
python tests/manual/compare_parsers.py aapl --table 7

# Compare a range of tables
python tests/manual/compare_parsers.py aapl --range 5:10
```

**Look for**:
- Currency symbols merged: `$1,234` not `$ | 1,234`
- Proper column alignment
- Correct row/column counts
- Clean rendering without extra spacing columns

**Note**: `--table N` shows the **complete table** with all rows - no truncation!

### 3. Verify Text Extraction

```bash
# See first 50 lines side-by-side (default limit)
python tests/manual/compare_parsers.py msft --text

# Show more lines (configurable)
python tests/manual/compare_parsers.py msft --text --lines 100

# Show first 200 lines
python tests/manual/compare_parsers.py msft --text --lines 200
```

**Check**:
- Semantic meaning preserved
- No missing content
- Clean formatting for LLM consumption

**Note**: Text mode shows first N lines only (default: 50). Use `--lines N` to adjust.

### 4. Check Section Detection

```bash
python tests/manual/compare_parsers.py aapl --sections
```

**Verify**:
- Standard sections identified (10-K/10-Q)
- Section boundaries correct
- Text length reasonable per section

### 5. Run Full Test Suite

```bash
# Test all files in corpus
python tests/manual/compare_parsers.py --all
```

**Results**:
- Summary table across all files
- Performance comparison
- Table detection comparison

## Test Files

Available in `data/html/`:

- `Apple.10-K.html` - 1.8MB, complex financials
- `Oracle.10-K.html` - Large filing
- `Nvidia.10-K.html` - Tech company
- `Apple.10-Q.html` - Quarterly format
- More files as needed...

## Command Reference

```
python tests/manual/compare_parsers.py [FILE] [OPTIONS]

Options:
  --all           Run on all test files
  --tables        Show tables summary (first 20 tables)
  --table N       Show specific table N side-by-side (FULL table)
  --range START:END  Show range of tables (e.g., 5:10)
  --text          Show text comparison (first 50 lines by default)
  --sections      Show sections comparison
  --lines N       Number of text lines to show (default: 50, only for --text)
  --help          Show full help
```

### Output Limits Summary

| Mode          | Limit      | Configurable      | Notes                           |
|---------------|------------|-------------------|---------------------------------|
| `--table N`   | None       | N/A               | Shows **complete table**        |
| `--range N:M` | None       | N/A               | Shows **complete tables** in range |
| `--tables`    | 20 tables  | No                | Lists first 20 tables only      |
| `--text`      | 50 lines   | Yes (`--lines N`) | Preview only                    |
| `--sections`  | None       | N/A               | Shows all sections              |

## Output Interpretation

### Overview Table

```
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric        ┃ Old Parser ┃ New Parser ┃ Notes      ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parse Time    │ 454ms      │ 334ms      │ 1.4x faster│
│ Tables Found  │ 63         │ 63         │ +0         │
│ Text Length   │ 0          │ 159,388    │ NEW!       │
└───────────────┴────────────┴────────────┴────────────┘
```

**Good signs**:
- ✅ New parser faster or similar speed
- ✅ Same or more tables found
- ✅ Text extracted (old parser shows 0)
- ✅ Sections detected

**Red flags**:
- ❌ Significantly slower
- ❌ Fewer tables (unless removing layout tables)
- ❌ Much shorter text (content missing)

### Table Comparison

```
Old Parser:
┌─────────┬──────────┬──────────┐
│ Year    │ Revenue  │ Profit   │
├─────────┼──────────┼──────────┤
│ 2023    │ $ 100M   │ $ 20M    │  <- Currency separated
└─────────┴──────────┴──────────┘

New Parser:
┌─────────┬──────────┬──────────┐
│ Year    │ Revenue  │ Profit   │
├─────────┼──────────┼──────────┤
│ 2023    │ $100M    │ $20M     │  <- Currency merged ✅
└─────────┴──────────┴──────────┘
```

**Look for**:
- Currency symbols merged with values
- No extra empty columns
- Proper alignment
- Clean numeric formatting

## Tips

1. **Start with overview** - Get the big picture first
2. **Check tables visually** - Automated metrics miss formatting issues
3. **Use specific table inspection** - Don't scroll through 60 tables manually
4. **Compare text for semantics** - Does it make sense for an LLM?
5. **Run --all periodically** - Catch regressions across files

## Troubleshooting

### Script fails with import error

```bash
# Clear cached modules
find . -type d -name __pycache__ -exec rm -rf {} +
python tests/manual/compare_parsers.py data/html/Apple.10-K.html
```

### File not found

```bash
# Check available files
ls -lh data/html/*.html

# Use full path
python tests/manual/compare_parsers.py /full/path/to/file.html
```

### Old parser shows 0 text

This is expected - old parser has different text extraction. Focus on:
- Table comparison
- Parse time
- Visual quality of output

## Next Steps

1. Run comparison on all test files
2. Document bugs in `quality-improvement-strategy.md`
3. Fix issues
4. Repeat until satisfied

See `edgar/documents/docs/quality-improvement-strategy.md` for full process.