# HTML Parser Testing Quick Start
Quick reference for testing the HTML parser rewrite during quality improvement.
## Quick Start
```bash
# Use shortcuts (easy!)
python tests/manual/compare_parsers.py aapl # Apple 10-K
python tests/manual/compare_parsers.py nvda --tables # Nvidia tables
python tests/manual/compare_parsers.py 'aapl 10-q' # Apple 10-Q
python tests/manual/compare_parsers.py orcl --table 5 # Oracle table #5
# Or use full paths
python tests/manual/compare_parsers.py data/html/Apple.10-K.html
# Run all test files
python tests/manual/compare_parsers.py --all
```
**Available shortcuts:**
- **Companies**: `aapl`, `msft`, `tsla`, `nvda`, `orcl` (or full names like `apple`)
- **Filing types**: `10-k` (default), `10-q`, `8-k`
- **Combine**: `'aapl 10-q'`, `'orcl 8-k'`
## Common Use Cases
### 1. First Look at a Filing
```bash
# Get overview: speed, table count, sections
python tests/manual/compare_parsers.py orcl
```
**Shows**:
- Parse time comparison (OLD vs NEW)
- Tables found
- Text length
- Sections detected
- New features (headings, XBRL)
### 2. Check Table Rendering
```bash
# List all tables with dimensions (shows first 20 tables)
python tests/manual/compare_parsers.py aapl --tables
# Compare specific table side-by-side (FULL table, no truncation)
python tests/manual/compare_parsers.py aapl --table 7
# Compare a range of tables
python tests/manual/compare_parsers.py aapl --range 5:10
```
**Look for**:
- Currency symbols merged: `$1,234` not `$ | 1,234`
- Proper column alignment
- Correct row/column counts
- Clean rendering without extra spacing columns
**Note**: `--table N` shows the **complete table** with all rows - no truncation!
### 3. Verify Text Extraction
```bash
# See first 50 lines side-by-side (default limit)
python tests/manual/compare_parsers.py msft --text
# Show more lines (configurable)
python tests/manual/compare_parsers.py msft --text --lines 100
# Show first 200 lines
python tests/manual/compare_parsers.py msft --text --lines 200
```
**Check**:
- Semantic meaning preserved
- No missing content
- Clean formatting for LLM consumption
**Note**: Text mode shows first N lines only (default: 50). Use `--lines N` to adjust.
### 4. Check Section Detection
```bash
python tests/manual/compare_parsers.py aapl --sections
```
**Verify**:
- Standard sections identified (10-K/10-Q)
- Section boundaries correct
- Text length reasonable per section
### 5. Run Full Test Suite
```bash
# Test all files in corpus
python tests/manual/compare_parsers.py --all
```
**Results**:
- Summary table across all files
- Performance comparison
- Table detection comparison
## Test Files
Available in `data/html/`:
- `Apple.10-K.html` - 1.8MB, complex financials
- `Oracle.10-K.html` - Large filing
- `Nvidia.10-K.html` - Tech company
- `Apple.10-Q.html` - Quarterly format
- More files as needed...
## Command Reference
```
python tests/manual/compare_parsers.py [FILE] [OPTIONS]
Options:
--all Run on all test files
--tables Show tables summary (first 20 tables)
--table N Show specific table N side-by-side (FULL table)
--range START:END Show range of tables (e.g., 5:10)
--text Show text comparison (first 50 lines by default)
--sections Show sections comparison
--lines N Number of text lines to show (default: 50, only for --text)
--help Show full help
```
### Output Limits Summary
| Mode | Limit | Configurable | Notes |
|---------------|------------|-------------------|---------------------------------|
| `--table N` | None | N/A | Shows **complete table** |
| `--range N:M` | None | N/A | Shows **complete tables** in range |
| `--tables` | 20 tables | No | Lists first 20 tables only |
| `--text` | 50 lines | Yes (`--lines N`) | Preview only |
| `--sections` | None | N/A | Shows all sections |
## Output Interpretation
### Overview Table
```
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric ┃ Old Parser ┃ New Parser ┃ Notes ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parse Time │ 454ms │ 334ms │ 1.4x faster│
│ Tables Found │ 63 │ 63 │ +0 │
│ Text Length │ 0 │ 159,388 │ NEW! │
└───────────────┴────────────┴────────────┴────────────┘
```
**Good signs**:
- ✅ New parser faster or similar speed
- ✅ Same or more tables found
- ✅ Text extracted (old parser shows 0)
- ✅ Sections detected
**Red flags**:
- ❌ Significantly slower
- ❌ Fewer tables (unless removing layout tables)
- ❌ Much shorter text (content missing)
### Table Comparison
```
Old Parser:
┌─────────┬──────────┬──────────┐
│ Year │ Revenue │ Profit │
├─────────┼──────────┼──────────┤
│ 2023 │ $ 100M │ $ 20M │ <- Currency separated
└─────────┴──────────┴──────────┘
New Parser:
┌─────────┬──────────┬──────────┐
│ Year │ Revenue │ Profit │
├─────────┼──────────┼──────────┤
│ 2023 │ $100M │ $20M │ <- Currency merged ✅
└─────────┴──────────┴──────────┘
```
**Look for**:
- Currency symbols merged with values
- No extra empty columns
- Proper alignment
- Clean numeric formatting
## Tips
1. **Start with overview** - Get the big picture first
2. **Check tables visually** - Automated metrics miss formatting issues
3. **Use specific table inspection** - Don't scroll through 60 tables manually
4. **Compare text for semantics** - Does it make sense for an LLM?
5. **Run --all periodically** - Catch regressions across files
## Troubleshooting
### Script fails with import error
```bash
# Clear cached modules
find . -type d -name __pycache__ -exec rm -rf {} +
python tests/manual/compare_parsers.py data/html/Apple.10-K.html
```
### File not found
```bash
# Check available files
ls -lh data/html/*.html
# Use full path
python tests/manual/compare_parsers.py /full/path/to/file.html
```
### Old parser shows 0 text
This is expected - old parser has different text extraction. Focus on:
- Table comparison
- Parse time
- Visual quality of output
## Next Steps
1. Run comparison on all test files
2. Document bugs in `quality-improvement-strategy.md`
3. Fix issues
4. Repeat until satisfied
See `edgar/documents/docs/quality-improvement-strategy.md` for full process.