Files
edgartools/venv/lib/python3.10/site-packages/edgar/documents/docs/TESTING.md
2025-12-09 12:13:01 +01:00

7.1 KiB

HTML Parser Testing Quick Start

Quick reference for testing the HTML parser rewrite during quality improvement.

Quick Start

# Use shortcuts (easy!)
python tests/manual/compare_parsers.py aapl              # Apple 10-K
python tests/manual/compare_parsers.py nvda --tables     # Nvidia tables
python tests/manual/compare_parsers.py 'aapl 10-q'       # Apple 10-Q
python tests/manual/compare_parsers.py orcl --table 5    # Oracle table #5

# Or use full paths
python tests/manual/compare_parsers.py data/html/Apple.10-K.html

# Run all test files
python tests/manual/compare_parsers.py --all

Available shortcuts:

  • Companies: aapl, msft, tsla, nvda, orcl (or full names like apple)
  • Filing types: 10-k (default), 10-q, 8-k
  • Combine: 'aapl 10-q', 'orcl 8-k'

Common Use Cases

1. First Look at a Filing

# Get overview: speed, table count, sections
python tests/manual/compare_parsers.py orcl

Shows:

  • Parse time comparison (OLD vs NEW)
  • Tables found
  • Text length
  • Sections detected
  • New features (headings, XBRL)

2. Check Table Rendering

# List all tables with dimensions (shows first 20 tables)
python tests/manual/compare_parsers.py aapl --tables

# Compare specific table side-by-side (FULL table, no truncation)
python tests/manual/compare_parsers.py aapl --table 7

# Compare a range of tables
python tests/manual/compare_parsers.py aapl --range 5:10

Look for:

  • Currency symbols merged: $1,234 not $ | 1,234
  • Proper column alignment
  • Correct row/column counts
  • Clean rendering without extra spacing columns

Note: --table N shows the complete table with all rows - no truncation!

3. Verify Text Extraction

# See first 50 lines side-by-side (default limit)
python tests/manual/compare_parsers.py msft --text

# Show more lines (configurable)
python tests/manual/compare_parsers.py msft --text --lines 100

# Show first 200 lines
python tests/manual/compare_parsers.py msft --text --lines 200

Check:

  • Semantic meaning preserved
  • No missing content
  • Clean formatting for LLM consumption

Note: Text mode shows first N lines only (default: 50). Use --lines N to adjust.

4. Check Section Detection

python tests/manual/compare_parsers.py aapl --sections

Verify:

  • Standard sections identified (10-K/10-Q)
  • Section boundaries correct
  • Text length reasonable per section

5. Run Full Test Suite

# Test all files in corpus
python tests/manual/compare_parsers.py --all

Results:

  • Summary table across all files
  • Performance comparison
  • Table detection comparison

Test Files

Available in data/html/:

  • Apple.10-K.html - 1.8MB, complex financials
  • Oracle.10-K.html - Large filing
  • Nvidia.10-K.html - Tech company
  • Apple.10-Q.html - Quarterly format
  • More files as needed...

Command Reference

python tests/manual/compare_parsers.py [FILE] [OPTIONS]

Options:
  --all           Run on all test files
  --tables        Show tables summary (first 20 tables)
  --table N       Show specific table N side-by-side (FULL table)
  --range START:END  Show range of tables (e.g., 5:10)
  --text          Show text comparison (first 50 lines by default)
  --sections      Show sections comparison
  --lines N       Number of text lines to show (default: 50, only for --text)
  --help          Show full help

Output Limits Summary

Mode Limit Configurable Notes
--table N None N/A Shows complete table
--range N:M None N/A Shows complete tables in range
--tables 20 tables No Lists first 20 tables only
--text 50 lines Yes (--lines N) Preview only
--sections None N/A Shows all sections

Output Interpretation

Overview Table

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric        ┃ Old Parser ┃ New Parser ┃ Notes      ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parse Time    │ 454ms      │ 334ms      │ 1.4x faster│
│ Tables Found  │ 63         │ 63         │ +0         │
│ Text Length   │ 0          │ 159,388    │ NEW!       │
└───────────────┴────────────┴────────────┴────────────┘

Good signs:

  • New parser faster or similar speed
  • Same or more tables found
  • Text extracted (old parser shows 0)
  • Sections detected

Red flags:

  • Significantly slower
  • Fewer tables (unless removing layout tables)
  • Much shorter text (content missing)

Table Comparison

Old Parser:
┌─────────┬──────────┬──────────┐
│ Year    │ Revenue  │ Profit   │
├─────────┼──────────┼──────────┤
│ 2023    │ $ 100M   │ $ 20M    │  <- Currency separated
└─────────┴──────────┴──────────┘

New Parser:
┌─────────┬──────────┬──────────┐
│ Year    │ Revenue  │ Profit   │
├─────────┼──────────┼──────────┤
│ 2023    │ $100M    │ $20M     │  <- Currency merged ✅
└─────────┴──────────┴──────────┘

Look for:

  • Currency symbols merged with values
  • No extra empty columns
  • Proper alignment
  • Clean numeric formatting

Tips

  1. Start with overview - Get the big picture first
  2. Check tables visually - Automated metrics miss formatting issues
  3. Use specific table inspection - Don't scroll through 60 tables manually
  4. Compare text for semantics - Does it make sense for an LLM?
  5. Run --all periodically - Catch regressions across files

Troubleshooting

Script fails with import error

# Clear cached modules
find . -type d -name __pycache__ -exec rm -rf {} +
python tests/manual/compare_parsers.py data/html/Apple.10-K.html

File not found

# Check available files
ls -lh data/html/*.html

# Use full path
python tests/manual/compare_parsers.py /full/path/to/file.html

Old parser shows 0 text

This is expected - old parser has different text extraction. Focus on:

  • Table comparison
  • Parse time
  • Visual quality of output

Next Steps

  1. Run comparison on all test files
  2. Document bugs in quality-improvement-strategy.md
  3. Fix issues
  4. Repeat until satisfied

See edgar/documents/docs/quality-improvement-strategy.md for full process.