kadu/edgartools

Fork 0

Files

kdusek 8e654ed209 Initial commit

2025-12-09 12:13:01 +01:00

7.1 KiB

Raw Blame History

HTML Parser Testing Quick Start

Quick reference for testing the HTML parser rewrite during quality improvement.

Quick Start

# Use shortcuts (easy!)
python tests/manual/compare_parsers.py aapl              # Apple 10-K
python tests/manual/compare_parsers.py nvda --tables     # Nvidia tables
python tests/manual/compare_parsers.py 'aapl 10-q'       # Apple 10-Q
python tests/manual/compare_parsers.py orcl --table 5    # Oracle table #5

# Or use full paths
python tests/manual/compare_parsers.py data/html/Apple.10-K.html

# Run all test files
python tests/manual/compare_parsers.py --all

Available shortcuts:

Companies: aapl, msft, tsla, nvda, orcl (or full names like apple)
Filing types: 10-k (default), 10-q, 8-k
Combine: 'aapl 10-q', 'orcl 8-k'

Common Use Cases

1. First Look at a Filing

# Get overview: speed, table count, sections
python tests/manual/compare_parsers.py orcl

Shows:

Parse time comparison (OLD vs NEW)
Tables found
Text length
Sections detected
New features (headings, XBRL)

2. Check Table Rendering

# List all tables with dimensions (shows first 20 tables)
python tests/manual/compare_parsers.py aapl --tables

# Compare specific table side-by-side (FULL table, no truncation)
python tests/manual/compare_parsers.py aapl --table 7

# Compare a range of tables
python tests/manual/compare_parsers.py aapl --range 5:10

Look for:

Currency symbols merged: $1,234 not $ | 1,234
Proper column alignment
Correct row/column counts
Clean rendering without extra spacing columns

Note: --table N shows the complete table with all rows - no truncation!

3. Verify Text Extraction

# See first 50 lines side-by-side (default limit)
python tests/manual/compare_parsers.py msft --text

# Show more lines (configurable)
python tests/manual/compare_parsers.py msft --text --lines 100

# Show first 200 lines
python tests/manual/compare_parsers.py msft --text --lines 200

Check:

Semantic meaning preserved
No missing content
Clean formatting for LLM consumption

Note: Text mode shows first N lines only (default: 50). Use --lines N to adjust.

4. Check Section Detection

python tests/manual/compare_parsers.py aapl --sections

Verify:

Standard sections identified (10-K/10-Q)
Section boundaries correct
Text length reasonable per section

5. Run Full Test Suite

# Test all files in corpus
python tests/manual/compare_parsers.py --all

Results:

Summary table across all files
Performance comparison
Table detection comparison

Test Files

Available in data/html/:

Apple.10-K.html - 1.8MB, complex financials
Oracle.10-K.html - Large filing
Nvidia.10-K.html - Tech company
Apple.10-Q.html - Quarterly format
More files as needed...

Command Reference

python tests/manual/compare_parsers.py [FILE] [OPTIONS]

Options:
  --all           Run on all test files
  --tables        Show tables summary (first 20 tables)
  --table N       Show specific table N side-by-side (FULL table)
  --range START:END  Show range of tables (e.g., 5:10)
  --text          Show text comparison (first 50 lines by default)
  --sections      Show sections comparison
  --lines N       Number of text lines to show (default: 50, only for --text)
  --help          Show full help

Output Limits Summary

Mode	Limit	Configurable	Notes
`--table N`	None	N/A	Shows complete table
`--range N:M`	None	N/A	Shows complete tables in range
`--tables`	20 tables	No	Lists first 20 tables only
`--text`	50 lines	Yes (`--lines N`)	Preview only
`--sections`	None	N/A	Shows all sections

Output Interpretation

Overview Table

┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Metric        ┃ Old Parser ┃ New Parser ┃ Notes      ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Parse Time    │ 454ms      │ 334ms      │ 1.4x faster│
│ Tables Found  │ 63         │ 63         │ +0         │
│ Text Length   │ 0          │ 159,388    │ NEW!       │
└───────────────┴────────────┴────────────┴────────────┘

Good signs:

✅ New parser faster or similar speed
✅ Same or more tables found
✅ Text extracted (old parser shows 0)
✅ Sections detected

Red flags:

❌ Significantly slower
❌ Fewer tables (unless removing layout tables)
❌ Much shorter text (content missing)

Table Comparison

Old Parser:
┌─────────┬──────────┬──────────┐
│ Year    │ Revenue  │ Profit   │
├─────────┼──────────┼──────────┤
│ 2023    │ $ 100M   │ $ 20M    │  <- Currency separated
└─────────┴──────────┴──────────┘

New Parser:
┌─────────┬──────────┬──────────┐
│ Year    │ Revenue  │ Profit   │
├─────────┼──────────┼──────────┤
│ 2023    │ $100M    │ $20M     │  <- Currency merged ✅
└─────────┴──────────┴──────────┘