Files
edgartools/venv/lib/python3.10/site-packages/edgar/documents/docs/quality-improvement-strategy.md
2025-12-09 12:13:01 +01:00

6.5 KiB

HTML Parser Quality Improvement Strategy

Overview

Simple, iterative testing strategy for the HTML parser rewrite. The goal is rapid feedback loops where we compare OLD vs NEW parser output, identify visual/functional issues, fix them, and repeat until satisfied.

Test Corpus

10 Representative Documents

Selected to cover different filing types, companies, and edge cases:

# Company Filing Type File Path Rationale
1 Apple 10-K data/html/Apple.10-K.html Large complex filing, existing test file
2 Oracle 10-K data/html/Oracle.10-K.html Complex financials, existing test file
3 Nvidia 10-K data/html/Nvidia.10-K.html Tech company, existing test file
4 Microsoft 10-K data/html/Microsoft.10-K.html Popular company, complex tables
5 Tesla 10-K data/html/Tesla.10-K.html Manufacturing sector, different formatting
6 [TBD] 10-Q TBD Quarterly report format
7 [TBD] 10-Q TBD Another quarterly for variety
8 [TBD] 8-K data/html/BuckleInc.8-K.html Event-driven filing
9 [TBD] Proxy (DEF 14A) TBD Proxy statement with compensation tables
10 [TBD] Edge case TBD Unusual formatting or very large file

Note: Fill in TBD entries as we identify good test candidates.

The 4-Step Loop

Step 1: Run Comparison

Use existing test scripts to compare OLD vs NEW parsers:

# Full comparison with metrics
python tests/manual/check_parser_comparison.py

# Table-focused comparison with rendering
python tests/manual/check_tables.py

# Or run on specific file
python tests/manual/check_html_rewrite.py

Outputs to review:

  • Console output with side-by-side Rich panels
  • Metrics (parse time, table count, section detection)
  • Rendered tables (old vs new)

Step 2: Human Review

Visual Inspection Process:

  1. Look at console output directly (Rich rendering)
  2. For detailed text comparison, optionally dump to files:
    • OLD parser: doc.text()output/old_apple.txt
    • NEW parser: doc.text()output/new_apple.txt
    • Use diff or visual diff tool
  3. Take screenshots for complex table issues
  4. Focus on:
    • Table alignment and formatting
    • Currency symbol placement (should be merged: $1,234 not $ | 1,234)
    • Column count (fewer is better after removing spacing columns)
    • Section detection accuracy
    • Text readability for LLM context

Quality Criteria (from goals.md):

  • Semantic meaning preserved
  • Tables render correctly when printed
  • Better than old parser in speed, accuracy, features
  • You are the final judge: "Does this look right?"

Step 3: Document Bugs

Record issues in the tracker below as you find them:

Bug # Status Priority Description File/Location Notes
Example Fixed High Currency symbols not merging in balance sheet Apple 10-K, Table 5 Issue in CurrencyColumnMerger

Status values: Open, In Progress, Fixed, Won't Fix, Deferred Priority values: Critical, High, Medium, Low

Bug Description Template:

  • What's wrong: Clear description of the issue
  • Where: Which file/table/section
  • Expected: What it should look like
  • Actual: What it currently looks like
  • Impact: How it affects usability/readability

Step 4: Fix & Repeat

  1. Pick highest priority bug
  2. Fix the code
  3. Re-run comparison on affected file(s)
  4. Verify fix doesn't break other files
  5. Mark bug as Fixed
  6. Repeat until exit criteria met

Quick verification:

# Re-run just the problematic file
python -c "
from edgar.documents import parse_html
from pathlib import Path
html = Path('data/html/Apple.10-K.html').read_text()
doc = parse_html(html)
# Quick inspection
print(f'Tables: {len(doc.tables)}')
print(doc.tables[5].render(width=200))  # Check specific table
"

Exit Criteria

We're done when:

  1. All 10 test documents parse successfully
  2. Visual output looks correct (maintainer approval)
  3. Tables render cleanly with proper alignment
  4. No critical or high priority bugs remain
  5. Performance is equal or better than old parser
  6. Text extraction is complete and clean for AI context

Final approval: Maintainer says "This is good enough to ship."

Testing Infrastructure

Primary Tool: compare_parsers.py

Simple command-line tool for the quality improvement loop:

# Quick overview comparison (using shortcuts!)
python tests/manual/compare_parsers.py aapl

# See all tables in a document
python tests/manual/compare_parsers.py aapl --tables

# Compare specific table (OLD vs NEW side-by-side)
python tests/manual/compare_parsers.py aapl --table 5

# Compare text extraction
python tests/manual/compare_parsers.py msft --text

# See section detection
python tests/manual/compare_parsers.py orcl --sections

# Test with 10-Q filings
python tests/manual/compare_parsers.py 'aapl 10-q'

# Run all test files at once
python tests/manual/compare_parsers.py --all

Shortcuts available:

  • Companies: aapl, msft, tsla, nvda, orcl
  • Filing types: 10-k (default), 10-q, 8-k
  • Or use full file paths

Features:

  • Clean command-line interface
  • Side-by-side OLD vs NEW comparison
  • Rich console output with colors and tables
  • Performance metrics
  • Individual table inspection

Other Available Scripts

Additional tools for specific testing:

  • tests/manual/check_parser_comparison.py - Full comparison with metrics
  • tests/manual/check_tables.py - Table-specific comparison with rendering
  • tests/manual/check_html_rewrite.py - General HTML parsing checks
  • tests/manual/check_html_parser_real_files.py - Real filing tests

Quick Reference

For day-to-day testing commands and usage examples, see TESTING.md.

Notes

  • Keep it simple: This is about rapid iteration, not comprehensive automation
  • Visual inspection is key: Automated metrics don't catch layout/formatting issues
  • Use screenshots: When describing bugs, screenshots speak louder than words
  • Iterative approach: Don't try to fix everything at once, prioritize
  • Trust your judgment: If it looks wrong, it probably is wrong

Bug Tracker

Active Issues

(Add bugs here as they're discovered)

Fixed Issues

(Move completed bugs here for history)

Deferred Issues

(Issues that aren't blocking release but could be improved later)


Status: Initial draft Last Updated: 2025-10-07 Maintainer: Dwight Gunning