Files

kdusek 8e654ed209 Initial commit

2025-12-09 12:13:01 +01:00

7.8 KiB

Raw Blame History

HTML Parser Rewrite Technical Overview

Executive Summary

The edgar/documents module represents a comprehensive rewrite of the HTML parsing capabilities originally implemented in edgar/files. This new parser is designed to provide superior parsing accuracy, structured data extraction, and rendering quality for SEC filing documents. The rewrite introduces a modern, extensible architecture with specialized components for handling the complex structure of financial documents.

Architecture Overview

Core Components

1. Document Object Model

The new parser introduces a sophisticated node-based document model:

Document: Top-level container with metadata and sections
Node Hierarchy: Abstract base classes for all document elements
- DocumentNode: Root document container
- TextNode: Plain text content
- ParagraphNode: Paragraph elements with styling
- HeadingNode: Headers with levels 1-6
- ContainerNode: Generic containers (div, section)
- SectionNode: Document sections with semantic meaning
- ListNode/ListItemNode: Ordered and unordered lists
- LinkNode: Hyperlinks with metadata
- ImageNode: Images with attributes

2. Table Processing System

Advanced table handling represents a major improvement over the old parser:

TableNode: Sophisticated table representation with multi-level headers
Cell: Individual cell with colspan/rowspan support and type detection
Row: Table row with header detection and semantic classification
TableMatrix: Handles complex cell spanning and alignment
CurrencyColumnMerger: Intelligently merges currency symbols with values
ColumnAnalyzer: Detects spacing columns and optimizes layout

3. Parser Pipeline

The parsing process follows a well-defined pipeline:

HTMLParser: Main orchestration class
HTMLPreprocessor: Cleans and normalizes HTML
DocumentBuilder: Converts HTML tree to document nodes
Strategy Pattern: Pluggable parsing strategies
DocumentPostprocessor: Final cleanup and optimization

Key Improvements Over Old Parser

Table Processing Enhancements

Old Parser (edgar/files):

Basic table extraction using BeautifulSoup
Limited colspan/rowspan handling
Simple text-based rendering
Manual column alignment
Currency symbols often misaligned

New Parser (edgar/documents):

Advanced table matrix system for perfect cell alignment
Intelligent header detection (multi-row headers, year detection)
Automatic currency column merging (1,234 instead of | 1,234)
Semantic table type detection (FINANCIAL, METRICS, TOC, etc.)
Rich table rendering with proper formatting
Smart column width calculation
Enhanced numeric formatting with comma separators

Document Structure

Old Parser:

Flat block-based structure
Limited semantic understanding
Basic text extraction

New Parser:

Hierarchical node-based model
Semantic section detection
Rich metadata preservation
XBRL fact extraction
Search capabilities
Multiple output formats (text, markdown, JSON, pandas)

Rendering Quality

Old Parser:

Basic text output
Limited table formatting
No styling preservation

New Parser:

Multiple renderers (text, markdown, Rich console)
Preserves document structure and styling
Configurable output options
LLM-optimized formatting

Implementation Details

Configuration System

The new parser uses a comprehensive configuration system:

@dataclass
class ParserConfig:
    # Size limits
    max_document_size: int = 50 * 1024 * 1024  # 50MB
    streaming_threshold: int = 10 * 1024 * 1024  # 10MB
    
    # Processing options
    preserve_whitespace: bool = False
    detect_sections: bool = True
    extract_xbrl: bool = True
    table_extraction: bool = True
    detect_table_types: bool = True

Strategy Pattern Implementation

The parser uses pluggable strategies for different aspects:

HeaderDetectionStrategy: Identifies document sections
TableProcessor: Handles table extraction and classification
XBRLExtractor: Extracts XBRL facts and metadata
StyleParser: Processes CSS styling information

Table Processing Deep Dive

The table processing system represents the most significant improvement:

Header Detection Algorithm

Analyzes cell content patterns (th vs td elements)
Detects year patterns in financial tables
Identifies period indicators (quarters, fiscal years)
Handles multi-row headers with units and descriptions
Prevents misclassification of data rows as headers

Cell Type Detection

Numeric vs text classification
Currency value recognition
Percentage handling
Em dash and null value detection
Proper number formatting with thousand separators

Matrix Building

Handles colspan and rowspan expansion
Maintains cell relationships
Optimizes column layout
Removes spacing columns automatically

XBRL Integration

The new parser includes sophisticated XBRL processing:

Extracts facts before preprocessing to preserve ix:hidden content
Maintains metadata relationships
Supports inline XBRL transformations
Preserves semantic context

Performance Characteristics

Memory Efficiency

Streaming support for large documents (>10MB)
Lazy loading of document sections
Caching for repeated operations
Memory-efficient node representation

Processing Speed

Optimized HTML parsing with lxml
Configurable processing strategies
Parallel extraction capabilities
Smart caching of expensive operations

Migration and Compatibility

API Compatibility

The new parser maintains high-level compatibility with the old parser while offering enhanced functionality:

# Old way
from edgar.files import FilingDocument
doc = FilingDocument(html)
text = doc.text()

# New way  
from edgar.documents import HTMLParser
parser = HTMLParser()
doc = parser.parse(html)
text = doc.text()

Feature Parity

All major features from the old parser are preserved:

Text extraction
Table conversion to DataFrame
Section detection
Metadata extraction

Enhanced Features

New capabilities not available in the old parser:

Rich console rendering
Markdown export
Advanced table semantics
XBRL fact extraction
Document search
LLM optimization
Multiple output formats

Current Status and Next Steps

Completed Components

✅ Core document model
✅ HTML parsing pipeline
✅ Advanced table processing
✅ Multiple renderers (text, markdown, Rich)
✅ XBRL extraction
✅ Configuration system
✅ Streaming support

Remaining Work

🔄 Performance optimization and benchmarking
🔄 Comprehensive test coverage migration
🔄 Error handling improvements
🔄 Documentation and examples
🔄 Validation against large corpus of filings

Testing Strategy

The rewrite requires extensive validation:

Comparison testing against old parser output
Financial table accuracy verification
Performance benchmarking
Edge case handling
Integration testing with existing workflows

Conclusion

The edgar/documents rewrite represents a significant advancement in SEC filing processing capabilities. The new architecture provides:

Better Accuracy: Advanced table processing and semantic understanding
Enhanced Functionality: Multiple output formats and rich rendering
Improved Maintainability: Clean, modular architecture with clear separation of concerns
Future Extensibility: Plugin architecture for new parsing strategies
Performance: Streaming support and optimized processing for large documents

The modular design ensures that improvements can be made incrementally while maintaining backward compatibility. The sophisticated table processing system alone represents a major advancement in handling complex financial documents accurately.

7.8 KiB Raw Blame History