Initial commit
This commit is contained in:
@@ -0,0 +1,240 @@
|
||||
# HTML Parser Rewrite Technical Overview
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The `edgar/documents` module represents a comprehensive rewrite of the HTML parsing capabilities originally implemented in `edgar/files`. This new parser is designed to provide superior parsing accuracy, structured data extraction, and rendering quality for SEC filing documents. The rewrite introduces a modern, extensible architecture with specialized components for handling the complex structure of financial documents.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Core Components
|
||||
|
||||
#### 1. Document Object Model
|
||||
The new parser introduces a sophisticated node-based document model:
|
||||
|
||||
- **Document**: Top-level container with metadata and sections
|
||||
- **Node Hierarchy**: Abstract base classes for all document elements
|
||||
- `DocumentNode`: Root document container
|
||||
- `TextNode`: Plain text content
|
||||
- `ParagraphNode`: Paragraph elements with styling
|
||||
- `HeadingNode`: Headers with levels 1-6
|
||||
- `ContainerNode`: Generic containers (div, section)
|
||||
- `SectionNode`: Document sections with semantic meaning
|
||||
- `ListNode`/`ListItemNode`: Ordered and unordered lists
|
||||
- `LinkNode`: Hyperlinks with metadata
|
||||
- `ImageNode`: Images with attributes
|
||||
|
||||
#### 2. Table Processing System
|
||||
Advanced table handling represents a major improvement over the old parser:
|
||||
|
||||
- **TableNode**: Sophisticated table representation with multi-level headers
|
||||
- **Cell**: Individual cell with colspan/rowspan support and type detection
|
||||
- **Row**: Table row with header detection and semantic classification
|
||||
- **TableMatrix**: Handles complex cell spanning and alignment
|
||||
- **CurrencyColumnMerger**: Intelligently merges currency symbols with values
|
||||
- **ColumnAnalyzer**: Detects spacing columns and optimizes layout
|
||||
|
||||
#### 3. Parser Pipeline
|
||||
The parsing process follows a well-defined pipeline:
|
||||
|
||||
1. **HTMLParser**: Main orchestration class
|
||||
2. **HTMLPreprocessor**: Cleans and normalizes HTML
|
||||
3. **DocumentBuilder**: Converts HTML tree to document nodes
|
||||
4. **Strategy Pattern**: Pluggable parsing strategies
|
||||
5. **DocumentPostprocessor**: Final cleanup and optimization
|
||||
|
||||
### Key Improvements Over Old Parser
|
||||
|
||||
#### Table Processing Enhancements
|
||||
|
||||
**Old Parser (`edgar/files`)**:
|
||||
- Basic table extraction using BeautifulSoup
|
||||
- Limited colspan/rowspan handling
|
||||
- Simple text-based rendering
|
||||
- Manual column alignment
|
||||
- Currency symbols often misaligned
|
||||
|
||||
**New Parser (`edgar/documents`)**:
|
||||
- Advanced table matrix system for perfect cell alignment
|
||||
- Intelligent header detection (multi-row headers, year detection)
|
||||
- Automatic currency column merging ($1,234 instead of $ | 1,234)
|
||||
- Semantic table type detection (FINANCIAL, METRICS, TOC, etc.)
|
||||
- Rich table rendering with proper formatting
|
||||
- Smart column width calculation
|
||||
- Enhanced numeric formatting with comma separators
|
||||
|
||||
#### Document Structure
|
||||
|
||||
**Old Parser**:
|
||||
- Flat block-based structure
|
||||
- Limited semantic understanding
|
||||
- Basic text extraction
|
||||
|
||||
**New Parser**:
|
||||
- Hierarchical node-based model
|
||||
- Semantic section detection
|
||||
- Rich metadata preservation
|
||||
- XBRL fact extraction
|
||||
- Search capabilities
|
||||
- Multiple output formats (text, markdown, JSON, pandas)
|
||||
|
||||
#### Rendering Quality
|
||||
|
||||
**Old Parser**:
|
||||
- Basic text output
|
||||
- Limited table formatting
|
||||
- No styling preservation
|
||||
|
||||
**New Parser**:
|
||||
- Multiple renderers (text, markdown, Rich console)
|
||||
- Preserves document structure and styling
|
||||
- Configurable output options
|
||||
- LLM-optimized formatting
|
||||
|
||||
## Implementation Details
|
||||
|
||||
### Configuration System
|
||||
|
||||
The new parser uses a comprehensive configuration system:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ParserConfig:
|
||||
# Size limits
|
||||
max_document_size: int = 50 * 1024 * 1024 # 50MB
|
||||
streaming_threshold: int = 10 * 1024 * 1024 # 10MB
|
||||
|
||||
# Processing options
|
||||
preserve_whitespace: bool = False
|
||||
detect_sections: bool = True
|
||||
extract_xbrl: bool = True
|
||||
table_extraction: bool = True
|
||||
detect_table_types: bool = True
|
||||
```
|
||||
|
||||
### Strategy Pattern Implementation
|
||||
|
||||
The parser uses pluggable strategies for different aspects:
|
||||
|
||||
- **HeaderDetectionStrategy**: Identifies document sections
|
||||
- **TableProcessor**: Handles table extraction and classification
|
||||
- **XBRLExtractor**: Extracts XBRL facts and metadata
|
||||
- **StyleParser**: Processes CSS styling information
|
||||
|
||||
### Table Processing Deep Dive
|
||||
|
||||
The table processing system represents the most significant improvement:
|
||||
|
||||
#### Header Detection Algorithm
|
||||
- Analyzes cell content patterns (th vs td elements)
|
||||
- Detects year patterns in financial tables
|
||||
- Identifies period indicators (quarters, fiscal years)
|
||||
- Handles multi-row headers with units and descriptions
|
||||
- Prevents misclassification of data rows as headers
|
||||
|
||||
#### Cell Type Detection
|
||||
- Numeric vs text classification
|
||||
- Currency value recognition
|
||||
- Percentage handling
|
||||
- Em dash and null value detection
|
||||
- Proper number formatting with thousand separators
|
||||
|
||||
#### Matrix Building
|
||||
- Handles colspan and rowspan expansion
|
||||
- Maintains cell relationships
|
||||
- Optimizes column layout
|
||||
- Removes spacing columns automatically
|
||||
|
||||
### XBRL Integration
|
||||
|
||||
The new parser includes sophisticated XBRL processing:
|
||||
- Extracts facts before preprocessing to preserve ix:hidden content
|
||||
- Maintains metadata relationships
|
||||
- Supports inline XBRL transformations
|
||||
- Preserves semantic context
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Memory Efficiency
|
||||
- Streaming support for large documents (>10MB)
|
||||
- Lazy loading of document sections
|
||||
- Caching for repeated operations
|
||||
- Memory-efficient node representation
|
||||
|
||||
### Processing Speed
|
||||
- Optimized HTML parsing with lxml
|
||||
- Configurable processing strategies
|
||||
- Parallel extraction capabilities
|
||||
- Smart caching of expensive operations
|
||||
|
||||
## Migration and Compatibility
|
||||
|
||||
### API Compatibility
|
||||
The new parser maintains high-level compatibility with the old parser while offering enhanced functionality:
|
||||
|
||||
```python
|
||||
# Old way
|
||||
from edgar.files import FilingDocument
|
||||
doc = FilingDocument(html)
|
||||
text = doc.text()
|
||||
|
||||
# New way
|
||||
from edgar.documents import HTMLParser
|
||||
parser = HTMLParser()
|
||||
doc = parser.parse(html)
|
||||
text = doc.text()
|
||||
```
|
||||
|
||||
### Feature Parity
|
||||
All major features from the old parser are preserved:
|
||||
- Text extraction
|
||||
- Table conversion to DataFrame
|
||||
- Section detection
|
||||
- Metadata extraction
|
||||
|
||||
### Enhanced Features
|
||||
New capabilities not available in the old parser:
|
||||
- Rich console rendering
|
||||
- Markdown export
|
||||
- Advanced table semantics
|
||||
- XBRL fact extraction
|
||||
- Document search
|
||||
- LLM optimization
|
||||
- Multiple output formats
|
||||
|
||||
## Current Status and Next Steps
|
||||
|
||||
### Completed Components
|
||||
- ✅ Core document model
|
||||
- ✅ HTML parsing pipeline
|
||||
- ✅ Advanced table processing
|
||||
- ✅ Multiple renderers (text, markdown, Rich)
|
||||
- ✅ XBRL extraction
|
||||
- ✅ Configuration system
|
||||
- ✅ Streaming support
|
||||
|
||||
### Remaining Work
|
||||
- 🔄 Performance optimization and benchmarking
|
||||
- 🔄 Comprehensive test coverage migration
|
||||
- 🔄 Error handling improvements
|
||||
- 🔄 Documentation and examples
|
||||
- 🔄 Validation against large corpus of filings
|
||||
|
||||
### Testing Strategy
|
||||
The rewrite requires extensive validation:
|
||||
- Comparison testing against old parser output
|
||||
- Financial table accuracy verification
|
||||
- Performance benchmarking
|
||||
- Edge case handling
|
||||
- Integration testing with existing workflows
|
||||
|
||||
## Conclusion
|
||||
|
||||
The `edgar/documents` rewrite represents a significant advancement in SEC filing processing capabilities. The new architecture provides:
|
||||
|
||||
1. **Better Accuracy**: Advanced table processing and semantic understanding
|
||||
2. **Enhanced Functionality**: Multiple output formats and rich rendering
|
||||
3. **Improved Maintainability**: Clean, modular architecture with clear separation of concerns
|
||||
4. **Future Extensibility**: Plugin architecture for new parsing strategies
|
||||
5. **Performance**: Streaming support and optimized processing for large documents
|
||||
|
||||
The modular design ensures that improvements can be made incrementally while maintaining backward compatibility. The sophisticated table processing system alone represents a major advancement in handling complex financial documents accurately.
|
||||
Reference in New Issue
Block a user