Initial commit

2025-12-09 12:13:01 +01:00
commit 8e654ed209
13332 changed files with 2695056 additions and 0 deletions
--- a/venv/lib/python3.10/site-packages/edgar/documents/docs/html-parser-rewrite-overview.md
+++ b/venv/lib/python3.10/site-packages/edgar/documents/docs/html-parser-rewrite-overview.md
@@ -0,0 +1,240 @@
+# HTML Parser Rewrite Technical Overview
+
+## Executive Summary
+
+The `edgar/documents` module represents a comprehensive rewrite of the HTML parsing capabilities originally implemented in `edgar/files`. This new parser is designed to provide superior parsing accuracy, structured data extraction, and rendering quality for SEC filing documents. The rewrite introduces a modern, extensible architecture with specialized components for handling the complex structure of financial documents.
+
+## Architecture Overview
+
+### Core Components
+
+#### 1. Document Object Model
+The new parser introduces a sophisticated node-based document model:
+
+- **Document**: Top-level container with metadata and sections
+- **Node Hierarchy**: Abstract base classes for all document elements
+  - `DocumentNode`: Root document container
+  - `TextNode`: Plain text content
+  - `ParagraphNode`: Paragraph elements with styling
+  - `HeadingNode`: Headers with levels 1-6
+  - `ContainerNode`: Generic containers (div, section)
+  - `SectionNode`: Document sections with semantic meaning
+  - `ListNode`/`ListItemNode`: Ordered and unordered lists
+  - `LinkNode`: Hyperlinks with metadata
+  - `ImageNode`: Images with attributes
+
+#### 2. Table Processing System
+Advanced table handling represents a major improvement over the old parser:
+
+- **TableNode**: Sophisticated table representation with multi-level headers
+- **Cell**: Individual cell with colspan/rowspan support and type detection
+- **Row**: Table row with header detection and semantic classification
+- **TableMatrix**: Handles complex cell spanning and alignment
+- **CurrencyColumnMerger**: Intelligently merges currency symbols with values
+- **ColumnAnalyzer**: Detects spacing columns and optimizes layout
+
+#### 3. Parser Pipeline
+The parsing process follows a well-defined pipeline:
+
+1. **HTMLParser**: Main orchestration class
+2. **HTMLPreprocessor**: Cleans and normalizes HTML
+3. **DocumentBuilder**: Converts HTML tree to document nodes
+4. **Strategy Pattern**: Pluggable parsing strategies
+5. **DocumentPostprocessor**: Final cleanup and optimization
+
+### Key Improvements Over Old Parser
+
+#### Table Processing Enhancements
+
+**Old Parser (`edgar/files`)**:
+- Basic table extraction using BeautifulSoup
+- Limited colspan/rowspan handling
+- Simple text-based rendering
+- Manual column alignment
+- Currency symbols often misaligned
+
+**New Parser (`edgar/documents`)**:
+- Advanced table matrix system for perfect cell alignment
+- Intelligent header detection (multi-row headers, year detection)
+- Automatic currency column merging ($1,234 instead of $ | 1,234)
+- Semantic table type detection (FINANCIAL, METRICS, TOC, etc.)
+- Rich table rendering with proper formatting
+- Smart column width calculation
+- Enhanced numeric formatting with comma separators
+
+#### Document Structure
+
+**Old Parser**:
+- Flat block-based structure
+- Limited semantic understanding
+- Basic text extraction
+
+**New Parser**:
+- Hierarchical node-based model
+- Semantic section detection
+- Rich metadata preservation
+- XBRL fact extraction
+- Search capabilities
+- Multiple output formats (text, markdown, JSON, pandas)
+
+#### Rendering Quality
+
+**Old Parser**:
+- Basic text output
+- Limited table formatting
+- No styling preservation
+
+**New Parser**:
+- Multiple renderers (text, markdown, Rich console)
+- Preserves document structure and styling
+- Configurable output options
+- LLM-optimized formatting
+
+## Implementation Details
+
+### Configuration System
+
+The new parser uses a comprehensive configuration system:
+
+```python
+@dataclass
+class ParserConfig:
+    # Size limits
+    max_document_size: int = 50 * 1024 * 1024  # 50MB
+    streaming_threshold: int = 10 * 1024 * 1024  # 10MB
+    
+    # Processing options
+    preserve_whitespace: bool = False
+    detect_sections: bool = True
+    extract_xbrl: bool = True
+    table_extraction: bool = True
+    detect_table_types: bool = True
+```
+
+### Strategy Pattern Implementation
+
+The parser uses pluggable strategies for different aspects:
+
+- **HeaderDetectionStrategy**: Identifies document sections
+- **TableProcessor**: Handles table extraction and classification
+- **XBRLExtractor**: Extracts XBRL facts and metadata
+- **StyleParser**: Processes CSS styling information
+
+### Table Processing Deep Dive
+
+The table processing system represents the most significant improvement:
+
+#### Header Detection Algorithm
+- Analyzes cell content patterns (th vs td elements)
+- Detects year patterns in financial tables
+- Identifies period indicators (quarters, fiscal years)
+- Handles multi-row headers with units and descriptions
+- Prevents misclassification of data rows as headers
+
+#### Cell Type Detection
+- Numeric vs text classification
+- Currency value recognition
+- Percentage handling
+- Em dash and null value detection
+- Proper number formatting with thousand separators
+
+#### Matrix Building
+- Handles colspan and rowspan expansion
+- Maintains cell relationships
+- Optimizes column layout
+- Removes spacing columns automatically
+
+### XBRL Integration
+
+The new parser includes sophisticated XBRL processing:
+- Extracts facts before preprocessing to preserve ix:hidden content
+- Maintains metadata relationships
+- Supports inline XBRL transformations
+- Preserves semantic context
+
+## Performance Characteristics
+
+### Memory Efficiency
+- Streaming support for large documents (>10MB)
+- Lazy loading of document sections
+- Caching for repeated operations
+- Memory-efficient node representation
+
+### Processing Speed
+- Optimized HTML parsing with lxml
+- Configurable processing strategies
+- Parallel extraction capabilities
+- Smart caching of expensive operations
+
+## Migration and Compatibility
+
+### API Compatibility
+The new parser maintains high-level compatibility with the old parser while offering enhanced functionality:
+
+```python
+# Old way
+from edgar.files import FilingDocument
+doc = FilingDocument(html)
+text = doc.text()
+
+# New way  
+from edgar.documents import HTMLParser
+parser = HTMLParser()
+doc = parser.parse(html)
+text = doc.text()
+```
+
+### Feature Parity
+All major features from the old parser are preserved:
+- Text extraction
+- Table conversion to DataFrame
+- Section detection
+- Metadata extraction
+
+### Enhanced Features
+New capabilities not available in the old parser:
+- Rich console rendering
+- Markdown export
+- Advanced table semantics
+- XBRL fact extraction
+- Document search
+- LLM optimization
+- Multiple output formats
+
+## Current Status and Next Steps
+
+### Completed Components
+- ✅ Core document model
+- ✅ HTML parsing pipeline
+- ✅ Advanced table processing
+- ✅ Multiple renderers (text, markdown, Rich)
+- ✅ XBRL extraction
+- ✅ Configuration system
+- ✅ Streaming support
+
+### Remaining Work
+- 🔄 Performance optimization and benchmarking
+- 🔄 Comprehensive test coverage migration
+- 🔄 Error handling improvements
+- 🔄 Documentation and examples
+- 🔄 Validation against large corpus of filings
+
+### Testing Strategy
+The rewrite requires extensive validation:
+- Comparison testing against old parser output
+- Financial table accuracy verification
+- Performance benchmarking
+- Edge case handling
+- Integration testing with existing workflows
+
+## Conclusion
+
+The `edgar/documents` rewrite represents a significant advancement in SEC filing processing capabilities. The new architecture provides:
+
+1. **Better Accuracy**: Advanced table processing and semantic understanding
+2. **Enhanced Functionality**: Multiple output formats and rich rendering
+3. **Improved Maintainability**: Clean, modular architecture with clear separation of concerns
+4. **Future Extensibility**: Plugin architecture for new parsing strategies
+5. **Performance**: Streaming support and optimized processing for large documents
+
+The modular design ensures that improvements can be made incrementally while maintaining backward compatibility. The sophisticated table processing system alone represents a major advancement in handling complex financial documents accurately.