Files

kdusek 8e654ed209 Initial commit

2025-12-09 12:13:01 +01:00

13 KiB

Raw Permalink Blame History

Document Class Architecture Review

Overview

The Document class in edgar.files.html provides a structured representation of HTML content extracted from SEC filings. It implements a node-based architecture that preserves document structure while supporting rich text formatting and tabular data extraction.

Key Components

Document: Top-level container for parsed document nodes
BaseNode: Abstract base class for all document node types
HeadingNode: Represents section and subsection headings
TextBlockNode: Represents paragraphs and text content
TableNode: Represents tabular data with advanced processing
SECHTMLParser: HTML parser that creates the node structure
IXTagTracker: Tracks inline XBRL tags during parsing

Primary Functionality

The Document.parse() method serves as the entry point, converting HTML text into a structured node tree that preserves document semantics, formatting, and inline XBRL metadata.

Implementation Analysis

Architectural Patterns

Composite Pattern: Implemented through BaseNode with specialized node types, allowing for a heterogeneous tree of document elements.
Factory Method: The create_node() function acts as a factory method for creating appropriate node instances based on content and type.
Decorator Pattern: The StyleInfo class applies layers of styling information, merging styles from parent elements with child elements.
Strategy Pattern: TableProcessor implements a strategy for processing tables, with specialized algorithms for different table structures.

Code Quality

Strengths

Strong typing with appropriate use of Union and Optional types
Consistent use of dataclasses for node representations
Clear separation of parsing logic from rendering logic
Detailed handling of text formatting and whitespace normalization
Comprehensive table processing with column alignment detection

Areas for Improvement

High cyclomatic complexity in _process_element method
Duplicate style parsing logic between html.py and styles.py
Limited documentation for some private methods
Heavy use of instance checking rather than polymorphism
Some recursive methods lack depth limits for safety

Parsing Workflow

The parsing process follows these key stages:

HTML Parsing: Uses BeautifulSoup to parse HTML into a DOM tree, handling malformed HTML and extracting the document root. (Implemented in HtmlDocument.get_root())
Node Creation: Traverses the DOM tree, creating appropriate node objects based on element type, text content, and styling. (Implemented in SECHTMLParser._process_element() and helper methods)
Inline XBRL Processing: Tracks and processes inline XBRL tags, preserving metadata for fact extraction and financial data processing. (Implemented in IXTagTracker class methods)
Style Analysis: Analyzes CSS styles and element semantics to determine document structure, headings, and text formatting. (Implemented in parse_style() and get_heading_level())
Table Processing: Processes HTML tables into structured TableNode objects with proper cell span handling and column alignment. (Implemented in SECHTMLParser._process_table())
Node Merging: Merges adjacent text nodes with compatible styling to create a more concise document structure. (Implemented in SECHTMLParser._merge_adjacent_nodes())

Document.parse() Method Analysis

@classmethod
def parse(cls, html: str) -> Optional['Document']:
    root = HtmlDocument.get_root(html)
    if root:
        parser = SECHTMLParser(root)
        return parser.parse()

Method Characteristics

Cyclomatic Complexity: Low (2)
Lines of Code: 5 lines
Dependencies: HtmlDocument, SECHTMLParser

Method Flow

Get document root using HtmlDocument.get_root()
Create SECHTMLParser instance with root
Call parser.parse() to create node structure
Return Document instance with parsed nodes

Edge Cases Handled

Returns None if document root cannot be found
Properly handles malformed HTML through BeautifulSoup

Suggestions

Add error handling for parser.parse() failures
Consider adding optional caching for parsed documents
Add metadata extraction to the parse method signature

Node Hierarchy Analysis

BaseNode

Abstract base class for all document nodes
Key methods: render(), type property, metadata management
Good extensibility through ABC pattern

HeadingNode

Represents section headings with level-based styling
Strengths:
- Level-aware rendering with appropriate visual hierarchy
- Comprehensive styling based on heading importance
- Good metadata support for semantic information

TextBlockNode

Represents paragraphs and formatted text content
Strengths:
- Sophisticated text wrapping algorithm
- Alignment and style preservation
- Efficient handling of long text blocks
Improvements:
- Could benefit from more advanced text styling capabilities
- Limited support for lists and nested formatting

TableNode

Represents tabular data with advanced processing
Strengths:
- Sophisticated table processing with TableProcessor
- Support for complex cell structures with colspan/rowspan
- Intelligent column alignment detection
- Efficient caching of processed tables
Improvements:
- Limited support for nested tables
- No handling for table captions or footer rows

Style Processing Analysis

Style processing is a crucial component that determines document structure and formatting. It handles inheritance, merging, and semantic interpretation.

Key Components

StyleInfo
- Dataclass representing CSS properties with proper unit handling
- Style inheritance through the merge method
parse_style
- Parses inline CSS styles into StyleInfo objects
- Handles units, validation, and fallback to standard values
get_heading_level
- Uses sophisticated heuristics to determine heading levels
- Based on style, content, and document context

Strengths

Unit-aware style processing with proper conversions
Sophisticated heading detection with multi-factor analysis
Context-sensitive style inheritance model

Improvements

Duplicate style logic between files could be consolidated
Limited support for advanced CSS features like flexbox
No caching for repeated style parsing of identical styles

Inline XBRL Handling

The IXTagTracker provides tracking and processing of inline XBRL tags, preserving metadata for financial data extraction.

Key Features

Tracks nested ix: tags and their attributes
Handles continuation tags for fragmented XBRL facts
Preserves context references for financial data analysis

Integration Points

Called during element processing in SECHTMLParser
Metadata stored in node.metadata for downstream processing

Improvements

Limited documentation of XBRL namespaces and tag semantics
No validation of XBRL context references
Could benefit from performance optimization for large documents

Technical Debt

Code Complexity

SECHTMLParser._process_element
- High cyclomatic complexity with nested conditions
- Suggestion: Refactor into smaller, focused methods with clear single responsibilities
SECHTMLParser._process_table
- Complex table cell processing with tight coupling
- Suggestion: Extract cell processing to a dedicated class with clear interface

Duplication

Style parsing logic
- Similar parsing logic in multiple files
- Suggestion: Consolidate style parsing into a unified module
Text normalization
- Multiple text normalization methods with similar functionality
- Suggestion: Create a TextNormalizer utility class

Performance

Deep recursion
- Recursive element processing without depth limits
- Suggestion: Add depth tracking and limits to prevent stack overflows
Repeated style parsing
- No caching for repeated style parsing
- Suggestion: Implement LRU cache for parsed styles by element ID

Recommendations

Architecture

Formalize node visitor pattern for operations on document structure
Create dedicated NodeFactory class to encapsulate node creation logic
Consider splitting large parser class into specialized parsers by content type

Code Quality

Refactor complex methods into smaller, focused functions
Add comprehensive docstrings to all public methods
Add type guards for complex type unions

Performance

Implement strategic caching for style parsing and heading detection
Add depth limits to recursive methods
Consider lazy parsing for large sections like tables

Testing

Add property-based testing for style inheritance
Create test fixtures for complex document structures
Add performance benchmarks for parsing large documents

Usage Examples

Basic Document Parsing

import requests
from edgar.files.html import Document

# Get HTML content from a filing
html_content = requests.get("https://www.sec.gov/filing/example").text

# Parse into document structure
document = Document.parse(html_content)

# Access document nodes
for node in document.nodes:
    print(f"Node type: {node.type}")
    if node.type == 'heading':
        print(f"Heading: {node.content}")

Extracting Tables from a Document

from edgar.files.html import Document
import pandas as pd

document = Document.parse(html_content)

# Extract all tables
tables = document.tables

# Convert to pandas DataFrames for analysis
dataframes = []
for table_node in tables:
    # Access the processed table
    processed = table_node._processed
    if processed:
        # Create DataFrame with headers and data
        df = pd.DataFrame(processed.data_rows, columns=processed.headers)
        dataframes.append(df)

Converting Document to Markdown

from edgar.files.html import Document

document = Document.parse(html_content)

# Convert to markdown
markdown_text = document.to_markdown()

# Save to file
with open("filing.md", "w") as f:
    f.write(markdown_text)

Accessing XBRL Data in Document Nodes

from edgar.files.html import Document

document = Document.parse(html_content)

# Find nodes with XBRL facts
xbrl_facts = []
for node in document.nodes:
    if 'ix_tag' in node.metadata and 'ix_context' in node.metadata:
        xbrl_facts.append({
            'concept': node.metadata['ix_tag'],
            'context': node.metadata['ix_context'],
            'value': node.content,
        })
        
# Process extracted facts
for fact in xbrl_facts:
    print(f"{fact['concept']}: {fact['value']}")

Known Issues and Limitations

Heading Detection Issues

During testing, we discovered that headings in some filings (such as Oracle 10-K) are not properly detected by the underlying Document class, which prevents proper item identification. This is a critical issue that needs addressing in the implementation.

Potential causes:

Heading detection in the Document class may be too strict
Some filings use non-standard formatting for headings
Style inheritance might not be working correctly
Heading level determination may not account for all possible cases

Possible solutions:

Add a fallback mechanism that uses regex-based item detection when structural detection fails
Implement a hybrid approach that combines structural and textual analysis
Create specialized detectors for specific filing types that account for their unique structures
Add more signals to the heading detection (e.g., positional info, surrounding context)

Priority: High - This issue directly impacts the core functionality of extracting items from filings.

13 KiB Raw Permalink Blame History

Document Class Architecture Review

Overview

Key Components

Primary Functionality

Implementation Analysis

Architectural Patterns

Code Quality

Strengths

Areas for Improvement

Parsing Workflow

Document.parse() Method Analysis

Method Characteristics

Method Flow

Edge Cases Handled

Suggestions

Node Hierarchy Analysis

BaseNode

HeadingNode

TextBlockNode

TableNode

Style Processing Analysis

Key Components

Strengths

Improvements

Inline XBRL Handling

Key Features

Integration Points

Improvements

Technical Debt

Code Complexity

Duplication

Performance

Recommendations

Architecture

Code Quality

Performance

Testing

Usage Examples

Basic Document Parsing

Extracting Tables from a Document

Converting Document to Markdown

Accessing XBRL Data in Document Nodes

Known Issues and Limitations

Heading Detection Issues

Performance Considerations

Parsing Performance

Bottlenecks

Optimization Opportunities

Memory Considerations

Rendering Performance

Considerations

Optimizations

13 KiB

Raw Permalink Blame History