Initial commit

This commit is contained in:
kdusek
2025-12-09 12:13:01 +01:00
commit 8e654ed209
13332 changed files with 2695056 additions and 0 deletions

View File

@@ -0,0 +1,261 @@
# ChunkedDocument Item Extraction Process
This document explains how the `ChunkedDocument` class in `edgar.files.htmltools` is used to extract items from SEC filings, particularly 10-K documents, and how the new `Document` implementation could replace this functionality.
## Overview
The `ChunkedDocument` class provides functionality to parse HTML from SEC filings and extract specific sections (items) based on their item numbers (e.g., "Item 1", "Item 1A", etc.). It works by:
1. Breaking the HTML into chunks
2. Identifying item headings
3. Creating a mapping of chunks to item numbers
4. Providing access to specific items through indexing
This functionality is essential for extracting specific sections from 10-K, 10-Q, and other structured SEC filings.
## Key Components of the Extraction Process
### 1. ChunkedDocument Class
The `ChunkedDocument` class is initialized with HTML content and a chunking function:
```python
def __init__(self, html: str, chunk_fn: Callable[[List], pd.DataFrame] = chunks2df):
self.chunks = chunk(html)
self._chunked_data = chunk_fn(self.chunks)
self.chunk_fn = chunk_fn
```
- `html`: The HTML content of the SEC filing
- `chunk_fn`: A function that converts chunks to a DataFrame (defaults to `chunks2df`)
### 2. Chunking Process
The HTML is first broken into chunks using the `chunk` function:
```python
@lru_cache(maxsize=8)
def chunk(html: str):
document = HtmlDocument.from_html(html)
return list(document.generate_chunks())
```
This leverages `HtmlDocument.from_html()` and its `generate_chunks()` method to divide the HTML into semantic chunks. The `HtmlDocument` class is part of the older implementation that the new `Document` class aims to replace.
### 3. Chunks to DataFrame Conversion
The chunks are then processed into a DataFrame using the `chunks2df` function:
```python
def chunks2df(chunks: List[List[Block]],
item_detector: Callable[[pd.Series], pd.Series] = detect_int_items,
item_adjuster: Callable[[pd.DataFrame, Dict[str, Any]], pd.DataFrame] = adjust_detected_items,
item_structure=None) -> pd.DataFrame:
```
This function:
- Takes the chunks and creates a DataFrame with columns for text, table flags, etc.
- Detects item headings using the specified `item_detector` (default: `detect_int_items`)
- Applies adjustments via `item_adjuster` (default: `adjust_detected_items`)
- Adds metadata like character count, signature detection, etc.
- Forward-fills item numbers so each chunk is associated with an item
### 4. Item Detection
The item detection process uses regular expressions to identify item headings:
```python
int_item_pattern = r"^(Item\s{1,3}[0-9]{1,2}[A-Z]?)\.?"
decimal_item_pattern = r"^(Item\s{1,3}[0-9]{1,2}\.[0-9]{2})\.?"
def detect_int_items(text: pd.Series):
return text.str.extract(int_item_pattern, expand=False, flags=re.IGNORECASE | re.MULTILINE)
def detect_decimal_items(text: pd.Series):
return text.str.extract(decimal_item_pattern, expand=False, flags=re.IGNORECASE | re.MULTILINE)
```
These patterns match standard item headings like "Item 1" or "Item 1.01" and extract them from the text.
### 5. Item Adjustment
After initial detection, the `adjust_detected_items` function ensures the items are in the correct sequence and filters out invalid or out-of-sequence items:
```python
def adjust_detected_items(chunk_df: pd.DataFrame, **kwargs) -> pd.DataFrame:
# Normalize items
# Find table of contents
# Process each item in sequence
# Validate items against expected sequence
```
This function:
- Normalizes item strings to a comparable format
- Locates the table of contents section
- Validates each detected item against the previous and next valid items
- Creates a sequence of valid items
### 6. Item Access
The `ChunkedDocument` class provides access to items through indexing:
```python
def __getitem__(self, item):
if isinstance(item, int):
chunks = [self.chunks[item]]
elif isinstance(item, str):
chunks = list(self.chunks_for_item(item))
else:
return None
# Convert chunks to text
# ...
```
This allows direct access to items by their number (e.g., `document["Item 1"]`) and returns the consolidated text for that item.
### 7. Integration with Company Reports
The `ChunkedDocument` is used by the `CompanyReport` and its subclasses (like `TenK`, `TenQ`, etc.) to provide structured access to filing sections:
```python
class CompanyReport:
@property
@lru_cache(maxsize=1)
def chunked_document(self):
return ChunkedDocument(self._filing.html())
def __getitem__(self, item_or_part: str):
item_text = self.chunked_document[item_or_part]
return item_text
```
This enables usage patterns like:
```python
tenk = TenK(filing)
business_description = tenk["Item 1"] # Gets the business description section
risk_factors = tenk["Item 1A"] # Gets the risk factors section
```
## Technical Details
### Chunk Creation and Rendering
Chunks are created from the HTML using `HtmlDocument.from_html(html).generate_chunks()`, which:
1. Parses the HTML using BeautifulSoup
2. Extracts blocks of content (text, tables, etc.)
3. Compresses blocks to avoid unnecessary whitespace
4. Groups related blocks into logical chunks
When rendering a chunk, the original HTML structure of tables is preserved through the `_render_blocks_using_old_markdown_tables` function:
```python
def _render_blocks_using_old_markdown_tables(blocks:List[Block]):
return "".join([
table_to_markdown(block.table_element) if isinstance(block, TableBlock) else block.get_text()
for block in blocks
]).strip()
```
### Special Cases
The system handles several special cases:
1. **Table of Contents**: Items in the table of contents are identified and excluded from being treated as section headers.
2. **Signatures**: Signature blocks at the end of filings are identified to prevent them from being treated as regular content.
3. **Empty Items**: Logic in `adjust_for_empty_items` handles cases where an item has no content but is followed by another item.
4. **Decimal Items**: The `decimal_chunk_fn` provides specialized handling for filings like 8-K that use decimal item numbers (e.g., "Item 1.01").
### Data Structure
The key data structure is the DataFrame created by `chunks2df`, which contains columns:
- `Text`: The text content of the chunk
- `Table`: Boolean indicating if the chunk is a table
- `Chars`: Character count of the chunk
- `Signature`: Boolean indicating if the chunk is part of a signature block
- `TocLink`: Boolean indicating if the chunk is a table of contents link
- `Toc`: Boolean indicating if the chunk is part of the table of contents
- `Empty`: Boolean indicating if the chunk is empty
- `Item`: The item number associated with the chunk (forward-filled)
## Replacing with the New Document Implementation
The new `Document` class implementation could replace the `ChunkedDocument` functionality by:
1. **Preserving Document Structure**: The new `Document` class already has a node-based structure that preserves document semantics, including headings, text blocks, and tables.
2. **Item Identification**: Implementing an item detection system that leverages the existing heading detection, perhaps with a specialized function that identifies item headings from `HeadingNode` instances.
3. **Item Association**: Creating a system to associate all nodes following an item heading with that item, similar to the forward-filling approach used in `chunks2df`.
4. **Item Access API**: Implementing an indexing system that allows access to items by their number, similar to `ChunkedDocument.__getitem__`.
### Specific Implementation Steps
1. **Create Item Detector**: Create a function that identifies item headings from `HeadingNode` instances based on their content and level:
```python
def identify_item_headings(document: Document) -> Dict[str, int]:
"""Identify item headings in the document and return a mapping of item names to node indices."""
item_headings = {}
for i, node in enumerate(document.nodes):
if node.type == 'heading':
match = re.match(r'^(Item\s+[0-9]+[A-Z]?)', node.content, re.IGNORECASE)
if match:
item_headings[match.group(1).strip()] = i
return item_headings
```
2. **Create Item Association**: Create a function that associates nodes with their respective items:
```python
def associate_nodes_with_items(document: Document, item_headings: Dict[str, int]) -> Dict[str, List[BaseNode]]:
"""Associate document nodes with their respective items."""
item_nodes = {}
item_indices = sorted(item_headings.values())
for i, idx in enumerate(item_indices):
item_name = next(k for k, v in item_headings.items() if v == idx)
next_idx = item_indices[i+1] if i+1 < len(item_indices) else len(document.nodes)
item_nodes[item_name] = document.nodes[idx:next_idx]
return item_nodes
```
3. **Implement Item Access**: Add an indexing method to `Document` that allows access to items:
```python
def get_item(self, item_name: str) -> Optional[str]:
"""Get a specific item from the document by name."""
item_headings = identify_item_headings(self)
if item_name not in item_headings:
return None
item_nodes = associate_nodes_with_items(self, item_headings)
# Convert nodes to text
return "\n".join(node.content for node in item_nodes[item_name])
```
4. **Integration with Company Reports**: Update the `CompanyReport` class to use the new `Document` implementation:
```python
@property
@lru_cache(maxsize=1)
def document(self):
html = self._filing.html()
return Document.parse(html)
def __getitem__(self, item_or_part: str):
return self.document.get_item(item_or_part)
```
## Conclusion
The `ChunkedDocument` class provides a robust system for extracting items from SEC filings. While the implementation is complex, it handles many edge cases and provides a clean API for accessing specific sections of filings.
Replacing this functionality with the new `Document` implementation would require preserving the ability to identify item headings, associate content with items, and provide an item access API. However, the new implementation could benefit from the more structured node-based approach, potentially leading to more accurate item extraction and better handling of complex document structures.
The key challenge will be correctly identifying item boundaries, especially in cases where item headings might be nested or where the document structure is complex. Careful testing against a variety of filings will be essential to ensure the new implementation matches or exceeds the capabilities of the current system.

View File

@@ -0,0 +1,350 @@
# Document Class Architecture Review
## Overview
The `Document` class in `edgar.files.html` provides a structured representation of HTML content extracted from SEC filings. It implements a node-based architecture that preserves document structure while supporting rich text formatting and tabular data extraction.
### Key Components
- **Document**: Top-level container for parsed document nodes
- **BaseNode**: Abstract base class for all document node types
- **HeadingNode**: Represents section and subsection headings
- **TextBlockNode**: Represents paragraphs and text content
- **TableNode**: Represents tabular data with advanced processing
- **SECHTMLParser**: HTML parser that creates the node structure
- **IXTagTracker**: Tracks inline XBRL tags during parsing
### Primary Functionality
The `Document.parse()` method serves as the entry point, converting HTML text into a structured node tree that preserves document semantics, formatting, and inline XBRL metadata.
## Implementation Analysis
### Architectural Patterns
1. **Composite Pattern**: Implemented through `BaseNode` with specialized node types, allowing for a heterogeneous tree of document elements.
2. **Factory Method**: The `create_node()` function acts as a factory method for creating appropriate node instances based on content and type.
3. **Decorator Pattern**: The `StyleInfo` class applies layers of styling information, merging styles from parent elements with child elements.
4. **Strategy Pattern**: `TableProcessor` implements a strategy for processing tables, with specialized algorithms for different table structures.
### Code Quality
#### Strengths
- Strong typing with appropriate use of Union and Optional types
- Consistent use of dataclasses for node representations
- Clear separation of parsing logic from rendering logic
- Detailed handling of text formatting and whitespace normalization
- Comprehensive table processing with column alignment detection
#### Areas for Improvement
- High cyclomatic complexity in `_process_element` method
- Duplicate style parsing logic between html.py and styles.py
- Limited documentation for some private methods
- Heavy use of instance checking rather than polymorphism
- Some recursive methods lack depth limits for safety
## Parsing Workflow
The parsing process follows these key stages:
1. **HTML Parsing**: Uses BeautifulSoup to parse HTML into a DOM tree, handling malformed HTML and extracting the document root. (Implemented in `HtmlDocument.get_root()`)
2. **Node Creation**: Traverses the DOM tree, creating appropriate node objects based on element type, text content, and styling. (Implemented in `SECHTMLParser._process_element()` and helper methods)
3. **Inline XBRL Processing**: Tracks and processes inline XBRL tags, preserving metadata for fact extraction and financial data processing. (Implemented in `IXTagTracker` class methods)
4. **Style Analysis**: Analyzes CSS styles and element semantics to determine document structure, headings, and text formatting. (Implemented in `parse_style()` and `get_heading_level()`)
5. **Table Processing**: Processes HTML tables into structured TableNode objects with proper cell span handling and column alignment. (Implemented in `SECHTMLParser._process_table()`)
6. **Node Merging**: Merges adjacent text nodes with compatible styling to create a more concise document structure. (Implemented in `SECHTMLParser._merge_adjacent_nodes()`)
## Document.parse() Method Analysis
```python
@classmethod
def parse(cls, html: str) -> Optional['Document']:
root = HtmlDocument.get_root(html)
if root:
parser = SECHTMLParser(root)
return parser.parse()
```
### Method Characteristics
- **Cyclomatic Complexity**: Low (2)
- **Lines of Code**: 5 lines
- **Dependencies**: `HtmlDocument`, `SECHTMLParser`
### Method Flow
1. Get document root using `HtmlDocument.get_root()`
2. Create `SECHTMLParser` instance with root
3. Call `parser.parse()` to create node structure
4. Return `Document` instance with parsed nodes
### Edge Cases Handled
- Returns None if document root cannot be found
- Properly handles malformed HTML through BeautifulSoup
### Suggestions
- Add error handling for parser.parse() failures
- Consider adding optional caching for parsed documents
- Add metadata extraction to the parse method signature
## Node Hierarchy Analysis
### BaseNode
- Abstract base class for all document nodes
- Key methods: `render()`, `type` property, metadata management
- Good extensibility through ABC pattern
### HeadingNode
- Represents section headings with level-based styling
- Strengths:
- Level-aware rendering with appropriate visual hierarchy
- Comprehensive styling based on heading importance
- Good metadata support for semantic information
### TextBlockNode
- Represents paragraphs and formatted text content
- Strengths:
- Sophisticated text wrapping algorithm
- Alignment and style preservation
- Efficient handling of long text blocks
- Improvements:
- Could benefit from more advanced text styling capabilities
- Limited support for lists and nested formatting
### TableNode
- Represents tabular data with advanced processing
- Strengths:
- Sophisticated table processing with TableProcessor
- Support for complex cell structures with colspan/rowspan
- Intelligent column alignment detection
- Efficient caching of processed tables
- Improvements:
- Limited support for nested tables
- No handling for table captions or footer rows
## Style Processing Analysis
Style processing is a crucial component that determines document structure and formatting. It handles inheritance, merging, and semantic interpretation.
### Key Components
1. **StyleInfo**
- Dataclass representing CSS properties with proper unit handling
- Style inheritance through the merge method
2. **parse_style**
- Parses inline CSS styles into StyleInfo objects
- Handles units, validation, and fallback to standard values
3. **get_heading_level**
- Uses sophisticated heuristics to determine heading levels
- Based on style, content, and document context
### Strengths
- Unit-aware style processing with proper conversions
- Sophisticated heading detection with multi-factor analysis
- Context-sensitive style inheritance model
### Improvements
- Duplicate style logic between files could be consolidated
- Limited support for advanced CSS features like flexbox
- No caching for repeated style parsing of identical styles
## Inline XBRL Handling
The `IXTagTracker` provides tracking and processing of inline XBRL tags, preserving metadata for financial data extraction.
### Key Features
- Tracks nested ix: tags and their attributes
- Handles continuation tags for fragmented XBRL facts
- Preserves context references for financial data analysis
### Integration Points
- Called during element processing in SECHTMLParser
- Metadata stored in node.metadata for downstream processing
### Improvements
- Limited documentation of XBRL namespaces and tag semantics
- No validation of XBRL context references
- Could benefit from performance optimization for large documents
## Technical Debt
### Code Complexity
1. **SECHTMLParser._process_element**
- High cyclomatic complexity with nested conditions
- Suggestion: Refactor into smaller, focused methods with clear single responsibilities
2. **SECHTMLParser._process_table**
- Complex table cell processing with tight coupling
- Suggestion: Extract cell processing to a dedicated class with clear interface
### Duplication
1. **Style parsing logic**
- Similar parsing logic in multiple files
- Suggestion: Consolidate style parsing into a unified module
2. **Text normalization**
- Multiple text normalization methods with similar functionality
- Suggestion: Create a TextNormalizer utility class
### Performance
1. **Deep recursion**
- Recursive element processing without depth limits
- Suggestion: Add depth tracking and limits to prevent stack overflows
2. **Repeated style parsing**
- No caching for repeated style parsing
- Suggestion: Implement LRU cache for parsed styles by element ID
## Recommendations
### Architecture
1. Formalize node visitor pattern for operations on document structure
2. Create dedicated NodeFactory class to encapsulate node creation logic
3. Consider splitting large parser class into specialized parsers by content type
### Code Quality
1. Refactor complex methods into smaller, focused functions
2. Add comprehensive docstrings to all public methods
3. Add type guards for complex type unions
### Performance
1. Implement strategic caching for style parsing and heading detection
2. Add depth limits to recursive methods
3. Consider lazy parsing for large sections like tables
### Testing
1. Add property-based testing for style inheritance
2. Create test fixtures for complex document structures
3. Add performance benchmarks for parsing large documents
## Usage Examples
### Basic Document Parsing
```python
import requests
from edgar.files.html import Document
# Get HTML content from a filing
html_content = requests.get("https://www.sec.gov/filing/example").text
# Parse into document structure
document = Document.parse(html_content)
# Access document nodes
for node in document.nodes:
print(f"Node type: {node.type}")
if node.type == 'heading':
print(f"Heading: {node.content}")
```
### Extracting Tables from a Document
```python
from edgar.files.html import Document
import pandas as pd
document = Document.parse(html_content)
# Extract all tables
tables = document.tables
# Convert to pandas DataFrames for analysis
dataframes = []
for table_node in tables:
# Access the processed table
processed = table_node._processed
if processed:
# Create DataFrame with headers and data
df = pd.DataFrame(processed.data_rows, columns=processed.headers)
dataframes.append(df)
```
### Converting Document to Markdown
```python
from edgar.files.html import Document
document = Document.parse(html_content)
# Convert to markdown
markdown_text = document.to_markdown()
# Save to file
with open("filing.md", "w") as f:
f.write(markdown_text)
```
### Accessing XBRL Data in Document Nodes
```python
from edgar.files.html import Document
document = Document.parse(html_content)
# Find nodes with XBRL facts
xbrl_facts = []
for node in document.nodes:
if 'ix_tag' in node.metadata and 'ix_context' in node.metadata:
xbrl_facts.append({
'concept': node.metadata['ix_tag'],
'context': node.metadata['ix_context'],
'value': node.content,
})
# Process extracted facts
for fact in xbrl_facts:
print(f"{fact['concept']}: {fact['value']}")
```
## Known Issues and Limitations
### Heading Detection Issues
During testing, we discovered that headings in some filings (such as Oracle 10-K) are not properly detected by the underlying Document class, which prevents proper item identification. This is a critical issue that needs addressing in the implementation.
Potential causes:
- Heading detection in the Document class may be too strict
- Some filings use non-standard formatting for headings
- Style inheritance might not be working correctly
- Heading level determination may not account for all possible cases
Possible solutions:
1. Add a fallback mechanism that uses regex-based item detection when structural detection fails
2. Implement a hybrid approach that combines structural and textual analysis
3. Create specialized detectors for specific filing types that account for their unique structures
4. Add more signals to the heading detection (e.g., positional info, surrounding context)
**Priority:** High - This issue directly impacts the core functionality of extracting items from filings.
## Performance Considerations
### Parsing Performance
#### Bottlenecks
- BeautifulSoup HTML parsing for large documents
- Recursive DOM traversal with style inheritance computation
- Complex table processing with layout analysis
- Text normalization and whitespace handling
#### Optimization Opportunities
- Add caching for parsed styles and computed node properties
- Implement lazy parsing for complex structures like tables
- Add document sectioning for parallel processing
- Optimize text handling for large text blocks
#### Memory Considerations
- Document representation can be memory-intensive for large filings
- Caching parsed tables can increase memory usage
- Consider streaming processing for very large documents
### Rendering Performance
#### Considerations
- Rich rendering is computation-intensive for large documents
- Table rendering with column optimization is particularly expensive
- Consider incremental or paginated rendering for large documents
#### Optimizations
- Implement view windowing for large documents
- Add caching for rendered nodes
- Consider asynchronous rendering for complex structures

View File

@@ -0,0 +1,923 @@
"""
Enhanced SEC filing document representation with structured item extraction.
This module provides a high-level document class specialized for SEC filings, with
rich support for extracting items, tables, and table of contents.
"""
import re
from dataclasses import dataclass
from typing import Any, Dict, Iterator, List, Optional, Pattern
import pandas as pd
from edgar.files.html import BaseNode, Document, HeadingNode, TableNode
class Table:
"""Rich representation of a table in a document."""
def __init__(self, table_node: TableNode):
self._node = table_node
self._processed = None # Lazy-loaded processed table
@property
def rows(self) -> int:
"""Get the number of rows in the table."""
processed = self._get_processed()
if processed is None:
return 0
# Count header row if present plus data rows
has_header = processed.headers is not None and len(processed.headers) > 0
return len(processed.data_rows) + (1 if has_header else 0)
@property
def columns(self) -> int:
"""Get the number of columns in the table."""
processed = self._get_processed()
if processed is None:
return 0
# Use headers if available, otherwise first data row
if processed.headers and len(processed.headers) > 0:
return len(processed.headers)
elif processed.data_rows and len(processed.data_rows) > 0:
return len(processed.data_rows[0])
return 0
def _get_processed(self):
"""Get or create the processed table."""
if self._processed is None:
if hasattr(self._node, '_processed'):
self._processed = self._node._processed
# Handle case where node doesn't have processed table yet
if self._processed is None and hasattr(self._node, '_get_processed'):
# Call node's processing method if available
self._processed = self._node._get_processed()
return self._processed
def to_dataframe(self) -> pd.DataFrame:
"""Convert this table to a pandas DataFrame."""
processed = self._get_processed()
if processed and processed.headers and processed.data_rows:
# Create DataFrame with proper headers and data
return pd.DataFrame(processed.data_rows, columns=processed.headers)
elif processed and processed.data_rows:
# No headers, use numeric column names
return pd.DataFrame(processed.data_rows)
return pd.DataFrame()
def to_markdown(self) -> str:
"""Convert this table to markdown format."""
df = self.to_dataframe()
if not df.empty:
return df.to_markdown()
return ""
def get_cell(self, row: int, col: int) -> str:
"""Get the content of a specific cell."""
processed = self._get_processed()
if processed is None:
return ""
# Handle header row (row 0)
if row == 0 and processed.headers and col < len(processed.headers):
return processed.headers[col]
# Adjust row index if we have headers (data rows start at index 1)
data_row_idx = row if processed.headers is None else row - 1
# Get data from data rows
if processed.data_rows and 0 <= data_row_idx < len(processed.data_rows):
data_row = processed.data_rows[data_row_idx]
if 0 <= col < len(data_row):
return data_row[col]
return ""
def contains(self, text: str) -> bool:
"""Check if the table contains the specified text."""
processed = self._get_processed()
if not processed:
return False
# Check headers
if processed.headers and any(text.lower() in str(header).lower() for header in processed.headers):
return True
# Check data rows
for row in processed.data_rows:
if any(text.lower() in str(cell).lower() for cell in row):
return True
return False
def __str__(self) -> str:
return self.to_markdown()
def __repr__(self) -> str:
return f"Table({self.rows}×{self.columns})"
@dataclass
class TocEntry:
"""Entry in a table of contents."""
text: str
level: int
page: Optional[int] = None
reference: Optional[str] = None # Item reference, if applicable
def __repr__(self) -> str:
return f"TocEntry('{self.text}', level={self.level}, page={self.page})"
class TableOfContents:
"""Table of contents extracted from a document."""
def __init__(self, entries: List[TocEntry]):
self.entries = entries
@classmethod
def extract(cls, document: Document) -> 'TableOfContents':
"""Extract table of contents from document."""
entries = []
# Find TOC section (usually at the beginning)
toc_node_index = cls._find_toc_section(document)
if toc_node_index is None:
return cls([])
# Get nodes after TOC heading until the next major heading
toc_nodes = cls._get_toc_nodes(document, toc_node_index)
# Process nodes to extract entries
entries = cls._process_toc_nodes(toc_nodes)
# Match entries to actual items
cls._match_entries_to_items(entries, document)
return cls(entries)
@staticmethod
def _find_toc_section(document: Document) -> Optional[int]:
"""Find the TOC section in the document."""
# Look for "Table of Contents" heading
toc_patterns = [
re.compile(r'table\s+of\s+contents', re.IGNORECASE),
re.compile(r'contents', re.IGNORECASE)
]
for i, node in enumerate(document.nodes):
if node.type == 'heading':
for pattern in toc_patterns:
if pattern.search(node.content):
return i
return None
@staticmethod
def _get_toc_nodes(document: Document, start_index: int) -> List[BaseNode]:
"""Get nodes belonging to the TOC section."""
# Get nodes between TOC heading and next heading of same or higher level
nodes = []
toc_heading = document.nodes[start_index]
heading_level = toc_heading.level if hasattr(toc_heading, 'level') else 1
for i in range(start_index + 1, len(document.nodes)):
node = document.nodes[i]
if node.type == 'heading' and hasattr(node, 'level') and node.level <= heading_level:
break
nodes.append(node)
return nodes
@staticmethod
def _process_toc_nodes(nodes: List[BaseNode]) -> List[TocEntry]:
"""Process TOC nodes to extract entries."""
entries = []
# Patterns for detecting TOC entries
item_pattern = re.compile(r'(item\s+\d+[A-Za-z]?)', re.IGNORECASE)
page_pattern = re.compile(r'(\d+)$')
for node in nodes:
if node.type == 'text_block':
# Process each line in the text block
lines = node.content.splitlines()
for line in lines:
line = line.strip()
if not line:
continue
# Extract indentation as a proxy for level
leading_spaces = len(line) - len(line.lstrip())
level = leading_spaces // 2 + 1 # Rough estimate of level
# Extract page number if present
page_match = page_pattern.search(line)
page = int(page_match.group(1)) if page_match else None
# Clean the text
text = line
if page_match:
text = line[:page_match.start()].strip()
# Check for Item reference
item_match = item_pattern.search(text)
reference = item_match.group(1) if item_match else None
entries.append(TocEntry(text, level, page, reference))
elif node.type == 'table':
# Process table rows as TOC entries
table = Table(node)
df = table.to_dataframe()
if not df.empty:
for _, row in df.iterrows():
if len(row) >= 2: # Assume col 0 is text, col 1 might be page
text = str(row[0]).strip()
if not text:
continue
# Try to extract page number
page = None
if len(row) > 1:
try:
page = int(row[1])
except (ValueError, TypeError):
pass
# Extract level from indentation or formatting
level = 1 # Default level
leading_spaces = len(text) - len(text.lstrip())
if leading_spaces > 0:
level = leading_spaces // 2 + 1
# Check for Item reference
item_match = item_pattern.search(text)
reference = item_match.group(1) if item_match else None
entries.append(TocEntry(text, level, page, reference))
return entries
@staticmethod
def _match_entries_to_items(entries: List[TocEntry], document: Document) -> None:
"""Match TOC entries to actual items in the document."""
# Create dictionary of potential item headings in the document
item_headings = {}
item_pattern = re.compile(r'(item\s+\d+[A-Za-z]?)', re.IGNORECASE)
for i, node in enumerate(document.nodes):
if node.type == 'heading':
match = item_pattern.search(node.content)
if match:
item_key = match.group(1).upper()
item_headings[item_key] = i
# Match entries to items
for entry in entries:
if entry.reference:
# Try to match reference to actual item
item_key = entry.reference.upper()
if item_key in item_headings:
entry.reference = item_key
def find(self, text: str) -> Optional[TocEntry]:
"""Find a TOC entry by text."""
text = text.lower()
for entry in self.entries:
if text in entry.text.lower():
return entry
return None
def __iter__(self) -> Iterator[TocEntry]:
return iter(self.entries)
def __len__(self) -> int:
return len(self.entries)
class Item:
"""Represents a logical item in an SEC filing."""
def __init__(self,
name: str,
heading_node: Optional[HeadingNode],
content_nodes: List[BaseNode],
metadata: Dict[str, Any] = None):
self.name = name
self.heading_node = heading_node
self.content_nodes = content_nodes
self.metadata = metadata or {}
@property
def title(self) -> str:
"""Get the title of this item."""
if self.heading_node:
# Extract title by removing the item number
item_pattern = re.compile(r'^item\s+\d+[A-Za-z]?\.?\s*', re.IGNORECASE)
return item_pattern.sub('', self.heading_node.content).strip()
return ""
@property
def text(self) -> str:
"""Get the text content of this item."""
parts = []
for node in self.content_nodes:
if hasattr(node, 'content'):
if isinstance(node.content, str):
parts.append(node.content)
elif isinstance(node.content, list):
# Handle list content (likely a table)
parts.append(str(node))
else:
parts.append(str(node.content))
else:
parts.append(str(node))
return "\n".join(parts)
@property
def tables(self) -> List[Table]:
"""Get all tables within this item."""
return [
Table(node) for node in self.content_nodes
if node.type == 'table'
]
def get_table(self, index: int) -> Optional[Table]:
"""Get a specific table by index."""
tables = self.tables
return tables[index] if 0 <= index < len(tables) else None
def find_tables(self, pattern: str) -> List[Table]:
"""Find tables containing the specified text pattern."""
tables = []
for table in self.tables:
if table.contains(pattern):
tables.append(table)
return tables
def get_subsections(self) -> List['Item']:
"""Extract nested subsections within this item."""
subsections = []
# Find heading nodes with higher level than the main item heading
item_level = self.heading_node.level if self.heading_node else 0
# Find all subsection headings
subsection_indices = []
for i, node in enumerate(self.content_nodes):
if node.type == 'heading' and node.level > item_level:
subsection_indices.append((i, node))
# Create subsections
for i, (idx, heading) in enumerate(subsection_indices):
next_idx = subsection_indices[i+1][0] if i+1 < len(subsection_indices) else len(self.content_nodes)
subsection_content = self.content_nodes[idx+1:next_idx]
# Create an item for this subsection
subsection = Item(
name=heading.content,
heading_node=heading,
content_nodes=subsection_content
)
subsections.append(subsection)
return subsections
def to_markdown(self) -> str:
"""Convert this item to markdown format."""
parts = []
# Add heading
if self.heading_node:
parts.append(f"# {self.heading_node.content}\n")
# Process content nodes
for node in self.content_nodes:
if node.type == 'heading':
# Add appropriate heading level
level = min(node.level + 1, 6) # Ensure we don't exceed markdown's 6 levels
parts.append(f"{'#' * level} {node.content}\n")
elif node.type == 'text_block':
parts.append(f"{node.content}\n\n")
elif node.type == 'table':
table = Table(node)
parts.append(f"{table.to_markdown()}\n\n")
return "\n".join(parts)
def to_html(self) -> str:
"""Convert this item to HTML format."""
parts = []
# Add heading
if self.heading_node:
parts.append(f"<h1>{self.heading_node.content}</h1>")
# Process content nodes
for node in self.content_nodes:
if node.type == 'heading':
# Add appropriate heading level
level = min(node.level + 1, 6) # Ensure we don't exceed HTML's 6 levels
parts.append(f"<h{level}>{node.content}</h{level}>")
elif node.type == 'text_block':
lines = node.content.split('\n')
paragraphs = [f"<p>{line}</p>" for line in lines if line.strip()]
parts.append("\n".join(paragraphs))
elif node.type == 'table':
# Convert the table to HTML
table = Table(node)
df = table.to_dataframe()
parts.append(df.to_html(index=False))
return "\n".join(parts)
def to_dict(self) -> Dict[str, Any]:
"""Convert this item to a dictionary."""
return {
'name': self.name,
'title': self.title,
'text': self.text,
'metadata': self.metadata
}
def __str__(self) -> str:
return self.text
def __repr__(self) -> str:
return f"Item('{self.name}', title='{self.title}')"
class ItemCollection:
"""Collection of items in a document with convenient access methods."""
def __init__(self, items: Dict[str, Item]):
self._items = items
def __getitem__(self, key: str) -> Item:
"""Get an item by name, with flexible matching."""
# Case-insensitive lookup
key = key.strip().upper()
# Direct lookup
if key in self._items:
return self._items[key]
# Remove any trailing periods for matching
clean_key = key.rstrip('.')
if clean_key in self._items:
return self._items[clean_key]
# Normalize for comparison (remove spaces and periods)
normalized_key = re.sub(r'[.\s]', '', key)
# Try to match normalized keys
for item_key in self._items:
normalized_item_key = re.sub(r'[.\s]', '', item_key)
if normalized_key == normalized_item_key:
return self._items[item_key]
# Partial match (e.g., "1" matches "ITEM 1")
if normalized_key.isdigit() or (len(normalized_key) > 1 and normalized_key[0].isdigit()):
for item_key in self._items:
normalized_item_key = re.sub(r'[.\s]', '', item_key)
if normalized_key in normalized_item_key:
return self._items[item_key]
raise KeyError(f"Item '{key}' not found")
def __contains__(self, key: str) -> bool:
"""Check if an item exists."""
try:
self[key]
return True
except KeyError:
return False
def __iter__(self) -> Iterator[Item]:
"""Iterate through items in order."""
for key in sorted(self._items.keys()):
yield self._items[key]
def __len__(self) -> int:
"""Get the number of items."""
return len(self._items)
def list(self) -> List[str]:
"""Get a list of item names."""
return sorted(self._items.keys())
class DocumentIndex:
"""Index of document structure for efficient lookups."""
def __init__(self):
self._headings = {} # Map of heading text to node index
self._items = {} # Map of item name to Item object
@classmethod
def build(cls, document: Document, filing_type: str = None) -> 'DocumentIndex':
"""Build an index from a document."""
index = cls()
index._build_heading_index(document)
index._build_item_index(document, filing_type)
return index
def _build_heading_index(self, document: Document) -> None:
"""Build an index of all headings in the document."""
for i, node in enumerate(document.nodes):
if node.type == 'heading':
self._headings[node.content] = i
def _build_item_index(self, document: Document, filing_type: str = None) -> None:
"""Build an index of items in the document."""
# Get appropriate item pattern based on filing type
item_pattern = self._get_item_pattern(filing_type)
# Find all item headings
item_headings = []
for i, node in enumerate(document.nodes):
if node.type == 'heading':
match = item_pattern.search(node.content)
if match:
item_name = match.group(1).strip().upper()
item_headings.append((item_name, i, node))
# If no heading-based items found, use fallback text-based detection
if not item_headings:
item_headings = self._fallback_item_detection(document, item_pattern)
# Sort by position in document
item_headings.sort(key=lambda x: x[1])
# Create items
for i, (item_name, node_idx, heading_node) in enumerate(item_headings):
# Find content nodes
start_idx = node_idx + 1
end_idx = (item_headings[i+1][1]
if i+1 < len(item_headings) else len(document.nodes))
content_nodes = document.nodes[start_idx:end_idx]
# Create item
self._items[item_name] = Item(item_name, heading_node, content_nodes)
def _fallback_item_detection(self, document: Document, item_pattern: re.Pattern) -> list:
"""
Fallback item detection when heading-based detection fails.
Uses text content and positional analysis to identify items.
"""
from edgar.files.html import HeadingNode
from edgar.files.styles import StyleInfo
# Create reusable heading nodes
def create_heading_node(content, level=2):
return HeadingNode(
content=content,
style=StyleInfo(font_weight='bold'), # minimal required style
level=level,
metadata={}
)
item_headings = []
# Step 1: Oracle-specific table-based TOC detection (handles Oracle 10-K format)
# First check for a table that contains item patterns and looks like a TOC
table_nodes = [node for node in document.nodes if node.type == 'table']
# Create a map of item references to detect in content
item_references = {}
toc_table_idx = None
# First pass: find the table of contents and extract item references
for table_idx, node in enumerate(table_nodes):
toc_candidate = False
item_to_content_map = {}
# Check if this looks like a TOC table
if hasattr(node, 'content') and isinstance(node.content, list):
rows = node.content
# Process each row to find item patterns
for row_idx, row in enumerate(rows):
if not hasattr(row, 'cells'):
continue
# Check if this row contains an item pattern
for cell_idx, cell in enumerate(row.cells):
cell_content = cell.content if hasattr(cell, 'content') else ""
if not isinstance(cell_content, str):
continue
# Look for item pattern in this cell
match = item_pattern.search(cell_content)
if match:
toc_candidate = True
item_name = match.group(1).strip().upper()
# Extract title - could be in same cell after item name or in next cell
title = ""
# First look in the same cell after the item name
remaining_content = cell_content[match.end():].strip()
if remaining_content:
title = remaining_content
# If no title found in same cell, check next cell
elif cell_idx + 1 < len(row.cells):
next_cell = row.cells[cell_idx + 1]
next_content = next_cell.content if hasattr(next_cell, 'content') else ""
if isinstance(next_content, str):
title = next_content.strip()
# Look for page number or anchor reference in later cells
ref = None
if cell_idx + 2 < len(row.cells):
ref_cell = row.cells[cell_idx + 2]
ref_content = ref_cell.content if hasattr(ref_cell, 'content') else ""
if isinstance(ref_content, str) and ref_content.strip():
ref = ref_content.strip()
# Store item details with full context
item_to_content_map[item_name] = {
'title': title,
'reference': ref,
'row_idx': row_idx
}
# Add to global item references
item_references[item_name] = {
'title': title,
'reference': ref,
'found': False # Will be set to True when we find the content
}
# If this table is a TOC candidate with multiple items, remember it
if toc_candidate and len(item_to_content_map) >= 2:
toc_table_idx = table_idx
# Second pass: if we found a TOC table, look for items in the document
if item_references:
# Look for anchor IDs that match item references
anchor_nodes = {}
for i, node in enumerate(document.nodes):
# Check for id attribute that might be a target for TOC links
if hasattr(node, 'attrs') and node.attrs.get('id'):
anchor_id = node.attrs.get('id')
anchor_nodes[anchor_id] = i
# Look for nodes that might contain items
for i, node in enumerate(document.nodes):
# Skip nodes before the TOC table if we found one
if toc_table_idx is not None and i <= toc_table_idx:
continue
# Get node content
if not hasattr(node, 'content'):
continue
node_content = node.content
if not isinstance(node_content, str):
continue
# Check for each item in our reference map
for item_name, item_info in item_references.items():
if item_info['found']:
continue # Already found this item
# Method 1: Look for exact item pattern at start of text
if node_content.strip().upper().startswith(item_name):
# Found item directly
content = f"{item_name} {item_info['title']}".strip()
heading_node = create_heading_node(content)
item_headings.append((item_name, i, heading_node))
item_references[item_name]['found'] = True
break
# Method 2: Look for the title text if we have it
if item_info['title'] and item_info['title'].strip():
# This can have false positives, so make sure it's a good match
title = item_info['title'].strip()
# Check if the title appears together with the item name
if (f"{item_name} {title}".upper() in node_content.upper() or
title.upper() in node_content.upper() and
"ITEM" in node_content.upper()):
content = f"{item_name} {title}".strip()
heading_node = create_heading_node(content)
item_headings.append((item_name, i, heading_node))
item_references[item_name]['found'] = True
break
# If we found items from the TOC references, return them
if any(info['found'] for info in item_references.values()):
# Sort by position in document
item_headings.sort(key=lambda x: x[1])
return item_headings
# Step 2: Oracle table cell detection
# This specifically targets Oracle 10-K's format where items are in table cells
# but not marked as headings and not part of a formal TOC
item_section_map = {}
for table_idx, node in enumerate(table_nodes):
if hasattr(node, 'content') and isinstance(node.content, list):
rows = node.content
for row_idx, row in enumerate(rows):
if not hasattr(row, 'cells'):
continue
for cell_idx, cell in enumerate(row.cells):
cell_content = cell.content if hasattr(cell, 'content') else ""
if not isinstance(cell_content, str):
continue
# Check if this cell contains an item pattern as an isolated entry
# This is common in Oracle 10-K where items are in cells by themselves
match = item_pattern.search(cell_content)
if match and len(cell_content.strip()) < 50: # Short isolated item cell
item_name = match.group(1).strip().upper()
# Look for the title in adjacent cells
title = ""
if cell_idx + 1 < len(row.cells):
next_cell = row.cells[cell_idx + 1]
next_content = next_cell.content if hasattr(next_cell, 'content') else ""
if isinstance(next_content, str):
title = next_content.strip()
# Check for bold text or other emphasis indicators
is_emphasized = False
if hasattr(cell, 'style'):
if hasattr(cell.style, 'font_weight') and cell.style.font_weight in ['bold', '700', '800', '900']:
is_emphasized = True
elif hasattr(cell.style, 'font_style') and cell.style.font_style == 'italic':
is_emphasized = True
# Store the item with its table position for later extraction
item_section_map[item_name] = {
'table_idx': table_idx,
'row_idx': row_idx,
'title': title,
'emphasized': is_emphasized
}
# If we found items in tables, try to map them to content sections
if item_section_map:
# Create a mapping of items to their positions in the document
table_positions = {}
for i, node in enumerate(document.nodes):
if node.type == 'table':
table_positions[node] = i
for item_name, info in item_section_map.items():
# Create a heading node with the item and title
content = f"{item_name} {info['title']}".strip()
heading_node = create_heading_node(content)
# Find this table's position in the document
target_table = table_nodes[info['table_idx']]
if target_table in table_positions:
table_pos = table_positions[target_table]
# Add this item, prioritizing emphasized ones
if info['emphasized']:
item_headings.insert(0, (item_name, table_pos, heading_node))
else:
item_headings.append((item_name, table_pos, heading_node))
# Sort item headings by position and check if we found enough
if item_headings:
item_headings.sort(key=lambda x: x[1])
# If we found multiple items from tables, return them
if len(item_headings) >= 2:
return item_headings
# Step 3: Iterate through all nodes looking for text blocks that might be item headings
for i, node in enumerate(document.nodes):
# Check text blocks that might be mis-classified headings
if node.type == 'text_block':
# Use only the first line to avoid matching within paragraphs
first_line = node.content.split('\n')[0] if hasattr(node, 'content') else ''
match = item_pattern.search(first_line)
if match:
item_name = match.group(1).strip().upper()
# Additional validation to reduce false positives
# Check if this looks like a real item heading:
# 1. Should be relatively short
# 2. Should start with the matched pattern
# 3. Should not be part of a longer paragraph
if (len(first_line) < 100 and
first_line.lower().startswith(match.group(1).lower()) and
len(first_line.split()) < 15):
# Check for bold font-weight in the node's style if available
is_bold = False
if hasattr(node, 'style') and hasattr(node.style, 'font_weight'):
fw = node.style.font_weight
is_bold = fw in ['bold', '700', '800', '900']
# Prioritize bold text that matches item patterns
if is_bold:
item_headings.insert(0, (item_name, i, node))
else:
item_headings.append((item_name, i, node))
# If we found items, return them
if item_headings:
return item_headings
# Step 4: Last resort - check all nodes for ANY mention of items
# This is a last resort to find something when other methods fail
for i, node in enumerate(document.nodes):
if hasattr(node, 'content') and isinstance(node.content, str):
lines = node.content.split('\n')
for _line_idx, line in enumerate(lines):
match = item_pattern.search(line)
if match and len(line.strip()) < 100: # Avoid matching in long paragraphs
item_name = match.group(1).strip().upper()
# Create a heading node with just the matching line
heading_node = create_heading_node(line)
# We'll use the position of the node containing the pattern
item_headings.append((item_name, i, heading_node))
return item_headings
@staticmethod
def _get_item_pattern(filing_type: str) -> Pattern:
"""Get the regex pattern for identifying items in this filing type."""
# Default to standard 10-K/10-Q item pattern
if filing_type in ('10-K', '10-K/A', '10-Q', '10-Q/A', '20-F', '20-F/A'):
# Enhanced pattern to better handle different formats:
# - Normal format: "Item 1." or "ITEM 1"
# - Oracle format: "ITEM 1." or "Item 1"
# - With periods: "Item 1." or without "Item 1"
# - With trailing spaces: "Item 1 "
# - With different spacing: "Item1" or "ITEM 1"
return re.compile(r'(item\s*\d+[A-Za-z]?)\.?\s*', re.IGNORECASE)
elif filing_type in ('8-K', '8-K/A', '6-K', '6-K/A'):
# 8-K uses decimal format like "Item 1.01"
return re.compile(r'(item\s*\d+\.\d+)\.?\s*', re.IGNORECASE)
else:
# Default pattern for other filings - most flexible
return re.compile(r'(item\s*\d+(?:\.\d+)?[A-Za-z]?)\.?\s*', re.IGNORECASE)
@property
def items(self) -> ItemCollection:
"""Get the collection of items in this document."""
return ItemCollection(self._items)
class FilingDocument:
"""High-level document class specialized for SEC filings."""
def __init__(self, html: str, filing_type: str = None):
self._document = Document.parse(html)
self._filing_type = filing_type
self._index = None # Lazy-loaded
self._toc = None # Lazy-loaded
@property
def document(self) -> Document:
"""Access the underlying Document instance."""
return self._document
@property
def index(self) -> DocumentIndex:
"""Get or create the document index."""
if self._index is None:
self._index = DocumentIndex.build(self._document, self._filing_type)
return self._index
@property
def items(self) -> ItemCollection:
"""Access items in the document."""
return self.index.items
@property
def table_of_contents(self) -> TableOfContents:
"""Get the table of contents for this document."""
if self._toc is None:
self._toc = TableOfContents.extract(self._document)
return self._toc
@property
def tables(self) -> List[Table]:
"""Get all tables in the document."""
return [
Table(node) for node in self._document.nodes
if node.type == 'table'
]
def __getitem__(self, key: str) -> Item:
"""Dictionary-style access to items."""
return self.items[key]

View File

@@ -0,0 +1,678 @@
# SEC Filing Item Extraction - New Design
## Analysis of Current Implementation
### Strengths
1. Simple item access via dictionary-style indexing (`doc["Item 1"]`)
2. Caching mechanisms for performance optimization
3. Robust detection of item headings with regex patterns
4. Sequence validation to ensure correct item ordering
5. Special handling for edge cases (table of contents, signatures)
6. Strong integration with company report classes
### Weaknesses
1. Overreliance on DataFrame as intermediate representation
2. Complex chunking process that operates on strings rather than document structure
3. Text-based pattern matching instead of leveraging semantic document structure
4. Forward-filling item associations rather than using hierarchical structure
5. Limited metadata about items (just text)
6. Mixing of responsibilities (parsing, chunking, indexing, item detection)
7. Tight coupling between chunking and item detection
8. Limited extensibility for new filing types
## Design Principles
For our new implementation, we'll follow these principles from successful software projects:
1. **Single Responsibility Principle**: Each component should have one clearly defined responsibility
2. **Separation of Concerns**: Parsing, structure analysis, and item extraction should be separate
3. **Fluent, Intuitive API**: Provide a clean, discoverable interface
4. **Progressive Disclosure**: Simple operations should be simple, complex operations possible
5. **Rich Models**: Return structured objects with useful methods, not just strings
6. **Immutability**: Operations produce new objects rather than modifying existing ones
7. **Extensibility**: Design for future enhancements and filing types
8. **Performance**: Optimize for common operations with appropriate caching
## New Design
### Core Components
#### 1. `FilingDocument` Class
A high-level wrapper around `Document` that specializes in SEC filing structure:
```python
class FilingDocument:
"""High-level document class specialized for SEC filings."""
def __init__(self, html: str, filing_type: str = None):
self._document = Document.parse(html)
self._filing_type = filing_type
self._index = None # Lazy-loaded
self._toc = None # Lazy-loaded
@property
def document(self) -> Document:
"""Access the underlying Document instance."""
return self._document
@property
def index(self) -> 'DocumentIndex':
"""Get or create the document index."""
if self._index is None:
self._index = DocumentIndex.build(self._document, self._filing_type)
return self._index
@property
def items(self) -> 'ItemCollection':
"""Access items in the document."""
return self.index.items
@property
def table_of_contents(self) -> 'TableOfContents':
"""Get the table of contents for this document."""
if self._toc is None:
self._toc = TableOfContents.extract(self._document)
return self._toc
@property
def tables(self) -> List['Table']:
"""Get all tables in the document."""
return [
Table(node) for node in self._document.nodes
if node.type == 'table'
]
def __getitem__(self, key: str) -> 'Item':
"""Dictionary-style access to items."""
return self.items[key]
```
#### 2. `DocumentIndex` Class
Analyzes document structure and builds indices for fast lookup:
```python
class DocumentIndex:
"""Index of document structure for efficient lookups."""
@classmethod
def build(cls, document: Document, filing_type: str = None) -> 'DocumentIndex':
"""Build an index from a document."""
index = cls()
index._build_heading_index(document)
index._build_item_index(document, filing_type)
return index
def _build_heading_index(self, document: Document) -> None:
"""Build an index of all headings in the document."""
# Implementation details...
def _build_item_index(self, document: Document, filing_type: str = None) -> None:
"""Build an index of items in the document."""
# Implementation details...
@property
def items(self) -> 'ItemCollection':
"""Get the collection of items in this document."""
return ItemCollection(self._items)
```
#### 3. `Item` Class
Represents a logical item in a filing with rich functionality:
```python
class Item:
"""Represents a logical item in an SEC filing."""
def __init__(self,
name: str,
heading_node: Optional[HeadingNode],
content_nodes: List[BaseNode],
metadata: Dict[str, Any] = None):
self.name = name
self.heading_node = heading_node
self.content_nodes = content_nodes
self.metadata = metadata or {}
@property
def title(self) -> str:
"""Get the title of this item."""
if self.heading_node:
# Extract title from heading
return self._extract_title(self.heading_node.content)
return ""
@property
def text(self) -> str:
"""Get the text content of this item."""
return "\n".join(
node.content if hasattr(node, 'content') else str(node)
for node in self.content_nodes
)
@property
def tables(self) -> List['Table']:
"""Get all tables within this item."""
return [
Table(node) for node in self.content_nodes
if node.type == 'table'
]
def get_table(self, index: int) -> Optional['Table']:
"""Get a specific table by index."""
tables = self.tables
return tables[index] if 0 <= index < len(tables) else None
def find_tables(self, pattern: str) -> List['Table']:
"""Find tables containing the specified text pattern."""
tables = []
for table in self.tables:
if table.contains(pattern):
tables.append(table)
return tables
def to_markdown(self) -> str:
"""Convert this item to markdown format."""
# Implementation details...
def to_html(self) -> str:
"""Convert this item to HTML format."""
# Implementation details...
def to_dict(self) -> Dict[str, Any]:
"""Convert this item to a dictionary."""
return {
'name': self.name,
'title': self.title,
'text': self.text,
'metadata': self.metadata
}
def __str__(self) -> str:
return self.text
def __repr__(self) -> str:
return f"Item('{self.name}', title='{self.title}')"
```
#### 4. `ItemCollection` Class
Provides a collection interface for working with items:
```python
class ItemCollection:
"""Collection of items in a document with convenient access methods."""
def __init__(self, items: Dict[str, Item]):
self._items = items
def __getitem__(self, key: str) -> Item:
"""Get an item by name, with flexible matching."""
# Case-insensitive lookup
key = key.strip().upper()
# Direct lookup
if key in self._items:
return self._items[key]
# Partial match (e.g., "1" matches "ITEM 1")
if key.isdigit() or (len(key) > 1 and key[0].isdigit()):
for item_key in self._items:
if key in item_key:
return self._items[item_key]
raise KeyError(f"Item '{key}' not found")
def __contains__(self, key: str) -> bool:
"""Check if an item exists."""
try:
self[key]
return True
except KeyError:
return False
def __iter__(self) -> Iterator[Item]:
"""Iterate through items in order."""
return iter(self._items.values())
def __len__(self) -> int:
"""Get the number of items."""
return len(self._items)
def list(self) -> List[str]:
"""Get a list of item names."""
return list(self._items.keys())
```
#### 5. `FilingRegistry` Class
Registry of known filing types and their structures:
```python
class FilingRegistry:
"""Registry of known filing types and their structures."""
_registry = {}
@classmethod
def register(cls, filing_type: str, structure: Dict[str, Any]) -> None:
"""Register a filing type structure."""
cls._registry[filing_type.upper()] = structure
@classmethod
def get_structure(cls, filing_type: str) -> Optional[Dict[str, Any]]:
"""Get structure for a filing type."""
return cls._registry.get(filing_type.upper())
@classmethod
def get_item_pattern(cls, filing_type: str) -> Optional[str]:
"""Get the regex pattern for identifying items in this filing type."""
structure = cls.get_structure(filing_type)
return structure.get('item_pattern') if structure else None
```
### Algorithm for Item Extraction
The core algorithm for extracting items will:
1. Identify all heading nodes in the document
2. Filter for headings that match item patterns
3. For each item heading:
- Determine the item name and normalize it
- Find all nodes between this item heading and the next one
- Create an Item object with the heading and content nodes
4. Build a mapping of item names to Item objects
```python
def extract_items(document: Document, filing_type: str = None) -> Dict[str, Item]:
"""Extract items from a document."""
# Get all heading nodes
heading_nodes = [node for node in document.nodes if node.type == 'heading']
# Get item pattern for this filing type
item_pattern = get_item_pattern(filing_type)
# Filter for item headings
item_headings = []
for node in heading_nodes:
match = re.search(item_pattern, node.content, re.IGNORECASE)
if match:
item_name = match.group(1).strip().upper()
item_headings.append((item_name, node))
# Sort by position in document
item_headings.sort(key=lambda x: document.nodes.index(x[1]))
# Create items
items = {}
for i, (item_name, heading_node) in enumerate(item_headings):
# Find content nodes
start_idx = document.nodes.index(heading_node) + 1
end_idx = (document.nodes.index(item_headings[i+1][1])
if i+1 < len(item_headings) else len(document.nodes))
content_nodes = document.nodes[start_idx:end_idx]
# Create item
items[item_name] = Item(item_name, heading_node, content_nodes)
return items
```
### Integration with Company Reports
Update the CompanyReport class to use the new FilingDocument:
```python
class CompanyReport:
def __init__(self, filing):
self._filing = filing
self._document = None
@property
def document(self) -> FilingDocument:
"""Get the filing document."""
if self._document is None:
html = self._filing.html()
self._document = FilingDocument(html, self._filing.form)
return self._document
@property
def items(self) -> ItemCollection:
"""Get all items in this filing."""
return self.document.items
def __getitem__(self, key: str) -> Item:
"""Get an item by name."""
return self.items[key]
```
Specialized classes like TenK would add property accessors for common items:
```python
class TenK(CompanyReport):
@property
def business(self) -> Item:
"""Get Item 1: Business."""
return self.items["Item 1"]
@property
def risk_factors(self) -> Item:
"""Get Item 1A: Risk Factors."""
return self.items["Item 1A"]
@property
def management_discussion(self) -> Item:
"""Get Item 7: Management's Discussion and Analysis."""
return self.items["Item 7"]
```
## Implementation Strategy
To implement this design, we'll follow these steps:
1. Implement the `Item` and `ItemCollection` classes first
2. Create the `DocumentIndex` class
3. Implement the `FilingDocument` class
4. Set up the `FilingRegistry` with known filing structures
5. Update the `CompanyReport` hierarchy to use the new classes
6. Write comprehensive tests
7. Deprecate the old implementation with appropriate warnings
## Optimizations
Performance is critical for this component. Key optimizations include:
1. **Lazy Loading**: Only build indices when needed
2. **Caching**: Cache document and index objects
3. **Efficient Node Traversal**: Use direct node references instead of searching by content
4. **Smart Item Matching**: Support flexible item lookup patterns
5. **Document Structure Awareness**: Leverage heading levels and hierarchy
## Comparison with Old Implementation
| Feature | Old Implementation | New Implementation |
|---------|-------------------|-------------------|
| Primary structure | DataFrame of chunks | Tree of nodes |
| Item detection | Regex on plaintext | Pattern matching on heading nodes |
| Item boundaries | Forward-fill in DataFrame | Node ranges in document |
| Return value | Text string | Rich Item object |
| Extensibility | Limited | Registry-based design |
| Performance | Good with caching | Better with structural analysis |
| API clarity | Medium (mixed responsibilities) | High (clear separation) |
| Edge case handling | Good, but complex | Simpler with structure awareness |
## Usage Examples
### Basic Usage
```python
# Get a filing
filing = edgartools.get_filing("AAPL", "10-K", latest=True)
# Create a 10-K report
tenk = TenK(filing)
# Access an item
business = tenk.business
print(f"Business description: {business.title}")
print(business.text[:100] + "...")
# Access using dictionary style
risk_factors = tenk["Item 1A"]
print(f"Risk factors ({len(risk_factors.text)} chars)")
```
### Working with Tables
```python
# Get the financial statements item
financial_statements = tenk["Item 8"]
# Get all tables in the item
tables = financial_statements.tables
print(f"Found {len(tables)} tables in financial statements")
# Get a specific table (e.g., income statement)
income_statement = financial_statements.get_table(0)
if income_statement:
# Convert to pandas DataFrame
df = income_statement.to_dataframe()
print(df.head())
# Get table metadata
print(f"Table dimensions: {income_statement.rows} rows × {income_statement.columns} columns")
# Access specific cell
revenue = income_statement.get_cell(1, 1)
print(f"Revenue: {revenue}")
# Find tables containing specific text
revenue_tables = financial_statements.find_tables("revenue")
for table in revenue_tables:
print(f"Found table with {table.rows} rows about revenue")
```
### Table of Contents
```python
# Get the table of contents
toc = tenk.document.table_of_contents
# Print TOC structure
for entry in toc.entries:
print(f"{entry.level * ' '}{entry.text} (page {entry.page})")
# Navigate directly to a TOC entry
item7 = toc.find("Management's Discussion")
if item7:
print(f"Found MD&A at level {item7.level}")
# Jump to that section
mda = tenk[item7.reference]
print(mda.title)
```
### Advanced Usage
```python
# Get all items
for item in tenk.items:
print(f"{item.name}: {item.title}")
# Convert to markdown
md_text = tenk.business.to_markdown()
# Get as JSON
import json
items_json = json.dumps({
name: item.to_dict()
for name, item in tenk.items.items()
})
# Search within items
for item in tenk.items:
if "revenue" in item.text.lower():
print(f"Found revenue discussion in {item.name}")
# Extract nested sections within an item
mda = tenk.management_discussion
subsections = mda.get_subsections()
for section in subsections:
print(f"Subsection: {section.title}")
```
## Table Components
To complete our design, we'll implement these additional classes for handling tables and table of contents:
### Table Class
```python
class Table:
"""Rich representation of a table in a document."""
def __init__(self, table_node: 'TableNode'):
self._node = table_node
self._processed = None # Lazy-loaded processed table
@property
def rows(self) -> int:
"""Get the number of rows in the table."""
return self._get_processed().processed_row_count
@property
def columns(self) -> int:
"""Get the number of columns in the table."""
return self._get_processed().processed_column_count
def _get_processed(self) -> 'ProcessedTable':
"""Get or create the processed table."""
if self._processed is None:
self._processed = self._node._processed
return self._processed
def to_dataframe(self) -> 'pd.DataFrame':
"""Convert this table to a pandas DataFrame."""
processed = self._get_processed()
if processed and processed.headers and processed.data_rows:
return pd.DataFrame(processed.data_rows, columns=processed.headers)
return pd.DataFrame()
def to_markdown(self) -> str:
"""Convert this table to markdown format."""
# Implementation details...
def get_cell(self, row: int, col: int) -> str:
"""Get the content of a specific cell."""
processed = self._get_processed()
if processed and 0 <= row < len(processed.data_rows):
data_row = processed.data_rows[row]
if 0 <= col < len(data_row):
return data_row[col]
return ""
def contains(self, text: str) -> bool:
"""Check if the table contains the specified text."""
processed = self._get_processed()
if not processed:
return False
# Check headers
if processed.headers and any(text.lower() in header.lower() for header in processed.headers):
return True
# Check data rows
for row in processed.data_rows:
if any(text.lower() in str(cell).lower() for cell in row):
return True
return False
def __str__(self) -> str:
return self.to_markdown()
def __repr__(self) -> str:
return f"Table({self.rows}×{self.columns})"
```
### TableOfContents Class
```python
class TocEntry:
"""Entry in a table of contents."""
def __init__(self, text: str, level: int, page: Optional[int] = None, reference: Optional[str] = None):
self.text = text
self.level = level
self.page = page
self.reference = reference # Item reference, if applicable
def __repr__(self) -> str:
return f"TocEntry('{self.text}', level={self.level}, page={self.page})"
class TableOfContents:
"""Table of contents extracted from a document."""
def __init__(self, entries: List[TocEntry]):
self.entries = entries
@classmethod
def extract(cls, document: Document) -> 'TableOfContents':
"""Extract table of contents from document."""
entries = []
# Find TOC section (usually at the beginning)
toc_node_index = cls._find_toc_section(document)
if toc_node_index is None:
return cls([])
# Get nodes after TOC heading until the next major heading
toc_nodes = cls._get_toc_nodes(document, toc_node_index)
# Process nodes to extract entries
entries = cls._process_toc_nodes(toc_nodes)
# Match entries to actual items
cls._match_entries_to_items(entries, document)
return cls(entries)
@staticmethod
def _find_toc_section(document: Document) -> Optional[int]:
"""Find the TOC section in the document."""
# Look for "Table of Contents" heading
toc_patterns = [
re.compile(r'table\s+of\s+contents', re.IGNORECASE),
re.compile(r'contents', re.IGNORECASE)
]
for i, node in enumerate(document.nodes):
if node.type == 'heading':
for pattern in toc_patterns:
if pattern.search(node.content):
return i
return None
@staticmethod
def _get_toc_nodes(document: Document, start_index: int) -> List['BaseNode']:
"""Get nodes belonging to the TOC section."""
# Implementation details...
@staticmethod
def _process_toc_nodes(nodes: List['BaseNode']) -> List[TocEntry]:
"""Process TOC nodes to extract entries."""
# Implementation details...
@staticmethod
def _match_entries_to_items(entries: List[TocEntry], document: Document) -> None:
"""Match TOC entries to actual items in the document."""
# Implementation details...
def find(self, text: str) -> Optional[TocEntry]:
"""Find a TOC entry by text."""
text = text.lower()
for entry in self.entries:
if text in entry.text.lower():
return entry
return None
def __iter__(self) -> Iterator[TocEntry]:
return iter(self.entries)
def __len__(self) -> int:
return len(self.entries)
```
## Challenges and Mitigations
1. **Accurate Item Detection**: Use a combination of patterns and structural analysis
2. **Handling Malformed Documents**: Fall back to text-based detection when structure is unclear
3. **Performance with Large Documents**: Use lazy evaluation and partial parsing
4. **Backward Compatibility**: Provide adapters for old API patterns
5. **Content Transformation**: Preserve tables and formatting during item extraction
6. **TOC Detection**: Use multiple heuristics to find and parse table of contents
7. **Table Extraction**: Handle complex tables with rowspan/colspan and formatting
By following this design, we'll create a cleaner, more robust API for extracting items from SEC filings that leverages the structural advantages of the new Document class while improving on the functionality of the current implementation.