# Goals ## Mission Replace `edgar.files` with a parser that is better in **every way** - utility, accuracy, and user experience. The maintainer is the final judge: output must look correct when printed. ## Core Principles ### Primary Goal: AI Context Optimization - **Token efficiency**: 30-50% reduction vs raw HTML while preserving semantic meaning - **Chunking support**: Enable independent processing of sections/tables for LLM context windows - **Clean text output**: Tables rendered in LLM-friendly formats (clean text, markdown) - **Semantic preservation**: Extract meaning, not just formatting ### Secondary Goal: Human Readability - **Rich console output**: Beautiful rendering with proper table alignment - **Markdown export**: Professional-looking document conversion - **Section navigation**: Easy access to specific Items/sections ## User-Focused Feature Goals ### 1. Text Extraction - Extract full document text without dropping meaningful content - Preserve paragraph structure and semantic whitespace - Handle inline XBRL facts gracefully (show values, not raw tags) - Clean HTML artifacts automatically (scripts, styles, page numbers) - **Target**: 99%+ accuracy vs manual reading ### 2. Section Extraction (10-K, 10-Q, 8-K) - Detect >90% of standard sections for >90% of test tickers - Support flexible access: `doc.sections['Item 1A']`, `doc['1A']`, `doc.risk_factors` - Return Section objects with `.text()`, `.tables`, `.search()` methods - Include confidence scores and detection method metadata - **Target**: Better recall than old parser (quantify with test suite) ### 3. Table Extraction - Extract all meaningful data tables (ignore pure layout tables) - Accurate rendering with aligned columns and proper formatting - Handle complex tables (rowspan, colspan, nested headers) - Preserve table captions and surrounding context - Support DataFrame conversion for data analysis - **Target**: 95%+ accuracy on test corpus ### 4. Search Capabilities - Text search within documents - Regex pattern matching - Semantic search preparation (structure for embedding-based search) - Search within sections for focused queries ### 5. Multiple Output Formats - Plain text (optimized for LLM context) - Markdown (for documentation/sharing) - Rich console (beautiful terminal display) - JSON (structured data export) ### 6. Developer Experience - Intuitive API: `doc.text()`, `doc.tables`, `doc.sections` - Rich objects with useful methods (not just strings) - Simple tasks simple, complex tasks possible - Helpful error messages with recovery suggestions - **Target**: New users productive in <10 minutes ## Performance Targets ### Speed Benchmarks (Based on Current Performance) - **Small docs (<5MB)**: <500ms ✅ *Currently 96ms - excellent* - **Medium docs (5-20MB)**: <2s ✅ *Currently 1.19s - excellent* - **Large docs (>50MB)**: <10s ✅ *Currently 0.59s - excellent* - **Throughput**: >3MB/s sustained ✅ *Currently 3.8MB/s* - **Target**: Maintain or improve on all benchmarks ### Memory Efficiency - **Small docs (<5MB)**: <3x document size *(currently 9x - needs optimization)* - **Large docs (>10MB)**: <2x document size *(currently 1.9x - good)* - **No memory spikes**: Never exceed 5x document size *(MSFT currently 5.4x)* - **Target**: Consistent 2-3x overhead across all document sizes ### Accuracy Benchmarks - **Section detection recall**: >90% on 20-ticker test set - **Table extraction accuracy**: >95% on manual validation set - **Text fidelity**: >99% semantic equivalence to source HTML - **XBRL fact extraction**: 100% of inline facts captured correctly ## Implementation Details ### HTML Parsing - Read the entire HTML document without dropping semantically meaningful content - Drop non-meaningful content (scripts, styles, pure formatting tags) - Preserve semantic structure (headings, paragraphs, lists) - Handle both old (pre-2015) and modern (inline XBRL) formats - Graceful degradation for malformed HTML ### Table Parsing - Extract tables containing meaningful data - Ignore layout tables (unless they aid document understanding) - Accurate rendering with proper column alignment - Handle complex structures: rowspan, colspan, nested headers, multi-level headers - Preserve table captions and contextual information - Support conversion to pandas DataFrame ### Section Extraction - Detect standard sections (Item 1, 1A, 7, etc.) for 10-K, 10-Q, 8-K filings - Support multiple detection strategies: TOC-based, heading-based, pattern-based - Return Section objects with full API: `.text()`, `.text_without_tables()`, `.tables`, `.search()` - Include metadata: confidence scores, detection method, position - Better recall than old parser (establish baseline with test suite) ## Quality Gates Before Replacing edgar.files ### Automated Tests - [ ] All existing tests pass with new parser (1000+ tests) - [ ] Performance regression tests (<5% slower on any document) - [ ] Memory regression tests (no >10% increases) - [ ] Section detection accuracy >90% on test corpus - [ ] Table extraction accuracy >95% on validation set ### Manual Validation (Maintainer Review) - [ ] Print full document text for 10 sample filings → verify quality - [ ] Compare table rendering old vs new → verify improvement - [ ] Test section extraction on edge cases → verify robustness - [ ] Review markdown output → verify professional appearance - [ ] Check memory usage → verify no concerning spikes ### Documentation Requirements - [ ] Migration guide (old API → new API with examples) - [ ] Updated user guide showing new features - [ ] Performance comparison report (old vs new) - [ ] Known limitations documented clearly - [ ] API reference complete for all public methods ## Success Metrics ### Launch Criteria 1. **Speed**: Equal or faster on 95% of test corpus 2. **Accuracy**: Maintainer approves output quality on sample set 3. **API**: Clean, intuitive interface (no confusion) 4. **Tests**: Zero regressions, 95%+ coverage on new code 5. **Docs**: Complete with examples for all major use cases ### Post-Launch Monitoring - Issue reports: <5% related to parser quality/accuracy - User feedback: Positive sentiment on ease of use - Performance: No degradation over time (regression tests) - Adoption: Smooth migration from old parser (deprecation path) ## Feature Parity with Old Parser ### Must-Have (Required for Migration) - ✅ Get document text (with/without tables) - ✅ Extract specific sections by name/number - ✅ List all tables in document - ✅ Search document content - ✅ Convert to markdown - ✅ Handle both old and new SEC filing formats - ✅ Graceful error handling ### Nice-to-Have (Improvements Over Old Parser) - 🎯 Semantic search capabilities - 🎯 Better subsection extraction within Items - 🎯 Table-of-contents navigation - 🎯 Export to multiple formats (JSON, clean HTML) - 🎯 Batch processing optimizations - 🎯 Section confidence scores and metadata