📖 Documentation
Enterprise-grade Intelligent Document Processing system for healthcare claims, enrollment forms, and policy documents with AI-powered extraction and adjudication.
📋 Project Overview
🎯 Purpose
This Healthcare IDP System is a Data Science portfolio project demonstrating Document Intelligence capabilities for processing healthcare documents. The system achieves 97-99% precision targets through an ensemble approach combining rule-based extraction, NLP models, and LLM integration.
📁 Document Types Supported
- Disability Claims - STD/LTD claim forms with claimant information, diagnosis, and benefit details
- Enrollment Forms - Employee enrollment with plan selection and dependent information
- Policy Certificates - Insurance policy terms, conditions, and coverage details
- RFP Documents - Request for Proposal documents for insurance services
🔧 Key Features
- Multi-format Support - PDF, PNG, JPG, TIFF, TXT document ingestion
- OCR Processing - Tesseract-based text extraction from images
- Smart Classification - ML-based document type identification
- Entity Extraction - Names, SSN, Policy numbers, Dates, Amounts
- Claim Adjudication - Automated approval/denial with reasoning
- Batch Processing - Process multiple documents with CSV export
🚀 Quick Start
📊 Performance Metrics
- Classification Accuracy: 95-98% on healthcare document types
- Entity Extraction F1: 92-97% for key entities (SSN, Policy#, Names)
- Processing Speed: ~2-3 seconds per document (including OCR)
- Adjudication Accuracy: 98%+ on rule-based decisions
🔄 Processing Workflow
📋 Detailed Pipeline Steps:
- Document Ingestion: Accept multi-format documents via REST API or Web UI
- Text Extraction: Use Tesseract OCR for images, PyPDF2 for PDFs, direct read for text
- Quality Assessment: Calculate OCR confidence and text quality scores
- Document Classification: Identify document type using keyword matching + ML model
- Entity Recognition: Extract entities using spaCy NER + custom regex patterns
- LLM Enhancement: Refine extraction using Claude via AWS Bedrock (optional)
- Policy Interpretation: Parse policy clauses and coverage terms
- Claim Adjudication: Apply business rules to approve/deny claims with reasons
- Result Generation: Output structured JSON with all extracted data and decisions
⚙️ System Capabilities
Document Classification
Intelligent document type detection
- Disability Claim Forms (STD/LTD)
- Employee Enrollment Forms
- Policy Certificates & Documents
- RFP Documents
- 95%+ classification accuracy
Entity Extraction
spaCy + Regex + LLM Ensemble
- Personal Info (Name, DOB, SSN)
- Policy & Claim Numbers
- Dates and Monetary Amounts
- Medical Diagnosis Codes
- Employer & Physician Details
Claim Adjudication
Rule-based Business Logic
- Automated approval/denial decisions
- Configurable business rules
- Pre-existing condition checks
- Coverage verification
- Detailed reasoning output
Eligibility Matching
Plan & Coverage Verification
- Policy validity verification
- Coverage period checking
- Benefit limit validation
- Waiting period enforcement
- Exclusion matching
Policy Interpretation
NLP Clause Analysis
- Elimination period extraction
- Benefit calculation rules
- Exclusion clause parsing
- Coverage term identification
- Maximum benefit limits
LLM Integration
AAWS Bedrock / OpenAI
- Context-aware entity refinement
- Complex document understanding
- Ambiguity resolution
- Mock mode for offline testing
- Graceful fallback handling
🛠️ Technology Stack
Core Language
Python ecosystem
- Python 3.11
- Type Hints PEP 484
- Async/Await asyncio
Web Framework
REST API & UI
- FastAPI 0.109.0
- Uvicorn ASGI Server
- Pydantic v2
NLP & ML
Entity extraction
- spaCy 3.7.2
- en_core_web_lg NER Model
- Regex Patterns Custom
Cloud & AI
LLM integration
- AWS Bedrock Claude
- boto3 AWS SDK
- Mock Mode Offline
OCR & Documents
Text extraction
- Tesseract 5.4.0
- pytesseract Python wrapper
- Pillow Image processing
Data Processing
Analysis & export
- pandas DataFrames
- PyYAML Config
- CSV/JSON Export