📖 Documentation

Enterprise-grade Intelligent Document Processing system for healthcare claims, enrollment forms, and policy documents with AI-powered extraction and adjudication.

🐍 Python 3.11 ⚡ FastAPI 🧠 spaCy NLP ☁️ AWS Bedrock

📋 Project Overview

🎯 Purpose

This Healthcare IDP System is a Data Science portfolio project demonstrating Document Intelligence capabilities for processing healthcare documents. The system achieves 97-99% precision targets through an ensemble approach combining rule-based extraction, NLP models, and LLM integration.

📁 Document Types Supported

  • Disability Claims - STD/LTD claim forms with claimant information, diagnosis, and benefit details
  • Enrollment Forms - Employee enrollment with plan selection and dependent information
  • Policy Certificates - Insurance policy terms, conditions, and coverage details
  • RFP Documents - Request for Proposal documents for insurance services

🔧 Key Features

  • Multi-format Support - PDF, PNG, JPG, TIFF, TXT document ingestion
  • OCR Processing - Tesseract-based text extraction from images
  • Smart Classification - ML-based document type identification
  • Entity Extraction - Names, SSN, Policy numbers, Dates, Amounts
  • Claim Adjudication - Automated approval/denial with reasoning
  • Batch Processing - Process multiple documents with CSV export

🚀 Quick Start

# Clone and setup git clone https://github.com/your-repo/healthcare-idp-system.git cd healthcare-idp-system # Create virtual environment python -m venv venv .\venv\Scripts\Activate.ps1 # Windows source venv/bin/activate # Linux/Mac # Install dependencies pip install -r requirements.txt python -m spacy download en_core_web_lg # Run the server uvicorn api.main:app --host 0.0.0.0 --port 8000 # Open browser at http://localhost:8000/ui

📊 Performance Metrics

  • Classification Accuracy: 95-98% on healthcare document types
  • Entity Extraction F1: 92-97% for key entities (SSN, Policy#, Names)
  • Processing Speed: ~2-3 seconds per document (including OCR)
  • Adjudication Accuracy: 98%+ on rule-based decisions

🔄 Processing Workflow

📄
Document Input
PDF, Image, Text
🔍
OCR/Text Extract
Tesseract OCR
📂
Classification
ML + Keywords
🧠
NER Extraction
spaCy + Regex
🤖
LLM Enhancement
AWS Bedrock
⚖️
Adjudication
Business Rules
Results
JSON / CSV

📋 Detailed Pipeline Steps:

  1. Document Ingestion: Accept multi-format documents via REST API or Web UI
  2. Text Extraction: Use Tesseract OCR for images, PyPDF2 for PDFs, direct read for text
  3. Quality Assessment: Calculate OCR confidence and text quality scores
  4. Document Classification: Identify document type using keyword matching + ML model
  5. Entity Recognition: Extract entities using spaCy NER + custom regex patterns
  6. LLM Enhancement: Refine extraction using Claude via AWS Bedrock (optional)
  7. Policy Interpretation: Parse policy clauses and coverage terms
  8. Claim Adjudication: Apply business rules to approve/deny claims with reasons
  9. Result Generation: Output structured JSON with all extracted data and decisions

⚙️ System Capabilities

📂

Document Classification

Intelligent document type detection

  • Disability Claim Forms (STD/LTD)
  • Employee Enrollment Forms
  • Policy Certificates & Documents
  • RFP Documents
  • 95%+ classification accuracy
📝

Entity Extraction

spaCy + Regex + LLM Ensemble

  • Personal Info (Name, DOB, SSN)
  • Policy & Claim Numbers
  • Dates and Monetary Amounts
  • Medical Diagnosis Codes
  • Employer & Physician Details
⚖️

Claim Adjudication

Rule-based Business Logic

  • Automated approval/denial decisions
  • Configurable business rules
  • Pre-existing condition checks
  • Coverage verification
  • Detailed reasoning output

Eligibility Matching

Plan & Coverage Verification

  • Policy validity verification
  • Coverage period checking
  • Benefit limit validation
  • Waiting period enforcement
  • Exclusion matching
📜

Policy Interpretation

NLP Clause Analysis

  • Elimination period extraction
  • Benefit calculation rules
  • Exclusion clause parsing
  • Coverage term identification
  • Maximum benefit limits
🤖

LLM Integration

AAWS Bedrock / OpenAI

  • Context-aware entity refinement
  • Complex document understanding
  • Ambiguity resolution
  • Mock mode for offline testing
  • Graceful fallback handling

🛠️ Technology Stack

🐍

Core Language

Python ecosystem

  • Python 3.11
  • Type Hints PEP 484
  • Async/Await asyncio

Web Framework

REST API & UI

  • FastAPI 0.109.0
  • Uvicorn ASGI Server
  • Pydantic v2
🧠

NLP & ML

Entity extraction

  • spaCy 3.7.2
  • en_core_web_lg NER Model
  • Regex Patterns Custom
☁️

Cloud & AI

LLM integration

  • AWS Bedrock Claude
  • boto3 AWS SDK
  • Mock Mode Offline
📷

OCR & Documents

Text extraction

  • Tesseract 5.4.0
  • pytesseract Python wrapper
  • Pillow Image processing
📊

Data Processing

Analysis & export

  • pandas DataFrames
  • PyYAML Config
  • CSV/JSON Export

🏗️ Project Structure

healthcare-idp-system/ ├── api/ │ ├── main.py # FastAPI application & endpoints │ └── schemas.py # Pydantic request/response models ├── src/ │ ├── document_classifier.py # ML-based document classification │ ├── entity_extractor.py # spaCy NER + regex extraction │ ├── claim_adjudicator.py # Business rule engine │ ├── policy_interpreter.py # Policy clause analysis │ ├── pipeline.py # Main processing orchestrator │ └── utils.py # OCR, file handling utilities ├── config/ │ └── config.yaml # Application configuration ├── static/ │ ├── batch_ui.html # Batch processing web UI │ ├── enhanced_ui.html # Single document UI │ └── docs.html # Documentation page ├── data/ │ └── samples/ # Sample test documents ├── tests/ │ ├── test_classifier.py # Classification tests │ ├── test_extractor.py # Extraction tests │ └── test_pipeline.py # Integration tests ├── deployment/ │ ├── Dockerfile # Container deployment │ └── docker-compose.yml # Multi-container setup ├── requirements.txt # Python dependencies └── README.md # Project documentation