Designed and built a secure, AWS-based data warehouse and automated ingestion pipeline to unify Mount Sinai's fragmented ophthalmology datasets — including fundus photos, OCT, OCTA, visual fields, and clinical demographics — into a single, standardized, queryable platform for researchers and clinicians.
Mount Sinai Ophthalmology Data Warehouse & Pipeline
Mount Sinai Health System
Healthcare / Ophthalmology Research
September 2025 – Present (ongoing; v1.0 spec authored September 12, 2025; iterative updates through February 2026)
Ophthalmology datasets (fundus, OCT, OCTA, visual fields, photos, HRF, oximeter, demographics/clinical) were fragmented across multiple projects, each with inconsistent naming conventions, storage locations, and metadata standards. There was no centralized, standardized home for the data.
Solving this problem was critical because researchers and clinicians needed the ability to securely ingest, version, search, and analyze imaging and clinical data at scale. Without a unified system, cross-project research, cohort building, and longitudinal disease analysis were impractical or impossible.
The previous state lacked predictable file organization, enforced metadata tagging, automated validation, versioning, and role-based access control. Files were scattered across siloed storage with no consistent naming, no catalog, and no way to query across projects or modalities.
Researchers, clinicians, data managers, principal investigators, and administrative staff across Mount Sinai's ophthalmology department were affected — anyone needing to discover, access, or analyze imaging and clinical data.
Yes, significant technical challenges included handling large proprietary file formats (e.g., .E2E, .BAK), enforcing PHI/PII compliance and de-identification, managing inconsistent legacy naming across historical datasets, preventing tag sprawl, controlling cloud storage costs, and supporting overlapping patient/project membership without duplicating files.
The main objective was to build a single AWS S3-based data warehouse with automated ingestion, validation, metadata extraction, cataloging, versioning, role-based access control, and audit trails providing researchers and clinicians a secure, standardized, queryable platform for ophthalmology data.
Success looked like: researchers can upload data via web UI or CLI and have it automatically validated, tagged, and cataloged; data managers can query metadata via Athena; admins can manage granular access without touching bucket-wide policies; all actions are auditable; and cross-project cohort building is possible without exposing PHI.
Secondary goals included:
Streamlining data ingestion with drag-and-drop web UI and CLI upload workflows
Enabling collaboration through shared queries, cohorts, and reproducible exports
Supporting de-identified patient discovery and cross-project overlap detection
Providing a dashboard with KPI tiles, data quality alerts, and quick actions
Automating batch processing (thumbnails, format conversions)
Controlling cloud costs through intelligent tiering and lifecycle policies
Secondary goals included:
Support for 10+ TB and 10M+ objects with graceful degradation
Tag-based policy changes effective within 15 minutes
RPO of 24 hours; RTO of 4 hours for the control plane
Validation and metadata extraction on every upload (zero untagged objects in production)
Paginated search results with defined SLAs on query response times
The team took a spec-driven, iterative approach starting with a comprehensive technical specification (v1.0) that defined the full scope, then refining through multiple revisions (v1.0 through v1.3) as requirements evolved around information sharing, data discovery, identity resolution, and ERD-based canonical modeling.
Collaborated with Mount Sinai stakeholders to identify pain points, data modalities, user roles, compliance requirements, and existing workflows.
Designed a canonical relational model (patients → scan sessions → scan files) with bridge tables for many-to-many project membership, JSONB for flexible modality-specific metadata, and a pooled storage strategy to eliminate file duplication.
Defined strict filename regex, required tags, server-side validation checks, duplicate prevention, disease normalization, and quarantine flows for non-conforming uploads.
Built the AWS infrastructure (S3, Lambda, Glue, Athena, IAM/Identity Center, CloudTrail, VPC endpoints) and the automated ingestion/validation/cataloging pipeline.
Developed the web interface including dashboard, project explorer, asset gallery, metadata grid, search, cohort builder, and sharing/export workflows.
Validated against user stories and acceptance criteria; iterated on the spec based on feedback (e.g., adding PI role, cohort sharing, de-identified discovery).
Clinical guidance from ophthalmology SMEs (modality-specific metadata requirements)
PHI/HIPAA compliance requirements and Mount Sinai governance policies
The need to support heterogeneous and evolving modality fields (driving the JSONB design)
Real query patterns (e.g., "NAION right-eye scans within +/-90 days of onset") shaping the data model and indexing strategy
Strict PHI/PII handling requirements limiting how patient identifiers could be stored and displayed
Large proprietary file formats (e.g., Heidelberg .E2E, Optovue .BAK) requiring raw storage with lightweight previews rather than full conversion
Inconsistent legacy naming conventions across historical datasets requiring migration scripts and mapping tables
Budget sensitivity requiring intelligent tiering, Glacier lifecycle policies, and monthly cost alerts
Backend Development Automated ingestion pipeline (S3 event triggers → Lambda → validation, metadata extraction, optional AWS Batch processing), canonical relational database (Postgres with ERD-based schema), Glue crawlers, Athena query layer, DVC+Git versioning integration.
Frontend Development Web UI with six primary views: Dashboard/Home, Project Explorer (Search + Results), Asset Gallery, Metadata Grid, Object Detail Drawer, and Global Search with advanced filter builder.
Strategy Full technical specification document (v1.0–v1.3) defining architecture, data model, user roles, validation rules, metadata standards, query patterns, NFRs, and risk mitigations.
Frontend Development Web UI with six primary views: Dashboard/Home, Project Explorer (Search + Results), Asset Gallery, Metadata Grid, Object Detail Drawer, and Global Search with advanced filter builder.
Canonical data model: Patients → Scan Sessions → Scan Files with bridge tables for many-to-many project membership, eliminating file duplication across projects
Automated validation pipeline: Filename regex, required tag enforcement, type/modality consistency checks, duplicate prevention (unique S3 keys + checksums), disease normalization against a controlled reference list
JSONB metadata model: Flexible, queryable storage for modality-specific attributes (OCT, OCTA, Visual Fields, Fundus) without schema churn.
Role-based access control: Four roles (Admin, Data Manager, Researcher/Collaborator, Principal Investigator) with tag-conditional and prefix-scoped S3 policies
De-identified patient discovery: Pseudonymous PatientKey for search/discovery with policy-gated re-identification workflow and full audit logging
Collaboration suite: Saved queries, snapshot and dynamic cohorts, project-scoped sharing, reproducible exports (manifests, metadata bundles, presigned links).
Dashboard: KPI tiles, recent activity, data quality alerts, quick actions, saved queries, cohort management
Quarantine workflow: Non-conforming uploads routed to /ingest/rejected/ with actionable error messages surfaced in the UI
AWS S3, AWS Lambda, AWS Glue, Amazon Athena, AWS Batch, AWS IAM / Identity Center (SSO), AWS CloudTrail, AWS CloudWatch, Amazon VPC Endpoints, PostgreSQL, DVC (Data Version Control), Git, JSONB (Postgres)
Pooled storage with logical project membership Files stored once in S3; project membership tracked via relational bridge tables (patient_project_map, session_project_map) rather than duplicating files across project prefixes dramatically reducing storage costs and eliminating sync issues.
Longitudinal disease modeling Diseases tracked via a normalized reference table with per-patient onset/resolution dates and per-eye laterality, enabling powerful temporal queries (e.g., "all right-eye scans within 90 days of NAION onset").
JSONB for modality-specific metadata A single flexible JSON column captures heterogeneous, evolving vendor- and modality-specific attributes while remaining queryable via Postgres and Athena JSON functions avoiding constant schema migrations.
De-identified discovery with controlled re-identification Deterministic pseudonymous keys (PatientKey) enable cross-project cohort building without exposing PHI, with a policy-gated, audit-logged workflow for authorized re-identification.
Inconsistent legacy naming across historical datasets Implemented strict filename regex validation with server-side enforcement, plus migration scripts and a mapping table for known exceptions to accommodate historical data during onboarding.
Large proprietary file formats (e.g., .E2E, .BAK) Stored raw files as-is to preserve fidelity; built lightweight thumbnail/preview generation via Lambda/Batch; documented conversion tools for researchers needing extracted data.
Tag sprawl and metadata misuse Enforced enumerated tag values per field with UI drop-downs and server-side validation; disease values must match a normalized reference list (no free-text); controlled admin process for adding new values.
PHI compliance and de-identification Separated pseudonymous identifiers (patient_uuid) from original identifiers (hashed MRN); made re-identification a policy-driven, role-gated workflow with immutable audit logs; UI defaults to de-identified PatientKey display.
Cost management at scale (10+ TB, 10M+ objects) Implemented S3 Intelligent-Tiering for unknown access patterns, lifecycle rules to Glacier Flexible Retrieval after 90 days for raw data, partitioned Athena tables, and monthly budget alerts.
Evolving modality-specific metadata requirements Adopted a JSONB-based metadata model that captures heterogeneous attributes without schema migrations, with the ability to promote frequently queried keys into typed columns as patterns emerge.
There were iterative pivots in the specification v1.1 expanded scope to include information sharing, data discovery, and results search; v1.2 aligned the model to an ERD-based canonical approach with pooled storage and stricter validation; v1.3 added the Principal Investigator role based on stakeholder feedback.
After implementation, Mount Sinai's ophthalmology research teams gained a single, unified platform replacing fragmented, siloed storage — enabling cross-project data discovery, standardized metadata, and collaborative cohort building for the first time.
Measurable results and operational improvements:
Improved efficiency : Automated validation and metadata extraction on every upload eliminated manual tagging and quality checks
Reduced processing time: Researchers can query across all modalities, projects, and disease states via Athena in seconds rather than manually searching through disparate file systems
Operational improvements: Quarantine workflow catches non-conforming uploads immediately with actionable error messages, reducing data quality remediation cycles
Increased engagement: Dashboard with KPI tiles, saved queries, shared cohorts, and one-click exports lowered the barrier to data exploration and collaboration
Revenue impact: Pooled storage architecture (files stored once, logical project membership) reduced S3 storage costs by eliminating cross-project file duplication; lifecycle policies further optimized long-term costs
Measurable results and operational improvements:
Cross-project overlap detection by de-identified patient key
Longitudinal disease-window queries (e.g., NAION right-eye scans within +/-90 days of onset)
Tag-based access policy changes effective within 15 minutes
Full audit trail of all data access, sharing, and export actions via CloudTrail
Client feedback is pending formal collection