THE PROBLEM

Ophthalmology datasets (fundus, OCT, OCTA, visual fields, photos, HRF, oximeter, demographics/clinical) were fragmented across multiple projects, each with inconsistent naming conventions, storage locations, and metadata standards. There was no centralized, standardized home for the data.

Solving this problem was critical because researchers and clinicians needed the ability to securely ingest, version, search, and analyze imaging and clinical data at scale. Without a unified system, cross-project research, cohort building, and longitudinal disease analysis were impractical or impossible.

The previous state lacked predictable file organization, enforced metadata tagging, automated validation, versioning, and role-based access control. Files were scattered across siloed storage with no consistent naming, no catalog, and no way to query across projects or modalities.

Researchers, clinicians, data managers, principal investigators, and administrative staff across Mount Sinai's ophthalmology department were affected — anyone needing to discover, access, or analyze imaging and clinical data.

Yes, significant technical challenges included handling large proprietary file formats (e.g., .E2E, .BAK), enforcing PHI/PII compliance and de-identification, managing inconsistent legacy naming across historical datasets, preventing tag sprawl, controlling cloud storage costs, and supporting overlapping patient/project membership without duplicating files.

Project Goals & Objectives

The main objective was to build a single AWS S3-based data warehouse with automated ingestion, validation, metadata extraction, cataloging, versioning, role-based access control, and audit trails providing researchers and clinicians a secure, standardized, queryable platform for ophthalmology data.

Success looked like: researchers can upload data via web UI or CLI and have it automatically validated, tagged, and cataloged; data managers can query metadata via Athena; admins can manage granular access without touching bucket-wide policies; all actions are auditable; and cross-project cohort building is possible without exposing PHI.

Secondary goals included:

Streamlining data ingestion with drag-and-drop web UI and CLI upload workflows

Enabling collaboration through shared queries, cohorts, and reproducible exports

Supporting de-identified patient discovery and cross-project overlap detection

Providing a dashboard with KPI tiles, data quality alerts, and quick actions

Automating batch processing (thumbnails, format conversions)

Controlling cloud costs through intelligent tiering and lifecycle policies

Secondary goals included:

Support for 10+ TB and 10M+ objects with graceful degradation

Tag-based policy changes effective within 15 minutes

RPO of 24 hours; RTO of 4 hours for the control plane

Validation and metadata extraction on every upload (zero untagged objects in production)

Paginated search results with defined SLAs on query response times

Our Approach & Strategy

The team took a spec-driven, iterative approach starting with a comprehensive technical specification (v1.0) that defined the full scope, then refining through multiple revisions (v1.0 through v1.3) as requirements evolved around information sharing, data discovery, identity resolution, and ERD-based canonical modeling.

Phase 1

Discovery & Requirements

Collaborated with Mount Sinai stakeholders to identify pain points, data modalities, user roles, compliance requirements, and existing workflows.

Phase 2

Architecture & Data Modeling

Designed a canonical relational model (patients → scan sessions → scan files) with bridge tables for many-to-many project membership, JSONB for flexible modality-specific metadata, and a pooled storage strategy to eliminate file duplication.

Phase 3

Specification & Validation Rules

Defined strict filename regex, required tags, server-side validation checks, duplicate prevention, disease normalization, and quarantine flows for non-conforming uploads.

Phase 4

Infrastructure & Pipeline Development

Built the AWS infrastructure (S3, Lambda, Glue, Athena, IAM/Identity Center, CloudTrail, VPC endpoints) and the automated ingestion/validation/cataloging pipeline.

Phase 5

Web UI & Collaboration Features

Developed the web interface including dashboard, project explorer, asset gallery, metadata grid, search, cohort builder, and sharing/export workflows.

Phase 6

Testing & Iteration

Validated against user stories and acceptance criteria; iterated on the spec based on feedback (e.g., adding PI role, cohort sharing, de-identified discovery).

Decisions were informed by:

Clinical guidance from ophthalmology SMEs (modality-specific metadata requirements)

PHI/HIPAA compliance requirements and Mount Sinai governance policies

The need to support heterogeneous and evolving modality fields (driving the JSONB design)

Real query patterns (e.g., "NAION right-eye scans within +/-90 days of onset") shaping the data model and indexing strategy

Constraints and limitations included:

Strict PHI/PII handling requirements limiting how patient identifiers could be stored and displayed

Large proprietary file formats (e.g., Heidelberg .E2E, Optovue .BAK) requiring raw storage with lightweight previews rather than full conversion

Inconsistent legacy naming conventions across historical datasets requiring migration scripts and mapping tables

Budget sensitivity requiring intelligent tiering, Glacier lifecycle policies, and monthly cost alerts

EXECUTIONS AND DELIVERABLES

What exactly did we deliver?

Backend Development Automated ingestion pipeline (S3 event triggers → Lambda → validation, metadata extraction, optional AWS Batch processing), canonical relational database (Postgres with ERD-based schema), Glue crawlers, Athena query layer, DVC+Git versioning integration.

Frontend Development Web UI with six primary views: Dashboard/Home, Project Explorer (Search + Results), Asset Gallery, Metadata Grid, Object Detail Drawer, and Global Search with advanced filter builder.

Strategy Full technical specification document (v1.0–v1.3) defining architecture, data model, user roles, validation rules, metadata standards, query patterns, NFRs, and risk mitigations.

Key features, systems, or deliverables

Canonical data model: Patients → Scan Sessions → Scan Files with bridge tables for many-to-many project membership, eliminating file duplication across projects

Automated validation pipeline: Filename regex, required tag enforcement, type/modality consistency checks, duplicate prevention (unique S3 keys + checksums), disease normalization against a controlled reference list

JSONB metadata model: Flexible, queryable storage for modality-specific attributes (OCT, OCTA, Visual Fields, Fundus) without schema churn.

Role-based access control: Four roles (Admin, Data Manager, Researcher/Collaborator, Principal Investigator) with tag-conditional and prefix-scoped S3 policies

De-identified patient discovery: Pseudonymous PatientKey for search/discovery with policy-gated re-identification workflow and full audit logging

Collaboration suite: Saved queries, snapshot and dynamic cohorts, project-scoped sharing, reproducible exports (manifests, metadata bundles, presigned links).

Dashboard: KPI tiles, recent activity, data quality alerts, quick actions, saved queries, cohort management

Quarantine workflow: Non-conforming uploads routed to /ingest/rejected/ with actionable error messages surfaced in the UI

Technologies and tools

AWS S3, AWS Lambda, AWS Glue, Amazon Athena, AWS Batch, AWS IAM / Identity Center (SSO), AWS CloudTrail, AWS CloudWatch, Amazon VPC Endpoints, PostgreSQL, DVC (Data Version Control), Git, JSONB (Postgres)

Standout creative or technical solutions

Pooled storage with logical project membership Files stored once in S3; project membership tracked via relational bridge tables (patient_project_map, session_project_map) rather than duplicating files across project prefixes dramatically reducing storage costs and eliminating sync issues.

Longitudinal disease modeling Diseases tracked via a normalized reference table with per-patient onset/resolution dates and per-eye laterality, enabling powerful temporal queries (e.g., "all right-eye scans within 90 days of NAION onset").

JSONB for modality-specific metadata A single flexible JSON column captures heterogeneous, evolving vendor- and modality-specific attributes while remaining queryable via Postgres and Athena JSON functions avoiding constant schema migrations.

De-identified discovery with controlled re-identification Deterministic pseudonymous keys (PatientKey) enable cross-project cohort building without exposing PHI, with a policy-gated, audit-logged workflow for authorized re-identification.

Challenges & Solutions

Inconsistent legacy naming across historical datasets Implemented strict filename regex validation with server-side enforcement, plus migration scripts and a mapping table for known exceptions to accommodate historical data during onboarding.

Large proprietary file formats (e.g., .E2E, .BAK) Stored raw files as-is to preserve fidelity; built lightweight thumbnail/preview generation via Lambda/Batch; documented conversion tools for researchers needing extracted data.

Tag sprawl and metadata misuse Enforced enumerated tag values per field with UI drop-downs and server-side validation; disease values must match a normalized reference list (no free-text); controlled admin process for adding new values.

PHI compliance and de-identification Separated pseudonymous identifiers (patient_uuid) from original identifiers (hashed MRN); made re-identification a policy-driven, role-gated workflow with immutable audit logs; UI defaults to de-identified PatientKey display.

Cost management at scale (10+ TB, 10M+ objects) Implemented S3 Intelligent-Tiering for unknown access patterns, lifecycle rules to Glacier Flexible Retrieval after 90 days for raw data, partitioned Athena tables, and monthly budget alerts.

Evolving modality-specific metadata requirements Adopted a JSONB-based metadata model that captures heterogeneous attributes without schema migrations, with the ability to promote frequently queried keys into typed columns as patterns emerge.

There were iterative pivots in the specification v1.1 expanded scope to include information sharing, data discovery, and results search; v1.2 aligned the model to an ERD-based canonical approach with pooled storage and stricter validation; v1.3 added the Principal Investigator role based on stakeholder feedback.

RESULTS & IMPACT

After implementation, Mount Sinai's ophthalmology research teams gained a single, unified platform replacing fragmented, siloed storage — enabling cross-project data discovery, standardized metadata, and collaborative cohort building for the first time.

Measurable results and operational improvements:

Improved efficiency : Automated validation and metadata extraction on every upload eliminated manual tagging and quality checks

Reduced processing time: Researchers can query across all modalities, projects, and disease states via Athena in seconds rather than manually searching through disparate file systems

Operational improvements: Quarantine workflow catches non-conforming uploads immediately with actionable error messages, reducing data quality remediation cycles

Increased engagement: Dashboard with KPI tiles, saved queries, shared cohorts, and one-click exports lowered the barrier to data exploration and collaboration

Revenue impact: Pooled storage architecture (files stored once, logical project membership) reduced S3 storage costs by eliminating cross-project file duplication; lifecycle policies further optimized long-term costs

Measurable results and operational improvements:

Cross-project overlap detection by de-identified patient key

Longitudinal disease-window queries (e.g., NAION right-eye scans within +/-90 days of onset)

Tag-based access policy changes effective within 15 minutes

Full audit trail of all data access, sharing, and export actions via CloudTrail

Client feedback is pending formal collection