Stem Cell Donor Registries: Two-Stage Matching Architecture

Overview

This documentation outlines a high-precision matching engine for stem cell donor registries. It is designed to facilitate the rapid identification of compatible donors for patients by balancing high-recall semantic retrieval with the strict precision required for medical HLA matching.

Technical Architecture

1. Data Ingestion (Text Space Pattern)

Due to server-side infrastructure constraints (psycopg.errors.DiskFull), this vertical utilizes a Text Space manual ingestion pattern for initial validation.

Corpus: Natural-language donor profiles and patient urgency rows.
Segmentation: Granular point-level analysis.
Topic Categorization: Enabled via GLM 5 for unsupervised grouping.

2. Stage 1: Semantic Proximity (Recall)

The system uses unsupervised embedding to surface contextually relevant donors based on clinical descriptions.

Signal Validation: Patient data utilizing urgency language (e.g., "CRITICAL", "URGENT") successfully clusters adjacent to high-priority donor neighborhoods.
Findings: Semantic encoding of urgency reliably drives geometric proximity, allowing for rapid triage of time-sensitive requests.

3. Stage 2: Relational Constraints (Precision)

Because stem cell matching requires exact biological compatibility, Composer AI applies a relational constraint layer over the semantic results.

Compatibility Logic: Validates HLA markers and Rh-factor compatibility.
Validation Case: Effectively isolates biologically compatible donors from those that are merely semantically similar (e.g., filtering out donors with high clinical urgency matches but incompatible blood types).

Infrastructure Notes

Primary Blocker: Server-side PostgreSQL disk volume exhaustion.
Resolution: Manual Data Bridge via hardcoded dictionary in Mantis Coding notebooks.
Phase 2 Migration: Scaling to 30,000+ row registries via a local WSL2 stack.

Overview ​

Technical Architecture ​

1. Data Ingestion (Text Space Pattern) ​

2. Stage 1: Semantic Proximity (Recall) ​