Johnny Rice

AI orchestration, data engineering, and scientific informatics

Multi-agent systems, evaluation harnesses, guardrails, pipelines, and product-facing data tools

Massachusetts

Background

The most effective AI is almost invisible. I connect complex data sources, make them queryable, and deliver fast, accurate answers inside the everyday tools people already know.

ClariTrial is a full example of this principle. It pulls from ClinicalTrials.gov, PubMed, ChEMBL, UniProt, WHO ICTRP, and AACT (a Postgres mirror of the US trial registry), then layers on agentic AI with a graph-based orchestration engine that can compose live queries across those sources in a single conversation. A source health monitor tracks each external API with a circuit-breaker pattern (three consecutive failures in an hour marks a source as down; auto-recovers after one hour). The same underlying data powers the search UI, company pipeline views, competitive landscape pages, an Excel add-in that puts trial intelligence into spreadsheet formulas, and a real-time meeting tool where data-aware agents map what people say to the trial and molecule data the system already knows about.

Before this, the same pattern showed up in informatics roles: Python ETL against vendor APIs, Postgres schemas for compound/assay/batch data, R and Shiny dashboards for scientists (same idea: meet researchers in a tool they trust), Airflow for orchestration, AWS for hosting. The difference now is that the front end is a full product, the data sources are public, and the AI layer replaces some of what used to be manual ad-hoc SQL sessions. The engineering habits are the same: normalize carefully, treat APIs as contracts, cache aggressively, and keep the path from raw records to what the user sees explicit enough that you can debug it six months later.

Skills (representative)

What actually shows up in the commits, not an aspirational list.

PythonSQLPostgresAWSData pipelines / ETLAPIsTypeScriptReact / Next.jsScientific informaticsData modelingAutomationCI/CDReal-time audio (Deepgram)AI orchestration (graph-based)Multi-agent systemsEvaluation harness (LLM-as-judge)Guardrails / prompt guardConfidence scoringReflection loopsCircuit-breaker resilienceEntity memory / knowledge graphAES-256 encryption / HMACVercel AI SDKPrompt engineeringObservability / structured tracingCache hierarchy design

What ClariTrial does

The core product is a clinical trial intelligence tool. You can search 470K+ trials, browse company pipelines, compare trials side by side, and ask an AI chat questions that hit live databases instead of a static knowledge cutoff. The AI orchestration layer uses a lead model that delegates to specialist agents (trial discovery, evidence synthesis, deep dive, comparison, and CDD Vault agents), each scoped to a specific data source with its own tool call budget, step limit, and reflection loops. A dynamic orchestration engine classifies queries as simple, moderate, complex, or debate-worthy, builds a dependency graph, and dispatches agents in topological order with parallel execution where dependencies allow. Every tool result carries provenance metadata (source, timestamp, row counts), and the full trace is visible in the UI: which tools ran, per-tool latency, specialist model IDs, tool budget usage, and whether a specialist was retried after a reflection failure.

Data access is governed by layered guardrails. A prompt guard runs moderation and relevance checks before the model sees user input. AACT queries use either allowlisted SQL presets (four fixed modality slices) or parameterized flexible queries where all user-supplied values are bound as $N parameters. An injection detection layer scans filter values for suspicious SQL patterns before queries execute. Results go through a validation pass for empty sets, implausible counts, and missing fields. Every chat turn is logged to a JSONL audit file with prompt version, model ID, and the serialized tool trace.

The part highly relevant to drug discovery use cases is how many data types connect. A single conversation can pull a trial from ClinicalTrials.gov, cross-reference the sponsor's pipeline from curated company data, check bioactivity on the drug target from ChEMBL, look up the protein structure from UniProt, and query AACT for competing trials in the same indication and phase. The same entities (companies, drugs, targets, conditions) persist in an entity memory layer (a lightweight knowledge graph in Postgres) so the system remembers what you've looked at across sessions. Cross-session context is injected into the system prompt when the user mentions previously seen entities.

Meetings are the clearest example of zero-friction AI. Deepgram Nova-3 handles real-time speech-to-text over a browser WebSocket, with domain-specific keyword boosting (molecule names, gene targets, clinical terms) so the transcript is actually useful for a research discussion, not garbled. Three specialist agents (infrastructure architect, data scientist, cheminformatician) are data-aware and tool-aware: they listen to the conversation, recognize the molecules and targets being mentioned, and recommend relevant tools from a 70+ catalog spanning AWS, cheminformatics, proteomics, and genomics. Nobody stops the discussion to search a database. A batch analysis loop runs every 20 seconds, extracting entities and mapping them to the trial and molecule data ClariTrial already indexes. The post-meeting summary ties everything together: what was discussed, which data points are relevant, and what to follow up on.

There is also a mission flow (multi-agent task runner) where you can describe a research question, the system generates an execution plan, and specialist agents run in parallel via SSE streaming. Results are synthesized into structured reports with confidence scoring aggregated by source tier and recency. For contested questions, a multi-agent debate protocol runs three perspectives (optimistic, skeptical, balanced) through challenge rounds before a judge synthesizes the final assessment. An agent capability registry declares each specialist's capabilities, cost tier, and step budget so the orchestrator can plan execution graphs without hard-coding agent selection.

The CDD Vault integration bridges internal discovery data (molecules, batches, assay protocols) with external clinical intelligence. Credentials are encrypted at rest with AES-256-GCM, permissions are role-based (five tiers from clinical analyst to vault admin), and the system scaffolds custom agent teams from a vault's schema. Scientists stay in their existing vault workflow; the AI reads the schema and adapts to it rather than forcing a new interface. Write operations require two-step approval with HMAC-signed tokens, and every write is audited to both an in-memory buffer and a Postgres table.

The Excel add-in is the same idea applied to the most familiar tool in any organization. Custom formulas like =CLARI.TRIAL() and =CLARI.ASK() put trial lookups, pipeline data, and natural-language AI queries directly into spreadsheet cells. A task pane lets users search trials and insert results without leaving Excel. The integration is nearly invisible: the same data layer that powers the chat, the meeting tool, and the mission runner also backs a set of spreadsheet formulas. End users get powerful access to multi-source clinical intelligence through a tool they already open every morning, with no new workflow to learn.

Search + home →AI chat →Informatics demo →Meetings →Missions →Emerging drugs →Data management →Excel add-in →Kymera case study →

How the AI orchestration works

A lead model receives the user's question and delegates to scoped specialist agents. Each specialist has a defined tool set, a step limit (3-5 generation rounds), and a separate tool call budget (4-8 invocations) that caps total API calls across all steps. After each run, a reflection layer checks relevance, specificity, and data presence; low-confidence results trigger a single retry with diagnostic context. The trace panel in the UI shows per-tool latency, specialist model IDs, tool budget usage, and retry indicators. All of this machinery exists so the surfaces that face end users, whether a chat, a meeting transcript, or a spreadsheet cell, can stay simple. The diagram below is the live pattern.

How ClariTrial routes a question

One lead model, scoped specialist agents with tool budgets, live APIs, provenance on every result.

Your question→Lead assistant→Specialist + tools→Cited answer

Specialists call ClinicalTrials.gov, PubMed, OpenFDA, and WHO ICTRP with per-agent tool budgets and step limits. Every result carries provenance; the trace panel shows latency, model IDs, and budget usage.

Open live chat

Engineering patterns

The architecture behind ClariTrial, not just what it does but how the reliability, safety, and quality systems are built.

Graph-based orchestration

Query classifier (simple / moderate / complex / debate) builds a dependency graph. Agents dispatch in topological order with parallel execution where the graph allows.

Evaluation harness

23-query golden dataset, four evaluator types: router accuracy with confusion matrix, code-based specialist validators, LLM-as-judge with weighted rubric, and convergence scoring. Experiment runner saves timestamped results with delta comparison and regression detection.

Guardrails and prompt guard

Pre-model moderation and relevance filtering. SQL injection detection on all user-supplied filter values. Tool call budgets and step limits per specialist. Draft-analysis heuristic flags uncertain outputs.

Reflection loops

Post-run validation checks relevance, specificity, consistency, and data presence. Scores below 0.5 trigger a single retry with diagnostic context injected into the specialist prompt.

Confidence scoring

Aggregated by source tier (registry > journal > curated > model) and recency. Surfaced in mission reports and debate synthesis so readers can weight claims.

Circuit-breaker resilience

Source health monitor tracks success/failure/rate-limit per external API. Three consecutive failures in one hour marks a source as down. Auto-recovers after one hour. State injected into the system prompt so the lead agent acknowledges gaps.

Entity memory

Lightweight knowledge graph in Postgres. Entities, facts, and relationships extracted from agent responses via heuristic NER. Cross-session context injected into the system prompt when the user mentions previously seen entities.

Cache hierarchy

Three layers: per-user first-turn replay from thread DB, shared single-turn response cache in Postgres, then model. Cached responses tagged in message metadata so the trace panel can indicate cache hits.

Structured tracing and observability

Every chat request and specialist run creates a trace with parent-child spans recording latency, model, I/O, and status. In-memory ring buffer (1,000 traces), JSONL file export, and optional Arize Phoenix collector.

Quality monitoring

Parses audit JSONL for feedback rates, step count distributions (p50/p95), prompt version and model breakdowns. Configurable threshold alerts (warning/critical) for degradation detection.