About this Research
FHIR Medication Reconciliation — Serialisation Strategy Benchmark
The Problem
Medication reconciliation is the clinical process of producing an accurate current medication list from a patient's health records. It is one of the most error-prone steps in care — when patients are transferred or discharged, an incorrect list can cause drug interactions, missed doses, or wrong prescriptions.
Large language models can read months or years of medication records and produce a summarised current list far faster than a human can. But two failure modes matter critically:
Hallucination
The model lists a medication that is not in the source data. Can be caught on manual review.
Omission — the more dangerous failure
The model misses an active medication. May be silently carried forward, causing harm.
The Research Hypothesis
Both failure modes are significantly influenced not by which model you use, but by how the source data is formatted before it is given to the model.
FHIR R4 — the international standard for electronic health records — stores data as deeply nested JSON with numeric codes from medical ontologies (RxNorm, SNOMED, LOINC). This is not a format language models were trained to reason over naturally. We believe the serialisation step is a dominant variable that has not been systematically studied.
Core Research Questions
- Q1Does serialisation strategy matter more than model size? A 3.8B model with a well-formatted input might outperform a 70B model given raw FHIR JSON.
- Q2Which format is safest for omission? Recall — not missing active medications — is the priority metric in a clinical setting.
- Q3Does biomedical domain pretraining help? BioMistral was trained on medical literature. Does that give it an advantage over a general-purpose model of the same size?
- Q4At what history length do models start failing? Patients with many years of records accumulate hundreds of medication entries. Does recall degrade with history length?
The Dataset
All experiments use synthetic patient data generated by Synthea, an open-source patient simulator developed by MITRE Corporation. No real patient data is used at any point.
Because the ground truth is known exactly — we generated the data — model accuracy can be measured with mathematical precision, with no need for human annotation.
Patients
200+
Min. history
3 years
Population
Elderly adults
Ground truth
status == active
The Four Serialisation Strategies
The central experimental variable is how each patient's FHIR R4 JSON bundle is converted into text before being sent to the model. All four strategies are given the same instruction prompt — only the input format changes.
The prompt (same for all strategies)
"You are a clinical assistant performing medication reconciliation. You will be given a patient's medication history. Your task is to identify all medications that are currently ACTIVE for this patient. Return your answer as a JSON array of medication names exactly as they appear in the data. Return nothing else — no explanation, no prose, just the JSON array."
The Models
Five open-source models are evaluated, running on an AWS GPU instance with 48 GB VRAM. The selection spans a practical deployment spectrum — from a 3.8B model any clinic can run on a consumer GPU, to a 70B model requiring a large workstation. One domain-specialised model (BioMistral) is included to test whether biomedical pretraining gives an advantage over general-purpose models of the same size.
How We Measure Success
For each of the 20 model × strategy combinations, we run all 200 patients and compute three core metrics:
A run is marked as a parse failure when the model does not return a valid JSON array. These runs are counted as F1 = precision = recall = 0 in all aggregations, consistent with the paper. Raw responses are visible in the response inspector.
Navigating This Dashboard
Every result from every run — 200 patients × 5 models × 4 strategies = 4,000 runs — is loaded into the dashboard. Here is what each section gives you.
Start here. Four headline numbers — total runs, overall mean F1, best model, and best strategy. Below that, a colour-coded heatmap showing F1 for every model × strategy combination at a glance, followed by a ranked summary table.
One tab per model, smallest to largest. Each tab shows KPI cards, a grouped precision/recall/F1 bar chart across strategies, an F1 distribution histogram, a per-strategy breakdown table, and a full patient-level results table with per-strategy scores and links to individual drill-downs.
One tab per strategy. Each tab shows distribution stats (mean, median, IQR), an F1 histogram, a per-model breakdown table, a mean inference time chart, and a patient-level table with per-model scores.
Search and filter all 200 patients by ID or name. The table shows each patient's active medication count and mean F1. Click any row to open the patient drill-down.
The core inspection view. A model × strategy grid shows F1, precision, recall, and inference time for every run on this patient. Click any cell to load a side-by-side diff of the ground truth medications versus the model output — green for matched, red for missed or hallucinated. Three tabs below let you inspect the raw model response, the full prompt that was sent, and the serialised input the model actually saw.