About this Research

FHIR Medication Reconciliation — Serialisation Strategy Benchmark

The Problem

Medication reconciliation is the clinical process of producing an accurate current medication list from a patient's health records. It is one of the most error-prone steps in care — when patients are transferred or discharged, an incorrect list can cause drug interactions, missed doses, or wrong prescriptions.

Large language models can read months or years of medication records and produce a summarised current list far faster than a human can. But two failure modes matter critically:

Hallucination

The model lists a medication that is not in the source data. Can be caught on manual review.

Omission — the more dangerous failure

The model misses an active medication. May be silently carried forward, causing harm.

The Research Hypothesis

Both failure modes are significantly influenced not by which model you use, but by how the source data is formatted before it is given to the model.

FHIR R4 — the international standard for electronic health records — stores data as deeply nested JSON with numeric codes from medical ontologies (RxNorm, SNOMED, LOINC). This is not a format language models were trained to reason over naturally. We believe the serialisation step is a dominant variable that has not been systematically studied.

Core Research Questions

Q1Does serialisation strategy matter more than model size? A 3.8B model with a well-formatted input might outperform a 70B model given raw FHIR JSON.
Q2Which format is safest for omission? Recall — not missing active medications — is the priority metric in a clinical setting.
Q3Does biomedical domain pretraining help? BioMistral was trained on medical literature. Does that give it an advantage over a general-purpose model of the same size?
Q4At what history length do models start failing? Patients with many years of records accumulate hundreds of medication entries. Does recall degrade with history length?

The Dataset

All experiments use synthetic patient data generated by Synthea, an open-source patient simulator developed by MITRE Corporation. No real patient data is used at any point.

Because the ground truth is known exactly — we generated the data — model accuracy can be measured with mathematical precision, with no need for human annotation.

Patients

200+

Min. history

3 years

Population

Elderly adults

Ground truth

status == active

The Four Serialisation Strategies

The central experimental variable is how each patient's FHIR R4 JSON bundle is converted into text before being sent to the model. All four strategies are given the same instruction prompt — only the input format changes.

Strategy A

Raw FHIR JSON

The FHIR R4 MedicationRequest resources are passed directly as cleaned JSON, preserving the original structure. All clinical fields remain intact. This is the baseline — what a naive automated pipeline would do.

Patient: Harold Pacocha | Age: 74 | Gender: male

[
  {
    "resourceType": "MedicationRequest",
    "status": "active",
    "authoredOn": "2023-11-14",
    "medicationCodeableConcept": {
      "coding": [{ "system": "...rxnorm", "code": "314076",
                   "display": "Lisinopril 10 MG Oral Tablet" }],
      "text": "Lisinopril 10 MG Oral Tablet"
    },
    "dosageInstruction": [{
      "timing": { "repeat": { "frequency": 1, "period": 1, "periodUnit": "d" } },
      "doseAndRate": [{ "doseQuantity": { "value": 1, "unit": "tablet" } }]
    }]
  },
  ...
]

Strategy B

Flat Markdown Table

Key fields from every MedicationRequest are extracted and formatted as a markdown table with columns for medication name, RxNorm code, status, date, dose, and frequency. All statuses are included — the model must determine which are active.

Patient: Harold Pacocha | Age: 74 | Gender: male

| Medication                   | RxNorm | Status  | Prescribed | Dose     | Frequency           |
| ---                          | ---    | ---     | ---        | ---      | ---                 |
| Metformin 500 MG Oral Tablet | 860975 | stopped | 2019-03-01 | -        | -                   |
| Lisinopril 10 MG Oral Tablet | 314076 | active  | 2023-11-14 | 1 tablet | 1 time(s) per 1 day |
| Amlodipine 5 MG Oral Tablet  | 197361 | active  | 2024-02-03 | 1 tablet | 1 time(s) per 1 day |

Strategy C

Clinical Narrative

Each MedicationRequest is converted into a plain English sentence with codes expanded to readable text. Active medications are listed first under a clear heading, followed by the full history. The goal is something close to how a clinical note reads.

Patient: Harold Pacocha | Age: 74 | Gender: male

Currently active medications:
  - Lisinopril 10 MG Oral Tablet (RxNorm: 314076), prescribed on 2023-11-14,
    status: active. Dose: 1 tablet, frequency: 1 time(s) per 1 day.
  - Amlodipine 5 MG Oral Tablet (RxNorm: 197361), prescribed on 2024-02-03,
    status: active. Dose: 1 tablet, frequency: 1 time(s) per 1 day.

Medication history (no longer active):
  - Metformin 500 MG Oral Tablet (RxNorm: 860975), prescribed on 2019-03-01,
    status: stopped. Dosage not recorded.

Strategy D

Chronological Timeline

All MedicationRequests are sorted by date, oldest to newest, regardless of status. The model must read the full history and reason about what is currently active versus what was stopped. This is the most temporally demanding format.

Patient: Harold Pacocha | Age: 74 | Gender: male

Chronological medication history (oldest to newest):

2019-03-01 | stopped | Metformin 500 MG Oral Tablet (RxNorm: 860975) | -
2020-07-15 | stopped | Atorvastatin 20 MG Tablet (RxNorm: 617311) | 1 tablet, 1x/day
2023-11-14 | active  | Lisinopril 10 MG Oral Tablet (RxNorm: 314076) | 1 tablet, 1x/day
2024-02-03 | active  | Amlodipine 5 MG Oral Tablet (RxNorm: 197361) | 1 tablet, 1x/day

The prompt (same for all strategies)

"You are a clinical assistant performing medication reconciliation. You will be given a patient's medication history. Your task is to identify all medications that are currently ACTIVE for this patient. Return your answer as a JSON array of medication names exactly as they appear in the data. Return nothing else — no explanation, no prose, just the JSON array."

The Models

Five open-source models are evaluated, running on an AWS GPU instance with 48 GB VRAM. The selection spans a practical deployment spectrum — from a 3.8B model any clinic can run on a consumer GPU, to a 70B model requiring a large workstation. One domain-specialised model (BioMistral) is included to test whether biomedical pretraining gives an advantage over general-purpose models of the same size.

Model	Size	Type	VRAM
Phi-3.5 Mini Instruct	3.8B	General	~8 GB
Mistral 7B v0.3 Instruct	7B	General	~16 GB
BioMistral 7B	7B	Biomedical	~16 GB
Llama 3.1 8B Instruct	8B	General	~18 GB
Llama 3.3 70B Instruct	70B	General	~43 GB

How We Measure Success

For each of the 20 model × strategy combinations, we run all 200 patients and compute three core metrics:

PrecisionOf the medications the model listed, what fraction were actually active? Low precision means hallucination — the model invented medications not present in the record.

RecallOf the medications that were truly active, what fraction did the model find? Low recall means omission — the safety-critical metric. A missed active medication may be silently carried forward to the next clinician.

F1 ScoreThe harmonic mean of precision and recall. The primary ranking metric used throughout this dashboard.

A run is marked as a parse failure when the model does not return a valid JSON array. These runs are counted as F1 = precision = recall = 0 in all aggregations, consistent with the paper. Raw responses are visible in the response inspector.

Navigating This Dashboard

Every result from every run — 200 patients × 5 models × 4 strategies = 4,000 runs — is loaded into the dashboard. Here is what each section gives you.

Overview

Start here. Four headline numbers — total runs, overall mean F1, best model, and best strategy. Below that, a colour-coded heatmap showing F1 for every model × strategy combination at a glance, followed by a ranked summary table.

Models

One tab per model, smallest to largest. Each tab shows KPI cards, a grouped precision/recall/F1 bar chart across strategies, an F1 distribution histogram, a per-strategy breakdown table, and a full patient-level results table with per-strategy scores and links to individual drill-downs.

Strategies

One tab per strategy. Each tab shows distribution stats (mean, median, IQR), an F1 histogram, a per-model breakdown table, a mean inference time chart, and a patient-level table with per-model scores.

Patients

Search and filter all 200 patients by ID or name. The table shows each patient's active medication count and mean F1. Click any row to open the patient drill-down.

Patient drill-down

The core inspection view. A model × strategy grid shows F1, precision, recall, and inference time for every run on this patient. Click any cell to load a side-by-side diff of the ground truth medications versus the model output — green for matched, red for missed or hallucinated. Three tabs below let you inspect the raw model response, the full prompt that was sent, and the serialised input the model actually saw.