FHIR MedRecon

Overview

Medication reconciliation accuracy across 200 patients · 5 models · 4 strategies

Total Patients

200

4000 total runs

Mean F1 (all)

0.680

across all models & strategies

Best Model

Llama 3.3 70B

by mean F1 across strategies

Best Strategy

Clinical Narrative

by mean F1 across models

890 parse failurescounted as F1 = 0 in all metrics
F1 Score — Model × Strategy Heatmap

Mean F1 across all patients · cells with fewer runs marked —

ModelRaw JSONTableNarrativeTimeline
Phi-3.5-mini (3.8B)
0.636
0.683
0.701
0.644
Mistral 7B v0.3
0.725
0.875
0.915
0.859
BioMistral 7B
0.000
0.000
0.000
0.000
Llama 3.1 8B
0.918
0.925
0.947
0.923
Llama 3.3 70B
0.996
0.987
0.985
0.874
F1 scale:≥0.750.60–0.750.45–0.60<0.45
Model Rankings
ModelF1Failures
1Llama 3.3 70B0.9607
2Llama 3.1 8B0.9283
3Mistral 7B v0.30.84322
4Phi-3.5-mini (3.8B)0.66667
5BioMistral 7B0.000791
Strategy Rankings
StrategyF1Avg time
1Strategy C — Clinical Narrative0.71012.1s
2Strategy B — Markdown Table0.69412.3s
3Strategy D — Timeline0.66010.7s
4Strategy A — Raw FHIR JSON0.65525.0s