Overview

Medication reconciliation accuracy across 200 patients · 5 models · 4 strategies

Total Patients

200

4000 total runs

Mean F1 (all)

0.680

across all models & strategies

Best Model

Llama 3.3 70B

by mean F1 across strategies

Best Strategy

Clinical Narrative

by mean F1 across models

890 parse failurescounted as F1 = 0 in all metrics

F1 Score — Model × Strategy Heatmap

Mean F1 across all patients · cells with fewer runs marked —

Model	Raw JSON	Table	Narrative	Timeline
Phi-3.5-mini (3.8B)	0.636	0.683	0.701	0.644
Mistral 7B v0.3	0.725	0.875	0.915	0.859
BioMistral 7B	0.000	0.000	0.000	0.000
Llama 3.1 8B	0.918	0.925	0.947	0.923
Llama 3.3 70B	0.996	0.987	0.985	0.874

F1 scale:≥0.750.60–0.750.45–0.60<0.45

Model Rankings

Model	F1	Precision	Recall	Failures
1Llama 3.3 70B	0.960	0.963	0.959	7
2Llama 3.1 8B	0.928	0.961	0.917	3
3Mistral 7B v0.3	0.843	0.937	0.804	22
4Phi-3.5-mini (3.8B)	0.666	0.768	0.619	67
5BioMistral 7B	0.000	0.000	0.000	791

Strategy Rankings

Strategy	F1	Precision	Recall	Avg time
1Strategy C — Clinical Narrative	0.710	0.726	0.700	12.1s
2Strategy B — Markdown Table	0.694	0.744	0.673	12.3s
3Strategy D — Timeline	0.660	0.716	0.635	10.7s
4Strategy A — Raw FHIR JSON	0.655	0.716	0.632	25.0s