Overview
Medication reconciliation accuracy across 200 patients · 5 models · 4 strategies
Total Patients
200
4000 total runs
Mean F1 (all)
0.680
across all models & strategies
Best Model
Llama 3.3 70B
by mean F1 across strategies
Best Strategy
Clinical Narrative
by mean F1 across models
890 parse failurescounted as F1 = 0 in all metrics
F1 Score — Model × Strategy Heatmap
Mean F1 across all patients · cells with fewer runs marked —
| Model | Raw JSON | Table | Narrative | Timeline |
|---|---|---|---|---|
| Phi-3.5-mini (3.8B) | 0.636 | 0.683 | 0.701 | 0.644 |
| Mistral 7B v0.3 | 0.725 | 0.875 | 0.915 | 0.859 |
| BioMistral 7B | 0.000 | 0.000 | 0.000 | 0.000 |
| Llama 3.1 8B | 0.918 | 0.925 | 0.947 | 0.923 |
| Llama 3.3 70B | 0.996 | 0.987 | 0.985 | 0.874 |
F1 scale:≥0.750.60–0.750.45–0.60<0.45
Model Rankings
| Model | F1 | Precision | Recall | Failures |
|---|---|---|---|---|
| 1Llama 3.3 70B | 0.960 | 0.963 | 0.959 | 7 |
| 2Llama 3.1 8B | 0.928 | 0.961 | 0.917 | 3 |
| 3Mistral 7B v0.3 | 0.843 | 0.937 | 0.804 | 22 |
| 4Phi-3.5-mini (3.8B) | 0.666 | 0.768 | 0.619 | 67 |
| 5BioMistral 7B | 0.000 | 0.000 | 0.000 | 791 |
Strategy Rankings
| Strategy | F1 | Precision | Recall | Avg time |
|---|---|---|---|---|
| 1Strategy C — Clinical Narrative | 0.710 | 0.726 | 0.700 | 12.1s |
| 2Strategy B — Markdown Table | 0.694 | 0.744 | 0.673 | 12.3s |
| 3Strategy D — Timeline | 0.660 | 0.716 | 0.635 | 10.7s |
| 4Strategy A — Raw FHIR JSON | 0.655 | 0.716 | 0.632 | 25.0s |