
Run Comparison (A/B)
Compare two experiment runs side by side to identify trade-offs.
Field Accuracy
A 94.2%
B 92.1%
+2.1 pp
Classification Accuracy
A 97.8%
B 96.5%
+1.3 pp
Avg. Cost/Message
A €0.0182
B €0.0156
+0.0026
Avg. Latency
A 1245ms
B 1380ms
-135.0ms
p95 Latency
A 2890ms
B 3120ms
-230.0ms
Perfect Orders
A 67.8%
B 64.2%
+3.6 pp
Metric Radar – Run A vs Run B
Multi-dimensional comparison across key metrics.
Run A
Run B (Baseline)
Per-Field Accuracy Delta
For each field, is Run A better or worse than Run B? (A – B)
pickup_date
A: 96.8% / B: 94.2% +2.6 pp
delivery_address
A: 88.7% / B: 86.9% +1.8 pp
pickup_address
A: 89.3% / B: 87.6% +1.7 pp
delivery_date
A: 95.2% / B: 93.8% +1.4 pp
weight
A: 98.4% / B: 97.1% +1.3 pp
reference
A: 94.5% / B: 93.2% +1.3 pp
volume
A: 92.1% / B: 91.5% +0.6 pp
goods_description
A: 91.2% / B: 90.8% +0.4 pp
← Run B better Run A better →
Cost Comparison Over Messages
Does one run consistently have lower cost?
Run A
Run B (Baseline)
Latency Comparison Over Messages
Does one run consistently have lower latency?
Run A
Run B (Baseline)