Metrix Logo
Experiments / Test Runs / Run Comparison

Run Comparison (A/B)

Compare two experiment runs side by side to identify trade-offs.

Field Accuracy

A 94.2%
B 92.1%
+2.1 pp

Classification Accuracy

A 97.8%
B 96.5%
+1.3 pp

Avg. Cost/Message

A €0.0182
B €0.0156
+0.0026

Avg. Latency

A 1245ms
B 1380ms
-135.0ms

p95 Latency

A 2890ms
B 3120ms
-230.0ms

Perfect Orders

A 67.8%
B 64.2%
+3.6 pp
Metric Radar – Run A vs Run B

Multi-dimensional comparison across key metrics.

Field AccuracyClass. AccuracyCost EfficiencyOrder AccuracyLatency ScoreError Rate (inv)
Run A
Run B (Baseline)
Per-Field Accuracy Delta

For each field, is Run A better or worse than Run B? (A – B)

pickup_date
A: 96.8% / B: 94.2% +2.6 pp
delivery_address
A: 88.7% / B: 86.9% +1.8 pp
pickup_address
A: 89.3% / B: 87.6% +1.7 pp
delivery_date
A: 95.2% / B: 93.8% +1.4 pp
weight
A: 98.4% / B: 97.1% +1.3 pp
reference
A: 94.5% / B: 93.2% +1.3 pp
volume
A: 92.1% / B: 91.5% +0.6 pp
goods_description
A: 91.2% / B: 90.8% +0.4 pp
← Run B better Run A better →
Cost Comparison Over Messages

Does one run consistently have lower cost?

Run A
Run B (Baseline)
Latency Comparison Over Messages

Does one run consistently have lower latency?

Run A
Run B (Baseline)