
Experiments / Test Runs
Test Runs
Compare model configurations, prompts, and pipelines before deploying to production.
Total test runs
7
experiments +3
this weekCompleted runs
4
finished experiments 71%
success rateAvg. field accuracy
91.5%
across completed runs +2.1 pp
vs baselineBest performer
94.2%
Claude 3.5 Sonnet Top run
All Test Runs
Overview of all experiment runs with key metrics.
| Run Name | Created | Pipeline | Messages | Field Acc. | Class. Acc. | Avg. Cost | Status | Actions |
|---|---|---|---|---|---|---|---|---|
Claude 3.5 Sonnet - Prompt v2.1 run-001 | Nov 28, 14:30 | Both | 1,250 | 94.2% | 97.8% | €0.0182 | Completed | View |
GPT-4o Mini - Cost Optimization run-002 | Nov 27, 10:15 | Extraction | 1,250 | 91.5% | 95.2% | €0.0098 | Completed | View |
Claude 3 Haiku - Speed Test run-003 | Nov 26, 16:45 | Classification | 1,250 | 88.3% | 93.1% | €0.0045 | Completed | View |
Production Baseline run-004 | Nov 25, 09:00 | Both | 1,250 | 92.1% | 96.5% | €0.0156 | Completed | View |
GPT-4o - Accuracy Focus run-005 | Nov 29, 08:30 | Both | 847 | — | — | — | Running | View |
Gemini 1.5 Pro - Experiment run-006 | Nov 29, 11:00 | Extraction | 0 | — | — | — | Pending | View |
Claude 3 Opus - Failed Config run-007 | Nov 24, 15:20 | Both | 156 | — | — | €0.0892 | Failed | View |
Accuracy vs Cost per Run
Which runs give high accuracy at low cost? (Cost normalized to 0–100 scale)
Field Accuracy (%)
Avg Cost (normalized)
Run Status Over Time
How frequently are experiments running and succeeding?
Completed
Failed
Running