Experiments / Test Runs

Test Runs

Compare model configurations, prompts, and pipelines before deploying to production.

Total test runs

experiments

this week

Completed runs

finished experiments

71%

success rate

Avg. field accuracy

91.5%

across completed runs

+2.1 pp

vs baseline

Best performer

94.2%

Claude 3.5 Sonnet

Top run

All Test Runs

Overview of all experiment runs with key metrics.

Run Name	Created	Pipeline	Messages	Field Acc.	Class. Acc.	Avg. Cost	Status	Actions
Claude 3.5 Sonnet - Prompt v2.1 run-001	Nov 28, 14:30	Both	1,250	94.2%	97.8%	€0.0182	Completed	View
GPT-4o Mini - Cost Optimization run-002	Nov 27, 10:15	Extraction	1,250	91.5%	95.2%	€0.0098	Completed	View
Claude 3 Haiku - Speed Test run-003	Nov 26, 16:45	Classification	1,250	88.3%	93.1%	€0.0045	Completed	View
Production Baseline run-004	Nov 25, 09:00	Both	1,250	92.1%	96.5%	€0.0156	Completed	View
GPT-4o - Accuracy Focus run-005	Nov 29, 08:30	Both	847	—	—	—	Running	View
Gemini 1.5 Pro - Experiment run-006	Nov 29, 11:00	Extraction	0	—	—	—	Pending	View
Claude 3 Opus - Failed Config run-007	Nov 24, 15:20	Both	156	—	—	€0.0892	Failed	View

Accuracy vs Cost per Run

Which runs give high accuracy at low cost? (Cost normalized to 0–100 scale)

Field Accuracy (%)

Avg Cost (normalized)

Run Status Over Time

How frequently are experiments running and succeeding?

Completed

Failed

Running