Evaluation System

Layerr evaluates at every granularity: individual traces, provider quality, strategy effectiveness, and system-wide calibration. The evaluation system is what makes the routing engine trustworthy, it does not just route, it learns from outcomes and improves over time.

Key Files

File	Purpose
`evaluation/quality/engine.ts`	Core quality scoring engine
`evaluation/calibration/engine.ts`	Provider calibration and confidence adjustment
`evaluation/coding/validators.ts`	Code-specific quality validators
`evaluation/outcomes/engine.ts`	Outcome evaluation and aggregate scoring
`evaluation/outcomes/queries.ts`	Query builders for outcome analysis
`src/features/evaluation/quality/QualityScoringInspector.tsx`	UI for inspecting quality scores
`src/features/evaluation/strategies/StrategyBenchmarkView.tsx`	Strategy benchmark dashboard

Quality Scoring

QualityScoringEngine.scoreTrace evaluates a single request/response pair on:

Response quality: Coherence, relevance, accuracy
Code quality: Syntax correctness, style adherence, test coverage (for coding tasks)
Efficiency: Token usage relative to output quality
Latency: Response time relative to workload complexity

Scores are normalised to a 0-1 scale and tagged with confidence grades (A-F).

Calibration

calibrateProviders adjusts provider scores based on historical outcomes:

Overestimation penalty: If a provider consistently under-performs its score, calibration lowers it
Underestimation boost: If a cheap provider over-performs, calibration raises it
Confidence recalculation: calibrateConfidence updates the confidence interval for each provider-model pair

Calibration runs automatically on a schedule and can be triggered manually.

Benchmarks

Layerr maintains benchmarks for:

Provider benchmarks: Head-to-head comparisons of providers on standardised tasks
Strategy benchmarks: Comparison of routing strategies (cost-optimised vs quality-first)
Coding benchmarks: Language-specific coding challenges with known-good solutions

The StrategyBenchmarkView dashboard shows benchmark results with trend analysis.

Outcome Tracking

Every routed request is tracked as an outcome. The outcomes system aggregates:

Provider success rates
Strategy effectiveness
Cost savings vs baseline
Quality degradation (if any)

See the Economics page for how savings are computed.