Evaluation System
Layerr evaluates at every granularity: individual traces, provider quality, strategy effectiveness, and system-wide calibration. The evaluation system is what makes the routing engine trustworthy, it does not just route, it learns from outcomes and improves over time.
Key Files
Section titled “Key Files”| File | Purpose |
|---|---|
evaluation/quality/engine.ts | Core quality scoring engine |
evaluation/calibration/engine.ts | Provider calibration and confidence adjustment |
evaluation/coding/validators.ts | Code-specific quality validators |
evaluation/outcomes/engine.ts | Outcome evaluation and aggregate scoring |
evaluation/outcomes/queries.ts | Query builders for outcome analysis |
src/features/evaluation/quality/QualityScoringInspector.tsx | UI for inspecting quality scores |
src/features/evaluation/strategies/StrategyBenchmarkView.tsx | Strategy benchmark dashboard |
Quality Scoring
Section titled “Quality Scoring”QualityScoringEngine.scoreTrace evaluates a single request/response pair on:
- Response quality: Coherence, relevance, accuracy
- Code quality: Syntax correctness, style adherence, test coverage (for coding tasks)
- Efficiency: Token usage relative to output quality
- Latency: Response time relative to workload complexity
Scores are normalised to a 0-1 scale and tagged with confidence grades (A-F).
Calibration
Section titled “Calibration”calibrateProviders adjusts provider scores based on historical outcomes:
- Overestimation penalty: If a provider consistently under-performs its score, calibration lowers it
- Underestimation boost: If a cheap provider over-performs, calibration raises it
- Confidence recalculation:
calibrateConfidenceupdates the confidence interval for each provider-model pair
Calibration runs automatically on a schedule and can be triggered manually.
Benchmarks
Section titled “Benchmarks”Layerr maintains benchmarks for:
- Provider benchmarks: Head-to-head comparisons of providers on standardised tasks
- Strategy benchmarks: Comparison of routing strategies (cost-optimised vs quality-first)
- Coding benchmarks: Language-specific coding challenges with known-good solutions
The StrategyBenchmarkView dashboard shows benchmark results with trend analysis.
Outcome Tracking
Section titled “Outcome Tracking”Every routed request is tracked as an outcome. The outcomes system aggregates:
- Provider success rates
- Strategy effectiveness
- Cost savings vs baseline
- Quality degradation (if any)
See the Economics page for how savings are computed.