Skip to content

Evaluation System

Layerr evaluates at every granularity: individual traces, provider quality, strategy effectiveness, and system-wide calibration. The evaluation system is what makes the routing engine trustworthy, it does not just route, it learns from outcomes and improves over time.

FilePurpose
evaluation/quality/engine.tsCore quality scoring engine
evaluation/calibration/engine.tsProvider calibration and confidence adjustment
evaluation/coding/validators.tsCode-specific quality validators
evaluation/outcomes/engine.tsOutcome evaluation and aggregate scoring
evaluation/outcomes/queries.tsQuery builders for outcome analysis
src/features/evaluation/quality/QualityScoringInspector.tsxUI for inspecting quality scores
src/features/evaluation/strategies/StrategyBenchmarkView.tsxStrategy benchmark dashboard

QualityScoringEngine.scoreTrace evaluates a single request/response pair on:

  • Response quality: Coherence, relevance, accuracy
  • Code quality: Syntax correctness, style adherence, test coverage (for coding tasks)
  • Efficiency: Token usage relative to output quality
  • Latency: Response time relative to workload complexity

Scores are normalised to a 0-1 scale and tagged with confidence grades (A-F).

calibrateProviders adjusts provider scores based on historical outcomes:

  • Overestimation penalty: If a provider consistently under-performs its score, calibration lowers it
  • Underestimation boost: If a cheap provider over-performs, calibration raises it
  • Confidence recalculation: calibrateConfidence updates the confidence interval for each provider-model pair

Calibration runs automatically on a schedule and can be triggered manually.

Layerr maintains benchmarks for:

  • Provider benchmarks: Head-to-head comparisons of providers on standardised tasks
  • Strategy benchmarks: Comparison of routing strategies (cost-optimised vs quality-first)
  • Coding benchmarks: Language-specific coding challenges with known-good solutions

The StrategyBenchmarkView dashboard shows benchmark results with trend analysis.

Every routed request is tracked as an outcome. The outcomes system aggregates:

  • Provider success rates
  • Strategy effectiveness
  • Cost savings vs baseline
  • Quality degradation (if any)

See the Economics page for how savings are computed.