The Quality Evaluation Layer measures whether Layerr is doing a good job. It benchmarks providers, calibrates strategies, validates code outputs, and produces quality scores that feed back into the adaptive learning loop.
Subsystem Purpose Key File Calibration Engine Recalibrates providers and strategies based on outcomes evaluation/calibration/engine.tsQuality Scoring Engine Scores provider reliability and strategy effectiveness evaluation/quality/engine.tsCoding Evaluation Validates code outputs for correctness and completeness evaluation/coding/engine.tsOutcome Analysis Aggregates execution outcomes into metrics evaluation/outcomes/engine.tsBenchmarks Systematic A/B testing of providers and strategies evaluation/benchmarks/engine.ts
The calibration engine (evaluation/calibration/engine.ts) periodically runs:
function calibrateProviders () : CalibrationReport {
// 1. Compare recent provider performance vs. historical baseline
// 2. Detect drift (providers whose scores have changed significantly)
// 3. Build recommendations for provider rotation
function calibrateStrategies () : StrategyCalibration {
// 1. Measure each strategy's cost/quality/speed outcomes
// 2. Detect strategies that are no longer optimal
// 3. Recommend weight adjustments
Output Description driftedProvidersProviders whose quality has deviated from baseline sparseWorkloadsWorkload types with insufficient data for reliable scoring riskScoresRisk assessment per provider group buildRecommendations()Actionable recommendations for workspace admins
The quality engine (evaluation/quality/engine.ts) computes:
Function Purpose scoreProviderReliability()Composite reliability score from historical traces scoreStrategyEffectiveness()Measures how well a strategy achieves its stated goal computeTraceConfidence()Confidence that a trace’s quality score is accurate buildAggregateExplanations()Human-readable quality explanations
Grade Score Meaning A+ 97-100 Exceptional A 93-96 Excellent A- 90-92 Very good B+ 87-89 Good B 83-86 Above average B- 80-82 Average C+ 77-79 Below average C 73-76 Needs improvement D 0-72 Poor
Functions: qualityGrade(), confidenceGrade()
The coding validator (evaluation/coding/engine.ts and evaluation/coding/validators.ts) checks code outputs for:
Check Validator What It Tests Code presence checkCodePresence()Response actually contains code Code completeness checkCodeCompleteness()Code is complete, not truncated Language consistency checkLanguageConsistency()Code matches the requested language JSON validity checkJsonValidity()JSON outputs parse correctly Schema compliance checkStructuredOutputSchema()Output matches expected schema Brace balance checkBraceBalance()Brackets/parentheses are balanced Response length checkResponseLength()Response is reasonably sized Patch format checkPatchFormat()Diff/patch format is valid
The outcomes engine (evaluation/outcomes/engine.ts) produces metrics like:
Metric Description Success rate % of requests that succeeded Average latency Mean time to first token and completion Average cost Mean cost per request Quality distribution Histogram of quality grades Fallback rate % of requests that needed fallback Provider distribution Which providers are being used most
Component File Purpose QualityScoringInspectorsrc/features/evaluation/quality/Deep quality score analysis CalibrationOutcomeViewsrc/features/evaluation/View calibration results ExecutionQualitySummarysrc/features/evaluation/Summarised quality dashboard
File What It Does evaluation/calibration/engine.tsRecalibration engine for providers and strategies evaluation/quality/engine.tsProvider reliability and strategy effectiveness scoring evaluation/coding/engine.tsCode output validation and grading evaluation/coding/validators.tsIndividual validator functions for code quality evaluation/outcomes/engine.tsExecution outcome aggregation and metrics evaluation/benchmarks/engine.tsSystematic provider/strategy benchmarking
Execution Engine → provides traces for evaluation
Replay → stores evaluation results alongside traces
Adaptive Learning → receives evaluation insights for weight updates
Explainability → includes quality grades in explanations
Strategy Engine → receives calibration recommendations