Skip to content

Quality Evaluation

The Quality Evaluation Layer measures whether Layerr is doing a good job. It benchmarks providers, calibrates strategies, validates code outputs, and produces quality scores that feed back into the adaptive learning loop.

SubsystemPurposeKey File
Calibration EngineRecalibrates providers and strategies based on outcomesevaluation/calibration/engine.ts
Quality Scoring EngineScores provider reliability and strategy effectivenessevaluation/quality/engine.ts
Coding EvaluationValidates code outputs for correctness and completenessevaluation/coding/engine.ts
Outcome AnalysisAggregates execution outcomes into metricsevaluation/outcomes/engine.ts
BenchmarksSystematic A/B testing of providers and strategiesevaluation/benchmarks/engine.ts

The calibration engine (evaluation/calibration/engine.ts) periodically runs:

function calibrateProviders(): CalibrationReport {
// 1. Compare recent provider performance vs. historical baseline
// 2. Detect drift (providers whose scores have changed significantly)
// 3. Build recommendations for provider rotation
}
function calibrateStrategies(): StrategyCalibration {
// 1. Measure each strategy's cost/quality/speed outcomes
// 2. Detect strategies that are no longer optimal
// 3. Recommend weight adjustments
}
OutputDescription
driftedProvidersProviders whose quality has deviated from baseline
sparseWorkloadsWorkload types with insufficient data for reliable scoring
riskScoresRisk assessment per provider group
buildRecommendations()Actionable recommendations for workspace admins

The quality engine (evaluation/quality/engine.ts) computes:

FunctionPurpose
scoreProviderReliability()Composite reliability score from historical traces
scoreStrategyEffectiveness()Measures how well a strategy achieves its stated goal
computeTraceConfidence()Confidence that a trace’s quality score is accurate
buildAggregateExplanations()Human-readable quality explanations
GradeScoreMeaning
A+97-100Exceptional
A93-96Excellent
A-90-92Very good
B+87-89Good
B83-86Above average
B-80-82Average
C+77-79Below average
C73-76Needs improvement
D0-72Poor

Functions: qualityGrade(), confidenceGrade()

The coding validator (evaluation/coding/engine.ts and evaluation/coding/validators.ts) checks code outputs for:

CheckValidatorWhat It Tests
Code presencecheckCodePresence()Response actually contains code
Code completenesscheckCodeCompleteness()Code is complete, not truncated
Language consistencycheckLanguageConsistency()Code matches the requested language
JSON validitycheckJsonValidity()JSON outputs parse correctly
Schema compliancecheckStructuredOutputSchema()Output matches expected schema
Brace balancecheckBraceBalance()Brackets/parentheses are balanced
Response lengthcheckResponseLength()Response is reasonably sized
Patch formatcheckPatchFormat()Diff/patch format is valid

The outcomes engine (evaluation/outcomes/engine.ts) produces metrics like:

MetricDescription
Success rate% of requests that succeeded
Average latencyMean time to first token and completion
Average costMean cost per request
Quality distributionHistogram of quality grades
Fallback rate% of requests that needed fallback
Provider distributionWhich providers are being used most
ComponentFilePurpose
QualityScoringInspectorsrc/features/evaluation/quality/Deep quality score analysis
CalibrationOutcomeViewsrc/features/evaluation/View calibration results
ExecutionQualitySummarysrc/features/evaluation/Summarised quality dashboard
FileWhat It Does
evaluation/calibration/engine.tsRecalibration engine for providers and strategies
evaluation/quality/engine.tsProvider reliability and strategy effectiveness scoring
evaluation/coding/engine.tsCode output validation and grading
evaluation/coding/validators.tsIndividual validator functions for code quality
evaluation/outcomes/engine.tsExecution outcome aggregation and metrics
evaluation/benchmarks/engine.tsSystematic provider/strategy benchmarking
  1. Execution Engine → provides traces for evaluation
  2. Replay → stores evaluation results alongside traces
  3. Adaptive Learning → receives evaluation insights for weight updates
  4. Explainability → includes quality grades in explanations
  5. Strategy Engine → receives calibration recommendations