Evals

Score every answer. Catch the regressions.

Run code checks and LLM judges against live production traffic on a 0–1 scale — so quality is a number you watch, not a vibe you hope for.

0.94avg score
90%pass rate
tone · groundedness · valid-json6.1k scored
Two kinds of checks

Code checks and LLM judges, side by side.

Deterministic checks for the things that must be exactly right — valid JSON, no PII, schema conformance — and model-graded judges for the fuzzy stuff like tone and groundedness.

  • Code evals run inline, no model cost
  • LLM judges grade tone, helpfulness, and groundedness
  • Every eval scored 0–1 with a pass threshold you set
Score distribution
0.0
0.2
0.4
0.6
0.8
1.0
On real traffic

Evals where your users actually are.

Don't grade a static test set and hope. Foglamp scores sampled production traces, so your pass rate reflects what real users are getting today.

  • Sample a fixed rate or score everything
  • Drill from a failing score to the exact trace
  • Trend pass rate per agent over time
Pass rate · answer-groundedness
80%pass rate
tone · groundedness · valid-json6.1k scored

Your agents are running in the fog.

Cost, latency, errors, eval scores — all there, all invisible. Wrap your model and turn the light on.

Start free