Evals

Score every answer. Catch the regressions.

Run code checks and LLM judges against live production traffic on a 0–1 scale — so quality is a number you watch, not a vibe you hope for.

Start free See pricing

0.94avg score

90%pass rate

tone · groundedness · valid-json6.1k scored

Two kinds of checks

Code checks and LLM judges, side by side.

Deterministic checks for the things that must be exactly right — valid JSON, no PII, schema conformance — and model-graded judges for the fuzzy stuff like tone and groundedness.

Code evals run inline, no model cost
LLM judges grade tone, helpfulness, and groundedness
Every eval scored 0–1 with a pass threshold you set

Start free

Score distribution

0.0

0.2

0.4

0.6

0.8

1.0

On real traffic

Evals where your users actually are.

Don't grade a static test set and hope. Foglamp scores sampled production traces, so your pass rate reflects what real users are getting today.

Sample a fixed rate or score everything
Drill from a failing score to the exact trace
Trend pass rate per agent over time

Explore traces

Pass rate · answer-groundedness

80%pass rate

tone · groundedness · valid-json6.1k scored

Your agents are running in the fog.

Cost, latency, errors, eval scores — all there, all invisible. Wrap your model and turn the light on.

Start free