Hallucination Detection
Flag generations that contain arithmetic errors, broken references, or unstable outputs.
Muster scans your traces nightly for three patterns that strongly correlate with hallucinated LLM output. Detection is heuristic — no external LLM judge is called — so it's fast, cheap, and predictable.
What it detects
Arithmetic drift
Parses numbers out of generation outputs and checks whether stated totals match the sum of their components. Flags drift in the 5-50% range (below 5% is likely rounding, above 50% is usually a different phenomenon worth flagging differently).
"Total: $150 (50 + 60 + 70)" → computed sum is $180 → 20% drift → flagged.
Reference failure
Pattern-matches output against well-known fabrication signatures:
example.comand other placeholder domains presented as real sources- arXiv IDs with appended trailing letters (e.g.
2401.12345v9z) - DOI patterns that don't resolve
- LLM-typical citation strings (
[1] Smith et al. (2024)) without a real bibliography
Output instability
Groups generations by identical input over the last 24 hours and compares outputs via Jaccard word similarity. If the same prompt produced outputs with similarity below the threshold (default 0.5), the generation is unstable.
This catches stochastic regressions (model upgrade, prompt drift, temperature change) without you having to write a regression test.
How it runs
A daily cron worker runs at 01:00 UTC, scans the previous day's
generations for each project, and writes any matches to
MusterHallucinationEvent. Each event captures the trace ID,
hallucination type, severity, a description, and an evidence JSON
column with the underlying numbers (numeric drift, URL snippets, or
similarity score).
Tuning thresholds
The two main knobs:
| Knob | Default | Effect |
|---|---|---|
arithmeticDriftMinThreshold | 0.05 (5%) | Lower → catch smaller errors but more false positives from rounding |
outputInstabilitySimilarityThreshold | 0.5 | Lower → only flag the most divergent outputs |
Both are tunable via Auto-Instrumentation
or directly in MusterProjectTuning. Workers read them at runtime via
TuningCache.get() with a fallback to hardcoded defaults.
Calibration tip: start at defaults, look at the first week of flags, and dial each threshold in the direction of fewer false positives if your reviewers are spending too much time dismissing noise.
Reviewing flagged generations
In the Hallucinations tab:
- List filters by type, severity, and active/acknowledged.
- Each row links back to the original trace, so you can read the full prompt and output.
- Acknowledge dismisses the flag and stamps
acknowledgedBy/acknowledgedAtfor audit.
Limits
- Detection is heuristic — it will miss hallucinations that don't fit the three patterns (e.g. plausible-sounding but factually wrong claims with no math, no citations, and consistent phrasing).
- For higher recall, layer on an LLM-as-judge eval — see the upcoming Evaluation guide.