MusterMuster Docs

Hallucination Detection

Flag generations that contain arithmetic errors, broken references, or unstable outputs.

Muster scans your traces nightly for three patterns that strongly correlate with hallucinated LLM output. Detection is heuristic — no external LLM judge is called — so it's fast, cheap, and predictable.

What it detects

Arithmetic drift

Parses numbers out of generation outputs and checks whether stated totals match the sum of their components. Flags drift in the 5-50% range (below 5% is likely rounding, above 50% is usually a different phenomenon worth flagging differently).

"Total: $150 (50 + 60 + 70)" → computed sum is $180 → 20% drift → flagged.

Reference failure

Pattern-matches output against well-known fabrication signatures:

  • example.com and other placeholder domains presented as real sources
  • arXiv IDs with appended trailing letters (e.g. 2401.12345v9z)
  • DOI patterns that don't resolve
  • LLM-typical citation strings ([1] Smith et al. (2024)) without a real bibliography

Output instability

Groups generations by identical input over the last 24 hours and compares outputs via Jaccard word similarity. If the same prompt produced outputs with similarity below the threshold (default 0.5), the generation is unstable.

This catches stochastic regressions (model upgrade, prompt drift, temperature change) without you having to write a regression test.

How it runs

A daily cron worker runs at 01:00 UTC, scans the previous day's generations for each project, and writes any matches to MusterHallucinationEvent. Each event captures the trace ID, hallucination type, severity, a description, and an evidence JSON column with the underlying numbers (numeric drift, URL snippets, or similarity score).

Tuning thresholds

The two main knobs:

KnobDefaultEffect
arithmeticDriftMinThreshold0.05 (5%)Lower → catch smaller errors but more false positives from rounding
outputInstabilitySimilarityThreshold0.5Lower → only flag the most divergent outputs

Both are tunable via Auto-Instrumentation or directly in MusterProjectTuning. Workers read them at runtime via TuningCache.get() with a fallback to hardcoded defaults.

Calibration tip: start at defaults, look at the first week of flags, and dial each threshold in the direction of fewer false positives if your reviewers are spending too much time dismissing noise.

Reviewing flagged generations

In the Hallucinations tab:

  • List filters by type, severity, and active/acknowledged.
  • Each row links back to the original trace, so you can read the full prompt and output.
  • Acknowledge dismisses the flag and stamps acknowledgedBy / acknowledgedAt for audit.

Limits

  • Detection is heuristic — it will miss hallucinations that don't fit the three patterns (e.g. plausible-sounding but factually wrong claims with no math, no citations, and consistent phrasing).
  • For higher recall, layer on an LLM-as-judge eval — see the upcoming Evaluation guide.