MusterMuster Docs

Concepts

The core data model — trace, observation, score, session, dataset.

Muster's data model comes from Langfuse upstream and stays compatible with every Langfuse SDK. Five entities cover almost everything you'll work with.

Trace

The end-to-end record of one logical request through your application. A trace groups every model call, tool call, retrieval, and event that happened while serving that request.

A trace has:

  • A unique ID
  • A name (typically the entry point, e.g. chat_completion or support_ticket_triage)
  • An input and output
  • Optional userId, sessionId, tags, metadata
  • A start and end time

Example: a user asks your support agent a question. The single trace covers input parsing, the retrieval call, the LLM call, the tool call, and the final response.

Observation

A single step inside a trace. Observations are nested — a trace contains observations, and observations can contain other observations to model parent / child structure.

Three observation types matter most:

  • Span — a generic unit of work with a duration. Use for retrieval calls, tool invocations, validation steps, anything you want to time.
  • Generation — an LLM call. Carries the model name, prompt, completion, token counts, and computed cost. Muster aggregates spend off this type.
  • Event — a point-in-time data marker without a duration. Use for decisions, branch points, or notable signals.

A typical chat trace contains one Span for retrieval, one Generation for the LLM call, and zero or more Event observations recording the decisions in between.

Score

A label attached to a trace or observation. Scores power evaluation, quality monitoring, and human review.

Scores have a value (numeric, categorical, boolean, or correction) and a source. Three sources:

  • API — your code computed the score and pushed it via the SDK
  • EVAL — a Muster evaluator computed it automatically (e.g. an LLM-as-judge job)
  • ANNOTATION — a human reviewed the trace and entered the score by hand

Typical scores: relevance (0-1), hallucination_detected (boolean), user_feedback (thumbs up / thumbs down).

Session

A sessionId you set on multiple traces to group them together. The most common case is a multi-turn conversation: every turn is its own trace, but they all share a session ID, so you can replay the whole interaction.

Sessions are also useful for batch jobs (one session per job run) and experiments (one session per experiment iteration).

Dataset

A curated set of input / expected-output pairs used for testing. Datasets are how you evaluate a prompt change or model upgrade against a stable benchmark instead of eyeballing live traffic.

A dataset contains items. Each item has an input, an expected output, and optional metadata. You run the dataset against a target (a prompt, a model, your full agent), Muster captures the actual output as a trace, and you score the result.

How they fit together

Project
  └── Trace (one per request)
        ├── Observation: Span (retrieval, 120ms)
        ├── Observation: Generation (gpt-4, 1.2s, $0.013, 850 tokens)
        ├── Observation: Event (chose route: "specialist")
        └── Score: relevance = 0.92 (source: EVAL)

              optional: also linked to a session ID
              optional: trace input/output also became a dataset item

Once you have this mental model, the rest of Muster — anomaly detection, cost aggregation, agent inventory — is "what we do with the traces flowing in."