Concepts

Muster's data model comes from Langfuse upstream and stays compatible with every Langfuse SDK. Five entities cover almost everything you'll work with.

Trace

The end-to-end record of one logical request through your application. A trace groups every model call, tool call, retrieval, and event that happened while serving that request.

A trace has:

A unique ID
A name (typically the entry point, e.g. chat_completion or support_ticket_triage)
An input and output
Optional userId, sessionId, tags, metadata
A start and end time

Example: a user asks your support agent a question. The single trace covers input parsing, the retrieval call, the LLM call, the tool call, and the final response.

Observation

A single step inside a trace. Observations are nested — a trace contains observations, and observations can contain other observations to model parent / child structure.

Three observation types matter most:

Span — a generic unit of work with a duration. Use for retrieval calls, tool invocations, validation steps, anything you want to time.
Generation — an LLM call. Carries the model name, prompt, completion, token counts, and computed cost. Muster aggregates spend off this type.
Event — a point-in-time data marker without a duration. Use for decisions, branch points, or notable signals.

A typical chat trace contains one Span for retrieval, one Generation for the LLM call, and zero or more Event observations recording the decisions in between.

Score

A label attached to a trace or observation. Scores power evaluation, quality monitoring, and human review.

Scores have a value (numeric, categorical, boolean, or correction) and a source. Three sources:

API — your code computed the score and pushed it via the SDK
EVAL — a Muster evaluator computed it automatically (e.g. an LLM-as-judge job)
ANNOTATION — a human reviewed the trace and entered the score by hand

Typical scores: relevance (0-1), hallucination_detected (boolean), user_feedback (thumbs up / thumbs down).

Project
  └── Trace (one per request)
        ├── Observation: Span (retrieval, 120ms)
        ├── Observation: Generation (gpt-4, 1.2s, $0.013, 850 tokens)
        ├── Observation: Event (chose route: "specialist")
        └── Score: relevance = 0.92 (source: EVAL)
              ↑
              optional: also linked to a session ID
              optional: trace input/output also became a dataset item

Once you have this mental model, the rest of Muster — anomaly detection, cost aggregation, agent inventory — is "what we do with the traces flowing in."

Concepts

Trace

Observation

Score

Session

Dataset

How they fit together

On this page