Concepts
The core data model — trace, observation, score, session, dataset.
Muster's data model comes from Langfuse upstream and stays compatible with every Langfuse SDK. Five entities cover almost everything you'll work with.
Trace
The end-to-end record of one logical request through your application. A trace groups every model call, tool call, retrieval, and event that happened while serving that request.
A trace has:
- A unique ID
- A name (typically the entry point, e.g.
chat_completionorsupport_ticket_triage) - An input and output
- Optional
userId,sessionId,tags,metadata - A start and end time
Example: a user asks your support agent a question. The single trace covers input parsing, the retrieval call, the LLM call, the tool call, and the final response.
Observation
A single step inside a trace. Observations are nested — a trace contains observations, and observations can contain other observations to model parent / child structure.
Three observation types matter most:
- Span — a generic unit of work with a duration. Use for retrieval calls, tool invocations, validation steps, anything you want to time.
- Generation — an LLM call. Carries the model name, prompt, completion, token counts, and computed cost. Muster aggregates spend off this type.
- Event — a point-in-time data marker without a duration. Use for decisions, branch points, or notable signals.
A typical chat trace contains one Span for retrieval, one Generation for the LLM call, and zero or more Event observations recording the decisions in between.
Score
A label attached to a trace or observation. Scores power evaluation, quality monitoring, and human review.
Scores have a value (numeric, categorical, boolean, or correction) and a source. Three sources:
- API — your code computed the score and pushed it via the SDK
- EVAL — a Muster evaluator computed it automatically (e.g. an LLM-as-judge job)
- ANNOTATION — a human reviewed the trace and entered the score by hand
Typical scores: relevance (0-1), hallucination_detected (boolean),
user_feedback (thumbs up / thumbs down).
Session
A sessionId you set on multiple traces to group them together. The most
common case is a multi-turn conversation: every turn is its own trace, but
they all share a session ID, so you can replay the whole interaction.
Sessions are also useful for batch jobs (one session per job run) and experiments (one session per experiment iteration).
Dataset
A curated set of input / expected-output pairs used for testing. Datasets are how you evaluate a prompt change or model upgrade against a stable benchmark instead of eyeballing live traffic.
A dataset contains items. Each item has an input, an expected output, and optional metadata. You run the dataset against a target (a prompt, a model, your full agent), Muster captures the actual output as a trace, and you score the result.
How they fit together
Project
└── Trace (one per request)
├── Observation: Span (retrieval, 120ms)
├── Observation: Generation (gpt-4, 1.2s, $0.013, 850 tokens)
├── Observation: Event (chose route: "specialist")
└── Score: relevance = 0.92 (source: EVAL)
↑
optional: also linked to a session ID
optional: trace input/output also became a dataset itemOnce you have this mental model, the rest of Muster — anomaly detection, cost aggregation, agent inventory — is "what we do with the traces flowing in."