Integrations
Cleanlab TLM
Score Muster traces with Cleanlab's Trustworthy Language Model to flag low-quality and hallucinated responses.
Cleanlab TLM (Trustworthy Language Model) analyses LLM outputs and produces a 0-1 trustworthiness score plus an explanation. Pair it with Muster traces to flag low-quality or hallucinated responses in production without manual review.
Setup
%pip install langfuse openai cleanlab-tlmConfigure the four required keys:
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
os.environ["LANGFUSE_BASE_URL"] = "https://app.getmuster.io"
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["CLEANLAB_TLM_API_KEY"] = "..."Workflow
- Generate traces in Muster while calling LLMs — wrap your code with the
@observe()decorator. - Fetch traces from Muster using the SDK's
fetch_traces()API with filtering options. - Evaluate with TLM — call
get_trustworthiness_score()on each prompt-response pair. - Upload the scores back to Muster via
langfuse.create_score(...).
from langfuse import get_client
from cleanlab_tlm import TLM
langfuse = get_client()
tlm = TLM()
traces = langfuse.fetch_traces(limit=50).data
for trace in traces:
score = tlm.get_trustworthiness_score(
prompt=trace.input,
response=trace.output,
)
langfuse.create_score(
trace_id=trace.id,
name="cleanlab-trustworthiness",
value=score.trustworthiness_score,
comment=score.explanation,
)The resulting score is filterable in the Muster UI like any other score — slice traces by score range, build evaluators around it, or alert on drops.