Ragas

Evaluate RAG pipelines with Ragas and feed scores back into Muster traces.

Ragas is an open-source tool for model-based evaluation of Retrieval-Augmented Generation (RAG) pipelines. Together with Muster's tracing, scoring, and analytics, you get a complete loop from instrumentation to evaluation to reporting.

What you can do

Ragas:

Generate synthetic test sets for pipeline assessment.
Reference-free evaluation without ground-truth data.
Performance metrics including faithfulness, answer relevancy, and context precision.
Custom prompt optimization with automatic adaptation.
CI/CD pipeline integration via Pytest.

Muster:

Span- and trace-level scoring.
Segmentation and analytics to identify performance gaps.
Detailed reporting per use case and user segment.
Integrations with OpenAI, LangChain, LlamaIndex, and more.

How it fits together

Typical workflow:

Instrument your RAG pipeline with Muster (via the OpenAI integration, LangChain callback, or LlamaIndex instrumentor).
Run a Ragas evaluation against a sample of your traces or a dedicated test set.
Push Ragas scores back to the corresponding Muster traces using langfuse.create_score(...).
Slice and dice the resulting score distribution in Muster's analytics.

from langfuse import get_client

langfuse = get_client()

# Suppose you ran Ragas and got per-trace results
for trace_id, ragas_result in ragas_results.items():
    langfuse.create_score(
        trace_id=trace_id,
        name="ragas-faithfulness",
        value=ragas_result["faithfulness"],
    )

Ragas

What you can do

How it fits together

See also

On this page