Beyond Benchmarks: Practical Evaluation Strategies for Compound AI Systems

Level: Intermediate Company/Institute: DataForce Solutions GmbH

Abstract

Evaluating large language models (LLMs) in real-world applications goes far beyond standard benchmarks. When LLMs are embedded in complex pipelines, choosing the right models, prompts, and parameters becomes an ongoing challenge. In this talk, we will present a practical, human-in-the-loop evaluation framework that enables systematic improvement of LLM-powered systems based on expert feedback. By combining domain expert insights and automated evaluation methods, it is possible to iteratively refine these systems while building transparency and trust. This talk will be valuable for anyone who wants to ensure their LLM applications can handle real-world complexity - not just perform well on generic benchmarks.

Prerequisites

The attendees are assumed to have familiarity with LLMs and machine learning workflows but do not require deep NLP expertise.

Description

As large language models become integral to real-world applications, evaluating and improving their performance is a growing challenge. Generic benchmarks and simple metrics fail to adequately assess domain-specific, multi-step reasoning required by compound AI pipelines like retrieval-augmented generation (RAG), multi-tool agents, or knowledge assistants. Moreover, manual evaluation of every step is infeasible at scale, while fully automated LLM-as-a-judge approaches lack critical domain insights.

In this talk, we will present a practical evaluation approach to enable continuous improvement of LLM-powered systems. It incorporates the following stages:
- Automatic tracing (e.g. using MLFlow, Langfuse, etc.): capturing input/output pairs across the pipeline to build an evaluation dataset.
- Expert feedback collection: working with subject matter experts and user interactions to assess correctness, and identify failure points.
- Iterative improvement cycle: tuning the components and/or optimizing prompts (using frameworks like DSPy, TextGrad, etc.).
- Degradation tests: turning feedback into automated evaluation tests - ranging from exact match checks to LLM-as-a-judge assertions (using approaches like DeepEval) - to guard against regressions.
- Continuous monitoring: using the growing evaluation dataset to validate the system as models, tools, or data sources evolve.

This framework ensures that LLM applications remain reliable, and aligned with specific business needs over time.

Target audience: AI practitioners developing and maintaining LLM-based applications.

Attendees will learn strategies to:
- Build a human-in-the-loop evaluation process combining expert feedback and automated methods.
- Turn expert knowledge into automatic tests to guard against regressions.
- Use iterative improvement cycles to refine LLM pipelines over time.

The attendees are assumed to have familiarity with LLMs and machine learning workflows but do not require deep NLP expertise.

Speakers

Iryna Kondrashchenko

Iryna is a data scientist and co-founder of DataForce Solutions GmbH, a company specialized in delivering end-to-end data science and AI services. She contributes to several open-source libraries, and strongly believes that open-source products foster a more inclusive tech industry, equipping individuals and organizations with the necessary tools to innovate and compete.

Oleh Kostromin

I am a Data Scientist primarily focused on Deep Learning and MLOps. In my spare time I contribute to several open-source python libraries.

View Full Conference Program