Evaluation harness for domain-specific RAG and QA systems with benchmark datasets, scoring, and regression workflows.
benchmarking domain-qa retrieval-augmented-generation llm-evaluation rag-evaluation evaluation-harness ai-evals
-
Updated
Mar 8, 2026 - Python