It really depends on the objective, and by nature it’s largely subjective.
I create my own test cases for specific RAG systems, and run them all at once. There are QA sets available online as well, so those are another option. For example, MedQA for medical-related QnAs.
I think Truera supports this type of agent-eval-saas, but honestly there are a lot these similar saas tools out there.