I am in the process of building a GenAI tool for Support Agents to interact and find answers quickly. To ensure the accuracy and quality is maintained over time, does anyone have any recommendations on tools or frameworks to evaluate quality and accuracy for LLMs/GenAI tools?

351 viewscircle icon1 Upvotecircle icon3 Comments
Sort by:
VP Support Readiness in IT Services10 months ago

I was the original poster (OP) - we have build our own solution which currently integrates with slack and soon will be built into SFDC. We currently do measure quality via human feedback but I want to also validate this using some level of automation. I have started on a framework but do not have any specific tools in mind yet on how to combined human with automation and present it in a way that is easy to consume and surface where deviations may occur - like some of you, we are refining as we go.

VP of Customer Success in Software10 months ago

This is such an important question—thanks for bringing it up! We’ve been working on something similar and faced the same challenge of maintaining accuracy and quality with a GenAI tool for support agents. Here’s how we’ve approached it:

Start with Human Validation: We kicked things off with a phased approach, focusing on auditing certain types of content first—especially answers that could trigger legal concerns or have a higher impact. It helped us prioritize where accuracy mattered the most.

Use Feedback Loops: We made sure to set up mechanisms for agents to flag AI responses directly, creating a continuous feedback loop. This real-world input has been invaluable for improving the tool over time.

Leverage the Right Tools: We’ve been using Intercom and FIN AI, which not only make interactions seamless but also offer features that help us refine responses and evaluate quality. They’ve been a solid foundation for scaling this effort.

Track the Right Metrics: To stay on top of accuracy and quality, we look at things like how often agents override the AI’s suggestions, user feedback, and other key trends. It’s a great way to identify where the tool needs tweaking.

It’s definitely been a journey, and we’re still refining as we go. Happy to dive deeper if you’d like to compare notes or share ideas. What stage are you at with your implementation? Would love to learn from your experience too!

Director of Customer Success10 months ago

It really depends on the tech stack being used for case management work. Here at NYC DOF we use Dynamics CRM and are looking to enable Microsoft Co-Pilot to do exactly what is described in the use case. Aside from case summaries and assisting with outbound communications, Co-Pilot will suggest responses based on a library of approved knowledge base article content to help agents get quick answers to questions. Hope this helps.

Content you might like

Sales11%

Marketing29%

Accounts Payable22%

Accounts Receivable14%

Legal6%

HR13%

Other (pls comment)2%

View Results

Waterfall13%

Prototype18%

Rapid Application Development7%

Agile Scrum44%

Agile Kanban8%

Dynamic System Development1%

Lean Software Development2%

Other .. please add it down2%

View Results