Testing Hailey Assist: Enhancing AI accuracy and reliability

Written by Andrew Lawrence | Jun 26, 2024

Hello, I’m Andrew Lawrence, the Chief Technology Officer at 6clicks. Today, I want to share an exciting chapter from our journey as we tested and refined Hailey Assist, our AI framework designed specifically for large language model (LLM) tuning and retrieval-augmented generation (RAG). I've personally found the process to be both challenging and immensely interesting, and it's been rewarding to work with the team as Hailey comes to life. I believe our experiences can provide valuable insights for anyone working in the AI field, so let's get technical.

Defining the scope: Hundreds of sample prompts

Our first step was to establish a comprehensive testing framework. We knew that the effectiveness of Hailey Assist hinged on its ability to handle a wide array of prompts with precision and reliability. To achieve this, we crowd-sourced and defined hundreds of sample prompts covering various topics and complexities. These prompts were crafted to simulate real-world scenarios that our users might encounter. We needed a mix of both 'happy path' prompts that we intended to handle elegantly, such as "Show me high severity risks assigned to me," and 'unhappy path' prompts, such as "Why is the sky blue?" We included different types of phrasing, including misspellings and poor grammar, and unusual requests deliberately designed to trip up an LLM.

Testing for accuracy: A rigorous process

With our sample prompts ready, we moved on to testing Hailey's responses. We developed a simple script to quickly test each prompt multiple times, recording the response, the path taken to generate the response, and the latency. We compared the responses against a set of expectations, assessing both the content and the context of the replies and assigning scores from low to high. This assessment was performed by actual humans (and Greg Rudakov, the product manager - sorry Greg!).

A low score represents a response that is unsatisfactory, for example, factually incorrect, or one that guides users to the wrong place for answers;
A medium is a technically correct answer, but one that could be further improved with greater accuracy, additional information, or language improvements;
And a high score represents an ideal answer - a correct and well-formed response, supported with additional information or links

This testing process was crucial for identifying any discrepancies or areas where the AI might falter, and assessing quickly if we were progressing or regressing with each update. Particularly when prompt engineering, we found it important to retest every increment, otherwise it could be hard to determine what caused a regression in performance.

We knew it would never be possible to achieve perfect accuracy for all responses. Maybe one day, but not with the current available capabilities of LLMs. We set ourselves targets for 'production readiness,' with an acceptable percentage of low and medium scores, and a minimum percentage of high scores.

Identifying errors and hallucinations

One of the most critical aspects of our testing was identifying prompts that caused the AI to generate errors or to hallucinate information. We meticulously reviewed the responses to pinpoint these issues, implementing corrective measures to enhance the AI's reliability.

To tackle hallucinations, we employed a multi-faceted approach. For example, we would find nuances in the prompts that resulted in widely different results; a prompt with perfect grammar may result in a high score, whereas even slight imperfections in the grammar would result in hallucinations. We adjusted the training data, fine-tuned the model parameters, and continuously iterated on the prompts and responses. This iterative process was essential for refining Hailey Assist, ensuring it could provide accurate and dependable information.

Of course, we also had to think outside the AI box and consider our user interface. How do we ensure the user receives useful information even under a hallucination scenario and is guided toward a beneficial outcome?

Latency and performance

In our case, we wanted Hailey to respond conversationally with a maximum latency of a few seconds. However, the current generation of LLMs is still inherently resource-intensive compared to more traditional software. What's more, RAG is inherently a multi-step process, combining multiple requests to GPT and a more traditional search step. Our close relationship with Microsoft meant we were well supported when it came to the tooling and computing resources that we needed, but that did not forgo the need to deeply analyze and innovate on our solution to ensure satisfactory latency.

We discovered that differences in infrastructure between data centers, different levels of load, and different prompts, could result in marked changes in latency. Continual measuring and understanding these patterns guided us in tuning the infrastructure, prompts, and RAG pipeline for maximum efficiency.

A continuous journey of improvement

Testing Hailey Assist has been a journey of continuous learning and improvement. By defining a comprehensive test suite of prompts, rigorously assessing accuracy, expanding our testing harness, and addressing errors and hallucinations, we have significantly enhanced the performance of our AI framework.

The road ahead involves ongoing refinement and adaptation as we strive to keep Hailey Assist at the forefront of AI technology. We are committed to delivering an AI that not only meets but exceeds the expectations of our users, providing them with a powerful tool for their needs.

We are excited about the future of Hailey Assist and look forward to sharing more updates as we continue to innovate and improve.

View full post