Two significant shifts in how people interact with companies have occurred recently: the web in the 1990s and mobile applications over a decade later. The web made information and services available at the click of a button, transforming traditional brick-and-mortar operations. Mobile applications took this transformation further, offering a more personalized and immediate way for users to engage with services through real-time updates, push notifications, location services, and personalized content.
However, despite these advancements, both channels remain relatively static. Often, the information you seek isn’t easily accessible. Imagine a different scenario: instead of navigating through pages or screens, you converse seamlessly with the website or app, asking questions and receiving tailored responses. This is the third shift, where GenAI will fetch the exact information you need and present it in an engaging mix of text, images, voice, and video.
Imagine opening your bank’s app and asking, “What were my top five expenses last month?” Instantly, the AI retrieves the data, presents a visual chart, and offers a brief audio summary. Or, while shopping online for furniture, you provide a photo of your living room, measurements, and preferred materials. The AI responds with detailed product suggestions with images, customer reviews, and video demonstrations.
Integrating Large Language Models (LLMs) into client-facing systems has transformative potential, but there are challenges due to AI’s tendency to hallucinate—providing plausible but incorrect information. Achieving seamless interaction requires reframing the approach to software development and testing.
Redefining Predictability in Software Outputs
Traditionally, software testing has revolved around predictability and reliability. Given an input, the output should be consistent and accurate every time. This paradigm is upended with the introduction of LLMs, where the same input can yield different but contextually appropriate outputs. This variability is not unlike human interaction—ask the same person the same question twice and receive two different and equally valid responses. To test and trust LLMs, we must shift our perspective from expecting rigid precision to embracing nuanced accuracy.
The human brain inspires neural networks. LLMs’ tendency to hallucinate mirrors human behavior; we sometimes answer confidently, even when mistaken. This indicates that hallucinations are not a bug but a (not so desired) feature. While traditional software aims to eliminate errors, completely eradicating hallucinations in LLMs may be unrealistic. Instead, our focus should shift to minimizing these errors to a degree that makes us sufficiently confident based on demonstrated competence.
When you hire a new employee for a role like customer support, you don’t expect perfection from day one. Instead, you provide training, assess performance, and allow for a period of adjustment. Similarly, integrating LLMs into client-facing roles requires a methodology that mirrors human onboarding. Initial training provides the AI with the necessary tools and knowledge, followed by continuous performance evaluation and occasional adjustments.
Testing LLMs should transition from rigid, deterministic outputs to a confidence-based framework. This approach evaluates the AI’s performance on a spectrum akin to measuring human reliability. For instance, an employee’s readiness to handle customer queries is gauged through tests and ongoing assessments. Similarly, LLMs should be subjected to iterative testing that measures their ability to handle varied inputs and scenarios, adjusting the confidence level accordingly.
Applying Bloom’s Taxonomy for AI Testing
Bloom’s Taxonomy, a model introduced in 1956, classifies educational learning objectives into hierarchical levels of complexity. We can systematically develop and test the AI’s competencies by categorizing tasks and objectives into levels, serving three purposes: identify the maximum complexity level that we want the LLM to achieve; understand the type and format of the information that needs to be provided to the LLM for solving each required level; and design the tests that will measure its capacity to respond to interactions for each individual level of complexity.
Level 1 – Remembering: The LLM should retrieve, recognize, and recall relevant knowledge from its memory, especially information imparted through techniques like fine-tuning or RAG (Retrieval Augmented Generation). For example, the LLM should answer questions like “What are the dimensions of the Kingston sofa?” To check the LLM’s ability to recall specific business information, use multiple-choice tests, fill-in-the-blanks, recall questions, and fact listing.
Level 2 – Understanding: The LLM constructs meaning from written, oral, and graphic messages through interpreting, exemplifying, classifying, summarizing, inferring, comparing, and explaining. For example, you can expect it to answer questions like “Can you explain the difference between these two mattresses?” Testers can gauge the LLM’s use of new concepts by asking it to summarize ideas, explain concepts using alternative words, and categorize texts.
Level 3 – Applying: The LLM uses learned information in new and concrete situations, applying knowledge to execute tasks, implement solutions, and demonstrate procedures. A question like “I have three children; which dining table set would be best for a large family?” drives the LLM to apply its product knowledge to recommend a suitable option. Testers can present practical scenarios requiring the application of knowledge to evaluate the LLM’s ability.
Level 4 – Analyzing: The LLM breaks down complex information into constituent parts, understands relationships, and recognizes patterns. For instance, asking the LLM to “tell me which materials are more suitable for a pet owner and which items in your store meet these criteria” requires it to analyze the properties of different materials and match them with suitable products. Prompt engineering techniques like chain-of-thought help the model break down a problem into steps and present a better response.
Level 5—Evaluating: The LLM makes judgments based on criteria or standards, defending opinions using evidence. For example, an instruction like “Which mattress has the best reviews for both comfort and durability?” requires evaluating content with nuanced understanding and subjective judgment. Test the LLM by asking it to defend opinions using evidence and verifying that they are appropriate to the context.
Level 6 – Creating: The LLM generates new patterns, structures, or models, such as designing a custom bookshelf for a home office. Achieving this level requires vast information and the ability to transcend existing knowledge, which current LLMs struggle with due to their reliance on pre-existing data.
Continuous monitoring and evaluation are critical to maintaining and enhancing the AI’s performance. Just as an employee receives periodic reviews and ongoing training, an LLM requires regular updates and adjustments to stay effective and reliable.
A final advantage of using Bloom’s Taxonomy is that it helps establish the usefulness of newer models as the GenAI field progresses. Having ready a battery of training material and test cases categorized by level can speed up system deployment once novel LLMs appear. Consider that not all new models will be more capable. Some will be cheaper, some will be faster, some will need less memory to operate, and so on. Quick testing, organized by levels, can help you determine which type of application you can deploy using each kind of model.
Incorporating LLMs into client-facing systems demands a paradigm shift in software testing and quality assurance. By viewing AI as something different from a flawless machine, we pave the way for more resilient, adaptable, and human-like interactions. This evolution in testing methodologies will ensure that they can effectively meet users’ dynamic needs, much like their human counterparts. As we embrace this new frontier, our approach to testing and training AI must be as innovative and adaptive as the technologies we seek to perfect.