# How to TEST a Q&A Correctiveness System with LangSmith ## Metadata - **Published:** 8/18/2023 - **Duration:** 16 minutes - **YouTube URL:** https://youtube.com/watch?v=357zfdc0rUQ - **Channel:** nerding.io ## Description In this video we continue exploring the Lang Smith cookbook and specifically focus on building a Q&A Correctiveness System. I also talk about Ever Efficient AI, our AI Development Productize subscription model reach out to us for their AI development needs. Throughout the video, I explain the steps involved in building the system, such as installing dependencies like ChromaDB and document splitting loading packages, defining the dataset, and creating a prompt template. I also demonstrate how to evaluate the system's performance and iterate on it for better results. ๐Ÿ“ฐ FREE eBooks & News: https://sendfox.com/nerdingio ๐Ÿ‘‰๐Ÿป Ranked #1 Product of the Day: https://www.producthunt.com/posts/ever-efficient-ai ๐Ÿ“ž Book a Call: https://calendar.app.google/M1iU6X2x18metzDeA ๐ŸŽฅ Chapters 00:00 Introduction and Ever Efficient AI 00:32 Q&A Correctiveness System Overview 01:13 Dependencies and Dataset Creation 02:41 Building the Q&A System 04:19 Running the System and Vector Storage 06:09 Defining the Prompt Template 06:57 Creating a Runnable Map 09:50 Evaluating and Predicting Q&A Correctiveness ๐Ÿ”— Links https://github.com/langchain-ai/langsmith-cookbook/tree/main https://github.com/langchain-ai/langsmith-cookbook/blob/main/testing-examples/qa-correctness/qa-correctness.ipynb https://www.trychroma.com/ https://smith.langchain.com/ โคต๏ธ Let's Connect https://everefficient.ai https://nerding.io https://twitter.com/nerding_io https://www.linkedin.com/in/jdfiscus/ https://www.linkedin.com/company/ever-efficient-ai/ ## Key Highlights ### 1. Testing Q&A Systems with LangSmith The video demonstrates how to use LangSmith to test a Q&A system, focusing on creating a dataset, defining the system, running evaluations, and iterating to improve results. ### 2. Creating Datasets for Evaluation LangSmith allows you to create datasets with question-answer pairs to evaluate the performance of your Q&A system. It handles static examples, and dynamic data will be covered later. ### 3. Leveraging ChromaDB for Vector Storage ChromaDB is used as an open-source, local vector store for embeddings, facilitating data retrieval for the Q&A system. Its Python and JavaScript clients integrate well with LangChain. ### 4. Iterative Improvement Using Feedback Loops The video shows how to iterate on a Q&A system by modifying the prompt template and re-running evaluations to assess the impact of changes, using LangSmith's feedback mechanisms. ### 5. LangSmith's Playground for Testing LangSmith provides a playground environment where you can test chains, input prompts, and analyze the generated outputs, allowing for rapid experimentation and debugging. ## Summary ## Video Summary: How to TEST a Q&A Correctiveness System with LangSmith **1. Executive Summary:** This video provides a hands-on demonstration of how to build and test a Q&A system using LangSmith, a tool for debugging, testing and improving LLM powered apps. It covers creating a dataset, defining a Q&A system with ChromaDB and LangChain, running evaluations, and iterating on the system to improve its correctness. **2. Main Topics Covered:** * **Introduction to Q&A Correctiveness Systems:** Explanation of the concept and its importance in AI development. * **Setting up the Environment:** Installing necessary dependencies like ChromaDB, LangChain, and document loaders. * **Dataset Creation:** Building a dataset of question-answer pairs using LangSmith (focus on static examples). * **Q&A System Architecture:** Defining the system using a Vector Store Retriever (ChromaDB), OpenAI embeddings model, and GPT 3.5 turbo. * **Prompt Engineering:** Creating a prompt template to guide the LLM's responses. * **Chain Assembly:** Creating a runnable map to efficiently process questions and retrieve relevant information. * **Evaluation and Prediction:** Running evaluations within LangSmith and analyzing the results to identify areas for improvement. * **Iterative Improvement:** Modifying the prompt template and re-running evaluations to observe the impact of changes on system performance. * **LangSmith Playground:** Utilizing the LangSmith playground for testing chains, input prompts, and analyzing generated outputs. **3. Key Takeaways:** * LangSmith provides a powerful platform for testing and evaluating Q&A systems. * ChromaDB offers an open-source solution for local vector storage, facilitating efficient data retrieval. * Iterative improvement, through prompt engineering and re-evaluation, is crucial for optimizing Q&A system performance. * LangSmith's playground allows for rapid experimentation and debugging of AI chains. * The video showcases a practical example from the LangSmith cookbook, providing a solid foundation for building and testing Q&A systems. **4. Notable Quotes or Examples:** * **Dataset Example:** The video highlights creating a dataset as an array of dictionaries with "question" and "answer" pairs for evaluation. * **Prompt Template:** "Defining the fact that we want it to be a helpful q a assistant we're specifically looking at questions from the linksmith documentation and we're going to answer based on on that information." * **Iterative Improvement:** Shows the addition of a "If you don't have an answer just respond with you don't have an answer" to the prompt. **5. Target Audience:** * AI developers working with LangChain and large language models (LLMs). * Individuals interested in learning how to test and evaluate Q&A systems. * Professionals seeking to improve the accuracy and reliability of their AI applications. * Anyone curious about using LangSmith for debugging, tracing, and iterating on LLM workflows. ## Full Transcript hey everyone welcome to nerding IO I'm JD and today we're going to be continuing looking at the Langs Smith cookbook and specifically building a testing example where we look at question and answer so before we get into that I just wanted to talk about ever efficient AI this is our AI development product dies subscription model so if you have any AI development needs please reach out to us you can find us in the comments all right so today what we're going to look at is the the Q a correctiveness system so again we're going to look at this cookbook example they have a folder testing examples and then the things that we're going to cover are going to be creating a data set which is new for us and then we're going to be defining our question and answer system actually running an evaluation through langsmith and then iterating to improve the system and see how that looks so the the first things that in order to do this we're going to look at our dependencies before we even create our data set so again we're going to be running jupyter notebook which you can run in your terminal you see I have it running in the background here but you're also going to want to make sure that Lang Chang is up to date we're going to be installing chroma DB which for our data set and Vector storage and then these other two packages specifically for Word document loading so quickly let's talk about chroma we're going to do a follow-up video on this specifically but the reason that we love chroma is because it's open source which means that we can run this for free locally they're also talking about having a hosted product it makes it super super easy of course since we're going to be doing things in both Python and JavaScript it's also great that they have a JavaScript client both of these uh both of these work specifically with Lane chain which makes it incredibly awesome so with that said go ahead and spin up your jupyter notebook make sure you clone everything down and once you have that up and running we'll go ahead and take a look all right so once you have your Jupiter notebook up and running we can uh you're going to make sure that you launch the QA correctiveness and we'll go through what this looks like so we already talked about our prerequisites so the first thing that we're going to do now that we have things like chroma installed and our document loader is we're going to build out a data set the really cool thing is that in this cookbook they actually give us an example so the way that this looks is you essentially have a array a dictionary that has both the let's find a good one a question and then an answer so these are static examples they're they're hard-coded in a future video we'll start talking about Dynamic data they also have an example for that and what we're going to do in order to create this data set is start running this code so the first thing we're doing is defining what our data set is going to be now we're implementing or importing Lane Smith remember you need to have the API Keys set up so both your openai key and your lane chain API key again we've been doing this in the previous project so if you haven't checked I had a chance to set that up specifically you can all you have to do is put in environment variables but we have previous videos on that so let's continue on so now that we have our examples at least saved now we're importing now we're defining what our data set is going to be and then we're going to Loop through all of these examples so let's run run again we didn't get any errors that's good now we're going to Define our q a system and what we're going to be doing for that is we're using our Vector store Retriever and that's going to use the open AI embeddings model and then our Vector store which is where we're going to store the the chunky the the splitting the split information and embedding specifically into our Vector storage of chroma then we have our chat prompt template and we'll be using the GPT 3.5 turbo as our llm so let's keep going and have this run and Define what our splitter our splitter is the other thing is the the recursive URL loader is um pulling in the linksmith the docs so and that that is think of that as like the uh the source information that we're that we're getting right the um these examples are questions that we're testing against all right so now that we have our retriever that's running we're going to go down and run our embedding and cool now we're going to Define oh we got an issue let's figure out what that is So reading this it's just saying uh that it it's using an HTML parser if it's XML ml we should use a parser I think that's fine so we're going to continue with our embeddings and next we'll be defining our prompt template so in our prompt template this is essentially how we're structuring our message so we're defining the fact that we want it to be a helpful q a assistant we're specifically looking at questions from the linksmith documentation and we're going to answer based on on that information that we'll we'll that we just created these vectors with so let's go ahead and run our prompt template and now we can assemble the full chain so what we're doing here is we're saying that we're creating a runnable map we're taking our documents that have been split and we're putting that in it so basically it's going the getter then we're piping the retriever then we're piping the the Lambda of the docs and that's in order to get all our information and then we're doing an item getter of what the question is so let's run this and we'll run our first question so here we're going to be putting together a token stream and asking it how long do I how long how do I log user feedback to a run so now we can see that it's doing the stream and we're getting our expected result so this is really cool it actually also gives a typescript example so that's awesome it says that it needs to provide our run ID our feedback key and yeah this is this is great and even Define our API model so very cool so what we can do is we can actually look at this and start to evaluate so before we actually run the evaluation on our chain let's go look at thingsmith and see what we have as our sequence so these arguments uh were from our last video this runnable sequence is what we just did and you can see the input is our question so let's take a look at our chain so this is really awesome it's actually giving us all the the information from our log giving us our latency we can see our our runnable map and then our prompt template and then so this is where it's pulling the information from as well as the response that we're getting back and our final parse so we can check there's no feedback this is the the metadata that we have at the time remember in the last video we customized some of this so let's go back to our to run to our notebook and let's run the evaluation so we have our output so now we're going to go in and run our evaluator on our q a and then we're actually going to look at the have it run in the evaluation and do prediction over the uh the Q a correctiveness so let's make sure we actually run this all right and now what we can do is we can actually if you don't have linksmith you can actually look here but what we're going to look at is this feedback and try and see what we can find with our our evaluation so we did look at the chain now let's go back to linksmith and start looking at our correctiveness so if we go back to our tutorial we see we have another a value printable sequence oops sorry so we want to go back and look at the feedback incredible sequence two we have our user feedback so and actually what I I made a mistake we want to go back and look at our evaluation results for this sequence that's being returned so we can take a look at that and okay so now this has our data set and our correctiveness so it's looking at all of these examples that we had put in and we can now see our correctiveness feedback we can see how it was running through we can see what is the the question so what is Lane chain and then we can see our output so basically on this uh these traces for all these examples that we just ran through it's giving us our chain and our status in our feedback right so the other thing that it's showing it's kind of hard to see but there's a function here that's saying and equal feedback key correctiveness and the score of zero is part of the filter so what we can try and do on this and I'll I'll blow this up so it's a little easier to see is basically we can right here in our traces to do a a function a filter function so hopefully you can see that I'll blow it up but basically it's and equal feedback key correctiveness EQ feedback score zero um we on this one we have a correct correctness of one so if we change that to one then it'll filter on our feedback here so let's see what else they have so the height so we can actually run the evaluator so um and again you can open in playground if there is something marked incorrect as well as editing in the playground itself the playground itself allows you to essentially do test runs and all of these are actually tracked inside of linksmith so what kind of uh when I go into the uh open playground it says that this particular one isn't isn't set up for for working on that however if we go back to our projects you can see some cool things so you'll notice that on our tracing cookbook we have a feedback score now um you also have this ability to look at evaluators and uh there is by default there is a playground inside of langsmith so you can go in here and actually uh you know type in and and try and use this particular chain as a playground so this is essentially just like an editor so you can put in whatever kind of input and then submit and then it'll actually run which is pretty sweet all right so let's go back and let's iterate now so what we're doing in this is we're showing that we actually want to take this uh same essentially same prompt but now we're going to add some some specific information and then this one what we're telling it to do is if you don't have an answer just respond with you don't have an answer you know do the best you can but otherwise make the the user aware that you don't know the answer so that's the only addition that we're going to be making but it's a good example of how to go back and iterate through this now we'll go ahead and make a run we will rerun our evaluation so again we're looking at these the the data set so let's go ahead and do that again we have our our link here let's go take a look so now we have a different run and let's go back to what we're what we're looking for so it looks like it's showing that it's just passing all the examples uh and we still have our correctiveness we can still run our okay so here we go now we have a correctness of zero um which we can click on and and look through --- *Generated for LLM consumption from nerding.io video library*