# Steps to Build a Real-Time Voice Bot with Deepgram, Langchain, and ChatGPT ## Metadata - **Published:** 2/8/2024 - **Duration:** 16 minutes - **YouTube URL:** https://youtube.com/watch?v=EgNerWaeZz0 - **Channel:** nerding.io ## Description In this video, I show an experiment example of a real-time voice bot using Websockets, Deepgram for speech recognition, Langchain for natural language processing, and ChatGPT by OpenAI for dialogue generation. The flow works like this: - User speaks into a microphone - Deepgram transcribes the audio into text in real-time - Langchain processes the text to understand the user's intent - ChatGPT generates a relevant voice response - Text to speech converts the response into an audio clip - The audio clip is played back to the user So essentially, it allows you to have a natural conversation with an AI assistant by speech. The bot understands what you're saying and responds back verbally in real-time. Some use cases could be creating smart speakers, voice interfaces for apps, or even phone call bots. The real-time aspect makes it more convenient than typing back and forth. ๐Ÿ“ฐ FREE Code & News: https://sendfox.com/nerdingio ๐Ÿ‘‰๐Ÿป Ranked #1 Product of the Day: https://www.producthunt.com/posts/ever-efficient-ai ๐Ÿ“ž Book a Call: https://calendar.app.google/M1iU6X2x18metzDeA ๐ŸŽฅ Chapters 00:00 Introduction 00:28 Deepgram 02:33 Real Time Bot Example 04:10 Setup 05:58 Client 10:02 Server 13:07 Testing 14:32 Langsmith ๐Ÿ”— Links https://deepgram.com/ https://github.com/deepgram-starters/live-nextjs-starter https://github.com/deepgram-previews/real-time-voice-bot https://js.langchain.com/docs/integrations/chat/openai https://smith.langchain.com/ โคต๏ธ Let's Connect https://everefficient.ai https://nerding.io https://twitter.com/nerding_io https://www.linkedin.com/in/jdfiscus/ https://www.linkedin.com/company/ever-efficient-ai/ ## Key Highlights ### 1. Real-Time Transcription Speed with Deepgram Deepgram's real-time transcription capabilities are highlighted as exceptionally fast, making it a strong contender for voice-based conversational bots. The live streaming transcription feature is emphasized. ### 2. Websockets for Persistent Connection The use of websockets for maintaining a persistent, two-way connection between the front-end and back-end is crucial for the real-time voice bot's functionality and reducing latency. ### 3. Langchain Integration & Langsmith Tracing The integration of Langchain for prompt management and the ability to trace the entire process, including latency and token usage, through Langsmith is a key aspect of the project. ### 4. Socket ID for Chat Management The video stresses the importance of using socket IDs to manage multiple concurrent chat sessions and consolidate chat history for each individual user within the Langchain sequence. ### 5. Server-Side Processing for Speed The bulk of heavy processing, especially TTS and prompt handling, is executed server-side. Utilizing buffer arrays for streaming audio allows near real-time audio feedback from the bot. ## Summary ## Summary Document: Steps to Build a Real-Time Voice Bot with Deepgram, Langchain, and ChatGPT **1. Executive Summary:** This video demonstrates how to build a real-time voice bot using Deepgram for speech-to-text, Langchain for natural language processing and prompt management, and ChatGPT for generating conversational responses. The tutorial covers the setup, code structure, and demonstrates the bot's functionality, showcasing its real-time transcription speed and conversational capabilities. **2. Main Topics Covered:** * **Introduction to Real-Time Voice Bots:** Explanation of the concept and potential use cases (smart speakers, voice interfaces, phone call bots). * **Deepgram for Real-Time Transcription:** Overview of Deepgram's capabilities, focusing on its speed, accuracy, and live streaming transcription feature. * **Websocket Implementation:** Explanation of how websockets maintain a persistent, two-way connection for real-time communication between the client and server. * **Langchain Integration:** Description of how Langchain is used for prompt management, including model selection and defining agent behavior. * **ChatGPT for Dialogue Generation:** Overview of how ChatGPT is used to generate relevant and engaging voice responses based on user input. * **Code Walkthrough:** Detailed explanation of the client-side (Next.js, Javascript, Socket.io) and server-side code (Node.js, Deepgram TTS API, Langchain) implementation. * **Langsmith Tracing:** Demonstration of using Langsmith to monitor latency, token usage, and the overall flow of the conversation within the Langchain sequence. * **Socket ID Management:** Importance of using Socket IDs for managing concurrent chat sessions and consolidating chat history for each individual user. **3. Key Takeaways:** * **Deepgram's speed makes it ideal for real-time voice bots.** Its real-time transcription capabilities allow for a more natural conversational experience. * **Websockets are crucial for maintaining persistent connections and reducing latency** in real-time applications. * **Langchain simplifies prompt management** and allows for easy integration with LLMs like ChatGPT. * **Server-side processing optimizes performance** by handling computationally intensive tasks like TTS and prompt handling. * **Langsmith provides valuable insights into latency and token usage**, enabling optimization and debugging. * **Socket IDs are essential for managing multiple concurrent chat sessions**, ensuring that each user has a personalized experience. **4. Notable Quotes or Examples:** * "So you could see that the transcript was happening in real time right here..." - Demonstration of Deepgram's real-time transcription. * "...it's almost like a conversational dot but in the sense that it's using audio and you could see that very quickly it was responding there wasn't as much latency as you would typically see which I found really interesting" - Highlighting the speed and low latency of the system. * Example Chat Interaction: Demonstrates the bot's ability to understand the user's intent and provide relevant responses (therapist scenario). **5. Target Audience:** * Developers interested in building real-time voice-based applications. * Individuals exploring the integration of speech recognition, NLP, and LLMs. * Anyone seeking to understand the use of Deepgram, Langchain, ChatGPT, and Websockets in a practical project. * Those interested in conversational AI and voice interfaces. ## Full Transcript hey everyone welcome to nerding IO I'm JD and today we're going to be going through deep gram which is a real-time transcription service some other audio features websockets and Lane chain we're going to put all this together and look at some experiments that deep grams put together and then actually trace it through Lang Smith so let's go ahead and get started all right so when I was experimenting with uh different kind of conversational Bots and specifically looking at like voice for transcription in different types of audio I came across this tool called Deep gram and one of the things that I really liked about it was how fast it was so it has this real-time accuracy for uh transcriptions the other cool thing is that it has no JS as well as python which is pretty standard but this Li idea of live streaming the transcription was what really interested me and so they do have things like uh speech to text and text to speech as well as audio intelligence which I haven't played with but seems really interesting and so when I was researching they actually have a nextjs example and so if we look at this example we can just we have a microphone where we can do a test and we even have this deep gram is connecting and it's Shing telling us the connection status so let's go ahead and give it a shot hello from nerding IO and so you could see that the transcript was happening in real time right here and then again we have this connection open so if we look at the code itself in the uh nextjs starter kit we can actually see how this is happening so the way it's doing this is it's toggling the microphone on and on first we need to have our uh media device then we're actually defining our recorder so our microphone this is all just straight JavaScript right here but later on we're actually uh establishing a connection not only to the API but also a creating a client to listen for deepgram to actually do the transcription live so in their cost they actually have this built in where you're you're doing so many minutes so the then you also have the uh the connection on and then you're looking for the type of event of open as well as close and so that's what what's allowed to process the transcription live which I thought was really interesting the other cool thing is they have like a bunch of different uh projects that are kind of experimental and so that's what we're going to look at in detail is this idea of this experimental project called realtime voice so with realtime voice you can actually see right here it's actually using this Athena V4 this was the uh The Voice the speech to text that they just released um back in December I believe and so what we're going to do is we're going to take this project and actually use the API keys that they have but then we're actually going to run this through Lang chain and then take a look at it in link Smith so all you need to do is pull down this project and then I'll show you a few modifications that we're going to make real quick if you haven't already please remember to like And subscribe it helps more than you know and with that let's just go ahead and get back to the content so once we pulled this down the first thing that we want to do is actually look at some environment variables so I went ahead and just made a sample because we're going to add some things outside of the uh description and so you need to have the open API key the telegram and then these variables for Lane ch chain so you have your endpoint your project and your API key as you can see I already started running this in the mpm so you can see that it's transcribing and picking up the information down here in my terminal it's console logging everything out and so what we're going to do is we'll go through the HTML but I just want to show you really quick how this looks so right here we have this information it's not actually transcribing but it's picking up everything I'm saying in the microphone as I'm running through an in my my command line so and that's because the server is actually console Lo console logging not the actual browser so what we're going to do is we're going to take a look at what is how this is built so right here you can see this theme and you can see your uh text to speech voice and where this right here is going to be a visual so if we do something like can you help me with sorry building a YouTube video so you can that's great what kind of video are you planning to make you can see that there's the audio wave here you can see the text is coming in and it's pretty quickly jumping in uh so again remember this is experimental but it's able to actually understand what I'm saying and then being able to respond and hopefully you could hear her speaking again we're on the default theme but there's other themes these different types of models so let's go ahead and go back to the code and just kind of look at what is actually going on here so the first thing is is we're actually using websocket so we're pulling in socket into our uh into our code so that we can actually communicate and keep an open connection between the front end and the back end we're using this wave server as an audio visualizer um and then we have this model change which is just allowing us to look at different models and what's cool about these models is if you look since we're using Lang chain we're going to use Lang chain we essentially just have prompts it's telling it to act a certain way and provide feedback and then go out to open AI to pull information back so it's almost like a conversational dot but in the sense that it's using audio and you could see that very quickly it was responding there wasn't as much latency as you would typically see which I found really interesting the next piece is just the the style of the voices so these are are just kind of like default voices that that you have and then you have the ability to record which is just toggling uh which we saw so you have um the container for an audio file and then our script so if we continue into uh what we want to establish the connection between the websockets what we need to do is we just have our websocket or origin as well as our API origin we're setting up a audio file this is where we're actually going to return some of the text and then we have our audio element and which was how we'll actually play the information we have audio for text uh function right here if you see right here this is where it's taking the the data and making it a blob and then setting that information to an MP3 right here in the browser so we're actually creating an audio element this is the function we'll call for the audio for text this is where we're actually setting up our recorder similar like we did in the next uh JS version we're actually just setting up our events to make sure that not only are we getting getting the recorder but we're actually setting up the socket here as well at the same time that we're going to set up that recorder and then we're waiting for the socket we're going to send our ad text both in our interim result which is what's happening here you can see down in the uh console of the the server and then our speech final so when it's actually listening and processing and then lastly we're establishing a socket ID this socket ID is really important because it allows us to understand which socket are we communicating back and forth from so we just have an established connection to this single page and not broadcasting it out to every single socket that's available the next piece is the the chat and actually showing the the text so this is separating the the lines and basically defining on who is speaking based on the AI or if uh it's us speaking and that's where we're seeing the color differ inue and then of course we need to send our information to uh The Prompt and so we have our prompt AI here and then we're saying that this is our socket or our API connection to our chat it has our socket ID it has our model it has our voice and then it has the message that it actually used to encode right here here all this information is going to get sent to the backend and that's where we're going to process our link change events these are just to show what's starting and stopping and then our uh if we have any model changes all right so next let's actually see this information going to the chat back end remember we're taking our socket ID we're taking our model our voice and then our encoded message so if we we look at the server what we're doing here is we're setting up all of deepgram so we have our TTS API we're Define error port and it's diving right into the The Prompt AI so it's passing the socket the model and the message what's interesting about this is It's again grabbing from the the models here so we know which model which is going to act as our prompt as the uh the the chat message and then it's actually putting in our message here but this is defined by the open socket as well as this call is actually going into Lang chain so let's go ahead and find this function so we're calling it inside of our g chat so we saw that this was coming across cross on this URI we're taking this information from the request we're sending it to our prompt and then we're expecting a response back that we're just sending as we're speaking we're doing this something similar we're doing text to speech we're actually as it's speaking it's actually doing text to speech so once we get this call from the uh The Prompt AI we're actually going to return it to this inform this URL and actually have a buffer as an array so that it's speaking in almost real time so we're sending as a buffer this is important for the the streaming aspect of this and so what's really crazy is that we have all of our events here and that we're seeing a two-way connection between them and this is all done by the socket that's why the socket ID is so important it's maintaining this connection between the two so that there's an open back and forth and right here is where we're defining what instance of that so that we can maintain this chat so when we have our open AI chat this part is the L chain then we're able to call this instance of that laying chain chat and then actually do the call function which is happening again On LAN chain so what's interest what was really interesting to me is that you're using the socket ID to essentially maintain the different types of chats that you're having and that's how you can consolidate your chat history into one kind of uh L chain chain event uh or sequence I should say and so what we're going to do now is we'll do a little bit more testing and then we'll actually look at how this uh goes into linkchain so if if we go back to our browser and we have a new instance of this deep gr let's just try the therapist and we'll say that we want to uh know more or we have a problem with public speaking I have a problem with public speaking thank you for sharing that with thank you for sharing that with of course if you would like to skip any questions or steps please let me know is there anything specific you would like to skip or any particular topic you would like to discuss so I'd like to discuss public speaking hello there it seems like you're feeling a bit down of course I'm here to help let's talk about public speaking can you tell me more about what specifically troubles you about it I have stage fright thank you for sharing that with me can you tell me more about the situations or conditions in which you experience stage fright so as we're going through you can see it's again this is experimental but if it's picking up the fact that as we started start talking it started pulling this in immediately then it it kind of cut off for a second it was already trying to say stuff and then I spoke over it so you need to figure out a better way to chain the events um but when we when we talk here about I'd like to discuss public speaking it actually comes back with a question we can interact with it it was fairly quick and then it's asking for again for more information so in order to view this in the Lang chain part again we're using laying chain for the call function right we have we know that our model is getting set up here in this particular way and with the chat open AI because this model variable is associated with the socket ID that we're communicating with and so we can actually trace this through the runnable sequence in Lang Smith so if we go in here for me it's just looking at the um the default you want to check and make sure you have your your project and whatnot you can actually see all the system inputs that are actually coming and how quick this is not to mention how many tokens are being associated with it what your run count is and um and everything else so if we look at our latency this latency is pretty good I mean it's it's staying staying very green again you can see that the system this chat or this prompt is really just the model that we looked at that's being associated in the server so if we look at the sequence we can actually uh check and see that it's being chained together so it's not uh a ton of information but it's still a way for us to look at the conversation that that's being had as well as the uh the latency itself so looking to see what what our run count is how many tokens we've used um and the if there's any errors or uh streaming or latency I just found this really interesting that you could actually use lsmith because of the fact that you're using the Lang chain functions and then start tracking the uh the latency of how quick this uh real- time bot is all right everyone Thanks for tuning in today what we went over was deep gram and the ability to connect real-time transcription look at their example of a real uh voice interaction bot almost like a conversational bot we checked out how we can use websockets in this process and then we actually traced all that information in lsmith so if you haven't already please remember to like And subscribe and with that happy nerding --- *Generated for LLM consumption from nerding.io video library*