# How to use Real Time In-Browser Transcription with WebGPU Whisper and Transformer.js ## Metadata - **Published:** 6/24/2024 - **Duration:** 11 minutes - **YouTube URL:** https://youtube.com/watch?v=LAFOhwwccgo - **Channel:** nerding.io ## Description Say goodbye to slow and insecure speech transcription with WebGPU Whisper and Transformer.js! This cutting-edge technology offers seamless real-time transcription and translation for over 100 languages, all while running directly in your browser. Thanks to Transformers.js, your data stays on your device, ensuring unparalleled privacy. In this video, we explore the technological marvels behind WebGPU Whisper, showcasing its blazingly-fast performance and exceptional accuracy. Whether you're a developer or a tech enthusiast, you'll discover the immense potential of this in-browser speech transcription tool. Watch now to see how WebGPU Whisper is changing the game! ๐Ÿ”” Make sure to like, comment, and subscribe for more tutorials and updates! ๐Ÿ“ฐ News & Resources: https://sendfox.com/nerdingio ๐Ÿ“ž Book a Call: https://calendar.app.google/M1iU6X2x18metzDeA ๐ŸŽฅ Chapters 00:00 Introduction 00:26 Demo 02:27 Setup 04:13 Code 06:10 Worker 10:36 Conclusion ๐Ÿ”— Links https://huggingface.co/spaces/Xenova/realtime-whisper-webgpu https://github.com/xenova/transformers.js/tree/v3/examples/webgpu-whisper โคต๏ธ Let's Connect https://everefficient.ai https://nerding.io https://twitter.com/nerding_io https://www.linkedin.com/in/jdfiscus/ https://www.linkedin.com/company/ever-efficient-ai/ ## Key Highlights ### 1. Real-time In-Browser Transcription The video demonstrates real-time speech recognition using WebGPU and Transformer.js directly in the browser, eliminating backend calls. ### 2. Whisper Base Model with WebGPU It leverages the Whisper base model (73M parameters) and WebGPU for fast, efficient processing within the browser environment. ### 3. Transformer.js V3 Branch The implementation utilizes the experimental V3 branch of Transformer.js, showcasing the latest advancements and capabilities. ### 4. Worker.js for Processing The core logic resides in worker.js, handling model loading, audio processing, and generating transcriptions on the fly using Transformers. ### 5. Data Streaming & Tokenization The video highlights the process of streaming audio chunks, converting them, performing transcription, tokenizing the output, and posting the information back to the main thread for display. ## Summary ## Video Summary: How to use Real Time In-Browser Transcription with WebGPU Whisper and Transformer.js **1. Executive Summary:** This video explores the implementation of real-time, in-browser speech transcription using WebGPU and Transformer.js, specifically the Whisper base model. It demonstrates how to set up and understand the core code logic, highlighting the benefits of local processing for privacy and speed. **2. Main Topics Covered:** * **Introduction to WebGPU Whisper and Transformer.js:** Overview of the technology and its capabilities for real-time, in-browser speech transcription. * **Demo of Real-Time Transcription:** Showcases the live transcription of speech using the Whisper base model running entirely in the browser. Network activity is monitored to prove no external API calls. * **Hugging Face Integration:** Utilizing a Hugging Face Space for demonstration and accessing the Whisper base model (73M parameters). * **Setup and Local Implementation:** Guides viewers on how to clone the project, install dependencies (npm install), and run the application locally (npm run Dev). * **Code Walkthrough (Worker.js):** Detailed examination of the `worker.js` file, which handles the core transcription logic using Transformers.js V3 experimental branch. This includes model loading, audio processing, and transcription generation. * **Audio Processing Pipeline:** Explanation of how audio chunks are streamed, converted, processed by the Whisper model, tokenized, and sent back to the main thread for display. * **Explanation of Data Flow:** From microphone input to displayed text, detailing how the `app.js` and `worker.js` communicate via message passing. **3. Key Takeaways:** * WebGPU and Transformer.js enable real-time, in-browser speech transcription, eliminating the need for backend servers and ensuring data privacy. * The Whisper base model (73M parameters) can be efficiently run in the browser using WebGPU for accelerated processing. * The Transformer.js V3 branch provides the necessary tools and APIs for implementing the speech recognition pipeline. * The `worker.js` script is crucial for handling computationally intensive tasks like model loading and audio processing in a separate thread. * The video demonstrates a practical application of modern browser technologies for advanced AI tasks. **4. Notable Quotes or Examples:** * "Again we didn't fetch anything from our back end this was just loaded through the browser" * "This is crazy fast for doing real time uh transcription in the browser super helpful if you were trying to do uh like chat based systems where you actually wanted to have voice in with your chat as without having to go out to open AI or some other um closed Source or even open source model that uh requires a backend call" * (Explanation of `worker.js` functionality) "...the generate function this function is what really is going to be taking the audio and taking the language and then actually processing it" * (Data streaming/chunking) "...we're going to take a blob and chunk it and so then we're going to make a file reader and read those and then send that on when the file is being read" **5. Target Audience:** * Web developers interested in implementing real-time speech recognition in their applications. * Developers using the Transformer.js library or interested in WebGPU and browser-based AI. * AI enthusiasts and researchers exploring the possibilities of running machine learning models directly in the browser. * Individuals concerned about data privacy and seeking solutions that avoid sending audio data to external servers. ## Full Transcript hey everyone welcome to nning IO I'm JD and today we're going to be looking at real time in browser speech recognition using web GPU and Transformer JS to actually use speech in the browser so with that let's go ahead and get started all right the first thing that we're going to do everyone is we're going to go to this hugging face space and basically what we're going to do is look at this demo for the whisper web GPO uh it's real time in browser recognition it's using the whisper base which is a 73 million parameter speech recognition mod model and using onx and Transformer JS so you can see that the audio audio visualizer is already picking up my voice from the microphone and uh able to do some kind of uh visual what we're going to do is actually load this model and then we should be able to see real time uh transcription but what we want to show is also that if you go and look at your network uh right now I have I'm on this wasm tab and we want to pay attention to both Fetch and wasm so we're going to go ahead and look at the model and you can see it pulled it in really quickly uh could be uh because it's cached but uh this was incredibly fast and it's already doing the the transcription in real time you can see that it's trying add punctuation and uh the token count down here if we go over to the wasm tab this is all that was loaded it's going through a worker to establish this and the size was 4 megabytes so again this is crazy fast for doing real time uh transcription in the browser super helpful if you were trying to do uh like chat based systems where you actually wanted to have voice in with your chat as without having to go out to open AI or some other um closed Source or even open source model that uh requires a backend call right again we didn't fetch anything from our back end this was just loaded through the browser so now what we're going to do is we're going to take this and we're going to actually look at the repo and and dig into the code so if you come over here to Transformers Js the thing you want to note is you want to be on uh the V3 Branch this is the the new Branch it's still experimental but it's going to be the next version of Transformer JS a lot of great examples we've been trying to go through some of them it feels like every day they're coming out with a new demo a new model that they're able to implement um definitely really cool stuff so if we look at this example so we go into Transformer JS examples and Whisper then we're able to actually D dive into this code so what we're going to do is we're just going to pull this down get this running on our local machine and then we'll start digging into the code so I'm going to use cursor I'm already I already have it pulled down but you can do uh get clone and then come over here and just do this uh we are so I'm in the root so what we need to do is go to examples and then we're going to do web GPU and go to whisper and we'll do an npm install and then we'll go ahead and do an npm run Dev and what we're going to do is we have our Local Host here so we can get this up and running we'll watch really quickly if we can get this uh loaded again super quick you saw how quickly the the load time was for that again some of it sometimes it's cashed and the first one's going to be slower but still really impressive so if we look at some of the core files in here we have our app which is the main main file and what we're going to be looking at is how we're actually pulling in this worker so this worker. JS is where we're actually going to do our Logic for Transformer Js we're pulling in the model we're actually sending messages back and forth where you can see right here that as messages are coming being received we're actually going to be looking for the type of event that is being pass and then send that uh data as well so you can see most of this is for like loading there's uh some update which is where you're updating the text we're actually going to look and see where this is uh is working so if we continue down and we are looking at the recorder this is all the information about starting and stopping the recording and then what's Happening Here is we are actually looking at how this information is being sent with this code so right here it's saying as the recording is coming in we're going to take a blob and chunk it and so then we're going to make a file reader and read those and then send that on when the file is being read get that bit information that's decoded slice it and then send those slices over to our worker so we're posting the message to generate our transcription on the Fly then this is just our uh our HTML that we're going to be pulling in so what we'll do is we'll jump over to the worker and then we'll take a look at how this is actually functioning real quick everyone if you haven't already please remember to like And subscribe it helps more than you know if you have any questions please leave them in the comments and with that let's get back to it all right so now that we're in our worker file we're going to start looking at how this actually functions first thing is we're loading in Transformers we're going to pull in our Auto tokenizer our Auto processor and then uh whisper and text stream as well as full so if we look at what we're first doing we have this class which is our Pipeline and then we're actually using our instance where we're getting our model so we're saying this is our model ID we have our tokenizer our processor we're actually getting the model based on our ID at this point uh we're telling it what our dtype is and then specifically we're using device web GPU and once all of these have processed uh then our git instance has been established if we scroll down to the bottom we can kind of look at like some of the loader functions so you can see here there's the load this happens like where we're actually posting back messages to actually see see the progress bar things like that we're establishing some events based on our message and then once the the browser uh or once the model is actually loaded we're going to be looking for this generate function this function is what really is going to be taking the audio and taking the language and then actually processing it so the first thing we'll be doing is saying is if if processing return if it is not processing then go ahead and set that processing to True maintaining State we're actually going to send a message back to say that we're trans that the thread is starting so we're saying start as our status and then we're actually pulling in our instance so again where we set our tokenizer and our processor and model we're actually pulling that from our um instance and we're starting to actually process so we have our call back function which is going to give us the out output of our post message of update and information about our TPS our tokens and our actual output and then we we have our streamer and we this is where we're actually processing our information so what we're saying is our input we're taking our AO audio and we're uh putting it through our processor we then this is our transcription our transcription BAS based on our model and our generate we're saying we're passing our same inputs our language and we're actually streaming the events or streaming the output so now we have our tokenizer this gives us the information that's displayed in the bottom right of how many tokens we're actually using and then we're posting back that chunk of text as it is complete and so remember if we are looking at what these post messages are we're coming back here and that is the information that is being on on message received so if we look down to update we're seeing that we're setting our TPS here as well as uh our message so our on message right here this is where we're being sending the information back and then that's how we're actually passing information down here so this is where our TPS is and then uh the this the rest of the progress bar in the audio I'm looking for the actual uh information around the text oh it's right here so on ready we're passing the information of the text that's being updated and that's how this works is we basically are taking this information from after after it passes passes through the model and then right here so being generated by the model and then posting that information back so it's processing the auto chunk it is audio chunk and it's uh then converting that and doing a transcription on this generation then it's streaming it's doing the tokenizer and posting all that information back all right everyone that's it for us today what we covered was using real time speech recognition in the browser with web GPU and Transformers JS we're able to dig through some of the repo and actually get their demo up and running and so with that happy nerding --- *Generated for LLM consumption from nerding.io video library*