# What is WebGPU Whisper ML Powered In Browser Transcription with Transformer.js

## Metadata

- **Published:** 6/17/2024
- **Duration:** 9 minutes
- **YouTube URL:** https://youtube.com/watch?v=YuYf-MWQbTo
- **Channel:** nerding.io

## Description

Explore the revolutionary WebGPU Whisper and Transformer.js in this video! Released exactly one year after Whisper Web, this technology supports multilingual transcription and translation across 100 languages. 

Leveraging WebGPU technology, this integration provides blazingly-fast performance and exceptional accuracy directly within your browser. Thanks to Transformers.js, the model runs entirely locally in your browser, ensuring that no data leaves your device – a huge win for privacy!

🔔 Make sure to like, comment, and subscribe for more tutorials and updates!

📰 News & Resources: https://sendfox.com/nerdingio
📞 Book a Call: https://calendar.app.google/M1iU6X2x18metzDeA

🎥 Chapters
00:00 Introduction
00:31 Demo
01:55 Setup
02:54 Code


🔗 Links
https://huggingface.co/spaces/Xenova/whisper-webgpu
https://github.com/xenova/whisper-web/tree/experimental-webgpu

⤵️ Let's Connect
https://everefficient.ai
https://nerding.io
https://twitter.com/nerding_io
https://www.linkedin.com/in/jdfiscus/
https://www.linkedin.com/company/ever-efficient-ai/

## Key Highlights

### 1. WebGPU Whisper: Offline Transcription

The video showcases Whisper WebGPU's ability to perform offline transcription tasks directly in the browser, utilizing files, URLs, or recordings, eliminating backend processing.

### 2. Transformer.js and Web Workers

The implementation leverages Transformer.js for ML processing and Web Workers for parallel execution, keeping all data processing localized within the browser.

### 3. Pipeline Factory for Task Management

The system uses a pipeline factory pattern with automatic speech recognition to define and manage the transcription task, streamlining the process.

### 4. Audio Chunking for Processing

The video explains how audio input is chunked and processed within the web worker, highlighting the data flow and steps for transcription.


## Summary

## Video Summary: What is WebGPU Whisper ML Powered In Browser Transcription with Transformer.js

**1. Executive Summary:**

This video explores the implementation of WebGPU Whisper and Transformer.js for fast and private in-browser audio transcription across 100 languages. Unlike real-time transcription, this approach utilizes a pipeline and tasks for offline transcription from files, URLs, or recordings, ensuring all processing remains local and data doesn't leave the user's device.

**2. Main Topics Covered:**

*   **Introduction to WebGPU Whisper:** Showcasing the offline transcription capabilities of WebGPU Whisper, built upon the earlier Whisper Web project.
*   **Demo of Functionality:** Demonstrating transcription from a URL, recording directly in the browser, and exporting the results.
*   **Codebase Overview:** A walkthrough of the codebase, focusing on the experimental WebGPU branch and highlighting key files (app.jsx, useTranscribe.jsx, worker.js).
*   **Implementation Details:** Explanation of the use of Transformer.js, Web Workers, and the pipeline factory pattern for managing transcription tasks.
*   **Audio Chunking and Processing:** Detailing how audio input is chunked, processed within the web worker, and transcribed.

**3. Key Takeaways:**

*   **Offline Transcription:** WebGPU Whisper enables offline audio transcription directly in the browser without backend processing.
*   **Privacy:** Leveraging Transformer.js ensures that all data processing, including audio and transcription, remains local to the user's device, enhancing privacy.
*   **Performance:** WebGPU offers significantly faster processing compared to previous iterations, facilitating quick transcription times.
*   **Task-Based Pipeline:** The system employs a pipeline factory for automatic speech recognition, streamlining the transcription process and allowing for pre-defined tasks.
*   **Web Workers for Parallelism:** Web Workers facilitate parallel execution of tasks, enhancing performance and responsiveness.

**4. Notable Quotes or Examples:**

*   "…nothing is it's getting loaded to the back end at all. It's saving everything in the browser so from the loading the audio file to actually getting the transcription is all done within the browser using uh Transformer JS." (Illustrates the local processing aspect.)
*   "…we're actually going to use tasks which is different than the real time example." (Highlights the difference from earlier implementations.)
*   Explanation of how the automatic speech recognition pipeline factory extends the pipeline factory defining the task like a "pre-trained or predefined task."
*   Mention of the ability to export transcriptions in STT or JSON formats.

**5. Target Audience:**

*   Web developers interested in implementing in-browser audio transcription.
*   Machine learning engineers exploring the use of Transformer.js and WebGPU.
*   Privacy-conscious developers seeking to minimize data transmission.
*   Individuals interested in the advancements in in-browser ML processing.


## Full Transcript

hey everyone welcome to nerding IO I'm JD and today we're going to be looking at whisper web GPU again but this time we're actually going to be looking at a task and putting it through a pipeline as opposed to real time transcription so that means you can actually use a file or a uh URL or actually record something in the browser and then have the transcription be processed with that let's go ahead and get started all right so the first thing that we're going to do is we're going to come to the hugging face uh spaces for whisper GPU and this is a little bit different from the Real Time example that we did previously so there's a couple different options where you can do from file record or uh just a URL so we're just going to take the URL and just kind of go through it so basically it pulls in the the audio file it hasn't done any transcription yet and then when you click you could see very quickly it was loading in and then now it's actually doing the transcription it's giving you again the tokens per second and then it gives you the ability to export either stxt or Json so what's also interesting is you can do a record so we'll just do uh some testing here hey testing one two 3 and then we can load this in and we'll just see cool and so what's really interesting again about this is that nothing is it's getting loaded to the back end at all it's saving everything in the browser so from the loading the audio file to actually getting the transcription is all done within the browser using uh Transformer JS so what we're going to do now is we're going to look at the code base uh and dig into the repo so basically in this repo uh you have to be on the experimental web GPU Branch this is actually they have a previous this example without using web GPU and so the reason that we're using web GPU is it's just faster to load a lot of this information in the browser and then we're going to be using workers and uh the other interesting part about this is we're actually going to use tasks which is different than the real time example so what we're going to do is we're going to uh go ahead and clone this and CD into the whisper web and we will uh jump into that repo so I'm just going to go ahead and pull it up in my IDE real quick everyone if you haven't already please remember to like And subscribe it helps more than you know if you have any questions please leave them in the comments we'll get back to you as soon as I can with that let's get back to it and we'll look at a few files so the first thing to note is uh I actually have this running you need to do an npm install and uh MPN Rond Dev but the first thing we're going to look at is the app and what's different about this is It's just loading in the HTML so they're actually using a transcribe hook and a transcribe uh component as well as the audio manager so what we're going to kind of look at is the transcriber hook first and we'll notice that it also has the worker so if we go ahead and go over to transcribe or use uh transcribe for the hook and we start going through this is all of our parameters and uh that are being set ways to like set language and our subtask and things like that and let's look at how we're actually using this web worker so what we're doing is we're sending an event and it's very similar to before where we have a case statement and we're looking at progress and then we're setting as the message is coming in the progress from this array of previous uh there's nothing on update at this point and there is on compete and so what it's doing is it's setting its status to update but or it's looking to see if it's busy or not sorry and then it is sending this information to the set transcript and uh letting us know when it's ready so if we continue further down this is where we're actually doing the post to that information so if you can think about the recording itself when we're processing that request what we're doing is we're taking the audio input as an audio buffer if that exists then we're chunking this again but we're doing it in a different manner where we're saying left and right channels then we're creating a float array to pass and when we get all of this information we're then going to do what we did before which is send this as a web worker or send this to our web web worker as a post message we're going to send the actual audio chunk the model the model itself the language and then the task that we're trying to run so the multilingual or the the language the last piece is more transcribed portion so where we're really going to look is we're going to look into the web worker and this is how we are defining what is actually happening so with Transformer JS so what What's Happening Here is when we we're just creating a use worker function to actually say what our create worker is and it's importing the worker. JS file as a module so if we look over here with uh worker this is a bit different right we're previous examples that we've seen we aren't using the pipeline so when we actually look at this what's happening is we're creating creating a factory of pipelines so the first thing that it's doing very similar is getting an instance but instead of defining the model necessarily it's actually defining the uh pipeline we're actually defining our model and tokenizer in the Constructor so we're passing that as part of the pipeline so if you look at the pipeline itself this is where you're kind of chaining different pieces together so you'll see here where we have task this is the these are the types of tasks that we can actually use inside of Transformer JS and so what we're going to do is we're going to see which one which class are we actually using inside here so we go back to our worker and we're looking further this is where we set our event listener so it's looking for message and now we're looking for a class where the of the automatic speech recognition pipeline Factory is extending our Factory our pipeline Factory and then we are defining our task so if we look over here we see that this automatic speech recognition is the pipeline that we're actually looking for you could think of that as almost like a pre-trained or predefined task so if we continue down when we get into our transcribe feature we have we have our Audio model and subtask we're actually loading in the the uh model here then we're uh defining our pipeline then we're actually doing our get instance uh well actually we're yeah and then we're posting our message of the transcriber here and as like before we're posting all that information to be heard back by the application whether that's loading or it's actually processing and sending the text which is what's updating our chat our chat so our chat is being updated rather instead of the back end it's being updated by the web worker so this streamer is where we're actually taking the text and then uh defining those chunks well we're we're streaming the audio as chunks and the text as well so finally we get to our output after we've awaited on our transcription and then we post or we catch if there's an error otherwise we return and this is what's giving us our TPS which is our tokens per second and then our output which is being generated all right that's it for us today thanks everyone what we went over was whisper web GPU specifically looking at how to do a pipeline and and the ability to use tasks as well as trying to understand what was happening from looking at a file or the URL as well as doing recording in the browser with Transformer JS and the Onyx so with that happy nerding

---

*Generated for LLM consumption from nerding.io video library*