# Chrome’s New AI Features Let You Build Multimodal Apps (No Backend!)

## Metadata

- **Published:** 6/18/2025
- **Duration:** 21 minutes
- **YouTube URL:** https://youtube.com/watch?v=_THRY0Gyksg
- **Channel:** nerding.io

## Description

Join me for an in person Vibe Coding Retreat https://www.vibecodingretreat.com/

Google Chrome just got a serious AI upgrade—and we’re putting it to the test by building a multimodal AI-powered game using only browser-native tools.

You’ll see how to combine:
🎤 Microphone input for voice interactions
🖼️ Image recognition from webcam, canvas or uploads
🎮 Simple JavaScript game logic
🧠 Local AI to drive real-time decisions

Perfect for building privacy-first games, offline LLM apps, and creative in-browser experiences using tools like MediaPipe, WebGPU, and experimental multimodal APIs.

🔗 Useful Links:
📘 Chrome AI Dev Docs: https://developer.chrome.com/docs/extensions/ai/prompt-api
🎮 Source Code: https://github.com/nerding-io/emoji-webai
📩 Newsletter: https://sendfox.com/nerdingio

💬 Want to build games with image or voice control? Let me know in the comments!
👍 Like & Subscribe for more cutting-edge AI + WebDev projects.

## Key Highlights

### 1. Chrome's Native AI: Gemini Nano in the Browser

Chrome is exposing JavaScript APIs to leverage Gemini Nano directly in the browser, enabling AI features without backend dependencies. This includes summarization, language detection and translation.

### 2. Multimodal Input: Images and Audio to LLMs

The prompt API supports multimodal inputs like images (blobs, canvas, video frames) and audio (blobs, audio buffers, media streams), allowing developers to analyze diverse data types directly in Chrome.

### 3. Prompt API: System Prompts & Multi-Turn Conversations

The Prompt API offers features like system prompts for model initialization, multi-turn (n-shot) prompts for complex conversations, and user-specific customization within sessions.

### 4. Tool Function Emulation for AI-Powered Actions

The API supports emulating tool function calling, allowing developers to rewrite prompts based on function execution results, extending the model's capabilities and interaction possibilities.

### 5. Sharable AI-Powered Experiences via Base64 Encoding

The video demonstrates building a sharable emoji game that encodes data in base64 within a URL, allowing easy distribution of AI-driven experiences.


## Summary

## Chrome's New AI Features: Build Multimodal Apps (No Backend!) - Summary Document

**1. Executive Summary:**

This video explores Chrome's new AI capabilities, specifically focusing on the multimodal API which allows developers to analyze images and audio directly within the browser using Gemini Nano. It demonstrates building an emoji-based game entirely on the client-side, showcasing how to leverage microphone input, image recognition, and JavaScript logic for AI-powered experiences without backend dependencies.

**2. Main Topics Covered:**

*   **Introduction to Chrome's Native AI:** Overview of Chrome's exposed JavaScript APIs for accessing Gemini Nano for AI features.
*   **Multimodal Input Capabilities:** Utilizing the prompt API to handle images (from various sources like webcam and canvas) and audio inputs for LLM processing.
*   **Prompt API Features:** Discussion of key features like system prompts for model initialization, n-shot prompts for complex conversations, user customization, tool function emulation, and multimodal input handling.
*   **Building a Multimodal Emoji Game:** Walkthrough of a practical example showcasing image/audio analysis and translation into emojis, encoded in a shareable URL.
*   **Implementation Details:** Code snippets and explanations on capturing images/audio, pre-processing data, initializing the AI model, constructing prompts, and generating output.
*   **Setting up the Chrome Environment:** Required Chrome Canary version, flags to enable (experimental web platform, prompt APIs for Nano and multimodal support).

**3. Key Takeaways:**

*   Chrome now offers powerful, browser-native AI capabilities with Gemini Nano, accessible via JavaScript APIs, enabling building AI apps without a backend.
*   The multimodal prompt API allows developers to analyze images and audio directly within the browser, opening up opportunities for creative and privacy-focused applications.
*   The API offers advanced features like system prompts, multi-turn conversations, and the ability to emulate tool function calls.
*   Shareable AI-powered experiences can be easily created by encoding data within URLs.
*   This technology allows building of offline LLM apps and creative in-browser experiences.

**4. Notable Quotes or Examples:**

*   **"We're going to look at how you can actually leverage AI in Chrome, the browser, and actually use something called multimodal. What that means is that we can actually analyze and run AI on images and audio text."** - Introduces the core concept.
*   **Example of using image input:** User takes a photo using the webcam, and the AI identifies it as a "person in a beanie".
*   **Example of using audio input:** User records voice saying "the quick brown fox jumps over the lazy dog", and the AI transcribes it to "a fox jumps over a dog" and converts into relevant emojis.
*   **Explanation of sharable links:** "The way that this is saving the information is we're all using like base 64 encode."

**5. Target Audience:**

*   Web developers interested in exploring and integrating AI features into their web applications.
*   Developers looking to build privacy-first applications that leverage AI without relying on external servers.
*   Developers interested in creating multimodal experiences that utilize image and audio inputs.
*   JavaScript developers looking for hands-on examples of how to use Chrome's new AI APIs.


## Full Transcript

Hey everyone, welcome to Nering.io. I'm JD and today what we're going to look at is how you can actually leverage AI in Chrome, the browser, and actually use something called multimodal. What that means is that we can actually analyze and run AI on images and audio text. So, we're going to build a little game and see how we can interface with this. And with that, let's go ahead and get started. All right. So the first thing I want to point out is this is actually running in Canary. So in order to use the multimodal uh Chrome AI, you actually have to be in the experimental browser. Um I'm going to go through the exact ways that you actually need to uh get this set up and what versions and all of that, but first I just want to take you through a demo. So, the concept of this game is that we're going to either take an image or take some audio and then we're going to uh have AI translate that into an emoji to decode. And that's the game. And then you can you can like share that link out and have people guess uh if they got it right and then use AI to give them like a hint. So, we're going to go through like the two examples and then uh actually dive into the code. So, first you can actually like just click and take a photo and I'm just going to say allow this time and you can see it's pulling in there. The other thing I want to point out is as always I love these like dumb little uh console debugs, but it is super important this time because you can see things like the uh you know what you need to take Chrome as well or what version of Chrome and then it's just kind of telling you what's actually happening. So we're creating a canvas. We're coming in here. We have our handler uh and we have different uh file readers. So the file reader is actually like pulling that information in. So again, we can uh we can also type in here. So if we wanted to type information like nerd wearing hat um or we can actually do audio. But first we're just going to go ahead and click see what it actually comes up with. It's going out and getting the model. It doesn't actually need to do this. it's uh already in the browser. That's the most important thing. This is actually using Chrome's native AI, so Gemini Nano, in order to pull this information. And so now it's uh this is what it's decoded. It's saying camera, which maybe is a selfie, and looking at it. Uh it doesn't give you the answer. You can copy this. Obviously, you can share this on social media. And then you can uh come in here and guess. So we'll say selfie. Uh and as you can see right there, wrong answer. It's reducing the sto uh generating the clue. So it's actually giving you the information here and um then it's uh going to post this back. So, we'll just say person in a beanie. And then we'll say in a beanie just to cool. And now it's going to reveal the answer. And so this is what the AI was actually seeing as the answer. And so the the way that this is saving the information is we're all using like base 64 encode. So if we go back and we look, we can also do a a new one and we can actually do the exact same thing. We'll kind of keep this decode open. And what we can do now is we can actually record something. And so we'll say the quick brown fox jumps over the lazy dog. And you can actually hear the recording. So this is actually taking the microphone information here. So we know that we have that information. Now what we're going to do is go ahead and generate again. We can kind of see what's transcribing. And right here you can see this a fox jumps over a dog. So it missed some of the words, but this is the emoji that it gave us. And again, we can copy this URL and we can go in here and let's do box dog. Uh, and so let's just over a dog. So, I could probably add something in here to uh actually get like the evaluation a little closer. So, it doesn't have to be the exact same thing. So, now we got it right. Correct. Well done. And again, we can actually take this uh we can share that, but we can create a new challenge if we want. And that's the game. So, again, this is using multimodal. You're able to analyze the uh image. you're able to analyze the audio. And uh again, all of this is being done in Chrome. Real quick, if you haven't already, please like and subscribe. It helps more than you know. Also, check out the link below for the vibe coding retreat. It's a boot camp that I'm putting together for people that are looking to elevate their skills and finish the last 10% of vibe coding to making your apps more secure and actually launching in the wild. With that, let's get back to it. All right. So during an announcement at Google IO, they specifically talked about different APIs that are coming to the com Chrome browser. So what that means is that you can actually leverage Gemini Nano specifically in the browser through exposed APIs, JavaScript APIs in order to access different AI features. So, previously I've done videos on summarizer, language detection, and translator. I even did one on the prompt API. We're actually going to look at the guts of prompt API today. But it starts with the Chrome 138 version. For the multimodel pieces that we just saw in action, you actually still have to use the prompt API, but they have different attributes or parameters that you can send, but you still need to be part of the early preview program. So, what we're going to do is we're actually going to look at the uh the code that you can actually pull for the prompt API, but we're also going to go through the steps of how you can actually get this set up. So, the first thing you need to do again is if you saw the other videos is you need to go ahead and join the early preview program. You just go ahead and and fill this out uh put in your email. So, depending on the uh APIs over here, it'll tell you uh you know you need to get like the EP EP uh ID and submit some of those things. So again, you need to go through this form process. Uh and depending on the API, you have to fill out a little bit more work. Again, that's all covered in the previous videos. Specifically with the prompt API though, you also then need to one again download Google Canary. That's the experimental browser. And then you actually have to go in and turn on these flags. So when you go into Chrome col or col flags, you have the ability to turn on different features and use them. First thing we're going to be doing is the experimental with web platforms features that needs to be enabled. You also need to have device optimization. You don't need the enables when web NN. I just I liked uh having that on and have done experiments with that. Um but right here is the important one. You need to have the prompt API for Google Nano turned on and enabled. You also need the prompt API with nano for multimodal support for input. This specifically allows us to look at images and audio. So you just go ahead and click enable. Every single time that you do enable uh you need to like restart in order for them to take effect. And then again per the um the other APIs I just have these turned on summarization writer and rewriter. Once you have those turned on you can actually start to experiment. So in the GitHub they actually have the API explainer. And so this gives you a whole bunch of different things that you can do with all this. But if you scroll down, not only do they cover things like uh zeroot prompting and system prompts, which is interesting because they have an initial prompt. So this is when you're actually starting or creating your language model and essentially initializing everything. You have an initial prompt. That initial prompt is the system. That makes a ton of sense because you always want your system prompt to be at the beginning. Then you can just call this prompt and that will again access the Gemini Nano. Uh you can sequence these prompts or they call them nshot prompts. So you can have basically your role you have your system user assistant user assistant very similar to how things work uh in most AI SDKs again because we're hitting the small the nano version. But if you keep scrolling you have the ability for uh customizing by user. So, a multi-user session. Uh, if you notice, I was scrolling scrolling a little fast, but they also have uh the ability for prompt streaming and then you can emulate tool function calling. So this is super awesome because now you can actually look for a function and then rewrite part of the prompt and actually execute on top of that. And then again the multimodal inputs. So this is super cool stuff. You can actually take different input inputs for images. can be a blob, a CSV image element, an HTML canvas element. So that means you could actually draw with something or an HTML video element. Again, it is the fact that it's getting the current frame of a video that is playing is super super cool. Uh and then you have the ability to get audio inputs as well. So another again a blob or an audio buffer uh in media stream. Uh again you can also put pipe in the audio element. So the fact that you can actually put in HTML elements into your model is like mind-blowing to me. So the only thing that you need to do is when you're actually doing the language create, you need to be able to uh say your expected inputs are going to be audio and image. And then you have the ability to actually take that information. So in this case, it's either doing an image of a blob or you're doing a query select of a canvas. Uh and then once you put that into your sessions, you can actually uh put uh put more or send that over to the L1. In this example, it's capturing the microphone for 10 seconds and then sending that information. One of the things to note is that the session uh is only allows so much contacts and only allows you to send so much data uh because it is the nano version. But the fact that you can uh send this information is just again super cool. Um can't do anything across uh the cross origin. So but now what we're going to do is we're actually going to look at the code itself. So what I did in the example is I went ahead and uh looked at the I did like a default model. So let's get to the top. Actually, this is right where I'm initializing this function. So, I created a helper function that basically looks for window.ai and window.language model, which was the previous uh version. And then now just looking for language model and basically seeing like does this language model exist in the browser? If not, do a null. Based on that null, I can kind of tell what the Chrome uh availability is. So if it doesn't have that, I'm going to log and return that this API is not available in Chrome. Um, and otherwise we'll go ahead and uh uh keep going. So the next check that we're going to do is we're actually going to check and see if the availab availability is a function. See what that uh and then return what the availability is which is coming down here. and we're trying to see okay do I have the expected inputs of text do I have image do I have audio I have all those things continue and then going ahead and um being able to execute on this uh API so let's keep going in here what this is doing is it's actually building out the uh the click event so once I'm actually getting the information from the the uh the camera or the audio. So, we can actually look and see where that is. So, when we capture the photo, what we're doing is we're actually getting the video stream. Then, we're taking that element. We're actually creating this video so that we can snapshot that image. Once we click the button, we'll go ahead and uh resolve through a promise. And look right here. We're using a canvas to actually put that screenshot on the page. We're going to use that information to make a blob later on so that we can actually take that information and leverage it in the LLM. So, if we go to audio now, we're going to do uh right here, we have the ability to start the media of the audio. We're going to chunk that information. We're going to make it a wave. We're going to make that uh or the type of wave. We're going to make it a blob. We're actually going to put the HTML element on there. And again, this is how we're going to grab that information and put it into the LM. So if we go look at our this is uh actually a good example of of text. So this is how we're going to generate our clue. So what we're going to do is uh initialize the model. We are using initial prompts. You could probably break this up into multiple prompts, but basically I'm passing in uh the attempt number and the text and giving it instructions of how to actually give a clue. And then based on the answer that I get back, I'm going to send that to the front end. But right here is where the clue session is. I go ahead and I send that over to the LLM. So now if we go to our uh ability to actually take the image. So now when we when we click uh go ahead and generate the first thing we're going to check is what is our current media type. Is it an image? Is it an audio? We're going to take that information and we're going to create our session just like we saw with the clue. But in this time this case we're going to look at the inspected inputs. We're going to go ahead and say that it is a image or a text. And then we're going to use the uh image additional prompts again. We're going to say that you're an expert in describing images. Write a one to two word uh that captures the main subject or action. So again, we can like change this around. We could also add different types of prompting in there. uh but we're by the user but we're just going to go ahead and uh send this information. So we've now initialized our session. We have the monitor event to make sure we understand like the progress which this is already loaded in the browser so we don't necessarily have to worry about that too much. We're actually pulling the information from the blob. We're going to process that information and then once we have that blob, we're actually going to send it into our object here where we have our value of image because we're sending a type of image and we're actually sending this to the L to the LLM. Right here is where we actually have this prompt session and that's where we're actually uh getting our description based on the one to two or two to six word uh example. Then once we have that example, we're basically going to tell it to do another sequence. So, this is now going to convert it whatever that description was to a uh emoji of sentences. You could potentially uh do it here, but the problem is is that we would need to store this information of the text that we want for a later date. And what we're going to do with that text is turn it into a base 64 encode. So now we have the ability for us to take that text and put it into an emoji. We're going to do the same thing. We're sending this to an LLM. And then we'll go ahead and destroy our session. So we have our emojis. We're also if if we wanted to do audio like we saw, we can take that blob and we can do the same thing. We're basically telling it the expected input is going to be audio. uh you're an expert at transcribing and uh describing the audio context. Listen to what you've had, give an example or and then write it out. Might be a way to do this all in a single function, but this is the easiest to go to figure out. Then what we're going to do again is we're going to have our transcription prompt. So we give the audio blob right here and we are going to tell it the description of our uh context that we expect and then again we're actually going to get the emoji session and we're going to take that audio and tell it to be transcribed or gener converted into an emoji sequence. Lastly, if there's text only, then what we're going to do is we're going to say take the original answer and go ahead and change it into uh an emoji as well. And so what you can see down here is basically we're going to take that information then and we're going to create a sharable link. That sharable link will base 64 encode either the uh the text as the answer just as a simple way to kind of like mask the fact of of what the answer is. And then we're going to um send the uh original the original text and the emojis so that we can populate the next screen. And then we're just using a location and origin. So, this way we can share out the emoji quiz as more of a game and we can actually have that uh be like a a little social experiment if you will. And so that that's uh pretty much it. So you have the ability to now use multimodal. You have the ability to use multiple different prompt sessions. you can actually do them in a sequence and you can uh take that information and include it directly into your uh JavaScript without ever touching the back end. All right, that's it for us today everyone. What we went through is the ability to use Chrome AI and actually leverage multimodel LLMs and the ability to actually analyze images and audio. And then we built the emoji game that is sharable. And with that, happy nerding.

---

*Generated for LLM consumption from nerding.io video library*