# NLP Made Easy: ChatGPT Tokenizer The Building Blocks Of Natural Language Processing ## Metadata - **Published:** 11/6/2023 - **Duration:** 12 minutes - **YouTube URL:** https://youtube.com/watch?v=6p2T9iZUkn0 - **Channel:** nerding.io ## Description Dive into the fascinating world of NLP with our latest video on ChatGPT tokenizer and how they work in Javascript! ๐Ÿค–๐Ÿ’ฌ Understand how machines break down and interpret human language with ease. We'll unravel tokenization the mystery behind word tokens, subword tokens, and byte pair encoding (BPE) tokens used in advanced models like GPT-3, and GPT-4. ๐Ÿง ๐Ÿ” ๐Ÿ”‘ Key Highlights: โœ… What is Tokenization in AI? โœ… Different Types of Tokens in Natural Language Processing โœ… Byte Pair Encoding (BPE) Demystified โœ… Real-World Applications of Tokenization Whether you're a tech enthusiast, a student stepping into machine learning, or simply curious about how AI understands language, this video is your gateway to the core concepts of tokenization. ๐Ÿš€ Don't forget to like, comment, and subscribe for more insightful content on artificial intelligence and machine learning. ๐Ÿ‘๐Ÿ”” ๐Ÿ“ฐ FREE snippets & news: https://sendfox.com/nerdingio ๐Ÿ‘‰๐Ÿป Ranked #1 Product of the Day: https://www.producthunt.com/posts/ever-efficient-ai ๐Ÿ“ž Book a Call: https://calendar.app.google/M1iU6X2x18metzDeA ๐ŸŽฅ Chapters 00:00 Introduction 00:17 OpenAI Limits 01:18 Playground 03:26 Sub Words 04:31 GPT Tokenizer 06:06 Advanced Tools 06:51 NextJS 10:41 Typescript 11:54 Conclusion ๐Ÿ”— Links https://gpt-tokenizer.dev/ โคต๏ธ Let's Connect https://everefficient.ai https://nerding.io https://twitter.com/nerding_io https://www.linkedin.com/in/jdfiscus/ https://www.linkedin.com/company/ever-efficient-ai/ #Tokenization #NLP #MachineLearning #GPT3 #AIExplained #TechEducation #LearnAI #ArtificialIntelligence #DataScience #Coding #Programming #token ## Key Highlights ### 1. Tokenizers and Limitations in AI Models AI models have token limitations that impact text processing. Tokenization isn't simply character or space-based; algorithms break text into 'tokens,' affecting input capacity and potentially influencing the model used. ### 2. Understanding Tokenization with OpenAI Visualizer OpenAI provides a visualizer tool to understand how text is broken down into tokens, revealing that punctuation and spaces can be significant tokens. HTML code generates many tokens due to its structured nature. ### 3. GPT Tokenizer JavaScript Package Overview The GPT tokenizer package in JavaScript offers encoding and decoding functionalities, along with features like token limit checking. It is similar to TikToken but with additional features, including async generators and a playground for visualization. ### 4. Using GPT Tokenizer in JavaScript/Next.js The package allows encoding/decoding text, checking token limits, and processing chat logs. Chunking and generator functions help manage large texts. This can be implemented server-side in API routes or client-side for real-time analysis. ### 5. Frontend Tokenization Example in React The video demonstrates using the GPT tokenizer directly within a React application in a code sandbox. This enables front-end encoding and decoding of text, bypassing the need for backend processing in certain situations. ## Summary Here's a summary document designed to help someone quickly understand the video content. **NLP Made Easy: ChatGPT Tokenizer The Building Blocks Of Natural Language Processing - Summary Document** **1. Executive Summary:** This video provides a practical introduction to tokenization in NLP, explaining how machines break down human language into tokens for processing. It covers different types of tokens, explores the GPT tokenizer in Javascript, and demonstrates its usage with code examples within Next.js and React environments. **2. Main Topics Covered:** * **Introduction to Tokenization:** What tokenization is in the context of AI and NLP. * **Token Limitations in AI Models:** Discussion of token limits in various AI models (e.g., GPT-3, GPT-4) and their impact on text processing. * **Types of Tokens:** Exploration of word tokens, subword tokens, and Byte Pair Encoding (BPE) tokens, illustrating how they differ. * **OpenAI Visualizer:** Using the OpenAI playground to visually understand how text is broken down into tokens. * **GPT Tokenizer JavaScript Package:** Overview of the `gpt-tokenizer` package in JavaScript, its functionalities (encoding, decoding, token limit checking), and its advantages compared to other tokenizers (like TikToken). * **Practical Implementation:** Demonstrations of using the `gpt-tokenizer` package in JavaScript within Next.js API routes and directly in React frontend components. The focus is on encoding, decoding, and managing token limits. **3. Key Takeaways:** * Tokenization is crucial for AI models to understand and process human language. * Token limitations affect the amount of text an AI model can handle at once, and can be handled with chunking functions in Javascript * Different models use different tokenization methods, impacting performance and cost. * The `gpt-tokenizer` JavaScript package offers a convenient way to work with tokens in JavaScript applications. * You can implement tokenization on the backend (Next.js API) or directly on the frontend (React), allowing for flexible application design. * Byte Pair Encoding (BPE) creates tokens from subwords to better handle rare words and improve language understanding. **4. Notable Quotes or Examples:** * "Tokens are just pieces of text that are broken down you can think of them as small pieces." * "About four characters of text is roughly one token so 100 tokens could be about 75 words" - Rule of thumb for token estimation. * HTML/Code generates a ton of tokens due to its structured nature. * Explanation of Subword tokenization and its advantages (using examples from Hugging Face). * Code snippets showcasing the usage of `gpt-tokenizer` for encoding, decoding, and checking token limits. **5. Target Audience:** * Tech enthusiasts interested in NLP. * Students learning about machine learning and AI. * Developers building applications using large language models (LLMs). * Anyone curious about how AI processes and understands human language. ## Full Transcript hey everyone welcome to nerding IO I'm JD and today we're going to be talking about tokens specifically how they're used in Ai and how we can work with them inside of JavaScript so let's go ahead and get started so one of the first things that will we so one of the first things that we learn when we're thinking about tokens is limitations and you can recognize that by the fact that all of the different models have token limitations and when you first first start looking at it you might think well maybe this is just the number of characters or you could think of it as a regex and it being split on uh different on Spaces or even on punctuation sometimes even on line breaks and while all of these are true you can actually split these uh these this text into different types of tokens that way it's actually using a very particular kind of algorithm so we're going to go through some of that to actually understand it a little bit more on the open AI website though it actually tells you the max tokens but then it'll actually tell you which uh tokenizer it's using so if we look at the open aai platform it has this visualization and this allows us to actually look at how tokens are created and tokens are just pieces of text that are broken down you can think of them as small pieces so we were talking about how you know this could be by characters so if we just do I we're starting to see stuff but then we see love code you notice that not only are the tokens actually being included with the spaces but it actually has parts for the punctuation too you can see double spaces will uh create a token you can see that uh different punctuation creates tokens however like line items don't even though you have multiple characters you don't actually have the same amount of tokens they do give you a rule of thumb that about four token four characters of text is roughly one or uh token so 100 tokens could be about 75 words so we know that we can actually do break down these tokens in JavaScript and that's because we have this tick token package that openi is using so they have one specifically for Python and also for JavaScript however there's a another Community package that we're going to take a look at one other thing I kind of wanted to show is that the more tokens that you have doesn't mean all the characters so if we actually look at HTML and I know a lot of people are using open AI for code generation you're actually noticing that these tokens are split up in a very unique way you can even see tabs and spaces uh we can have that debate later on but that all of these are different tokens and so HTML or any kind of code is is actually going to generate a ton of tokens and so there's actually a good reason for this and if we go to uh hugging face they have a really good example Le of why this is happening and they're just still using words but if you look at it they break things down for you have words like annoying or annoyingly and the word annoying also has ly so if we look at this Bert tokenizer it actually states that I have a new are all words that it understands however GPU is a subword and it's saying GP is going going to be a subword and u is going to be a subword and then pound pound is actually going to be how you would reattach these words these subwords into a uh token reversal so if we were decoding these tokens so these tokens are going to be essentially like associations and that's how it's actually going out and doing some of the the searching and putting together its respons all right so now let's actually take a look at our JavaScript package I hope you're enjoying this video on tokens and how they're being used in AI specifically how we can work with them in in JavaScript if you haven't already please like And subscribe it helps more than you know and greatly appreciate it all right let's get back to nerding so the package that we're going to look at is GPT tokenizer and what's really cool about this is it has the token bite pair encoder and decoder for all the open AI models so it's using it's like very similar to Tik token but they like to say right here that it has additional features sprinkled on top so you can see the different types of encodings which are the RV 50k base p50k uh base and so on the most one the most common one for gpt3 and GPT 4 is the CL 100K base so what we're going to be looking at are a couple of the different features that it has so just like on open AI it has the decoder and encoder F functions it also has an async generator which is really nice and then this which is the within token limit this is great because it gives you two different things it gives you the it can either give you the number of tokens or it can tell you uh that you're exceeding the amount of tokens that you might have put a limit on so it's just running an mpm install in order to get it started and but the other really cool thing is it actually has a playground so it's just a really easy URL and it gives you a whole bunch of other information so we're going to go ahead and take a look at that so if we do a Hello World you can see that the tokenizer is the CL 100K and that's doing the GPT 3.5 and GPT 4 it'll actually visualize the the tokens you can actually see the token IDs then it will tell you the characters and tokens very similar to the open AI tokenizer we saw earlier but it'll also give you the dollar amounts this is super helpful when kind of doing some estimates so now that we have like a pretty good understanding we have a visual of how we can actually use this now we're going to actually dive into our code so in order to use this in JavaScript we have to do our import you can see here what I'm doing is I'm actually importing this into nextjs in a route what I we're we're just going to be using static text there's really no visuals for this um and I'm just going to show you kind of behind the scenes what it's doing so this example um you can get either from our uh from our Link in the description or this is actually the same code uh from the mpm example of GPT tokenizer so right here you have your text and it's giving you a toket limit of 10 if we look at our example previously that's 13 characters but only four tokens so we're not going to exceed our limit it will go ahead and decode or encode our text here and then we can actually decode the text and get the information back both super important the next thing is the one of the the biggest functions I think that's really cool and that's the within token limit so as we were talking about before basically you can can set your token limit which we did which is 10 put in the text and it'll actually give you the token limit back so it'll either tell you what the tokens are or if it's exceeded it'll return false that allows you to do some conditional logic so let's say that you were ingesting a ton of data maybe over 30 let's just say over 8,000 tokens we know of text that we're going to be pulling in well that would be too big for GPT uh 3.5 we'd have to go to the 16 limit or even higher this is a way to kind of predefine what that model could be just by doing that now this isn't full uh safe because Bas the token limit for GPT 3.5 and four is based on all the tokens so not just the tokens you're taking in but the T tokens that you're outputting as well so just keep that in mind but it's still a great function again you can put in conditional logic but this this is just flat text as we can see up here so the next call or function that we're going to be going through is you can actually take the entire chat log that you have and actually put that in to see what your tokenizer would come back as so right here we have an example we're doing a system and and the assistant you could also see this as a role for the user then you can take all the content that you that it expects from this chat as an array and actually encode the chat tokens again super helpful and you can actually put that array into the token limit without changing the function or anything else it can it can look for an array or text both really cool uh again it gives you the token limit or you can uh look back or returns false the next thing it does is it actually takes the encode of the text and it'll chunk this information based on the generator of this text so then the next two things you can do is you can actually de uh decode as a gen or using a generator from those chunks and then you can also do an await so you can actually do an async function or an async tokens assuming that they're iterable and a and an iterator number uh and then do a decode async generator so this is how we would actually implement the tokenizer in JavaScript all right so the last thing I wanted to show you was you can actually use this on the front end too uh so this exact same tokenizer that we were looking at for visual earlier you can actually go and a code sandbox and actually load all of this actually directly into react so as you can see uh the tokenizer part is a bit down here you've got the types for encoding you can actually use um select which state uh which uh tokenizer you're going to use and then actually put it through the encode on the front end so none of this is hitting the backend the example we showed is is just if you wanted to use it um behind the scenes on a uh like an API request but this is how you would actually decode your tokens same same functions right we're still using the decode generator and encode uh but it's on the front end so just really cool that you can see it on both the front end and back end uh with that all right so what we learned today was tokens and specifically how they're used in AI so different ways to break them down also how you can encode and decode them in JavaScript some of the limitations that you might run into and how to actually get around those with chunking and then also looking at examples you know for real world things like chat bots so don't forget to like And subscribe and leave us any comments of things you want to learn in the future happy nerding --- *Generated for LLM consumption from nerding.io video library*