# How to Secure Data in AI: Essential Guide to PII Masking in LangChain ## Metadata - **Published:** 1/31/2024 - **Duration:** 19 minutes - **YouTube URL:** https://youtube.com/watch?v=54vUy5kvHRI - **Channel:** nerding.io ## Description Discover key strategies for securing sensitive data in AI systems with LangChain's experimental feature of PII masking. This video offers a deep dive into an advanced masking parser and transformer tool, crucial for protecting personal information in customer support interactions. Ideal for AI professionals and enthusiasts, learn how to navigate data privacy in the AI realm effectively, ensuring compliance and safeguarding customer trust. Dive in for practical insights into AI data security! 📰 FREE Code & News: https://sendfox.com/nerdingio 👉🏻 Ranked #1 Product of the Day: https://www.producthunt.com/posts/ever-efficient-ai 📞 Book a Call: https://calendar.app.google/M1iU6X2x18metzDeA 🎥 Chapters 00:00 Intro 00:51 Real World Scenario 02:07 Getting Started 03:22 Basic Example 05:52 Kitchen Sink Example 10:49 Customer Support Example 17:36 LangSmith 19:00 Conclusion 🔗 Links https://github.com/nerding-io/langchain-nextjs-example https://js.langchain.com/docs/modules/experimental/mask/ https://smith.langchain.com/ ⤵️ Let's Connect https://everefficient.ai https://nerding.io https://twitter.com/nerding_io https://www.linkedin.com/in/jdfiscus/ https://www.linkedin.com/company/ever-efficient-ai/ ## Key Highlights ### 1. LangChain JS Masking Feature The video focuses on LangChain JS's experimental masking feature for PII removal, showcasing its application in customer support scenarios to protect sensitive user data before it reaches LLMs. ### 2. Regex-Based Masking Customization The masking process leverages regex patterns, allowing for flexible customization of what data gets masked. However, the presenter notes that some regex patterns may require adjustments for proper functionality. ### 3. Rehydration Capability The masking is reversible (rehydration), which is key for accessing the original data when needed. This isn't a one-way process and has some potential security implications if not handled correctly. ### 4. LangSmith Integration for Masking The video demonstrates how masked data is passed through LangChain's runnable sequences and visualized in LangSmith, ensuring that sensitive information is not stored in third-party LLMs. ### 5. Stream Route with Utilities Addresses the use case of chat history masking, where an entire conversation with sensitive PII data can be summarized while protecting this information from the LLM. ## Summary ## Video Summary: How to Secure Data in AI: Essential Guide to PII Masking in LangChain **1. Executive Summary:** This video provides a practical guide to using LangChain JS's experimental PII masking feature to protect sensitive customer data in AI applications, specifically customer support systems. It demonstrates how to implement customizable regex-based masking, utilize rehydration capabilities, and integrate with LangSmith to ensure data privacy throughout the LLM process. **2. Main Topics Covered:** * **Introduction to LangChain JS Masking:** Overview of the experimental masking feature for PII removal. * **Real-World Scenario:** Application in customer support systems for masking sensitive user data before it's processed by LLMs. * **Basic Masking Example:** Demonstrates simple text masking using regex and the masking parser, focusing on masking and rehydrating email addresses and phone numbers. * **"Kitchen Sink" Example:** Shows a more complex scenario with various regex patterns for different PII types (names, bank accounts, etc.) and event hooks for masking and hydrating stages, highlighting potential challenges with regex accuracy. * **Customer Support Stream Example:** Uses a full chat history with PII to demonstrate masking in a streaming context. This includes using a prompt template for summarizing conversations, masking the chat history, and ensuring sensitive data is masked when passed to the LLM. * **LangSmith Integration:** Illustrates how masked data flows through LangChain runnable sequences and is visualized in LangSmith, preventing sensitive information from being stored in third-party LLMs. * **Rehydration Capabilities:** Demonstrates how to reverse the masking process to access the original data when needed, while acknowledging potential security considerations. **3. Key Takeaways:** * LangChain JS offers an experimental masking feature to protect PII. * Regex is used to customize masking patterns, but requires careful attention to accuracy. * Masking is reversible (rehydration), enabling access to original data when necessary. * Integration with LangSmith ensures data privacy throughout the LLM pipeline. * Masking can be applied to chat histories in streaming scenarios. * Careful planning of regex patterns is crucial to avoid unintended masking and maintain accuracy. * While powerful, the rehydration capability poses a potential security risk if not properly managed. **4. Notable Quotes or Examples:** * "…you're actually getting information from users and you need to scrub that data so that when you're putting it into your llm it's actually masked from the actual information that it needs…" (Explanation of the problem PII Masking solves) * The video showcases specific examples for email, phone numbers, names, bank account numbers, driver's license numbers, and passport numbers being masked using regex. * The Customer Support Example provides a specific code example which utilizes a full UI, stream, and runnable sequence to summarize data after it has been properly masked. **5. Target Audience:** * AI professionals * Machine learning engineers * Data scientists * Developers working with LangChain and LLMs * Individuals interested in data privacy and security in AI ## Full Transcript hey everyone welcome to nerding IO I'm JD and today we're going to be talking about Lang chain and its experimental feature called masking a real world example of this would be maybe you're a customer support system and you're actually getting information from users and you need to scrub that data so that when you're putting it into your llm it's actually masked from the actual information that it needs so we're going to look at not only the masking feature as well as some of the streaming capabilities and then actually look at it in Langs Smith just to see what the output is so with that let's go ahead and get started all right so the first thing that we're going to do is we're actually going to look at the Lang chain docs um what's interesting about this is so if you go into the experimental uh section Underneath more and go to masking this is where we're going to find it however this is only available for Lan chain JS and the way that I came about this is I was actually reading their blog post where they're talking about how uh different systems are actually using masking for the pii or personal personal identifiable information removing that from the uh the llm so a real world scenario is that this could be in a customer report support system and it's receiving messages that are sensitive uh based on the the customer information and you want to mask that information we're going to look at multiple different examples uh specifically in nextjs but we're going to go through the three that they have here we have a basic example we're going to look at the kitchen sink and then we're actually going to look at the nextjs Stream So I built this in a couple different ways where we also have some utilities and then we'll actually do a goey where it looks like it's it's just kind of a chat so the first thing that we're what you're going to need to pay attention to is make sure that you're getting the Lang chain open AI install um because it has some of the experimental uh pieces in there uh and this is specific to open AI so let's go ahead and get our project up so if I open this code and remember if you sign up for the newsletter you'll uh you'll have access to this code base as well as the the versel example but if you if you want to start this from scratch you want to make sure that you have your environment variables so you have the Lang chain uh tracing this is all for um Smith specifically and then you are going to need an open a I key so what I did for this is I specifically built a um an API just called masking and I just put in different routes for the basic route kitchen route and the stream route um and then we'll also go through down here you can see the uh masking parser which I just put in in a UTS file this is using App router um so just uh for this example just know that so if we dive right into the basic example um what we can see is the fact that it's actually creating uh some text and this is what we want our mask to look like so specifically it's giving a essentially a random hash and then an identifier that we know okay this mask is particular to this uh type of data we are again you're going to be pulling in the Lang chain experimental make uh masking and then we're actually using regex for defining what these masks are so we're building a regx masking Transformer and it's taking the hash function if we want as well as like the pattern itself so these are our different patterns you can customize them however you want these are just based on the example from the docs themselves and then you're taking this information and you're putting it into your masking parser and you're adding the Transformer last part is you're going to actually mask this information so you can take whatever input you have you can see here it's got things like email as well as a phone number and more email and we'll actually mask the information based on the parser and then what I did just for this example is to show you that you can also rehydrate this is really important so that if you want to uh then show that it's not it's not a one-way mask you can actually um disable it as well and the interesting thing is you could technically use this on the front end because it is Javascript um I'm just doing everything in the back end just to show how it works and then we're going to Output this Json uh right here so if we bounce back to our browser and we go ahead and we look at our first example here we're just going to go ahead and refresh you can see that the mask is let me blow this up for you we can see that the mask is uh is actually the mask is actually going over the email as well as the phone number and you can even see it over here it's catching the phone number as well and then we when we rehydrate we're actually looking at the uh the information that's coming back from it if we look at our uh text we can actually see a console logged here as well again I'll blow this up a little bit um where you can see the the information coming back from in the back end of it being masked as well as it being uh hydrated we're not actually doing anything with the chaining right now what we're doing is actually just transforming the data uh that we have before we actually put it into more information all right so next we're actually going to jump to the kitchen sink example so if we go back back to our documentation you can look at the kitchen sink here and we're you can see that we're using a lot of the same uh features but there's a little bit of this being extended with different patterns for the reg X as well as you can see some venting so what we're going to do is we're actually going to take all this and we'll put it in a u tailes file so if we go back to our code if you notice I have the the route already made right here it just has has the import for the utility that we're actually going to build it has the message with all the pii information and then we're actually going to just hydrate and uh I'm sorry we're just going to mask and then rehydrate and return a response so far everything that we're doing again isn't going through the stream so we could actually do this on the front end I'm just choosing it to do it on the back end um as good security practice so if we look at our utility you can see it's in our uaes folder our masking parser and then we have our import again we're pulling in the experimental masking what we're doing here is we're creating a simple hash what this will do is it'll actually allow just as like before basically uh a variable number uh system as a randomized hash we're also going to be assigning like a readable variable beforehand so we'll know which mask is actually doing the reg X and then as we go down to these patterns all of these are are really just reg X patterns when I ran this the first time I noticed that this uh question mark I actually isn't permitted in the uh regx for JavaScript and it was throwing an error so what I did is I actually commented this out removed that um probably not the best uh and I'm also not the best at regex so if you have a better way of doing this would love to see it in the comments um but I I asked chat and this is what it told me so that is one thing to be aware of that if you're copying and pasting from the um the guides to just kind of watch out for this this piece uh with all of these reg X's we are then going to put them into our masking Transformer right here we'll have our patterns and then we have our different hooks for each stage of the masking and hydrating so basically these are different events that are going to be fired and we'll have uh we can have different functions we're just going to be console logging just like the documentation would say but uh we can actually see this information lastly we're actually going to initialize the masking uh parser then we're just exporting it so we can use it back here in our kitchen sync so let's go ahead and just do a quick test if we go back to our uh URL here we can actually see that this is the masking uh kitchen example we'll go ahead and do a refresh and we'll find some interesting things going on here so not only is it doing the hashing as expecting and hydrating we're seeing some some different kinds of Errors so right here we're noticing that name is actually uh doing the first part of name but it's not actually doing it for do which is kind of interesting did recognize the email but then it actually saw the hash that the email was creating and thought it was a bank account so it's hashing the uh the random hash here which is pretty interesting so this is where maybe like the regx that we took out is a is it needs to be a little more specific the other thing we'll notice is that the driver's license uh and passport are incorrect as well as when we have our name here which uh is just called bank account it actually recognized that as a name so you need to be careful with your regx uh just to make sure that it is actually masking the appropriate information um as well the other thing that I found interesting is that when it rehydrated and it had we had this email and the bank account which was actually hashed again it it doesn't actually go back through and do the hash uh for email what it's doing is it's taking the bank account that it had it tried to mask and actually putting that number back in but not the entire number so it's not running it twice which makes sense but just interesting uh kind of things to look out for when you're when you're doing your regex to make sure that you're you're getting the information that you actually want to mask and then being able to rehydrate it real quick everyone if you haven't already please remember to like And subscribe it helps more than you know also all the code for this will be in the link below you just sign up for newsletter you'll get the code as well as the link to versel and with that let's get back to coding all right so next what we're going to do is we're gonna actually go back and look at our stream example and this is actually taking uh chat history and doing a payload so this is where we're actually going to see information going into our sequence and what I did for this is I actually made an entire chat sequence right here that we can actually use so we have a UI that has different uh pii information going back and forth between two parties and we're actually going to take this information and pass the entire history and then try and get a summary of that so that we can see what the uh customer support might see or how we can mask all this information and still pass it through the llm so let's go ahead and take a look at the code for this the first thing we're going to do is we're actually going to take a look at the front end and so this is the code that I put together again this is available if you um just click the link below and sign up for the newsletter but it has the chat history we have a role and we have our content and it's basically just an array of adjacent object we have the ability to add a new message to our array as well as uh send a message send a single message to continue that chain and then we're also doing a U we're taking the entire chat history and we're posting it to our Stream So what we're going to be doing with that is when we receive the information we'll actually uh be getting the entire chat history and being have it being sent over so again this is uh you can see this is the the key press and things that and the send button to to send the entire uh chat history right here so once we click this button what we'll do is we'll send it to the stream the entire uh chat history and as you can see here we're still going to be using our utility that we built and you can see some of the other things that we're importing is actually chat open AI as well as our prompt template and then our byes bytes out parser output po parser we're also in nextjs so we're going to be using the runtime of edge the first thing we want to do is format the message so we're going to be taking our role so we know who's speaking and we're going to be taking the content that was being said between for that uh that role we're going to take all this we're going to create a uh customer support prompt uh prompt template and so what this is saying is that you are a customer support summarizer agent always include pii m pii in your response here's the current conversation this is what we'll pass into our template and then with the input from the user which will be the latest part of the the message so as you come down to the actual post part of this the stream you can see we're getting our body information we're checking to see if there's messages otherwise give a blank array we're taking the previous messages which is going to say one from the last one so it's assuming that as you click the button it's going to grab everything with the last message and then the current message which is going to be the newest input again so what this is doing is it's taking the the single message and it's actually putting it through our masking parser to actually take that information and and Ma successfully mask it so what we're going to do here is we are going to actually then we're going to see the guarded history so we're taking taking all the messages joining them as an array and doing the same thing we're going to mask that entire history next we're building out our prompt template and this is where it's actually going to go through the llm because we have our model here and we have our output parser and then we're chaining our events from our prompt our model and our output right here you can just see that we're console logging the message itself the entire history and then we're actually mapping the state of each one of the hashes to the pii information so we'll see that in our response next we're going to stream this data sorry we're going to stream this data and uh put it into our chain right of these are the variables that we're looking for inside of our prompt template that we're going to replace with then we just have our our stream response so let's go ahead and go over to our browser and actually see this in action so the first thing we're going to do is we're going to just scroll to the bottom we can see we have all our pii information there's a mix of information and we're just going to click the send to chat history you can see right here it's starting the stream and so now we're actually getting our information back you can even see in the response that it's grabbing the particular information again we're having that uh interesting um bank account going over the email and we're looking at uh different information it's giving us a summary it's telling us all the the hashes that we have inside the history of information so now what we'll do is we'll just kind of see what was happening on the back end so we can kind of look at the state as well as what's going on here so the first thing we can notice is we have our input hash and we have our guarded history which is hashed so you can see this is the hashed information as well as this is the entire object of the history itself and then we have our map which is the current state of everything so it's holding on to all this information you can see it even recognized Social Security it thinks that's a name Main Street is a name instead of an address uh Jane do it got this time and then you can kind of see like some of the mistakes it was it was making uh as far as like the hash I think this is probably so now what we're going to do is we're GNA actually look at this as to see what happened in Lang Smith so if we go back to our browser and we look at lsmith we'll see that we have a new runnable sequence which is right here we can go ahead and look at this runnable sequence and we see that even the input itself has the masked uh information as it's going into a runnable sequence the next piece is it goes into the into open Ai and you can see as it's going into our llm again it's sending in the masked information so that the llm itself is not getting this information stored in a third party which is incredibly important and then the output again is returning it's able to recognize that this these states are actually the information the mass information which is why that we can rehydrate if we look at the the state that's being maintained in uh our map right here and this is how we can rehydrate is through this uh information and so with that that's where uh the experimental branch is of or the experimental part of masking is and uh if you're interested there is a way to I put all of this up on verell you can take a look and uh play with it yourself all right that's it for us today don't forget to like And subscribe if you haven't already and remember that all the code you can get if you sign up through the newsletter it'll give you access to the codebase as well as the versell link to test this out on your own I hope you enjoyed today's session where we covered masking specifically in Lan chain JS and then looked at how that information is actually passing through the runnable sequences in l Smith if you have any questions please leave them in the comments we'd love to hear from you and with that have a great one happy nerding --- *Generated for LLM consumption from nerding.io video library*