hi I'm Martin zaboski mlops architect at getting data today I will show you how to deploy large language models in your own private kubernetes cluster in just five simple steps let's [Music] go this video is a followup to blog post we released in July where I showed you how you can deploy L language models using TGI back in July I've deployed Falcon large language model using hugging face text gener inference but a lot changed since July and for example TGI license has changed to be more restrictive and less open source and obviously the leader board of Open Source language models changes every day right now one of the best models in open source is mistal so I will deploy that one today besides TGI used before we have new players in the area of llm deployment Frameworks such as open llm from ml or VM that I will be using today VM is one of the fastest and also the most open-source ones it has Apache 2.0 license you can use it on a both in your private and Commercial projects in this tutorial I will guide you for the deployment that is shown on the diagram that you can see right now keep in mind that we are using kubernetes deployment so no matter whether you are running AIA or AWS or gcp or even your own on premise environment this this tutorial will most likely apply to you because I'm using mostly the standard kubernetes resources without anything related strictly to Google Cloud platform for example in the diagram you can see I'm using GCS but it can be as well S free or Azure blob storage and other stuff like persistent volumes uh deployment services and B jobs are obviously available in all kubernetes installations so the question is how it will work right so uh why do we have GCS and PVC and some deployments so the idea here is to give you a scalable solution so if you want to deploy your large language models in your private cluster at some point you will most likely need to scale up the deployment initially it will be one replica but as your usage grows and your customer base grows you will need to serve more and more traffic so will need to scale up the deployment and while every container that will run your L language models needs to have the access to the actual model files to the model weights this is why the GCS and PVC uh comes into into light here first we'll download the model from hugging face Hub into GCS in order to prevent any external dependency on some third party server for your installation then we'll initialize uh volume in Cuates and this volume will contain all of the model FES the idea here is different from the one that I've shown you in the blog where every time a new container was created I downloaded model files from GCS it's not as performant as you might expect because in that configuration you always need to uh rely on the networking between your cluster and blob storage while in the new configuration that I will show you today in this tutorial you will be able to uh initialize the persistent volume one once and then just attach this volume uh to all of the deployments that uh will serve your L language model when you are using volumes in kubernetes make sure that you initialize separate volumes in each Zone that you plan to run your deployments in because if you use Simple volumes such as the ones that I am using today they are usually bound to a specific Zone in the region that you have your cuberes cluster and for example if you scale up to a different Zone then the volumes will not be accessible so just make sure that you are either using multiple volumes for each of the zone that you your kubernetes cluster is running in or that you are running something like EFS that is a regional storage step one download the model from hugging phas first thing that you need to do is to download the model from hugging phas so just open the hugging face Hub with the model stop and just find the model that you want to deploy I will will be using mistol today the 7B instruct 0.2 version you can choose the model that you want to just make sure that the GPU that you attach to your kubernetes cluster will be sufficient to handle this model the fastest way to download the model is to click on the files tab and download all of the files that are listed here keep in mind that for example in mistal model it has both weights in the save tensors format and in the pickle format or py torch you only need one of them I will be using save tensors today so download those files and copy them over to GCS or S3 or Azure blob storage whatever Cloud you are using using your favorite command line tool or whatever tool you you will select for that uh just a side note make sure that the bucket in S3 or GCS or or storage container in Azure that you will you are using is in the same region that you are running your kubernetes to basically have faster file transfer speeds as you can see on the screen right now I've put my mistal model files on the GCS and I have this puff I am ready to go to the next step which will be to initialize the volume that will store those files so the first thing that I did is to create a storage class for my persistent volume two important things to note here the first one is the reclaim policy you need you want to retain those models and retain the volumes that will be provisioned with this class because whenever you encounter some crash in your kubernetes cluster you want this model fires to be persistent so you'll not have to care about repr provisioning the volume every time and the next thing is the allowed topologies key it's quite unusual but due to the recent shortages of gpus between different regions in different clouds uh you need to make make sure that you actually will get one of the gpus that are available there so the quotas also matter but also the availability of gpus is important so before deployment make sure that you are running both in a region and in the zone that gives you the gpus you want to so in order to create the storage class you just need to run c c apply with this file I've already did that and we can check that it was created step two persistent volume next thing is to create persistent volume and um I will be using persistent volume claim for that make sure that you create it with access Rite once also make sure that you use the same storage class that you created before and also make sure that the amount of storage that you assign to the volume is sufficient to fit your your selected model and now it's just a matter of running Cub apply and we'll see the message that the volume was already created copy the model files from blob storage into persistent volume make sure that you provide correct node selector I am running my workloads in Europe West 4 in the zone C and here one and only container that I am running in this job is the g-cloud SDK if you are running for example AWS you will most likely use AWS CLI container and something similar for Azure 2 and the only Comm that I'm invoking here is to just copy over the model files that you've seen in the GCS in the deployment section in in the deployment folder mistal instruct to local puff SL model and local puff / model is actually my volume so when we scroll down to the volume mods here you can see that I'm attaching existing volume to my job and that I'm mounting this volume under SL model and make sure to assign the resources to the amount that your tool requires okay once to create this manifest just run c c create and now in need volume and it will create a batch job in your kubernetes cluster that will involve this commment and it will copy over the model files you can use CU CTL get jobs to see whether your job is running as you can see I've already run one of the jobs and the second one is still pending so let's wait a little bit for that step number four deploy the llm using VM the most important step in this tutorial is to actually do the deployment and for that I've created a deployment in cuberes here you can see that I'm using qu replica for the tutorial purposes but uh if you are running this in a production environment you will either start with multiple replicas from the very beginning or attach horizontal po autoscaler or some external uh Autos scaling tools such as uh kues even driven Autos scaler Kea to scaler deployment based on the load that you have in your system the next important section in the deployment is the node selector here uh I'm using gcp with gke autopilot that's why I'm using the topology selector so that my deployment goes into Europe West for in the zone C and the same principle applies for example if you are using AWS on Azure you can use either node selectors or some other scheduling mechanism to point your deployments into the for example node groups or node pools that are providing your cluster with the access to gpus and here as an example I am selecting note pools that have Nvidia L4 attached and GPU is good enough to handle mistal 7B model in the 16 point Precision the next important bit is the volumes here I'm attaching the volume that I've created before mistal 7 that was initialized uh before by the badge job but here note the read only set is set to true and that's how you can attach the same volume to multiple ports in it only mode so whenever a new replica of the same deployment uh pops up it will have access to the model fires and it will start really quickly next important section in deployment is obviously containers one here I'm running only one container named model and I'm using VM image from dockerhub but if you really care about privacy and the isolation of your kubernetes containers you will probably need to download this model into your private registry and deploy it from there and I'm invoking one command in VM container to expose my model as an API in the open AI format so the format used to query your model using HTTP API will be the same that is used by open Ai and I'm specifying the model to be the mistro 7B instruct from SL model and the same principle here I'm using volume mounts to mount my shared volume to this container next few PRS are related to the scalability and to the model itself since I'm running the model on L4 gpus they those gpus support B Lo 16 format and this model is natively saved in such format that's why I'm using it here you can specify the seed and Max contact length for your for model again this is something really bound to the gpus that you are using so if you are using more powerful gpus you will probably uh be able to put put larger context length here and as for resources I'm requesting 7 CPUs 32 GB of memory which is more than enough and one GPU which in my case will be the Nvidia L4 the default part that the VM is running is Port 8,000 and that's basically it for the for the deployment definition we are good to deploy it so in order to do this just type in C C apply and point to this file as you can see my deployment is already running so the final thing is to create a service for that and here you can see simp manifest that will create a cluster IP service for my deployment uh based on selector Mr 7B instruct and it will expose Port 8,000 to my internal kubernetes cluster step five quaring the model the model is already deployed and it's good to go so let's quar it so as you can see my mol 7B model is already running in one of the pods let's see the services that we have there's also a mral 7B service so we can now port forward to the service I binding my local Port 8,000 to Port 8,000 in the service which will be directed directly to the Pod and let's open another tab here and send some requests so the request have the following format so you need to specify the model the promt that you sending to the model maximal number of tokens that you want the model to return and the temperature which is a parameter related to how stable the responses are the lower the temperature the more the more stable the responses will be and the format for mistal uh 7B model is the following so you need to start your prom with uh start token and then since this is an instruction model uh you specify the instruction uh in those uh special tax inst and you end up your in instruction with the closing tag for instruction and then the model will basically answer you can chat with it so let's send this uh request to our model and there's one more thing in the request there's this uh model key that points to some puff and you might wonder where does this puff comes from so when you open the browser and go to the port 8,000 or local machine that is pointing to the deployment and go to V1 / models you will see a Json file that points to all of the models that are deployed within this instance of VM right now for us it's only one model and it has ID pointing to the actual path of our model so this is the place where you obtain the model string that you attach to all of the requests the default endpoint that you Quire in order to get responses from your model is /v one/ completions so let's send a request for that remember to also send a content type uh header because the server expects Json inputs and when we create the model with the prompt who James Hatfield provide forward answer it answers with Le singer guitarist of Metallica And since mistal is an instruct model you can use it in chart and in workflows such as retrieval argumented generation you can provide it with some prompt and then ask it to for example summarize something or extract some Rel information or answer your question so let's see if it works so let's open for example our getting data block I'll scroll down to one of the recent blog post and let's see whether we can actually extract some information from it so for example here we have some challenges related to uh real time trading platform let's copy over real paragraph and let's paste it here and the instruction here is to uh list all of the challenges from the text that I just pasted and being concise so let's run this script and as you can see it really quickly responded with the extracted information so now you can play it on your own private environment in your kubernetes cluster and build your new large language model application on top of the open source models now that the model is working it's best time to give us a thumbs up and subscribe to our channel for more tutorials like this one make sure to check out our blog and if you are interested in free mlops consultation drop us a line in the form linked below that's all for today you now know how to deploy private L language models in your Cuates cluster in five simple steps and see you in the next one bye