Transcript for:
NVIDIA Inference Microservice (NIM)

Let's say you have an amazing app idea, and you want to use a large language model for that. But since your app is focused on privacy and data security, you decided to use an open source and local large language model. And you were able to put together a pretty amazing prototype, and everything is working pretty nicely. Now it's time to take that prototype into production so that you can serve thousands or millions of enterprise users Now you need to decide what tools to use how the infrastructure is going to look like and what type of software optimizations you will need in order to run your app in the best optimal configurations possible. So you have to make a lot of decisions and these include what is going to be the cost efficiency, what is the expected latency, flexibility, security, infrastructure needs, scalability, what type of inference endpoint to use, whether you want to go with VLLM, Llama CPP, or Hugging Face inference endpoint. And all of these tools will require different expertise. So it becomes a huge hassle to come up with a well optimized solution that you can use to put your model in production. That's why I think this new tool from NVIDIA can be a game changer for developers. This is called NVIDIA Inference Microservice, or NVIDIA NIM for short. which is a set of pre trained AI models packaged and optimized to run across a number of different NVIDIA hardware platforms. So what exactly is a NIM? Think about this as a pre configured container for simplified deployment. Essentially, it's an inference microservice and it follows industry standard APIs. So, for example, when it comes to LLMs, It uses the OpenAI API standard. And they have optimized inference engines. So for LLMs, it uses Triton inference service with TensorRT and TensorRT LLM. And it also gives you a number of different tools that you can use for monitoring health checks and some other metrics. Now, NIMS comes with optimized AI models. So these are your. Normal models, for example, a Lama3 model, but optimized specifically to be run on NIM. And it gives you substantial performance boost. Now the best part, all of this is put together into a single package that you can deploy with a single command or single click on your infrastructure. NIMS not only supports LLMs, but it supports a wide variety of different AI models, including vision, video, text to image, even some protein folding models. Now, all this sounds good, but what type of performance boost NIMS? So, here is an official slide from NVIDIA. This is running a Lama 3 8 billion instruct model on a single H100. Now if you decide to use NIM, you can expect a three times improvement in throughput compared to when you don't use a NIM. And this is not only higher throughput, but it will also reduce your cost of operation. So this can be substantial. So this can be substantial when you're trying to productionize open source local LLMs. So if all this sounds good, let me walk you through how to get started with NVIDIA NIMS. There is a whole catalog of NIMs available on NVIDIA's website. I'll put a link in the video description. You can experiment with a number of NIMs that is deploying a number of different models. One of the popular one is Lamma 370 billion model, but NIMs are not just limited to language models. For example, there's a NIM deployed for vision language model. There's even one for text to image and stable video diffusion. Not only these models are optimized for NIMS, but I think there are a whole bunch of community built AI models as well that are optimized for deployments using NIMS. Now, the great thing is that you can deploy NIMS any way you want because it provides industry standard APIs that you can interact with the LLMs. And before showing you how to get started, let's talk about how to get access to NIMS. To NIMS and how to experiment with them so you can sign up for free and you will get about 1000 infants credits to get you started so you can use the NVIDIA managed serverless, serverless APIs in your own applications. Now names are geared toward enterprises. And that means if you want to deploy them on your own infrastructures, you can do that. So essentially you can just download pre configured containers and deploy them on your own Infrastructure so that will give you self hosted apis without any code Code changes, but in that case you will need to buy nvidia ai enterprise license for production deployment Okay, so to test out nims, there are a few options Either you can get started here on the platform. So just select one of the nims So let's say I'm going to select the Llama 3 70 billion model, then this gives you the ability to actually interact with the model right here on this website. Or if you want, you can start integrating this in your own projects. There are, there is a Python client, Node client, or you can just make requests to the same API endpoint using shell. But if you want local deployment, then you can use Docker. And we're going to talk about all these different options. So first, let's look at this. Here are a couple of things. You can select, for example, what can I see at NVIDIA's GPU technology conference. If we send this, here is the response that we got. This is probably a cached response. Now we can experiment with this by saying, what is the meaning of life? And this will send the user query. And it starts generating a response. I think it's streaming a response and right now it's deploying a full 70 billion model. So this is pretty Amazing because the speed is I think really good. Okay, let's look at another name In this case, we're going to look at polygamma. This is a vision language model so for example You can upload an image if you want by clicking on this button Okay, so I uploaded one of my thumbnails and I said, what do you see in this image? And the Google's polygama model says in this image, there's a man and we can see text. The background is red in color, right? I think it didn't probably detect the text properly, right? But this is a quick example of, you can actually start interacting with the names and try them out in here. Now let me show you how you can integrate these names in your own projects. So for that we're going to select the Llama3 8 Billion Instruct model and first we'll look at the Python example. The great thing about NVIDIA NIMS is that it actually is using the OpenAI API standard. So you will need to provide this base URL, then you will need to provide your API key. For that you will need to sign up for the account and then just click on get API key. You will get access to the API key. And then you can use the OpenAI client to actually start generating responses. So to get started, I put together this Google Colab notebook. I'll put a link to this in the video description. So first we need to install the OpenAI API client. And here I'm just importing the OpenAI API client. Then we need to provide our API key. This is the one that you're going to get from the NVIDIA. Here I'm actually just selling it as a secret in my Google Colab and enabling the access to this specific notebook. Now you probably have seen there are a number of different companies which are using the OpenAI API standard. In this case, you can use the open AI API client. You will just need to provide the base URL. Now this will point to the name that you created. So that's basically the API endpoint. You will need to provide your API key. After that we're going to create a chat completion endpoint. We need to provide the name of the model that we want to use. So here I'm selecting the Llama 3 8 billion model, but you can either use the 70 billion model or. You can use you and the vision language models, right? Now, similar to the OpenAI client, we will need to provide our messages. So here is the actual prompt and the role is going to be user, right? And if you want to keep chat history, you just need to keep appending the assistant responses and then user input to this. List. Okay. Now, since we're using the OpenAI client we can set all the different parameters. So we can set the temperature, top P, the maximum number of tokens that we want to generate, and we are setting streaming to true. So let me actually show you what is the speed of generation when you're using NVIDIA NIM through NVIDIA's serverless API. And this is pretty fast. And keep in mind, we are trying to access the full 8 billion instruct model. Now this was the case when we were trying to access the NVIDIA's serverless API. But what if you want to download the model and run it locally for that, you need to download the Docker container. So you need to first install Docker on your local machine. Then you're going to need to provide your user credentials. So the user's name is going to be this the password is going to be the API key, and then you can set up a Docker container. So you provide your API key. You need to set your local cashier and then you need to run the docker container. Now I'm going to be creating a more detailed video on actually running this so if you are interested make sure you subscribe to the channel so that you don't miss that But once your container is up and running in order to use it. It's very similar. To how you access the open AI API the only difference now is going to be that you're going to be using the local host For this and as an API endpoint rather than using the NVIDIA's serverless API. And NVIDIA lets you deploy this on a whole bunch of different cloud providers. So you can deploy this on GCP Azure, AWS, or even the Hugging Face inference endpoint. A couple of other things that I wanted to mention about NIMS. So it's not only limited to the models that you see here. You can fine tune your own models and deploy them using NIM as well. And you can also deploy the quantized models as a part of NIM. You can run LoRa adopters on top of NIMs and hard swap them if needed. And the good thing is that since it's a container, you can deploy this on a Kubernetes cluster. Scale based on your needs. So this is a really amazing project. I think it will make deployment of LLMs in general and other AI models very easy For developers make sure to subscribe to the channel if you are interested on more technical content I am going to be creating a lot more videos on deploying LLMs and I think NVIDIA NIM is going to be a critical part of that. I hope you found this video useful. Thanks for watching and as always, see you in the next one.