🔧

Deploying Large Language Models in Kubernetes

Jul 2, 2024

How to Deploy Large Language Models in Your Private Kubernetes Cluster

Introduction

Speaker: Martin Zaboski, MLOps Architect at Getting Data
Topic: Deploying large language models in a private Kubernetes cluster in 5 steps
Models Mentioned: Falcon, Mistal
Frameworks Mentioned: Hugging Face Text Generation Inference (TGI), OpenLLM from vLLM

Key Points

Background & Changes
- TGI license has become more restrictive
- Open source language model landscape is evolving rapidly
- Example: Mistal is currently a top-performing open-source model
Why Kubernetes?
- Compatible with various platforms: GCP, AWS, Azure, On-premise
- Uses standard Kubernetes resources
- Scalable solution for increased traffic
- Importance of GCS/PVC to prevent dependency on third-party servers
Steps Overview
- Step 1: Download the model from Hugging Face and upload to cloud storage
- Step 2: Initialize persistent volume in Kubernetes
- Step 3: Copy model files from cloud storage to persistent volume
- Step 4: Deploy the model using vLLM
- Step 5: Query the deployed model

Detailed Steps

Step 1: Download the Model

Navigate to the Hugging Face Hub
Select the model (e.g., Mistal 7B Instruct 0.2)
Download model files (e.g., in 'safe tensors' format)
Upload files to cloud storage (GCS, S3, or Azure Blob)
Ensure the storage is in the same region as the Kubernetes cluster

Step 2: Initialize Persistent Volume

Create a storage class with Retain policy
Ensure storage class is in a region where GPUs are available
Use kubectl apply to create the storage class
Create a persistent volume claim (PVC)
Ensure ReadWriteOnce access and sufficient storage capacity

Step 3: Copy Model Files

Create a Kubernetes job to copy files from cloud storage to PVC
Use appropriate CLI (e.g., gcloud SDK, AWS CLI, Azure CLI)
Job mounts the persistent volume and copies files locally
Monitor job status using kubectl get jobs

Step 4: Deploy Using vLLM

Create a Kubernetes deployment manifest
Configure node selectors for GPU availability
Attach initialized volume in read-only mode
Use vLLM image from DockerHub or private registry
Set model parameters and resources (e.g., GPUs, memory, CPUs)
Use kubectl apply to deploy

Step 5: Query the Model

Use kubectl port-forward to access the service
Query the deployed model via OpenAI API format
Specify model, prompt, max tokens, temperature, etc.
Example prompt structure for Mistal model
View deployed models via /v1/models endpoint
Query endpoint: /v1/completions

Additional Notes

Private and secure deployment options
Importance of the reclaim policy and topology for resilient Kubernetes setup
Utilization of horizontal pod autoscalers or similar tools for scaling
Use cases: Chat, retrieval-augmented generation, summarization

Conclusion

The tutorial demonstrates deploying a private language model in a Kubernetes cluster.
Encouragement to subscribe and follow for more tutorials.

Full transcript