🔧

Deploying Large Language Models in Kubernetes

Jul 2, 2024

How to Deploy Large Language Models in Your Private Kubernetes Cluster

Introduction

  • Speaker: Martin Zaboski, MLOps Architect at Getting Data
  • Topic: Deploying large language models in a private Kubernetes cluster in 5 steps
  • Models Mentioned: Falcon, Mistal
  • Frameworks Mentioned: Hugging Face Text Generation Inference (TGI), OpenLLM from vLLM

Key Points

  1. Background & Changes

    • TGI license has become more restrictive
    • Open source language model landscape is evolving rapidly
    • Example: Mistal is currently a top-performing open-source model
  2. Why Kubernetes?

    • Compatible with various platforms: GCP, AWS, Azure, On-premise
    • Uses standard Kubernetes resources
    • Scalable solution for increased traffic
    • Importance of GCS/PVC to prevent dependency on third-party servers
  3. Steps Overview

    • Step 1: Download the model from Hugging Face and upload to cloud storage
    • Step 2: Initialize persistent volume in Kubernetes
    • Step 3: Copy model files from cloud storage to persistent volume
    • Step 4: Deploy the model using vLLM
    • Step 5: Query the deployed model

Detailed Steps

Step 1: Download the Model

  • Navigate to the Hugging Face Hub
  • Select the model (e.g., Mistal 7B Instruct 0.2)
  • Download model files (e.g., in 'safe tensors' format)
  • Upload files to cloud storage (GCS, S3, or Azure Blob)
  • Ensure the storage is in the same region as the Kubernetes cluster

Step 2: Initialize Persistent Volume

  • Create a storage class with Retain policy
  • Ensure storage class is in a region where GPUs are available
  • Use kubectl apply to create the storage class
  • Create a persistent volume claim (PVC)
  • Ensure ReadWriteOnce access and sufficient storage capacity

Step 3: Copy Model Files

  • Create a Kubernetes job to copy files from cloud storage to PVC
  • Use appropriate CLI (e.g., gcloud SDK, AWS CLI, Azure CLI)
  • Job mounts the persistent volume and copies files locally
  • Monitor job status using kubectl get jobs

Step 4: Deploy Using vLLM

  • Create a Kubernetes deployment manifest
  • Configure node selectors for GPU availability
  • Attach initialized volume in read-only mode
  • Use vLLM image from DockerHub or private registry
  • Set model parameters and resources (e.g., GPUs, memory, CPUs)
  • Use kubectl apply to deploy

Step 5: Query the Model

  • Use kubectl port-forward to access the service
  • Query the deployed model via OpenAI API format
  • Specify model, prompt, max tokens, temperature, etc.
  • Example prompt structure for Mistal model
  • View deployed models via /v1/models endpoint
  • Query endpoint: /v1/completions

Additional Notes

  • Private and secure deployment options
  • Importance of the reclaim policy and topology for resilient Kubernetes setup
  • Utilization of horizontal pod autoscalers or similar tools for scaling
  • Use cases: Chat, retrieval-augmented generation, summarization

Conclusion

  • The tutorial demonstrates deploying a private language model in a Kubernetes cluster.
  • Encouragement to subscribe and follow for more tutorials.