Details

πŸš€ Deploying DeepSeek with vLLM on AWS EKS

πŸ“Œ Overview

The DeepSeek with vLLM on EKS project facilitates the deployment of DeepSeek models on AWS EKS (Elastic Kubernetes Service), optimized for GPU or NeuronCore. Below is a step-by-step guide on setting up the infrastructure, deploying the model, and integrating the chatbot UI.


πŸ—οΈ Project Structure

chatbot-ui/
│── application/
β”‚   β”œβ”€β”€ Dockerfile          # Dockerfile for building the UI container  
β”‚   β”œβ”€β”€ app.py              # Chatbot UI application (Gradio)  
β”‚   β”œβ”€β”€ requirements.txt    # Python dependencies  
β”‚
│── manifests/
β”‚   β”œβ”€β”€ deployment.yaml     # Kubernetes Deployment for UI  
β”‚   β”œβ”€β”€ ingress-class.yaml  # Ingress configuration  
β”‚
│── static/images/          # UI image assets  
β”‚
│── vllm-chart/             # Helm chart for vLLM  
β”‚   β”œβ”€β”€ values.yaml         # Default configuration values  
β”‚   β”œβ”€β”€ templates/
β”‚       β”œβ”€β”€ deployment.yaml # Pod configuration  
β”‚       β”œβ”€β”€ service.yaml    # Service configuration  
β”‚       β”œβ”€β”€ _helpers.tpl    # Template helpers  
β”‚
│── .gitignore  
│── helm.tf                 # Terraform script for Helm configuration  
│── main.tf                 # Terraform infrastructure setup  
│── nodepool_automode.tf    # Auto-scaling node pool configuration  
│── README.md               # Documentation  

πŸ”„ Workflow

1️⃣ Infrastructure Setup with Terraform

  1. Create VPC and EKS Cluster

    • Defined in main.tf using the terraform-aws-modules/eks/aws module.
    • Configure Auto Mode Node Pool (nodepool_automode.tf) to automatically select GPU/Neuron instances.
    • Create an Amazon ECR repository to store chatbot and vLLM images.
  2. Configure Node Pool (nodepool_automode.tf)

    • Set up GPU (g5/g6/p5) or Neuron (inf2).
    • Define node selectors, taints, and labels for optimized resource allocation.

2️⃣ Deploy Model with Helm

  1. Deploy Helm Chart (helm.tf)

    • Set up two deployments:
      • deepseek_gpu: Runs on GPU.
      • deepseek_neuron: Runs on Neuron.
    • Define resource allocation, selectors, and runtime settings for vLLM.
  2. Configure vLLM Chart (vllm-chart/)

    • deployment.yaml: Defines the vLLM running pod.
    • service.yaml: Configures the service to handle requests to vLLM.

3️⃣ Deploy Chatbot UI

  1. Build UI Container (Dockerfile)

    • Base Image: python:3.12-slim
    • Install dependencies from requirements.txt.
    • Run the app with uvicorn app:app --host 0.0.0.0 --port 7860.
  2. Deploy UI to Kubernetes (manifests/)

    • deployment.yaml: Creates a pod running the UI container.
    • ingress-class.yaml: Configures Ingress with an Application Load Balancer (ALB).

βš™οΈ Configuration Details

🌍 AWS Infrastructure

  • EKS Cluster: Deployed with Terraform.
  • Node Pool: Auto-selects GPU or Neuron.
  • ECR: Stores container images.

πŸ›  Chatbot Application

  • Backend: Uses FastAPI to communicate with vLLM.
  • Frontend: Built with Gradio for an interactive UI.
  • Authentication: Supports login via username/password.

πŸ“¦ vLLM Deployment

  • GPU Model: DeepSeek runs on GPUs (g5/g6/p5).
  • Neuron Model: DeepSeek runs on NeuronCore (inf2).
  • Resource Requests & Limits: Optimized resource allocation.

πŸ”’ Security

  • IAM Role-Based Access Control for secure permissions.
  • SSL Certificates with Ingress Controller.
  • Kubernetes Secrets for protecting credentials.

πŸ“Š Monitoring & Performance

  • Liveness & Readiness Probes for container health checks.
  • Resource Limits to prevent overload.
  • Node Affinity & Taints for efficient resource scheduling.