π DeepSeek Deployment Guide
In this guide, we will deploy DeepSeek-R1-Distill-Llama-8B, a lightweight version requiring fewer resources than the full-scale 671-billion parameter model. If you prefer deploying the full model, simply update the model configuration in vLLM.
We will use AWS CloudShell to simplify the setup process.
Ensure your AWS account has sufficient quotas to launch the required EC2 instances, especially GPU instances like G or P series. π Check AWS EC2 Instance Quotas
Clone the GitHub repository containing the necessary configuration files:
git clone https://github.com/aws-samples/deepseek-using-vllm-on-eks
cd deepseek-using-vllm-on-eks
Use Terraform to create an EKS Cluster, VPC, and ECR Repository:
terraform init
terraform apply -auto-approve
Configure kubectl to connect with the EKS cluster:
$(terraform output configure_kubectl | jq -r)
π The nodepool_automode.tf file enables Auto Mode for node pools, allowing AWS to dynamically manage instance scaling and selection, optimizing performance and cost.
You can deploy using GPU or Neuron (Inferentia). Follow the relevant steps below.
terraform apply -auto-approve -var="enable_deep_seek_gpu=true" -var="enable_auto_mode_node_pool=true"
1οΈβ£ Export the ECR Repository URI for Neuron:
export ECR_repo_neuron=$(terraform output ecr_repository_uri_neuron | jq -r)
2οΈβ£ Clone the vLLM repository:
git clone https://github.com/vllm-project/vllm
3οΈβ£ Build and push the Neuron-compatible vLLM image:
finch build --platform linux/amd64 -f Dockerfile.neuron -t $ECR_repo_neuron:0.1 .
aws ecr get-login-password | finch login --username AWS --password-stdin $ECR_repo_neuron
finch push $ECR_repo_neuron:0.1
4οΈβ£ Apply Terraform with Neuron support:
terraform apply -auto-approve -var="enable_deep_seek_gpu=true" -var="enable_deep_seek_neuron=true" -var="enable_auto_mode_node_pool=true"
π AWS Neuron SDK is optimized for Inferentia and Trainium chips, making it ideal for efficient inference of large models.
kubectl get po -n deepseek
kubectl get nodes -l owner=data-engineer
kubectl logs deployment.apps/deepseek-gpu-vllm-chart -n deepseek
kubectl logs deployment.apps/deepseek-neuron-vllm-chart -n deepseek
kubectl port-forward svc/deepseek-neuron-vllm-chart -n deepseek 8080:80 > port-forward-neuron.log 2>&1 &
kubectl port-forward svc/deepseek-gpu-vllm-chart -n deepseek 8081:80 > port-forward-gpu.log 2>&1 &
curl:curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" --data '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [{"role": "user", "content": "What is Kubernetes?"}]
}'
π The helm.tf file customizes the Helm chart for vLLM, configuring resource allocation, node selectors, and tolerations for DeepSeek.
1οΈβ£ Export ECR Repository URI:
export ECR_repo=$(terraform output ecr_repository_uri | jq -r)
2οΈβ£ Build Chatbot UI Image:
finch build --platform linux/amd64 -t $ECR_repo:0.1 chatbot-ui/application/.
3οΈβ£ Authenticate and Push Image:
aws ecr get-login-password | finch login --username AWS --password-stdin $ECR_repo
finch push $ECR_repo:0.1
4οΈβ£ Update Deployment Manifest:
sed -i "s#__IMAGE_DEEPSEEK_CHATBOT__#$ECR_repo:0.1#g" chatbot-ui/manifests/deployment.yaml
sed -i "s|__PASSWORD__|$(openssl rand -base64 12 | tr -dc A-Za-z0-9 | head -c 16)|" chatbot-ui/manifests/deployment.yaml
5οΈβ£ Apply Manifest Files:
kubectl apply -f chatbot-ui/manifests/ingress-class.yaml
kubectl apply -f chatbot-ui/manifests/deployment.yaml
6οΈβ£ Retrieve Chatbot UI URL:
echo http://$(kubectl get ingress/deepseek-chatbot-ingress -n deepseek -o json | jq -r '.status.loadBalancer.ingress[0].hostname')
7οΈβ£ Fetch Admin Credentials:
echo -e "Username=$(kubectl get secret deepseek-chatbot-secrets -n deepseek -o jsonpath='{.data.admin-username}' | base64 --decode)\nPassword=$(kubectl get secret deepseek-chatbot-secrets -n deepseek -o jsonpath='{.data.admin-password}' | base64 --decode)"
β Deployment Complete! Your DeepSeek chatbot is now live and ready for interaction. π