Deploying DeepSeek Model on AWS Inferentia with vLLM
The DeepSeek model, particularly optimized versions like DeepSeek-R1-Distill-Llama-8B and 70B, is designed to enhance inference efficiency, making it well-suited for deployment on AWS Inferentia. The vLLM library provides a flexible and optimized model serving capability. This guide walks you through deploying the DeepSeek model on an AWS EC2 instance using vLLM in a professional and efficient manner.
Copy the public DNS of the instance and connect using the following command:
ssh -i /path/to/key-pair.pem ubuntu@instance-public-dns-name
Ensure the key pair file has the correct permissions set to 400.
source /opt/aws_neuronx_venv_pytorch_2_5_nxd_inference/bin/activate
sudo apt-get install git-lfs
git lfs install
git clone https://github.com/vllm-project/vllm
cd vllm
pip install -U -r requirements-neuron.txt
pip install .
Ensure that
huggingface-cliis installed in your environment before proceeding.
Visit Hugging Face Tokens and log in:
huggingface-cli login
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-8B
huggingface-cli download deepseek-ai/DeepSeek-R1-Distill-Llama-70B
The model files are typically stored in:
/home/ubuntu/.cache/huggingface/hub/models--deepseek-ai--DeepSeek-R1-Distill-Llama-8B/snapshots/d66bcfc2f3fd52799f95943264f32ba15ca0003d/
Set the model path environment variable:
export MODEL_PATH=/path/to/your/model
vllm serve $MODEL_PATH
The server will be available at localhost:8000.
You can further configure the model with the following command:
cd ~/vllm/
python3 -m vllm.entrypoints.openai.api_server \
--model $MODEL_PATH \
--served-model-name DeepSeek-R1-Distill-Llama-8B \
--tensor-parallel-size 8 \
--max-model-len 2048 \
--max-num-seqs 4 \
--block-size 8 \
--use-v2-block-manager \
--device neuron \
--port 8080
curl -X POST http://localhost:8000/v1/chat/completions -d '{ "model": "deepseek-r1-distill-lama-8b", "messages": [ { "role": "user", "content": "Hello, world!" } ] }'