Deploying DeepSeek-R1 on AWS SageMaker: A Step-by-Step Guide
DeepSeek-R1 is now available on Amazon SageMaker JumpStart, making it easy to deploy and integrate into your applications. This guide provides detailed steps for deploying the model on AWS SageMaker, along with cost considerations.
You can deploy the DeepSeek-R1 model using the SageMaker JumpStart UI or via the SageMaker Python SDK.
Access SageMaker Studio
Open JumpStart

Search for DeepSeek-R1
Review Model Details
The model details page includes:
Deploy the Model
Configure Deployment Settings
ml.p5e.48xlarge).⚠️ Choosing the right instance type and count is crucial for balancing cost and performance.
Review and Deploy
Install the SageMaker Python SDK
!pip install --force-reinstall --no-cache-dir sagemaker==2.235.2
Deploy DeepSeek-R1 using Python
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.jumpstart.model import ModelAccessConfig
from sagemaker.session import Session
import logging
sagemaker_session = Session()
execution_role_arn = sagemaker_session.get_caller_identity_arn()
js_model_id = "deepseek-llm-r1"
gpu_instance_type = "ml.p5e.48xlarge"
sample_input = {
"inputs": "Hello, I'm a language model,",
"parameters": {"max_new_tokens": 128, "top_p": 0.9, "temperature": 0.6},
}
sample_output = [{"generated_text": "Hello, I'm a language model, and I'm here to help you with your English."}]
schema_builder = SchemaBuilder(sample_input, sample_output)
model_builder = ModelBuilder(
model=js_model_id,
schema_builder=schema_builder,
sagemaker_session=sagemaker_session,
role_arn=execution_role_arn,
log_level=logging.ERROR
)
model = model_builder.build()
predictor = model.deploy(model_access_configs={js_model_id: ModelAccessConfig(accept_eula=True)}, accept_eula=True)
predictor.predict(sample_input)
Run Additional Inference Requests
new_input = {
"inputs": "What is Amazon doing in Generative AI?",
"parameters": {"max_new_tokens": 64, "top_p": 0.8, "temperature": 0.7},
}
prediction = predictor.predict(new_input)
print(prediction)
This guide walks through the deployment of DeepSeek-R1-Distill-Llama-70B on an AWS Neuron instance, such as AWS Trainium 2 and AWS Inferentia 2.
Before deploying the model, ensure you meet the following requirements:
ml.inf2.48xlarge to 1 for endpoint usage).Instantiate a SageMaker session to determine the current AWS region and execution role:
import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client("iam")
role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]
Define the SageMaker model using the Python SDK:
image_uri = get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.25")
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-70B"
model_name = model_id.split("/")[-1].lower()
# Hub Model Configuration
hub = {
"HF_MODEL_ID": model_id,
"HF_NUM_CORES": "24",
"HF_AUTO_CAST_TYPE": "bf16",
"MAX_BATCH_SIZE": "4",
"MAX_INPUT_TOKENS": "3686",
"MAX_TOTAL_TOKENS": "4096",
}
# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=image_uri,
env=hub,
role=role,
)
endpoint_name = f"{model_name}-ep"
# Deploy the model to SageMaker Inference
predictor = huggingface_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type="ml.inf2.48xlarge",
container_startup_health_check_timeout=3600,
volume_size=512,
)
response = predictor.predict(
{
"inputs": "What is the capital of France?",
"parameters": {
"do_sample": True,
"max_new_tokens": 128,
"temperature": 0.7,
"top_k": 50,
"top_p": 0.95,
}
}
)
print(response)
Once testing is complete, delete the endpoint to avoid unnecessary costs:
predictor.delete_model()
predictor.delete_endpoint()
This guide details how to export, deploy, and run DeepSeek-R1-Distill-Llama-70B on an inf2.48xlarge AWS EC2 instance.
Before proceeding, ensure:
inf2.48xlarge instance in EC2 with this AMI.If you need guidance on launching an EC2 instance, refer to AWS documentation.
Execute the following command on your EC2 instance to deploy the model:
docker run -p 8080:80 \
-v $(pwd)/data:/data \
--device=/dev/neuron0 \
--device=/dev/neuron1 \
--device=/dev/neuron2 \
--device=/dev/neuron3 \
--device=/dev/neuron4 \
--device=/dev/neuron5 \
--device=/dev/neuron6 \
--device=/dev/neuron7 \
--device=/dev/neuron8 \
--device=/dev/neuron9 \
--device=/dev/neuron10 \
--device=/dev/neuron11 \
-e HF_BATCH_SIZE=4 \
-e HF_SEQUENCE_LENGTH=4096 \
-e HF_AUTO_CAST_TYPE="bf16" \
-e HF_NUM_CORES=24 \
ghcr.io/huggingface/neuronx-tgi:latest \
--model-id deepseek-ai/DeepSeek-R1-Distill-Llama-70B \
--max-batch-size 4 \
--max-total-tokens 4096
This process will take a few minutes as it downloads the compiled model from the Hugging Face cache and launches a Text Generation Inference (TGI) endpoint.
To verify that the model is running correctly, send a test request:
curl localhost:8080/generate \
-X POST \
-d '{"inputs":"Why is the sky dark at night?"}' \
-H 'Content-Type: application/json'
To avoid unnecessary costs, pause or terminate the EC2 instance after testing.
Deploying DeepSeek-R1 on SageMaker incurs several costs:
Instance Hours
ml.p5e.48xlarge) and the number of instances you use.Custom Model Import
Data Transfer Fees
Storage Costs
Amazon Bedrock Guardrails
Avoiding Unexpected Charges
To prevent unnecessary costs, delete your SageMaker endpoint when no longer in use:
predictor.delete_model()
predictor.delete_endpoint()
Optimizing Cost vs. Performance