vLLM is a powerful solution for optimizing AI model inference. This guide will help you set up and deploy vLLM quickly and efficiently.
Before starting, ensure Python is installed on your system:
python3 -V
Recommendation: Use the latest Python version for optimal compatibility and security.
Install pip (Python package manager) using the command:
apt install -y python3-pip

Using a virtual environment helps manage Python packages independently:
python3 -m venv vllm
source vllm/bin/activate
Note: Activate the virtual environment each time you work with vLLM.
Install vLLM using the command:
pip install vllm

Handling common errors: If you encounter issues related to transformers, update it with:
pip install transformers -U

Start vLLM with an AI model:
vllm serve
You can load various models from Hugging Face, for example:
vllm serve "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
vllm serve "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
Important: Limit the output token count to prevent memory exhaustion:
vllm serve "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" --max_model 4096

Open a new terminal window and send a query to the model using curl:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ { "role": "user", "content": "Tell me the recipe for tea" } ] }'
Replace the "content" field with your desired prompt to test the model response.
Watch this video to see the performance difference between vLLM and a traditional Flask API:
