1
Prasyarat
GPU
NVIDIA + CUDA 12.1
Python
3.10+
Throughput
5-20x Ollama
API
OpenAI-compatible
Engine inferensi berkinerja tinggi (PagedAttention) untuk melayani LLM open-weight dalam skala produksi.
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct
Server berjalan di port 8000 dengan format OpenAI. Uji:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role":"user","content":"Halo!"}]
}'