Semua Tutorial
AI

vLLM — Serving LLM Throughput Tinggi

Engine inferensi berkinerja tinggi (PagedAttention) untuk melayani LLM open-weight dalam skala produksi.

Inference GPU Production OpenAI API
1 Prasyarat
GPU
NVIDIA + CUDA 12.1
Python
3.10+
Throughput
5-20x Ollama
API
OpenAI-compatible
2 Install
bash
pip install vllm
3 Serve Model (OpenAI-compatible)
bash
vllm serve meta-llama/Llama-3.1-8B-Instruct

Server berjalan di port 8000 dengan format OpenAI. Uji:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role":"user","content":"Halo!"}]
  }'