Name: vLLM
Author: vLLM Project

Key features

Optimized performance

vLLM delivers state-of-the-art high-throughput serving.

OpenAI-compatible server

Drop-in replacement for OpenAI-style clients and SDKs.

Runtime parameterization

Expose key engine and server args to tune latency, throughput, and memory.

Flexible model support

Deploy popular Hugging Face models for chat, completions, and more.

Pricing

Additional Nebius infrastructure costs may apply. Use the Nebius Pricing Page to estimate your infrastructure costs.

Self-managed

vLLM on VM

Root access & custom setup. Maximum performance tuning. Direct hardware control.

Free

Charged for resources

Setup time2-5 minutes

ScalingManual

MaintenanceSelf-managed

Deploy

Self-managed

vLLM on Kubernetes

Run on your own Kubernetes for horizontal scaling and upgrades as you grow.

Free

Charged for resources

Setup time20+ minutes

ScalingAuto

MaintenanceSelf-managed (cluster)

Deploy

Security & compliance

Run vLLM on infrastructure built for AI workloads

Reliable AI infrastructure backed by top-tier NVIDIA GPUs, purpose-built for demanding inference workloads. Multiple deployment methods — virtual machines for full hardware control, Kubernetes for scalable cluster deployments, and managed serverless applications for teams that want inference running without infrastructure overhead.

Learn about Nebius AI Cloud

Security & compliance, out of the box

Nebius meets a broad set of security and compliance standards. Fine-grained IAM controls, audit logs, and encrypted storage are available out of the box — so teams can meet security requirements without additional tooling.

Explore the Trust center