Nebius Token Factory

Run open-source AI at production speed

Deploy models like Llama, Qwen, DeepSeek, GPT OSS on dedicated endpoints with sub-second targets and 99.9% uptime.

Autoscaling, speculative decoding and multi-region routing keep latency predictable at any scale.

Scalability without constraints

Run large open-source models on dedicated Nebius endpoints for consistent, sub-second performance. Seamlessly scale from prototype to full production and handle hundreds of millions of tokens per minute with autoscaling and 99.9% uptime.

Optimized pricing for inference

Experience transparent, predictable $/token pricing across both shared and dedicated tiers. Cut cost and latency further with optimized serving pipelines and upcoming distillation-based reductions, independently benchmarked for accuracy.

State-of-the-art multimodal models

Choose from 60+ open-source models, including DeepSeek, GPT OSS, Llama, Qwen, Mistral and more. Serve text, code and image models through one API, and combine modalities effortlessly in production.

AI agent essentials

Build and deploy intelligent agents faster with native function calling, structured JSON outputs and built-in safety guardrails for reliable real-world interaction.

Custom and fine-tuned models

Adapt models to your data using LoRA or full fine-tuning workflows. Deploy your own checkpoints directly on Token Factory endpoints with guaranteed performance and transparent per-token pricing.

RAG development tools

Create retrieval-augmented systems using high-performance embedding models and PGVector-powered storage. Keep everything—indexing, context retrieval and inference—within one governed, production-ready platform.

Inference service

Serve text, code and vision models via a simple, OpenAI-compatible API with production SLAs.

Batch API

High-throughput asynchronous inference for large workloads, predictable p95 even at peak.

AI image generation

Flux, SDXL and more, fast generation, simple pricing and enterprise-grade reliability.

AI model fine-tuning

Streamlined post-training workflows to adapt open models to your data and deploy in a click.

Top open-source models available

Text and multimodal

Deepseek R1 and V3

DeepSeek-R1-Distill-Llama-70B

Llama-3.3-70B-Instruct

Mistral-Nemo-Instruct-2407

Qwen2.5-72B

QwQ-32B

Google gemma-2-27b-it

GPT OSS 120B and 20B

View all models

Embeddings and guardrails

BAAI/bge-en-icl

BAAI/bge-multilingual-gemma2

intfloat/e5-mistral-7b-instruct

meta-llama/Llama-Guard-3-8B

Qwen/Qwen3-Embedding-8B

View all models

Text to image

black-forest-labs/flux-schnell

black-forest-labs/flux-dev

stability-ai/sdxl

View all models

Join our community

Follow Nebius Token Factory' X account for instant updates, LinkedIn for those who want more detailed news, and Discord for technical inquiries and meaningful community discussions.

X/Twitter LinkedIn Discord

Benchmark-backed performance and cost efficiency

Proven performance, verified benchmarks

Sub-second responses and stable latency, even at peak load. Top tier performance on models like DeepSeek V3 0324, independently verified by Artificial Analysis.

Scale without limits

Handle 100M+ tokens per minute with consistent throughput and 99.9% uptime SLAs. Autoscaling and speculative decoding ensure reliability from prototype to global deployment.

Comprehensive model coverage

Access 60+ premium models spanning LLMs, vision, image generation and embeddings, expanding monthly.

Familiar API at your fingertips

from openai import OpenAI
client = OpenAI(base_url="https://api.tokenfactory.nebius.com",
        api_key="NEBIUS_API_KEY")
        
completion = client.chat.completions.create(
    model="llama-3-70b-instruct",
    messages=[{
      "role":"user",
      "content":"What is the answer to all questions?"
    }]
)
print(completion.choices[0].message.content)

Learn more about our API