Nebius and Eigen AI partner to accelerate frontier open-source AI inference

Nebius and Eigen AI are partnering to bring faster, optimized open-source AI models to Token Factory, Nebius’s production-grade managed inference platform.

As part of the collaboration, Nebius and Eigen AI are co-developing optimized versions of leading open-source models — including DeepSeek, GLM, GPT-OSS, Kimi, Llama, MiniMax and Qwen, — and integrating them into Token Factory. Eigen brings deep expertise in model optimization and serving systems, while Token Factory provides autoscaling inference and built-in fine-tuning tools.

Developers can access these models through an API on a per-token basis, or run them as managed solutions for production workloads.

Running open models in production

More organizations are moving to open-source AI models. They offer lower cost compared to proprietary APIs and allow teams to customize models for their own data, workflows and infrastructure.

At the same time, many of the newest open models — including Mixture-of-Experts (MoE) architectures, Linear Attention variants and reasoning models — are harder to run efficiently at scale. Getting strong performance requires optimized inference runtimes, smart GPU scheduling and infrastructure designed for large models.

Without a production platform, teams typically have to run these models themselves — often building custom infrastructure around frameworks like vLLM, Ray or Kubernetes, managing GPU clusters, tuning inference performance and maintaining scaling and reliability on their own. This adds significant engineering overhead and makes it difficult to move quickly from experimentation to production.

Token Factory is designed to close that gap. It provides a production platform for running, improving and operating open-source models.

Key capabilities include:

  • Autoscaling inference endpoints that adjust capacity automatically as traffic changes;

  • Dedicated model endpoints with guaranteed performance isolation and service levels;

  • Integrated post-training pipelines for LoRA fine-tuning and distillation;

  • Draft model training for speculative decoding to improve inference efficiency;

  • Instant promotion of tuned models into production endpoints for fast integration;

  • Enterprise governance tools, including team workspaces, SSO and access controls.

Together, these capabilities allow AI developers to adapt open models to their own data and run them in production without managing infrastructure themselves.

Optimized open models and serving from Eigen AI

Eigen AI specializes in making frontier open-source models fast and efficient in production through deep full-stack optimization. At the model layer, Eigen improves efficiency with advanced post-training quantization, quantization-aware training, KV-cache optimization and multi-granular sparsity techniques that reduce compute and memory costs while maintaining strong model quality.

At the systems layer, Eigen improves how these models run in production. Their work includes speculative decoding, custom CUDA and Triton kernels, parallel execution, continuous batching and graph-level runtime optimizations.

In practice, these optimizations help models generate tokens faster, use GPUs more efficiently and reduce the cost of serving large models at scale. This is especially important for modern Mixture-of-Experts and reasoning models, where routing, scheduling and memory efficiency often determine whether a model can run reliably and economically in production.

By bringing these optimized implementations into Nebius Token Factory, Nebius and Eigen AI are making it easier for developers to use frontier open models with high speed, reliability and production readiness, without having to build and maintain the optimization stack themselves.

Eigen’s optimized models have demonstrated leading performance in benchmarks tracked by Artificial Analysis.

For example, Eigen currently holds the #1 output speed, for multiple widely used models, reaching up to 911 output tokens per second.

Eigen Achieves #1 Output Speed Across Leading Open Models

(Artificial Analysis benchmarks, as of March 13, 2026)

Model Eigen Output Speed (tokens/sec) Workload
GLM-5 (Non-reasoning) 204 General
GPT-OSS-120B (high) 911 General
GPT-OSS-120B (low) 911 General
Qwen3 Next 80B A3B Reasoning 322 Reasoning
Qwen3 235B A22B 2507 (Reasoning) 179 Reasoning
Qwen3-VL 235B A22B 81 Vision-Language
Qwen3-VL 30B A3B (Non-reasoning) 252 Vision-Language
Qwen3-VL 30B A3B (Reasoning) 255 Vision-Language Reasoning
Qwen3 Coder 480B 244 (10k general) / 374 (1k coding) General / Coding
Qwen3.5 397B A17B (Non-reasoning) 145 General
Qwen3.5 397B A17B (Reasoning) 144 Reasoning
Qwen3 8B (Non-reasoning) 358 General
Qwen3 8B (Reasoning) 349 Reasoning
Qwen3 30B A3B (Non-reasoning) 280 General
Qwen3 30B A3B (Reasoning) 248 Reasoning
DeepSeek V3.1 Terminus 141 General
DeepSeek V3.1 Terminus (Reasoning) 141 Reasoning
DeepSeek V3.1 (Reasoning) 274 Reasoning
DeepSeek V3.2 82 Reasoning
Llama-3.3-70B 275 General
Llama-4 Scout 506 (1k coding) Coding
Llama-4 Maverick 387 General
Llama-3.1-8B 764 (1k coding) General

Visualization of the 23 #1 speed Eigen AI’s models on Artificial Analysis (Source):

For example, by combining Nebius’s infrastructure with Eigen AI’s model and serving optimizations, popular models such as GPT-OSS-120B and Qwen3 Coder 480B have consistently ranked among the top three fastest implementations in AA benchmark tracking as shown below.


GPT-OSS-120B


Qwen3 Coder 480B

These optimized models are available through Token Factory, giving developers access to high-performance implementations directly through the platform.

What the partnership enables

For teams building applications on frontier open models, this collaboration shortens the path from model release to production use.

Developers can access optimized models directly through Token Factory without needing to build or maintain their own inference optimization infrastructure.

Roman Chernin, co-founder and CBO of Nebius, said:

“Open-source models are improving incredibly quickly, but running them efficiently at scale remains challenging. By co-developing optimized versions of frontier models with Eigen AI on Token Factory, we’re making it easier for developers to access high-performance open models in production.”

Ryan Hanrui Wang, co-founder and CEO of Eigen AI, added:

“Many frontier open models rely on Mixture-of-Experts architectures, where efficient expert routing, GPU scheduling, speculative decoding, quantization and sparsity have a significant impact on performance. Working closely with Nebius allows us to bring these optimized models to Token Factory so teams can benefit from that performance without building their own inference infrastructure.”

Get started

Developers can access optimized open-source models co-developed by Nebius and Eigen AI directly on Token Factory.

Models are available via API for self-service access and can also be delivered as managed solutions for production workloads.

Explore Nebius Token Factory

Explore Nebius AI Cloud

See also

Sign in to save this post