Nebius and Eigen AI partner to accelerate frontier open-source AI inference

March 17, 2026

8 mins to read

Nebius and Eigen AI are partnering to bring faster, optimized open-source AI models to Token Factory, Nebius’s production-grade managed inference platform.

As part of the collaboration, Nebius and Eigen AI are co-developing optimized versions of leading open-source models — including DeepSeek, GLM, GPT-OSS, Kimi, Llama, MiniMax and Qwen, — and integrating them into Token Factory. Eigen brings deep expertise in model optimization and serving systems, while Token Factory provides autoscaling inference and built-in fine-tuning tools.

Developers can access these models through an API on a per-token basis, or run them as managed solutions for production workloads.

Running open models in production

More organizations are moving to open-source AI models. They offer lower cost compared to proprietary APIs and allow teams to customize models for their own data, workflows and infrastructure.

At the same time, many of the newest open models — including Mixture-of-Experts (MoE) architectures, Linear Attention variants and reasoning models — are harder to run efficiently at scale. Getting strong performance requires optimized inference runtimes, smart GPU scheduling and infrastructure designed for large models.

Without a production platform, teams typically have to run these models themselves — often building custom infrastructure around frameworks like vLLM, Ray or Kubernetes, managing GPU clusters, tuning inference performance and maintaining scaling and reliability on their own. This adds significant engineering overhead and makes it difficult to move quickly from experimentation to production.

Token Factory is designed to close that gap. It provides a production platform for running, improving and operating open-source models.

Key capabilities include:

Autoscaling inference endpoints that adjust capacity automatically as traffic changes;
Dedicated model endpoints with guaranteed performance isolation and service levels;
Integrated post-training pipelines for LoRA fine-tuning and distillation;
Draft model training for speculative decoding to improve inference efficiency;
Instant promotion of tuned models into production endpoints for fast integration;
Enterprise governance tools, including team workspaces, SSO and access controls.

Together, these capabilities allow AI developers to adapt open models to their own data and run them in production without managing infrastructure themselves.

Optimized open models and serving from Eigen AI

Eigen AI specializes in making frontier open-source models fast and efficient in production through deep full-stack optimization. At the model layer, Eigen improves efficiency with advanced post-training quantization, quantization-aware training, KV-cache optimization and multi-granular sparsity techniques that reduce compute and memory costs while maintaining strong model quality.

At the systems layer, Eigen improves how these models run in production. Their work includes speculative decoding, custom CUDA and Triton kernels, parallel execution, continuous batching and graph-level runtime optimizations.

In practice, these optimizations help models generate tokens faster, use GPUs more efficiently and reduce the cost of serving large models at scale. This is especially important for modern Mixture-of-Experts and reasoning models, where routing, scheduling and memory efficiency often determine whether a model can run reliably and economically in production.

By bringing these optimized implementations into Nebius Token Factory, Nebius and Eigen AI are making it easier for developers to use frontier open models with high speed, reliability and production readiness, without having to build and maintain the optimization stack themselves.

Leading output speeds across popular open models

Eigen’s optimized models have demonstrated leading performance in benchmarks tracked by Artificial Analysis.

For example, Eigen currently holds the #1 output speed, for multiple widely used models, reaching up to 911 output tokens per second.

Eigen Achieves #1 Output Speed Across Leading Open Models

(Artificial Analysis benchmarks, as of March 13, 2026)

Model	Eigen Output Speed (tokens/sec)	Workload
GLM-5 (Non-reasoning)	204	General
GPT-OSS-120B (high)	911	General
GPT-OSS-120B (low)	911	General
Qwen3 Next 80B A3B Reasoning	322	Reasoning
Qwen3 235B A22B 2507 (Reasoning)	179	Reasoning
Qwen3-VL 235B A22B	81	Vision-Language
Qwen3-VL 30B A3B (Non-reasoning)	252	Vision-Language
Qwen3-VL 30B A3B (Reasoning)	255	Vision-Language Reasoning
Qwen3 Coder 480B	244 (10k general) / 374 (1k coding)	General / Coding
Qwen3.5 397B A17B (Non-reasoning)	145	General
Qwen3.5 397B A17B (Reasoning)	144	Reasoning
Qwen3 8B (Non-reasoning)	358	General
Qwen3 8B (Reasoning)	349	Reasoning
Qwen3 30B A3B (Non-reasoning)	280	General
Qwen3 30B A3B (Reasoning)	248	Reasoning
DeepSeek V3.1 Terminus	141	General
DeepSeek V3.1 Terminus (Reasoning)	141	Reasoning
DeepSeek V3.1 (Reasoning)	274	Reasoning
DeepSeek V3.2	82	Reasoning
Llama-3.3-70B	275	General
Llama-4 Scout	506 (1k coding)	Coding
Llama-4 Maverick	387	General
Llama-3.1-8B	764 (1k coding)	General

Visualization of the 23 #1 speed Eigen AI’s models on Artificial Analysis (Source):

For example, by combining Nebius’s infrastructure with Eigen AI’s model and serving optimizations, popular models such as GPT-OSS-120B and Qwen3 Coder 480B have consistently ranked among the top three fastest implementations in AA benchmark tracking as shown below.

GPT-OSS-120B

Qwen3 Coder 480B

These optimized models are available through Token Factory, giving developers access to high-performance implementations directly through the platform.

What the partnership enables

For teams building applications on frontier open models, this collaboration shortens the path from model release to production use.

Developers can access optimized models directly through Token Factory without needing to build or maintain their own inference optimization infrastructure.

Roman Chernin, co-founder and CBO of Nebius, said:

“Open-source models are improving incredibly quickly, but running them efficiently at scale remains challenging. By co-developing optimized versions of frontier models with Eigen AI on Token Factory, we’re making it easier for developers to access high-performance open models in production.”

Ryan Hanrui Wang, co-founder and CEO of Eigen AI, added:

“Many frontier open models rely on Mixture-of-Experts architectures, where efficient expert routing, GPU scheduling, speculative decoding, quantization and sparsity have a significant impact on performance. Working closely with Nebius allows us to bring these optimized models to Token Factory so teams can benefit from that performance without building their own inference infrastructure.”

Get started

Developers can access optimized open-source models co-developed by Nebius and Eigen AI directly on Token Factory.

Models are available via API for self-service access and can also be delivered as managed solutions for production workloads.

Explore Nebius Token Factory

Docs and support

Explore Nebius AI Cloud

Docs

Nebius team

Contents

Running open models in production
Optimized open models and serving from Eigen AI
Leading output speeds across popular open models
What the partnership enables
Get started

Nebius and Eigen AI partner to accelerate frontier open-source AI inference

Running open models in production

Optimized open models and serving from Eigen AI

Leading output speeds across popular open models

Eigen Achieves #1 Output Speed Across Leading Open Models

What the partnership enables

Get started

Explore Nebius Token Factory

Explore Nebius AI Cloud

See also

NVIDIA Nemotron 3 Super now available on Nebius Token Factory

Nebius and Toloka to introduce integration to bring human experts-on-demand to AI agents

Introducing self-service NVIDIA Blackwell GPUs in Nebius AI Cloud

Products

Resources

Solutions

Prices

Security and compliance

Programs

Company

Legal

Nebius and Eigen AI partner to accelerate frontier open-source AI inference

Running open models in productionRunning open models in production

Optimized open models and serving from Eigen AIOptimized open models and serving from Eigen AI

Leading output speeds across popular open modelsLeading output speeds across popular open models

Eigen Achieves #1 Output Speed Across Leading Open ModelsEigen Achieves #1 Output Speed Across Leading Open Models

What the partnership enablesWhat the partnership enables

Get startedGet started

Explore Nebius Token Factory

Explore Nebius AI Cloud

See also

NVIDIA Nemotron 3 Super now available on Nebius Token Factory

Nebius and Toloka to introduce integration to bring human experts-on-demand to AI agents

Introducing self-service NVIDIA Blackwell GPUs in Nebius AI Cloud

Running open models in production

Optimized open models and serving from Eigen AI

Leading output speeds across popular open models

Eigen Achieves #1 Output Speed Across Leading Open Models

What the partnership enables

Get started