Nebius partners with Positronic on Physical AI Leaderboard (PhAIL)

Physical AI is moving fast, but the field has lacked a rigorous, real-world benchmark to measure progress. Demo videos and lab success rates tell only part of the story. The operators who decide whether to deploy robotics at scale need harder numbers: throughput, reliability and reproducibility on genuine commercial tasks.

Today, Nebius is excited to announce our role as a founding consortium partner of the Physical AI Leaderboard (PhAIL) by Positronic, a platform for training and deploying any robot AI model on any robot. PhAIL uses real hardware to evaluate vision-language-action (VLA) models on bin-to-bin order picking—a high-volume, commercially representative task.

Unlike existing benchmarks that report abstract success rates, it measures metrics that matter on an actual shop floor: Units Per Hour (UPH) and Mean Time Between Failures or Assists (MTBF/A). Every run is recorded and published with synchronized video, robot telemetry and scoring logs, so any result can be independently audited. Positronic developed the evaluation methodology and operates the benchmark rigs. The inaugural results, including comparisons to human and teleoperated baselines, are live now at phail.ai.

As part of the consortium, Nebius will provide its vertically-integrated AI infrastructure for fine-tuning and evaluation of robot models. Nebius AI Cloud is well suited for physical AI workloads and includes a managed service for data and compute workflows in robotics. Nebius has integrated NVIDIA OSMO, an open workflow orchestration framework to deliver an easy-to-consume managed service, providing unified, agentic orchestration across the entire physical AI development pipeline. Nebius also offers scalable, high-performance storage, powerful NVIDIA Blackwell and NVIDIA Hopper clusters for AI training and inference and simulation instances with NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. Teams that submit their models for evaluation can apply for Nebius compute credits to support their fine-tuning work.

PhAIL is designed to be open and reproducible from end to end. Positronic publishes a free fine-tuning dataset collected through teleoperated demonstrations, along with open-source training scripts that any team can use to prepare their model for evaluation. The benchmark hardware is a Franka Research 3 arm with a Robotiq 2F-85 gripper in the DROID configuration, which is widely available and reproducible. Evaluation is ‘blind’: model checkpoints are rotated randomly so the operator does not know which model is running. Full methodology is documented in the PhAIL white paper.

If you’re building physical AI models, the path to participation is open: download the dataset, fine-tune and submit your checkpoint for evaluation on Positronic’s rigs. The consortium launching PhAIL already includes Toloka, the human data infrastructure for frontier AI. If you represent a hardware vendor, simulation platform, academic lab or industry operator and want to help shape what PhAIL measures next, the consortium is actively welcoming new members.

Read Positronic’s full blog post for more details, explore the live leaderboard and get in touch or email us at hi@phail.ai if you’d like to get involved.

Explore Nebius AI Cloud

Explore Nebius Token Factory

author
Evan Helda
Head of GTM Physical AI
author
Akshai Parthasarathy
Solutions Marketing Director
Sign in to save this post