SWE-rebench: A continuously updated benchmark for SWE LLMs

Our AI R&D team presents SWE-rebench, a new benchmark for evaluating agentic LLMs on a continuously updated and decontaminated set of real-world software engineering tasks mined from real GitHub repositories.

May 14, 2025

1 min to read

In today’s landscape of rapid LLM progress, static benchmarks quickly lose their relevance. As models are likely to encounter benchmark data during training, it’s becoming increasingly difficult to distinguish true generalization from memorization. Additionally, agent performance is heavily influenced not just by the model but by the surrounding scaffolding, such as prompts, additional test-time computations and available tools, making fair comparisons across systems challenging.

Today, we present a new LLM benchmark, SWE-rebench, which addresses these issues in SWE domain through:

a standardized evaluation pipeline with fixed scaffolding;
frequent dataset updates sourced from live open-source repositories;
explicit contamination tracking tied to model release dates.

Figure 1. SWE-rebench leaderboard as of today

Our goal with this benchmark is to make evaluation of software engineering LLMs more transparent, reproducible and focused on core model capabilities.

You can explore the benchmark leaderboard and methodology behind it at swe-rebench.com.

Explore Nebius AI Cloud

Docs

Explore Nebius AI Studio

Docs and support

Alexander Golubev

Lead ML Engineer

Today, we’re open-sourcing Kvax, our FlashAttention implementation based on JAX. Designed for efficient training with long sequences, Kvax supports context parallelism and optimized computation of document masks. It outperforms many other FlashAttention implementations in long-context training with dense packing, achieving state-of-the-art performance.

Scaling data collection for training software engineering agents

In this follow-up to our previous research blog post, we focus on the data collection process used to train our action generator and critic models. We’re also releasing two datasets on Hugging Face: nebius/SWE-bench-extra, containing 6,411 Issue/Pull-Request pairs, and nebius/SWE-agent-trajectories, featuring 80,036 software engineering agent trajectories, where an agent attempts to solve these issues.

SWE-rebench: A continuously updated benchmark for SWE LLMs

Explore Nebius AI Cloud

Explore Nebius AI Studio

See also

Reasoning critics enable better parallel search for software engineering agents

Kvax: Fast and easy-to-use FlashAttention implementation for JAX

Scaling data collection for training software engineering agents

Products

Resources

Solutions

Prices

Security and compliance

Programs

Company

Legal