SWE-rebench: A continuously updated benchmark for SWE LLMs

Our AI R&D team presents SWE-rebench, a new benchmark for evaluating agentic LLMs on a continuously updated and decontaminated set of real-world software engineering tasks mined from real GitHub repositories.

In today’s landscape of rapid LLM progress, static benchmarks quickly lose their relevance. As models are likely to encounter benchmark data during training, it’s becoming increasingly difficult to distinguish true generalization from memorization. Additionally, agent performance is heavily influenced not just by the model but by the surrounding scaffolding, such as prompts, additional test-time computations and available tools, making fair comparisons across systems challenging.

Today, we present a new LLM benchmark, SWE-rebench, which addresses these issues in SWE domain through:

  • a standardized evaluation pipeline with fixed scaffolding;

  • frequent dataset updates sourced from live open-source repositories;

  • explicit contamination tracking tied to model release dates.


Figure 1. SWE-rebench leaderboard as of today

Our goal with this benchmark is to make evaluation of software engineering LLMs more transparent, reproducible and focused on core model capabilities.

You can explore the benchmark leaderboard and methodology behind it at swe-rebench.com.

Explore Nebius AI Cloud

Explore Nebius AI Studio

author
Alexander Golubev
Lead ML Engineer

See also

Kvax: Fast and easy-to-use Flash Attention implementation for JAX

Today, we’re open-sourcing Kvax, our Flash Attention implementation based on JAX. Designed for efficient training with long sequences, Kvax supports context parallelism and optimized computation of document masks. It outperforms many other Flash Attention implementations in long-context training with dense packing, achieving state-of-the-art performance.

Sign in to save this post