SWE-rebench: A continuously updated benchmark for SWE LLMs
SWE-rebench: A continuously updated benchmark for SWE LLMs
Our AI R&D team presents SWE-rebench, a new benchmark for evaluating agentic LLMs on a continuously updated and decontaminated set of real-world software engineering tasks mined from real GitHub repositories.
In today’s landscape of rapid LLM progress, static benchmarks quickly lose their relevance. As models are likely to encounter benchmark data during training, it’s becoming increasingly difficult to distinguish true generalization from memorization. Additionally, agent performance is heavily influenced not just by the model but by the surrounding scaffolding, such as prompts, additional test-time computations and available tools, making fair comparisons across systems challenging.
Today, we present a new LLM benchmark, SWE-rebench, which addresses these issues in SWE domain through:
-
a standardized evaluation pipeline with fixed scaffolding;
-
frequent dataset updates sourced from live open-source repositories;
-
explicit contamination tracking tied to model release dates.
Figure 1. SWE-rebench leaderboard as of today
Our goal with this benchmark is to make evaluation of software engineering LLMs more transparent, reproducible and focused on core model capabilities.
You can explore the benchmark leaderboard and methodology behind it at swe-rebench.com