Leveraging training and search for better software engineering agents

An extended version of this research blog post has been accepted to ICML 2025 and is available on arXiv. We recommend reading the paper instead of the blog post, as it contains more technical details.

In this blog post, we’ll share some of our recent findings on how search can enhance agentic systems for software engineering. Specifically, we’ll demonstrate how we’ve built an SWE-agent-based system that exclusively uses open-weight LLMs and achieves 40.6% resolved rate on SWE-bench Verified.

November 15, 2024

10 mins to read

At Nebius, we believe that LLM-based agentic systems have matured to a level where they can manage routine software engineering tasks. As these systems continue to evolve, they will take on increasingly complex challenges. This drives our LLM R&D team to explore effective methods for building agentic systems, with a focus on scalable approaches that leverage computational power, such as search. Our aim is to push the boundaries of software engineering automation by utilizing our in-house access to advanced compute resources, all while investigating what the future of computing infrastructure may look like in a world dominated by automated engineering systems.

Introduction

Agentic systems for automated software engineering

Software engineering agents are advanced AI systems designed to autonomously perform software development tasks. Unlike traditional coding assistants that merely offer suggestions or code completions, these agents can:

Execute commands: run code, compile programs, and manage development environments.
Write and test code: develop new code segments, create corresponding tests, and validate functionality.
Iterate and refine: assess the outcomes of their actions, make necessary adjustments, and, if needed, revert changes to maintain code integrity.

This level of autonomy enables software engineering agents to handle complex tasks with minimal human intervention, significantly enhancing efficiency and productivity in software development. Recent advancements in large language model technology have made this autonomy possible, allowing agents to understand and generate code at a much more sophisticated level.

SWE-bench ¹ is the leading benchmark for evaluating the capabilities of autonomous software engineering agents in addressing real-world software issues. It consists of 2,294 issue-pull request pairs sourced from 12 popular Python repositories on GitHub. The goal of the system being tested is to generate a patch based on the issue description that passes the relevant tests, while having access to a containerized environment with the repository where it can execute commands and interact with the code. The SWE-bench data preparation process involved:

Issue selection: identifying resolved issues from selected repositories that include both the problem description and the corresponding solution.
Data pairing: associating each issue with its respective pull request, which contains the code changes implemented to address the issue.
Test integration: incorporating unit tests from the pull requests to verify the correctness of the solutions.

There also exists a curated subset of SWE-bench, SWE-bench Verified, where experts have confirmed that the included issues can indeed be solved based on their descriptions and that the solutions are correctly validated using the provided tests.

SWE-bench has driven significant progress in the development of software engineering agents, with many researchers actively working to enhance their systems' performance. However, the benchmark is still far from being fully mastered, indicating substantial room for further improvement. While no benchmark is perfect, SWE-bench provides a meaningful and reliable assessment of progress in building better software engineering agents. It captures the key challenges and capabilities of these systems, making it an invaluable tool for measuring advancements.

The bitter lesson and agentic systems

Recent research on LLM-based agentic systems largely focuses on building better scaffolding — more sophisticated frameworks and logical structures to coordinate an agent’s actions. However, this emphasis contradicts what has come to be known as the bitter lesson: historically, approaches that scale well with compute, such as search and learning, have consistently outperformed those relying on carefully engineered structures. While complex scaffolding may yield short-term improvements, simple and general methods that capitalize on computational scale ultimately prevail in the long run.

Moreover, when scaffolding is the main focus, valuable trajectory data generated by the agent is often underutilized. We believe in fully leveraging the data through learning from it, rather than merely using it to guide the process of imposing constraints.

Top performing agents leverage frontier models

Top-performing software engineering agents mostly utilize frontier models, e.g. GPT-4o or Claude 3.5 Sonnet, to generate action proposals. To improve on such systems, one could attempt to train an action generator narrowly focused on the problem of agentic software development, hoping that this specialization would enable it to outperform generalist models, which must excel across many domains. However the reality is that frontier models are exceptionally good in their target domains due to an extensive amount of resources dedicated to their development. These models are also specifically trained to write code and reason about its correctness, making them well suited for powering software engineering agents. Even if a specialised model briefly outperforms them, frontier model performance constantly advances, likely rendering such efforts redundant. This is why we are more interested in approaches that can potentially work alongside frontier models, rather than replace them, while still benefiting from compute scaling.

An alternative approach: guided search with critic models

In reasoning-heavy domains such as mathematical problem solving or software engineering, there’s often a significant gap between the number of problems that current top systems can solve reliably and those they can only solve occasionally. For instance, the plot below illustrates the performance of a reasonably capable gpt-4o-based agent on SWE-bench Verified under two scenarios:

best-of-N: the agent runs on each problem instance N times, and a problem is considered solved if it is successfully solved in at least one of these runs.
random: a problem is considered solved if a solution, randomly selected from N runs, is correct. This serves as a Monte Carlo estimate of the success rate using N samples.

Best-of-N performance of a capable model significantly exceeds its average single run performance.

Best-of-N performance of a capable model far surpasses its average performance in a single run. At N=5 runs, there is nearly a two-fold difference between the two approaches! This raises the question: can we narrow the gap and increase the likelihood of generating a correct solution in cases where an agent already succeeds occasionally?

One promising way to achieve this is to train an additional critic model that can steer the action generator in the direction of success. In this approach, the action generator’s goal is to produce good actions at least some of the time, while the critic model — trained with the knowledge of a specific environment — evaluates generated actions to determine which ones most likely lead to a successful outcome. These evaluations then guide some form of search in the solution trajectory space. Guided search can harness the power of the frontier models by considering only actions with reasonably high probability under such models, while leveraging critic’s insights to prioritize the most promising paths of the search space. Importantly, this hybrid system has the potential to outperform generator-only approaches even as new, more powerful frontier models emerge.

In this post, we’ll share some of our findings on training critic models and using them to perform various forms of guided search within the solution trajectory space.

Preliminaries

Before discussing our findings, it’s essential to outline our experimental setup. In this section, we’ll describe the agent scaffolding we use, provide details on the action generator we’ve trained in-house and used in the majority of experiments, and explain how we run the agent.

The agent

Our aim is to use critic models to conduct search in the solution trajectory space. Imposing unnecessary restrictions on the structure of this space might interfere with our goal, limiting our ability to explore promising directions. For this reason, we’ve selected SWE-agent ² as the scaffolding for the experiments presented in the blog post — its structure is minimal, allowing for flexible exploration. SWE-agent solution process can be summarized in the following pseudocode:

def run_agent(llm, issue_id):
　env = init_environment(issue_id)
　trajectory = init_trajectory_from_issue_description(issue_id)
　while not finished(trajectory):
　　action = generate_next_action(llm, trajectory)
　　observation = execute_action(env, action)
　　trajectory = update_trajectory(trajectory, action, observation)
　return trajectory

We use a custom fork of SWE-agent code, with several improvements focused on agent-environment communication and error handling. We also addressed issues with environment setup and patch application in our fork of SWE-bench, which we use to evaluate the agent.

Reducing performance variance

One of our early observations was that inter-run performance variance can be quite high, especially for less capable policies that struggle to recover from mistakes, which makes the performance of a single agent run misleading. To address this, we repeat each agent run five times with different seeds and report both mean performance and its estimation error.

Running agent 5x more times requires 5x more compute. To speed things up, we conduct the evaluations on a test set we call verified-50: an unbiased random subset of SWE-bench Verified containing 50 problems. We validated that for the SWE-agent with gpt-4o our results on verified-50 match the results independently reported on the full SWE-bench Verified test set.

The in-house action generator

While our goal is to train critic models that provide value on top of frontier models like gpt-4o, conducting large-scale experimentation with such models can be prohibitive in terms of experimentation costs. To address this, we trained a reasonably capable action generator based on an open-weight model in-house. We used this generator in most experiments, reserving frontier model evaluations for the most promising setups.

After experimenting with various training configurations, including different models, hyperparameters, and data mixes, we found the following setup to be most effective:

Fine-tuning starting from Qwen-2.5 72B Instruct.
Training for 6 epochs with a batch size of 128 and a sequence length of 32k.
Using a cosine schedule with a 7-step warmup, a max learning rate of 4e-6, and a final learning rate of 0.
Employing a special data mix.

The dataset we trained the model on consists of SWE-agent trajectories that successfully solved various SWE-bench-like problems. These problems come from either the SWE-bench dev set or an additional set of 3.3k issues from 1.3k repositories with a permissive licence, which we collected using SWE-bench recipe to expand our training data. We took only those issues where we could set up the environment and successfully run tests with golden patches applied without manual intervention. We ensured that no mirrors or forks of repositories from the SWE-bench test set were included to avoid data leakage. We gathered these trajectories by running SWE-agent with instruct models or earlier versions of action generators we trained, a process similar in spirit to expert iteration¹⁰.

We collected a large number of successful trajectories for easier problems but significantly fewer for harder ones, which required filtering and rebalancing the dataset. We first filtered the trajectories to exclude those with clearly incorrect actions, such as malformed LLM responses, tool misuses, or linter-triggering errors. We then removed duplicate trajectories for each problem, retaining only unique action sequences, and kept only the 30 shortest trajectories for each problem. The resulting training set contained 2.7k trajectories.

The action generator trained using the described setup performs significantly better than its parent model, Qwen-2.5-72B Instruct, but still falls slightly short of gpt-4o (2024-08-06) performance. Nevertheless, the performance gap is small enough for the in-house action generator to serve as a reasonable proxy for frontier models in our experiments.

Fine-tuning on trajectory data improves upon Qwen-2.5-72B Instruct, closing the gap to gpt-4o. Here and below bar heights indicate mean resolved rate computed by averaging over 5 runs. Error bars represent one standard deviation of the mean.

Rerun until submitted

SWE-agent terminates the problem-solving process when one of two conditions is met: either the agent issues a “submit” command, indicating the problem is considered solved, or it encounters an unrecoverable error (e.g. the LLM runs out of context). If the latter occurs, the changes accumulated in the working copy are treated as the generated patch. However, if the LLM exhausts its context, it’s often because the agent made an unrecoverable mistake, leading to an incorrect trajectory.

One way to mitigate this, which can be considered a simple search strategy on its own, is to rerun the agent until it either successfully submits a solution or exhausts the maximum number of attempts:

def run_agent_until_submitted(llm, issue_id, run_agent_fn, max_retries):
　trajectory = None
　for attempt_idx in range(1 + max_retries):
　　trajectory = run_agent_fn(llm, issue_id)
　　if exit_reason(trajectory) == "submitted":
　　　break
　return trajectory

As shown below, the default submission rate of our action generator is around 50%, but allowing for up to three retries increases it to 80%, and up to nine retries — to 90%. Most trajectories require significantly fewer attempts, so this strategy doesn’t drastically increase compute costs. For instance, achieving an 80% submission rate typically requires only one extra attempt per problem on average, while reaching 90% requires about two attempts. As we will show later, these numbers further decrease as agent policy improves.

The average number of retries it takes to generate a trajectory that ends with “submit” for a given fraction of the test set using the in-house action generator. We allow for a maximum of 20 retries.

What are the benefits of improving submission rate? Submitted trajectories are more likely to be successful than random ones, so conditioning on the fact of a submission improves agent performance, fully closing the gap to gpt-4o in our case. One major downside of our action generator compared to gpt-4o is its weaker ability to recover from mistakes — a limitation that re-running until submission essentially circumvents. We’ve also found that this strategy generally reduces inter-run performance variance, making evaluations less noisy (although it was not the case for the evaluation run shown below).

Re-running until submitted improves mean resolved rate, allowing Qwen-based action generator to catch up with gpt-4o.

It should be noted that simply increasing maximum context length is not a good alternative to re-running until submitted. The latter searches for a trajectory that does not hang, while using longer context just increases the time before it does.

Given the benefits and relatively low computational costs, we apply this strategy to the agent in all subsequent experiments unless explicitly stated otherwise.

Critics and guided search

In this section, we explain our approach to training critic models and outline some relatively straightforward ways to use these models for guided search.

Training critic models

Two common strategies for training critic models are process supervision and outcome supervision. The difference between them lies in the prediction targets being used:

Process supervision: predicting the quality of each individual action.
Outcome supervision: predicting whether the overall trajectory solves the task.

Process supervision is often considered more powerful and sample-efficient, but obtaining per-step target data is challenging. It requires either a costly and labour-intensive manual annotation process³ or significant compute resources to calculate Monte Carlo estimates of action value⁴˒ ⁵. Outcome supervision is simpler, requiring only positive and negative trajectory examples. However, to train a critic capable of per-step decision-making using outcome supervision, an additional model is needed to break down the outcome into per-step scores.

After experimenting with both strategies, we settled on a hybrid approach. Our critic model is trained as an approximator for both the value and action-value functions, using discounted rewards-to-go as targets. Rewards come from two sources: a large terminal reward computed by evaluating the trajectory as in the regular outcome supervision regime, and smaller per-step rewards produced by gpt-4o prompted to recognize clearly poor actions. While these per-step rewards are not strictly necessary, they significantly improve sample efficiency of training. We also tried using bootstrapped targets but found no advantages over rewards-to-go.

More formally, our training dataset D consists of trajectories

with each trajectory composed of alternating observations, actions and rewards. We train the critic model to predict

using single-sample Monte-Carlo estimates of the corresponding expectations as training targets. Each step’s reward is either 0 or 1, except for the terminal step, where the reward is either 0 or 20.

The training dataset consists of 8.3k positive and 12.7k negative trajectories collected using the same problem instances we used for training the action generator. To maximize the utility of the trajectory data collected over the course of this project, our dataset consists of trajectories collected by running SWE-agent with multiple LLMs having drastically varying performance levels. To help the critic model account for these performance variations when estimating value functions, we additionally prompt it with a string identifying the LLM used to produce the trajectory being evaluated, such as swe-agent-llama-3.1-70b-instruct. This string allows the critic model to adjust its predictions to the current policy.

We train the critic for 3 epochs with a batch size of 128, using LLaMa 3.1 70B Base as the starting point. When calculating discounted rewards-to-go, we use a relatively small discount factor of 0.85, ensuring there’s no incentive to delay issuing the “submit” command in favor of collecting additional intermediate rewards. Both value and action-value predictions use the L2 loss with equal weighting. We experimented with cross-entropy losses for value prediction (e.g. using 2-hot targets) and with predicting action advantages instead of values but found no consistent benefits.

A critic model trained using the described setup can estimate the values of individual states, the values and advantages of actions, and the correctness of the overall trajectory (by looking at the action-value of the final action, such as “submit”). In the next section, we’ll discuss how to use these predictions for various types of guided search.

1-step lookahead

A straightforward search strategy that leverages our critic model is to generate multiple action candidates at each step, then select the action with the highest predicted value or advantage:

def run_agent_1_step_lookahead(llm, critic_llm, issue_id, num_action_candidates):
　env = init_environment(issue_id)
　trajectory = init_trajectory_from_issue_description(issue_id)
　while not finished(trajectory):
　　action_candidates = [
　　　generate_next_action(llm, trajectory)
　　　for _ in range(num_action_candidates)
　　]
　　action = select_most_valuable_action(critic_llm, trajectory, action_candidates)
　　observation = execute_action(env, action)
　　trajectory = update_trajectory(trajectory, action, observation)
　return trajectory

This strategy has some clear benefits:

It’s easy to integrate into an existing agentic system.
It has minimal impact on action generator inference costs, as it only affects the number of output tokens produced, while it is input tokens that mostly determine agentic inference costs, especially when dealing with long trajectories.

However, it’s important to note that 1-step lookahead is a local, greedy search strategy that cannot fully explore the search space, potentially limiting its impact. Despite this, applying 1-step lookahead to our agent yields clear improvements in performance and reduces the number of retries needed to achieve submission.

Comparison of agent runs with and without lookahead using action sampling temperature T=0.7

Running the agent with 1-step lookahead significantly reduces the number of attempts needed to generate a trajectory that ends with “submit”.

Given the local nature of this search strategy, we hypothesize that increasing the diversity of action candidates can improve performance by allowing a larger fraction of the search space to be explored. To achieve that we can use higher temperatures when sampling action candidates. However, we find that action quality quickly deteriorates as the temperature raises, limiting the potential to generate diverse actions and reducing performance. There appears to be an optimal point around T=0.9.

Effect of varying sampling temperature when generating 4 action candidates for 1-step lookahead.

Another way to improve action diversity is by increasing the number of action candidates. However, we don’t observe further benefits beyond 4 candidates at T=0.9.

Effect of varying the number of candidates for 1-step lookahead with T=0.9. We observe no benefits from going beyond 4 candidates.

We’ve also experimented with prompting gpt-4o to judge action candidates to serve as a baseline for the learned critic, but didn’t observe any benefits compared to no lookahead.

In summary, this simple search strategy is relatively inexpensive in terms of additional compute and delivers substantial performance gains, boosting our agent’s resolved rate by approximately 1.5x.

Trajectory selection

Another simple way to search within trajectory space using a trained critic is to generate multiple complete solution trajectories for each problem and then use the one with the highest estimated success probability:

def run_agent_trajectory_selection(
　llm, critic_llm, issue_id, run_agent_fn, num_runs
):
　trajectory_candidates = [
　　run_agent_fn(llm, issue_id)
　　for _ in range(num_runs)
　]
　return select_trajectory_with_max_prob_of_success(
　　critic_llm, trajectory_candidates
　)

This strategy directly attempts to close the gap between the average and best-of-N performance discussed earlier. Essentially, it approximates the best-of-N selection process without access to evaluation results. It offers several practical benefits:

The strategy is agnostic to the policy used to generate the trajectory candidates, allowing for easy integration with other methods.
Multiple trajectories can be generated in parallel, meaning that although trajectory selection requires additional compute, it does not increase latency.

Applying trajectory selection on top of our action generator yields clear performance benefits and substantially reduces the resolved rate gap to best-of-N. Interestingly, performance doesn’t monotonically improve as the number of runs increases. This effect is caused by occasional out-of-distribution (OOD) trajectories that our critic model mistakenly assigns high scores to; as the number of runs grows, the likelihood of encountering such trajectories also rises.

Trajectory selection improves the baseline, but not necessarily monotonically.

Trajectory selection performs even better when applied on top of solutions generated with 1-step lookahead, merging the benefits of both search methods. Interestingly, the performance now monotonically depends on the number of runs, suggesting that fewer OOD trajectories are produced with lookahead. When using 10 runs, the combined approach reduces the performance gap with best-of-N more than 2x and achieves 48% resolved rate on verified-50.

Trajectory selection provides an even larger performance boost on top of trajectories found with 1-step lookahead.

We also ran this setup on the full SWE-Bench Verified dataset using the original SWE-bench code for evaluation, where it scores 40.6%. Performance reduction relative to 48% achieved on verified-50 is caused by several factors:

A number of environment fixes that we’ve made in our fork of the evaluation codebase.
Slightly overfitting to verified-50 when selecting the temperature and the number of action candidates.

Nevertheless, at the time of writing, this result is to the best of our knowledge state-of-the-art among approaches relying solely on open-weight models, and the best result achieved using SWE-agent scaffolding.

In summary, trajectory selection provides a substantial performance improvement, especially when layered with 1-step lookahead, enhancing our agent’s success rate and narrowing the gap to best-of-N performance.

Combining search with frontier models

To demonstrate that the benefits of critic-guided search extend to more powerful generators, we first compare the performance of SWE-agent using gpt-4o (2024-08-06) with and without 1-step lookahead. We use the best setup identified for our action generator: sampling 4 candidates with a temperature of T=0.9. The plot below demonstrates that our critic, though based on a weaker model, can meaningfully enhance gpt-4o’s performance through guided search. Additionally, running the model until submission further boosts performance. Interestingly, this setup does not outperform the same configuration applied to our in-house action generator, suggesting that search acts as an equalizing factor, increasing the reliability of weaker models to reach a quality level comparable to frontier models.

Applying 1-step lookahead and “retry until submitted” strategy to gpt-4o significantly improve its quality. However it does not outperform the same setup running on top of the in-house action generator.

We can further enhance performance by applying trajectory selection. To keep costs down, we tested up to N=5 runs. As the number of runs grows, the resolved rates of the in-house generator and gpt-4o remain closely aligned, suggesting that their performances are comparable in both mean and variance (e.g. gpt-4o does not generate a more diverse set of candidate solutions).

Applying trajectory selection on top of agent runs with gpt-4o boosts performance very similar to the in-house generator.

Future steps

Methods like 1-step lookahead and trajectory selection are straightforward to implement and provide significant performance benefits, but they have limited ability to explore the search space comprehensively. The success of these methods suggests that more advanced strategies, such as beam search, best-first search, or Monte Carlo Tree Search (MCTS), could offer even greater performance improvements. These strategies have already shown promise in other LLM-heavy domains⁶˒ ⁷˒ ⁸.

However, implementing more sophisticated search methods introduces challenges. These strategies require the ability to reset the environment to specific states, enabling efficient exploration of multiple branches—a complex task in a general-purpose virtual machine. One workaround is to replay actions from the beginning of a trajectory to reach a desired state, but this approach doesn’t fully address environment stochasticity, and its efficiency depends on the time required to execute each action. For example, replaying pytest ./tests repeatedly in a large repository can become highly time-consuming!

We believe that search-based methods have strong potential for the future of automated software engineering and other agentic domains. Developing environments that support rapid action rollback and state restoration will be a crucial infrastructure component for an agent-driven world, and we are eager to explore these research directions further.

Another promising avenue is creating an iterative process where the policy induced by guided search is distilled back into the action generator. This process could involve retraining the critic to produce an even better search policy, then refining the action generator again, and so on. This approach—combining search and learning—has been highly effective in previous works, notably AlphaZero⁹, and could be directly applicable to agent-based domains, though scaling it efficiently presents new challenges.

Conclusion

Modern LLM-based agentic systems, particularly those designed for automating software engineering tasks, can reliably solve simple problems but remain prone to failure as complexity increases. Our work demonstrates that applying search guided by a critic model can significantly enhance these systems' reliability, bridging the gap between best-case and average-case performance. Specifically, we illustrated how to train a critic model on agent trajectory data, how to leverage it within various trajectory search strategies, and how to combine these strategies for even greater performance gains. Our critic-based approach adds value not only on top of open weight models like LLaMa or Qwen, but also when paired with frontier LLMs. Notably, search also helps to close the performance gap between these model families.

By focusing on scalable approaches, such as learning and guided search, rather than overly intricate scaffolding, we’ve taken steps toward creating more robust and adaptive agentic systems. We are excited to continue exploring whether more sophisticated search methods can further enhance these systems’ performance.

Contributors

Alexander Golubev, Sergey Polezhaev, Karina Zainullina, Maria Trofimova, Ibragim Badertdinov, Yury Anapolskiy, Daria Litvintseva, Simon Karasik, Filipp Fisin, Sergey Skvortsov, Maxim Nekrashevich, Anton Shevtsov, Sergey Abramov, Boris Yangel^＊

AG and SP did the majority of work on critic. KZ, MT, IB trained the action generator. YA, DL, SK and SP built the inference and evaluation infrastructure. FF and SS improved training and inference performance, especially for long sequences. FF supported training infrastructure. MN, AS, SA and BY contributed into model training and infrastructure. BY led the project.

^＊Correspondence to byangel@nebius.com

Citation information

Please cite as:

Golubev et al., "Leveraging training and search for better software engineering agents", Nebius blog, 2024.

BibTeX citation:

@article{golubev2024search, title={Leveraging training and search for better software engineering agents}, author={Golubev, Alexander and Polezhaev, Sergey and Zainullina, Karina and Trofimova, Maria and Badertdinov, Ibragim and Anapolskiy, Yury and Litvintseva, Daria and Karasik, Simon and Fisin, Filipp and Skvortsov, Sergey and Nekrashevich, Maxim and Shevtsov, Anton and Abramov, Sergey and Yangel, Boris}, year={2024}, journal={Nebius blog}, note={https://nebius.com/blog/posts/training-and-search-for-software-engineering-agents} }

References

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 ↵
Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., & Press, O. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. arXiv:2405.15793 ↵
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., & Cobbe, K. (2023). Let’s Verify Step by Step. arXiv:2305.20050 ↵
Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., & Shieh, M. (2024). Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning. arXiv:2405.00451 ↵
Havrilla, A., Raparthy, S. C., Nalmpantis, C., Dwivedi-Yu, J., Zhuravinskyi, M., Hambro, E., & Raileanu, R. (2024). GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements. arXiv:2402.10963 ↵
Koh, J. Y., McAleer, S., Fried, D., & Salakhutdinov, R. (2024). Tree Search for Language Model Agents. arXiv:2407.01476 ↵
Setlur, A., Nagpal, C., Fisch, A., Geng, X., Eisenstein, J., Agarwal, R., Agarwal, A., Berant, J., & Kumar, A. (2024). Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. arXiv:2410.08146 ↵
Putta, P., Mills, E., Garg, N., Motwani, S., Finn, C., Garg, D., & Rafailov, R. (2024). Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents. arXiv:2408.07199 ↵
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., & Hassabis, D. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815 ↵
Anthony, T., Tian, Z., & Barber, D. (2017). Thinking Fast and Slow with Deep Learning and Tree Search. arXiv:1705.08439 ↵

Explore Nebius

Docs

Support

Key services

Leveraging training and search for better software engineering agents

Introduction

Agentic systems for automated software engineering

The bitter lesson and agentic systems

Top performing agents leverage frontier models

An alternative approach: guided search with critic models

Preliminaries

The agent

Reducing performance variance

The in-house action generator

Rerun until submitted

Critics and guided search

Training critic models

1-step lookahead

Trajectory selection

Combining search with frontier models

Future steps

Conclusion

Contributors

Citation information

References

Explore Nebius

Key services

See the contributors

See also

In-house AI R&D: Nebius’ secret ingredient for truly AI‑centric cloud

How transformers, RNNs and SSMs are more alike than you think

Mixtures of Experts and scaling laws

Products

Resources

Solutions

Prices

Security and compliance

Programs

Company

Legal

Leveraging training and search for better software engineering agents

IntroductionIntroduction

Agentic systems for automated software engineeringAgentic systems for automated software engineering

The bitter lesson and agentic systemsThe bitter lesson and agentic systems

Top performing agents leverage frontier modelsTop performing agents leverage frontier models

An alternative approach: guided search with critic modelsAn alternative approach: guided search with critic models

PreliminariesPreliminaries

The agentThe agent

Reducing performance varianceReducing performance variance

The in-house action generatorThe in-house action generator

Rerun until submittedRerun until submitted

Critics and guided searchCritics and guided search

Training critic modelsTraining critic models

1-step lookahead1-step lookahead

Trajectory selectionTrajectory selection

Combining search with frontier modelsCombining search with frontier models

Future stepsFuture steps

ConclusionConclusion

Contributors

Citation information

References

Explore Nebius

Key services

See the contributors

See also

In-house AI R&D: Nebius’ secret ingredient for truly AI‑centric cloud

How transformers, RNNs and SSMs are more alike than you think

Mixtures of Experts and scaling laws

Introduction

Agentic systems for automated software engineering

The bitter lesson and agentic systems

Top performing agents leverage frontier models

An alternative approach: guided search with critic models

Preliminaries

The agent

Reducing performance variance

The in-house action generator

Rerun until submitted

Critics and guided search

Training critic models

1-step lookahead

Trajectory selection

Combining search with frontier models

Future steps

Conclusion