Reinforcement learning demonstrates state-of-the-art results on software engineering tasks¹², yet it presents significant infrastructure challenges beyond simple compute scale. It requires complex MLOps workflows, including simultaneous stages of inference (collecting traces from the latest policy) and training (updating the policy) that typically require asynchronous setups for efficiency, as well as fine-grained experimentation to resolve instabilities and achieve high performance.
In contrast, behavioral cloning and model distillation, where models learn directly from curated demonstrations, require only standard supervised fine-tuning (SFT) pipelines and serve as a powerful baseline. Among these approaches, rejection fine-tuning (RFT) has proven particularly effective: by training on successful trajectories filtered from multiple solution attempts, RFT captures high-quality behavior without the infrastructure overhead of RL.
While our research primarily focuses on RL, given the strong performance of RFT, we are contributing our accumulated trajectories to support this approach. To expand the set of available datasets containing multi-turn trajectories based on open-source scaffoldings and models, we release nebius/SWE-rebench-openhands-trajectories: 67,074 agent trajectories solving GitHub issues across 1,823 Python repositories from SWE-rebench. The trajectories were generated using Qwen3-Coder-480B-A35B-Instruct³ running on OpenHands (v0.54.0)⁴, one of the most widely adopted open-source agent scaffolding frameworks.
To demonstrate the dataset’s utility, we release RFT checkpoints at two scales:
30B (fine-tuned from Qwen3-30B-A3B-Instruct-2507³): Matches the specialized Qwen3-Coder-30B-A3B-Instruct at 50.3% Pass@1 on SWE-bench Verified.
235B (fine-tuned from Qwen3-235B-A22B-Instruct-2507³): Achieves 61.7% Pass@1 — outperforming the 30B coding specialist model (50.0%) while using half the parameters of Qwen3-Coder-480B-A35B-Instruct (66.5%).
We collected trajectories from GitHub issues across Python repositories sourced from SWE-rebench, a collection of real-world software engineering tasks.
Model configuration: All trajectories were generated using Qwen3-Coder-480B-A35B-Instruct, one of the most powerful open-source code generation models available to date.
Agent scaffolding: We used OpenHands (v0.54.0), which provides a comprehensive framework for repository exploration, file editing, command execution and test validation.
Trajectory format: Each trajectory contains a sequence of messages with roles of system, user, assistant or tool:
Assistant messages contain function calls in structured tool format, which often yields superior performance on modern architectures compared to the string-formatted commands.
The agent maintains a linear message history, allowing efficient training on the complete sequence of steps.
arguments are serialized to string format for storage efficiency. When training on this data, you may need to deserialize it first to ensure chat templates apply the same formatting (including any additional tags or text) during both training and inference.
The exit_status field contains submit in case the agent completes trajectory with a terminating action, or an error message from the OpenHands agent otherwise.
Filtering process: We removed all trajectories where the generated code patches failed to apply to the target repository state. This ensures every trajectory represents a valid solution attempt.
After filtering, our dataset contains 32,161 successful trajectories with 67,074 total solution attempts.
Table 1: Comparison of statistics across different datasets containing multi-turn trajectories of agent interactions with executable SWE environments. (Scroll to the right).
*Statistic value could not be derived from available data.
Our dataset provides 3× more successful trajectories and covers 1.5× more Python repositories than existing alternatives solving real-world issues, all while relying on fully open-source infrastructure.
We evaluate our fine-tuned models on two complementary benchmarks.
SWE-bench Verified: A challenging, widely adopted benchmark containing 500 curated issues. Due to infrastructure difficulties in applying golden patches for 16 instances, we report results on the subset of 484.
While we cannot definitively determine whether base models encountered this data during pretraining, we explicitly excluded all SWE-bench Verified issues and repositories from our training data to ensure that observed improvements stem from fine-tuning rather than data leakage.
SWE-rebench September⁵: A smaller but temporally fresh evaluation set collected after all model training cutoffs, ensuring zero contamination at the issue level. Our training data similarly excludes all SWE-rebench September instances.
We applied rejection sampling fine-tuning (RFT) with a maximum sequence length of 131k tokens.
Table 2. Pass@1 with standard error of the mean and Pass@5 for OpenHands agent with the maximum number of turns set to 100 and 500. Deltas vs. base models are shown in parentheses for fine-tuned models. Metrics are reported in percentages. (Scroll to the right).
Evaluation results for agent-based systems are highly sensitive to both model and scaffolding configurations. As OpenHands continues to evolve, results can vary significantly between releases. To ensure reproducibility, we pin our evaluation to a specific library version (v0.54.0) and document all key configuration parameters:
Tool call format: Tool calls in messages from OpenHands agents used as inputs to LLMs can be formatted as strings or lists of dictionaries — different inference engines expect different formats, which affects behavior. This is controlled by the native_tool_calling parameter (None, true or false). To avoid relying on automatic heuristics for input tool call formatting (default None behavior), we recommend explicitly setting this parameter (we used true since all our models supported tool formatting on the inference engine’s side).
Tool call post-processing for Qwen3-Coder-30B-A3B-Instruct: While the larger Qwen3-Coder-480B-A35B-Instruct correctly formatted tool calls for the OpenHands scaffolding, the smaller Qwen3-Coder-30B-A3B-Instruct struggled, causing tool formatting errors. To match the quality reported by the Qwen3 team for the smaller coder model¹², we applied an additional post-processing for assistant steps on OpenHands side¹³.
Security settings: The security_risk parameter introduced in v0.55.0 controls which operations the agent can perform and requires the agent to manually set a new environment variable (see bug report. To avoid this, we stick to v0.54.0.
History management: To maintain the linear message history required for efficient training, we disabled history truncation when the LLM hits its maximum sequence length (enable_history_truncation=false, enable_default_condenser=true, condenser.type=noop) and prevented LLM access to the truncation tool (enable_condensation_request=false).
Infrastructure stability: The OpenHands README cautions that max_workers>1 is not well-tested. We observed that higher parallelism introduces infrastructure instability, particularly Docker build failures due to timeouts. To facilitate data collection, we increased timeouts during bootstrapping 2–10×, which enabled operation at max_workers=75, but this didn’t guarantee successful trajectory collection for all parallel instances. We recommend max_workers=1 for reliable evaluation.
Patch application: OpenHands-generated patches require specific post-processing to be applied correctly. Using evaluation pipelines not designed to handle OpenHands predictions can result in valid patches being rejected. We implement the required post-processing, resulting in a near-zero patch application failure rate.
While max_iterations (the maximum number of turns) is commonly reported, it alone is insufficient for full reproduction. To enable reproduction of the reported results, we provide a completed reproducibility checklist for our experiments and encourage researchers to use the same checklist when reporting results:
Table 3. Reproducibility checklist. (Scroll to the right).
*We used sequence length of 262k for Qwen3-235B-A22B-Instruct-2507 opposite to the value of 131k for all others, but it never exceeded 131k tokens during evaluation, thus its results should be comparable.
We contribute three resources to the community: (1) nebius/SWE-rebench-openhands-trajectories, 67,074 open-source trajectories with 3× more successful attempts than alternatives, (2) two RFT checkpoints achieving 50.3% (30B) and 61.7% (235B) Pass@1 on SWE-bench Verified, and (3) complete reproducibility documentation for fine-tuning and OpenHands evaluation. All are released under permissive licenses.
Contributors
Maria Trofimova, Anton Shevtsov, Ibragim Badertdinov, Konstantin Pyaev, Simon Karasik, Alexander Golubev*
MT performed fine-tuning; MT and AS ran data collection and evaluation; AS and IB helped debugging; KP and SK helped with infrastructure. AG led the project.
*Correspondence to alex_golubev@nebius.com
Citation information
Please cite as:
Trofimova et al., "OpenHands Trajectories with Qwen3-Coder-480B-A35B-Instruct", Nebius blog, 2025.
BibTeX citation:
@article{trofimova2025openhandstrajs, title={OpenHands Trajectories with Qwen3-Coder-480B-A35B-Instruct}, author={Trofimova, Maria and Shevtsov, Anton and Ibragim, Badertdinov and Pyaev, Konstantin and Karasik, Simon and Golubev, Alexander}, year={2025}, journal={Nebius blog}, note={} }
References
Team, K., Bai, Y., Bao, Y., Chen, G., Chen, J., Chen, N., ... & Zhang, H. (2025). Kimi k2: Open agentic intelligence. ArXiv: arxiv.org/abs/2507.20534↵
Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., ... & Zhou, Z. (2025). Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. ArXiv: arxiv.org/abs/2508.06471. ↵
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., ... & Qiu, Z. (2025). Qwen3 technical report. ArXiv: arxiv.org/abs/2505.09388↵
Wang, X., Li, B., Song, Y., Xu, F. F., Tang, X., Zhuge, M., ... & Neubig, G. (2024). Openhands: An open platform for ai software developers as generalist agents. ArXiv: arxiv.org/abs/2407.16741↵
Badertdinov, I., Golubev, A., Nekrashevich, M., Shevtsov, A., Karasik, S., Andriushchenko, A., ... & Yangel, B. (2025). SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents. ArXiv: arxiv.org/abs/2505.20411↵
Yang, J., Lieret, K., Jimenez, C. E., Wettig, A., Khandpur, K., Zhang, Y., ... & Yang, D. (2025). Swe-smith: Scaling data for software engineering agents. ArXiv: arxiv.org/abs/2504.21798↵
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, & Ofir Press (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. ArXiv: arxiv.org/abs/2405.15793↵
Wang, Q., Zhang, H., Fu, J., Fu, K., Liu, Y., Zhang, T., ... & Zhou, G. (2025). Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling. ArXiv: arxiv.org/abs/2511.05951↵
Badertdinov, I., Trofimova, M., Anapolskiy, Y., Abramov, S., Zainullina, K., Golubev, A., Polezhaev, S., Litvintseva, D., Karasik, S., Fisin, F., Skvortsov, S., Nekrashevich, M., Shevtsov, A., & Yangel, B. (2024). Scaling data collection for training software engineering agents. Nebius blog: nebius.com/blog/posts/scaling-data-collection-for-training-swe-agents↵
Pan, J., Wang, X., Neubig, G., Jaitly, N., Ji, H., Suhr, A., & Zhang, Y. (2024). Training software engineering agents and verifiers with swe-gym. ArXiv: arxiv.org/abs/2412.21139↵
Jain, N., Singh, J., Shetty, M., Zheng, L., Sen, K., & Stoica, I. (2025). R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents. ArXiv: arxiv.org/abs/2504.07164v1↵