MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation
cs.LG
/ Authors
/ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard recipe for post-training LLMs on reasoning tasks, with Group Relative Policy Optimization (GRPO) emerging as a leading approach. However, GRPO and its variants are inherently single-turn: they optimize from terminal rewards on isolated prompt-response pairs, leaving them poorly suited to agentic settings where models must iteratively refine solutions in response to environmental feedback. We introduce MURPHY, a multi-turn extension of GRPO for self-correcting code generation. MURPHY constructs feedback-conditioned rollout trees in which failed candidate solutions are paired with executor feedback and expanded into subsequent turns, and propagates rewards backward through the tree so that later successful refinements credit earlier attempts that surfaced informative feedback. We study two propagation strategies, Max Reward (MARS) and Mean Reward (MERS), and introduce post-rollout pruning mechanisms that reduce multi-turn optimization cost. Across three code generation benchmarks (HumanEval, MBPP, LiveCodeBench-v6) and two model families (Qwen3-1.7B/4B, OLMo-2-7B), MURPHY delivers up to 6% absolute pass@1 gains over the strongest prior multi-turn execution-feedback methods. Gains are largest on the Medium/Hard subset (+4.38/+4.20 at Iter-5), where iterative self-correction matters more.