Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO — arXiv2