The Model Did What I Rewarded, Not What I Wanted
This experiment started after reading Prime Intellect's Systematic Reward Hacking and Prime Sprints post. Their setup made reward hacking feel small enough to test directly: give a model a visible task, add a hidden reward component the model is never told about, and watch whether RL learns the proxy instead of the intended behavior.
I wanted to try the same style of experiment with a continuous (length-based) hack instead of a binary keyword hack. But the question was not simply whether reward hacking would happen. I was deliberately creating a conflict between the prompt and the reward.
I made the full experiment public here: DidierRLopes/reward-hacking. It includes the environment, hosted training configs, cached run data, generated figures, and the notebook used for this post.
The more interesting question was whether better prompting could protect against it:
What if the user asks for a direct answer, but the training reward quietly pays the model for being longer?

