The code shown looks correct for this problem. My guess is that one or more of the terms is incorrect.
For any encryption core, you really need to get detailed test vectors. Not just input and final output, but all intermediate terms for every iteration. when everything looks random by design it is harder to intuitively trace the problem to a source.