The real reason here: when I was evaluating possible moves to see which one is the optimal move, I was calculating probabilities assuming the player held the dice. This is wrong; after the player's move the opponent holds the dice. So the equity evaluation on the possible moves needs to assume the opponent holds the dice. I fixed that (eventually) and now training is much cleaner.