I finally cracked the nut on the problem with the "whose turn is it" input.
The problem was how I trained the board for the t+1 state - ie updating the probabilities after the player moves.
I was evaluating the network without changing who owns the dice. So not surprisingly: if a player always holds the dice, they will definitely win!
Now that I made it flip the dice for the t+1 evaluation, it seems to be properly converging, at least in initial tests.
Phew - that one was bugging me for several days.
The problem was how I trained the board for the t+1 state - ie updating the probabilities after the player moves.
I was evaluating the network without changing who owns the dice. So not surprisingly: if a player always holds the dice, they will definitely win!
Now that I made it flip the dice for the t+1 evaluation, it seems to be properly converging, at least in initial tests.
Phew - that one was bugging me for several days.
Not quite: I still wasn't do the right thing when evaluating the equity of possible moves to choose the best one. In that case I was assuming the player still holds the dice, which is wrong; after the move, the opponent holds the dice.
ReplyDelete