Monday, September 12, 2011

More frustration

I haven't been able to spend much time on this recently, but what I'm trying now is a variation on the symmetric setup - this time adding a new input that represents whose turn it is (=1 for the player's turn and 0 for the other player's turn).

I'm running up against another case where the network wants to converge an output probability to 100% almost all the time. This time the probability of win from the starting board seems to be okay (it settles around 53-55%, so a bit high compared to the gnubg standard, but nothing crazy) but the conditional probability of gammon win converges to 100%.

The last time I saw this, the problem was that I was not properly handing the dice off to the other player at the end of the turn (when evaluating the post-move probabilities). I do that correctly now, but the gammon probability is still messing up.

Frustrating! I'm not even sure how to debug this sensibly. It's a huge complex nonlinear network, so looking at the weights doesn't give me any intuition about what's happening. I've gone over the code many times now and can't see any obvious bugs. The slide toward 100% probability happens gradually over thousands of iterations.

Friday, September 2, 2011

Reference estimate of probabilities from the starting position

One thing I'd like the network to estimate properly is the probability of a win (any kind of win) and the probability of a gammon win or loss from the starting position. To do that, of course, I need some independent estimate to compare against.

I checked in gnubg to see its estimates, and reproduce them here for reference (assumes the player has the dice):

Probability of any kind of win from the starting position: 51.9%
Probability of a gammon win (incl backgammon): 15.1%
Probability of a gammon loss (incl backgammon): 13.6%

Thursday, September 1, 2011

Ah ha!

I finally cracked the nut on the problem with the "whose turn is it" input.

The problem was how I trained the board for the t+1 state - ie updating the probabilities after the player moves.

I was evaluating the network without changing who owns the dice. So not surprisingly: if a player always holds the dice, they will definitely win!

Now that I made it flip the dice for the t+1 evaluation, it seems to be properly converging, at least in initial tests.

Phew - that one was bugging me for several days.

Frustrating results

I'm running up against a brick wall in trying to add "whose turn is it" to the inputs.

Basically whichever way I implement this, the probability of win output trains gradually to 1, as does the conditional probability of gammon win, and the conditional probability of gammon loss goes to zero.

I've tried adding this as an input to the symmetric network, to the non-symmetric networks, and to a completely rebuilt network that follows Tesauro's design exactly. In all cases, if I leave out the "whose turn it is" input (by setting the hidden->input weights for that input to zero and not training them), the network trains just fine, giving results like I posted before. But if I train those weights, the network trains to the ridiculous probabilities I mentioned above. It does not matter which position I put that input - ie it's not an error in the code that has a problem with a particular input index.

I really don't understand what's going on here. I see no mention in any publications about a problem with this input.

Thinking about it a bit, this input does seem to have a bit of a preferred status: whenever it's your turn to roll, you expect your probability of winning to get more positive just because most of the time you're removing pips through your roll. I don't think any of the other inputs have that status.

In the training, the change in weight value is ( new output - old output ) * input value * output->hidden weight * ( stuff that is positive ). For the probability of win and probability of gammon outputs, and the input value is +1 if it's your turn, the expected value of the first two terms is positive.

The output->hidden weight can be positive or negative. If it's positive, the trained middle->input weight will tend to +infinity; if it's negative, the weight will tend to -infinity. I think that's what's happening here.

But this seems a bit facile - no one else mentions this as a problem. So I suspect I've got some fundamental problem in my setup that I'm missing.

The hunt continues!