I'm running up against a brick wall in trying to add "whose turn is it" to the inputs.
Basically whichever way I implement this, the probability of win output trains gradually to 1, as does the conditional probability of gammon win, and the conditional probability of gammon loss goes to zero.
I've tried adding this as an input to the symmetric network, to the non-symmetric networks, and to a completely rebuilt network that follows Tesauro's design exactly. In all cases, if I leave out the "whose turn it is" input (by setting the hidden->input weights for that input to zero and not training them), the network trains just fine, giving results like I posted before. But if I train those weights, the network trains to the ridiculous probabilities I mentioned above. It does not matter which position I put that input - ie it's not an error in the code that has a problem with a particular input index.
I really don't understand what's going on here. I see no mention in any publications about a problem with this input.
Thinking about it a bit, this input does seem to have a bit of a preferred status: whenever it's your turn to roll, you expect your probability of winning to get more positive just because most of the time you're removing pips through your roll. I don't think any of the other inputs have that status.
In the training, the change in weight value is ( new output - old output ) * input value * output->hidden weight * ( stuff that is positive ). For the probability of win and probability of gammon outputs, and the input value is +1 if it's your turn, the expected value of the first two terms is positive.
The output->hidden weight can be positive or negative. If it's positive, the trained middle->input weight will tend to +infinity; if it's negative, the weight will tend to -infinity. I think that's what's happening here.
But this seems a bit facile - no one else mentions this as a problem. So I suspect I've got some fundamental problem in my setup that I'm missing.
The hunt continues!
Basically whichever way I implement this, the probability of win output trains gradually to 1, as does the conditional probability of gammon win, and the conditional probability of gammon loss goes to zero.
I've tried adding this as an input to the symmetric network, to the non-symmetric networks, and to a completely rebuilt network that follows Tesauro's design exactly. In all cases, if I leave out the "whose turn it is" input (by setting the hidden->input weights for that input to zero and not training them), the network trains just fine, giving results like I posted before. But if I train those weights, the network trains to the ridiculous probabilities I mentioned above. It does not matter which position I put that input - ie it's not an error in the code that has a problem with a particular input index.
I really don't understand what's going on here. I see no mention in any publications about a problem with this input.
Thinking about it a bit, this input does seem to have a bit of a preferred status: whenever it's your turn to roll, you expect your probability of winning to get more positive just because most of the time you're removing pips through your roll. I don't think any of the other inputs have that status.
In the training, the change in weight value is ( new output - old output ) * input value * output->hidden weight * ( stuff that is positive ). For the probability of win and probability of gammon outputs, and the input value is +1 if it's your turn, the expected value of the first two terms is positive.
The output->hidden weight can be positive or negative. If it's positive, the trained middle->input weight will tend to +infinity; if it's negative, the weight will tend to -infinity. I think that's what's happening here.
But this seems a bit facile - no one else mentions this as a problem. So I suspect I've got some fundamental problem in my setup that I'm missing.
The hunt continues!
The real reason here: when I was evaluating possible moves to see which one is the optimal move, I was calculating probabilities assuming the player held the dice. This is wrong; after the player's move the opponent holds the dice. So the equity evaluation on the possible moves needs to assume the opponent holds the dice. I fixed that (eventually) and now training is much cleaner.
ReplyDelete