Monday, December 12, 2011

The symmetry constraint is not good

I ran the test described in the previous post out to 400k training runs, and the results were relatively unchanged. In head to head competitions the Normal network started to edge out the Symmetric one, and they started to comparably against the pub eval benchmark, with Symmetric still performing a little better on average.

However, I did some more granular benchmarking and Symmetric did much worse. I looked at the network estimate of probability of win from a pair of end games: one where white is almost sure to win, and one where white is almost sure to lose. Both networks started out estimating both probabilities near 50%, as expected with random weights. But the Normal network over time correctly identified the near-certain win and loss fairly accurately after some tens of thousands of training runs, and the Symmetric network did not. In the certain-to-win case it converged to around 97.5% chance of win, so not far off (but should be very close to 100%); but in the certain-to-lose case it converged to around 20% chance of win, which is way off.

Also, I spent some more time thinking about the basic implications of the Symmetric constraint on the hidden->input weights. In particular: the network probability estimates can ever only depend on the difference between the white and black layouts, not on either one independently. That does not seem to afford sufficient flexibility high-level to the network in making probability estimates.

There might be other ways to enforce the symmetry constraint, but at this point I'm just going to drop the whole thing and follow the same approach as GNU backgammon: always evaluate the board from the perspective of the person holding the dice, and drop the input that represents whose turn it is.

Sunday, December 11, 2011

Testing the value of the symmetry constraint

Most of the players I've built so far make the "symmetry" assumption: that it's good to ensure explicitly that when you flip the perspective of the board (including an input that signifies whose turn it is), the network probabilities transform as they should. That is, probability of any win transforms to probability of loss, probability of gammon win transforms to probability of gammon loss, and so on.

Satisfying the symmetry adds a constraint to the hidden->input weights. The inputs are set up in pairs, where the inputs for the second player mirror the inputs for the first one, just in different positions in the inputs list. The constraint is that the weight for a mirror input must be the negative of the weight for the corresponding input.

There are two benefits to this approach: first, there are roughly half the number of network weights, so training should be faster; and second, a known symmetry is enforced so the network probability estimates should be more accurate in some general way.

The downside, it seems to me, is that there are half the number of weights, so the network is less flexible than it is without the constraint. So perhaps if we do not explicitly enforce the relationship between weights for regular and mirror inputs, the network can find more powerful strategies.

I designed a simple experiment to test out the net benefit of adding the symmetry constraints. This is not conclusive for more powerful networks, but it should be suggestive.

I designed two networks:
  • "Normal": a network setup that does not include the symmetry constraint. The inputs also do not include a flag for whose turn it is; the evaluator always assumes "player 1" is on turn.
  • "Symmetric": a network setup that does include the symmetry constraint, and also includes two inputs that reflect whose turn it is (1 for player on turn, 0 otherwise).
Both networks contain only one output for probability of win, and are trained in the usual TD approach. The inputs (aside from the variance on the "whose turn is it" inputs) are the original Tesauro inputs. So not particularly sophisticated networks, but pared down to highlight the difference I want to test out.

I trained both networks with constant alpha=0.1 and lambda=0 for 100k iterations, and every 1k iterations I ran benchmarking. Both strategies were benchmarked on 1k games against the pub eval standard, and against each other. Both networks had 10 hidden nodes.

Conclusions and Results

For this simple network setup, adding the symmetry constraint is marginally superior. The Symmetric network trains more quickly than the Normal network and has a comparable final performance against both pub eval and the Normal network in benchmarking games. That is, the Normal network's extra flexibility from a larger set of weights did not seem to have much of a net benefit.

100k iterations is not that much for backgammon training, and perhaps the Normal network would do better on a longer run. These are pretty simple networks that might not be representative of more complex ones. And the benchmarks I tried are simple and do not say much about network performance in more granular aspects of the game.

Nonetheless, this suggests minimally that adding the symmetry constraint does not make the trained player net worse, and a smaller number of network weights should make it more stable and easier to train.

This chart shows the results of the experiment:

The x-axis is the training run. The y-axis is the fraction of games won in the 1k benchmark games, run every 1k training runs. The green points and line show the Normal setup against the pubeval benchmark; the orange points and line show the same for the Symmetric setup. The blue points and line show the results of a matchup between Normal and Symmetric. The points are the individual 1k-game competitions and the lines are a rolling 10-set average.

Note: results against pubeval are incorrect. Also the training was done incorrectly. Updated results are later in the blog; but the conclusion doesn't change. Symmetric is worse than Normal, and by a more significant margin than is clear here.

Both the Symmetric and Normal networks train quite quickly (within 20k training runs or so), though the Symmetric network trains substantially quicker than the Normal one - by 5k training runs it is winning around 70-75% of games against pub eval.

Symmetric does generally better against pub eval than Normal, even after 100k training runs.

Normal tends to lose by a small margin to Symmetric. By the later training runs they are roughly on par when playing against each other.

Neither player is particularly good on an absolute scale, which is as expected with this relatively simple pair of setups. 

Friday, December 9, 2011

I'm baaack

It's been a while, but I'm back onto the computational backgammon hobby. I took the free Stanford machine learning course, which was a really interesting high level overview of many machine learning techniques, including neural networks. In the end I don't think it much changes my approach here even if it did give me some interesting context.

The next step: generalize to a more complex evaluator of the board value. I'll use a bearoff database in the appropriate end game, and separate neural networks for different game phases. To start with I'll just use  two, race and contact networks. But the framework will be generic so I can define however many networks I like, along with a generic way to decide which network (or other technique like bearoff) is appropriate for a given stage of the game.

I never did figure out why the previous version didn't converge properly, but I'm going to try a fresh implementation to see if there was some hidden bug. And hopefully the bearoff bit will help it get more sensible results near the end of a (race) game.

Monday, October 24, 2011

Bearoff database created

I built a little program to construct a bearoff database for me and put together an initial cut. It covers bearoffs on all six points in both players' home boards, with each player having up to nine checkers remaining.

I wrote out the data to a text file, with a key representing the board layout and value representing the equity (ie probability of win in this case, since the probability of a gammon is always zero). There are 25M elements in the table and the file takes 557MB, which needs to be loaded into process memory to be used. I looked at a table for six points and ten checkers, but that ended up being too big (closing on 4GB). I'm sure there's a clever way to partially load elements of the table when needed to avoid this kind of process memory bloat, but it's not worth figuring that out now.

The key representing the board layout is very simple. First you start with the player who holds the dice: for each point, starting with #1, add a character representing the number of checkers on the point. "A" represents 0, "B" 1, and so on. I used letters because potentially the number of checkers could be > 9 in the generic case. So for a bearoff db for six points, the key has six elements for each of the player's six points in the home board, and each element is a character. Then the same thing is appended for the opponent's position. For six points that means a twelve-character key.

Now I can build a strategy that toggles between a normal network, a race network, and exact bearoff probabilities in the appropriate cases. I'm going to design it generally so it'll work with potentially many different networks (more than the two listed above).

Next steps

I've continued to be frustrated in getting my network to sensibly converge when I include the "whose turn is it" input. I tried a bunch of different approaches but none of them seem to make that much difference:

  • Include a new output node for backgammons. I noticed that in the training for networks that contained probability of win and gammon nodes, the realized behavior included a lot of backgammons, and wondered whether this was skewing the results. So I added a new output node which represents the conditional probability of a backgammon win given a gammon win. As with the gammon node, you can use symmetry to imply the probability of a backgammon loss given a gammon loss. This ended up making very little difference.
  • Weight the learning rate for the gammon node by the probability of a win. Since the gammon node represents the conditional probability of a gammon given a win, the idea is that very little information is contained in games where the prob of win is low. And a similar thing when we include backgammon nodes: weight its learning rate by the unconditional probability of a gammon win. In practice, neither of these made much difference.
  • Instead of always estimating the probability of a gammon from the network, use knowledge of the game to properly set that probability to zero when the opponent has taken off at least one checker. At the same time, stop training the gammon node when the probability of gammon is zero. This basically means that the "end of the game" training that happens for the probability of win node happens (sometimes) earlier in a game. This converged to something non-trivial, in the sense that the gammon (and backgammon) probabilities didn't converge to zero or one; but the converged probabilities were not sensible and the resulting player performed poorly against the benchmark.

I've gone over the code many times with fresh eyes and don't see anything wrong here, so I'm assuming my network is just confused somehow.

I'm going to try it using a few different values for alpha to see whether I've just got too large a learning rate, but I've played with that before and don't hold out much hope.

The next thing I'm going to try is a more sophisticated network structure: include a race network and a bearoff database. I was holding off on doing this before since I wanted to try to solve the convergence problems on a simpler setup, but that's not bearing much fruit. Hopefully there isn't some bug I've missed that moots the more complex setup.

Wednesday, October 19, 2011

Bearoff databases

I was thinking a bit recently about "bearoff databases". These can be used near the end of many games, where both players are bearing off. The number of possible game states is actually fairly manageable in these cases, unlike the case of a generic board where the number of states is truly huge.

So in the case of bearing off, you can just enumerate the board layouts and the expected number of points in each.

Here's how it works: for a simple board, eg when both players have just one piece left, it's pretty easy to work out the expected number of points. Just run through all 24 possible rolls and check what happens in each one. In most of those cases the player with the dice will bear off and win a point. If not, you check what the expected number of points is for the opposing player post your move, and your expected number of points is just -1 times his.

You can then recursively figure out all the more complex board layouts: for a given layout, run through all possible rolls; for each roll check all possible moves; for each possible move check the opponent's expected number of points post-move, and choose your move to minimize your opponent's benefit. "Check the opponent's expected number of points" is the recursive bit - that will do a similar calculation, which will also recurse, and so on.

To make things easier you can store each board layout->expected number of points in a hash table so you don't have to keep repeating calculations. That's the database part of "bearoff database". Then once you have the database, you can quickly look up the expected number of points given a board layout.

That's a very brute force approach, and it only works if there aren't too many possible board layouts. Remember, the board layout is a combination of the player's setup and the opponent's setup: this is called a "two sided" bearoff database.

With a bit of math you can work out the total number of board layouts if you restrict your database to N points nearest the end and up to M checkers on those points:

Number of board layouts = ( (N+M)!/N!/M! - 1 )^2

That is, the number of ways to lay out just one player's side of the board is (N+M) choose N minus 1; and for each of those there are the same number of ways for the opponent's side to be laid out.

There are also "one sided" bearoff database. Here you enumerate just one player's board layouts, and for each you record the distribution of number of moves before all pieces are born off. If you have that data for both players in can approximate the expected number of points for either. I'm not that clear on how that approximation works, and haven't tried to investigate it much: the two sided database seems manageable and easier to understand.

Some examples for (number of points,number of checkers): (3,3): 361; (3,10): 81,225; (3,15): 664,225; (4,3): 1,156; (4,10): 1,000,000; (6,3): 6,889; (6,10): 64.1M; (6,15): 2.9B. On my Mac Pro in a single process I can calculate around 100k/minute. Storing them as an uncompressed text file uses around 1.8MB/100k entries.

Just for interest I note that the number of layouts for 24 points and 15 checkers is 6.3e20. This is an upper limit to the total number of valid backgammon boards, since it includes layouts where there are white and black checkers on the same point. It ignores layouts where either player has a checker on the bar, but I suspect those number much less than the overlapping cases. So the total number of backgammon boards is order 1e20 anyways.

Monday, September 12, 2011

More frustration

I haven't been able to spend much time on this recently, but what I'm trying now is a variation on the symmetric setup - this time adding a new input that represents whose turn it is (=1 for the player's turn and 0 for the other player's turn).

I'm running up against another case where the network wants to converge an output probability to 100% almost all the time. This time the probability of win from the starting board seems to be okay (it settles around 53-55%, so a bit high compared to the gnubg standard, but nothing crazy) but the conditional probability of gammon win converges to 100%.

The last time I saw this, the problem was that I was not properly handing the dice off to the other player at the end of the turn (when evaluating the post-move probabilities). I do that correctly now, but the gammon probability is still messing up.

Frustrating! I'm not even sure how to debug this sensibly. It's a huge complex nonlinear network, so looking at the weights doesn't give me any intuition about what's happening. I've gone over the code many times now and can't see any obvious bugs. The slide toward 100% probability happens gradually over thousands of iterations.

Friday, September 2, 2011

Reference estimate of probabilities from the starting position

One thing I'd like the network to estimate properly is the probability of a win (any kind of win) and the probability of a gammon win or loss from the starting position. To do that, of course, I need some independent estimate to compare against.

I checked in gnubg to see its estimates, and reproduce them here for reference (assumes the player has the dice):

Probability of any kind of win from the starting position: 51.9%
Probability of a gammon win (incl backgammon): 15.1%
Probability of a gammon loss (incl backgammon): 13.6%

Thursday, September 1, 2011

Ah ha!

I finally cracked the nut on the problem with the "whose turn is it" input.

The problem was how I trained the board for the t+1 state - ie updating the probabilities after the player moves.

I was evaluating the network without changing who owns the dice. So not surprisingly: if a player always holds the dice, they will definitely win!

Now that I made it flip the dice for the t+1 evaluation, it seems to be properly converging, at least in initial tests.

Phew - that one was bugging me for several days.

Frustrating results

I'm running up against a brick wall in trying to add "whose turn is it" to the inputs.

Basically whichever way I implement this, the probability of win output trains gradually to 1, as does the conditional probability of gammon win, and the conditional probability of gammon loss goes to zero.

I've tried adding this as an input to the symmetric network, to the non-symmetric networks, and to a completely rebuilt network that follows Tesauro's design exactly. In all cases, if I leave out the "whose turn it is" input (by setting the hidden->input weights for that input to zero and not training them), the network trains just fine, giving results like I posted before. But if I train those weights, the network trains to the ridiculous probabilities I mentioned above. It does not matter which position I put that input - ie it's not an error in the code that has a problem with a particular input index.

I really don't understand what's going on here. I see no mention in any publications about a problem with this input.

Thinking about it a bit, this input does seem to have a bit of a preferred status: whenever it's your turn to roll, you expect your probability of winning to get more positive just because most of the time you're removing pips through your roll. I don't think any of the other inputs have that status.

In the training, the change in weight value is ( new output - old output ) * input value * output->hidden weight * ( stuff that is positive ). For the probability of win and probability of gammon outputs, and the input value is +1 if it's your turn, the expected value of the first two terms is positive.

The output->hidden weight can be positive or negative. If it's positive, the trained middle->input weight will tend to +infinity; if it's negative, the weight will tend to -infinity. I think that's what's happening here.

But this seems a bit facile - no one else mentions this as a problem. So I suspect I've got some fundamental problem in my setup that I'm missing.

The hunt continues!

Saturday, August 27, 2011

Some more results

The earlier work was not in vain: I generalized the network so that they do not make the symmetry assumptions, but used the symmetry network weights as starting points for the training of the generic networks. That allowed relatively fast training.

Here are some results showing the training performance for 20-, 40-, and 80-hidden node networks:

The training worked as before, except that the benchmark was no longer the pubEval player, as that was too weak for sensible comparisons. Instead, each generic network used the maximal-performance version of the symmetric network with the same number of hidden nodes as a benchmark. So for example, the 80-hidden node training used the 80-hidden node version of the symmetric network as a benchmark - and used the instance of that symmetric network that performed best against the pubEval benchmark in its training.

In all cases the trained networks performed around +0.2-0.3ppg vs their benchmarks, and in the maximal cases got to about +0.4-0.5ppg. (In the 200-game comparison runs done every 1,000 training iterations.)

These results show that the training does continue to improve the performance of the bots, even without adding a separate race network. However, there are some mysteries that I am still trying to decipher:

  • At each training step, I have to train it against the board and its flipped-perspective board. That makes sense for the final step where you use the actual win or lose performance at the end of the game, since otherwise you would never train the probability-of-gammon-loss node (since the update function is called only from the perspective of the winner, since that is the last play in any game). But I do not understand why the training fails to work if I do not train it on the flipped board for every step.
  • The network should know which player holds the dice. As mentioned before, this is key to knowing the probability of winning or losing at every point, and especially at some end game scenarios. I tried adding a new input which is 1 if the player holds the dice, and 0 otherwise; and initialized the weights for those inputs to zero. However, the training fails if I actually train those weights. The results above do not train these weights.
"Training fails" above means that the weights become large and positive for the probability of win and probability of gammon win nodes, and large and negative for the probability of gammon loss node. And of course the measured performance of the network vs its benchmark is terrible.

I am not sure why this is happening, and I want to sort that out before I move on to adding a separate race network.

Tuesday, August 23, 2011

Not so smart again

One trick I used was to assume that when the perspective of the board flips, the probability of win flips to the probability of loss - ie, that the probability player 1 wins is the same as the probability player 2 loses.

However, that's not really the case. The network returns the probability of win given that it's your turn with the dice - so flipping the perspective does not just flip the perspective, but also hands the dice to the other player!

The starting board is one way to see this, but it is true for any symmetric setup - the probability of a win is forced to be 50%. The most extreme example: both players have only two checkers left to bear in, both on their 1 spot. Then the play with the dice has a 100% chance of winning and the other 0%. In my setup, the network would think both have 50% chance of winning.

So my whole initial assumption about symmetry on board flips is now in question.

But at least it solves one of these issues: the probability of winning the game from the starting position was always 50%. Even adding the bias node didn't change that; the symmetry I enforced always made sure that flipping the perspective was equivalent, when it isn't really.

Have to think about how to tweak the setup. This could require some relatively deep changes.

Not so smart

Turns out that the first innovation I mentioned in the last post - training on the conditional probability of gammon loss in the case that the game ends in a gammon - is irrelevant.

That's because the training only ever ends a game on a win for the player, since the winner is always the last side to play in a game.

As part of playing with this, though, I noticed that the probability of a gammon win from the starting board changes quite a bit during the training: anywhere from 10% to 40%. And the network does not recognize that the probability of a gammon is zero if the other player has borne in any checkers. That's one innovation that should help things: force the bot to notice that game fact.

Also, the probability of win from the starting board is exactly 50%. That is by construction, since I require that flipping the board perspective must give the same value. Of course, the probability of win given that a player has the first move is a bit more than 50%, and that should come out of this.

The right thing here, I think, is to add a bias node to the probability of win. That changes the symmetry relationship so that the probability of win is not always 50%.

Monday, August 22, 2011

The next steps

I've finally got a bit of free time to spend on the backgammon bot again, so I'm trying two quite different next steps.

The first is a bit of a tweak on what was already there. It came up because I noticed something unappealing about my current setup. The second output node is the probability of a gammon win conditioned on a win. When a game is over, I set that to 1 if the player wins a gammon (or backgammon). However, I would set it exactly the same way if the second node represented the unconditional probability of a gammon, which seems a bit weird.

One solution to this is to ignore the second node altogether (for training its weights or the middle weights) in the case where the player loses the game. But that seems a bit wasteful. In any case, I wasn't ignoring the second node before in losses, and I think that was incorrect. Perhaps that was why many of my trained networks ended up with artificially large estimates for probability of gammon.

I can do better than just ignoring those games though: if the player loses, instead of setting the "true" value of the second node to zero and training on that, I train on the conditional probability of a gammon loss conditioned on a loss. That's a kind of virtual third node in my setup; its value is entirely determined by the weights of the second node, to satisfy symmetry on flipping the board perspective.

I'm not sure how big a difference that will make, but I'm running some tests now to see.

The second change is more significant: I am adding a second neural network that is used in races, in addition to the one that's already there which I use in contact games. Those two game regimes are significantly different, and all the literature I've read suggests that generalizing to different networks for different qualitative game phases improves things significantly. (This version also includes the train-on-gammon-loss change described above.)

Hopefully I'll have some results in a couple of days for both extensions.

Wednesday, August 17, 2011

Stanford Machine Learning course

Interesting - Stanford is offering a free online course in Machine Learning this fall. I expect it'll cover some of the neural network stuff that the backgammon bots are based on.

They're also doing one on Artificial Intelligence and one on Introduction to Databases.

Monday, August 8, 2011

An example game

Here is the output of an example game I played against the bot. I was white and the bot was black:

New results post-bug

The results after fixing the bug actually are not terribly dissimilar to the ones before, except shifted downward a little in points per game. It is a little surprising how much it looks like it just changed the gammon payout to three from two without affecting the training at all - I would have expected its training path to change a little as well.

Here are the results:

I also added a strategy that is a "human" strategy - ie one that lets me interactively play against the trained neural net strategies. I had it print out the network's estimate of probability of win and gammon at each step in the games I played to get an idea how reasonable they are.

I played a 10-game match against the strongest player - the 80-node network that during training performed best against pubEval (+0.885ppg in the 200-game benchmark match). It actually beat me! I played -0.2ppg against it.

Note: I discovered later on that my pubEval benchmark implementation was wrong; the results above are invalid. PubEval is a much stronger player than it would seem from the numbers above. In any case my network setup also had a significant bug, which I later discovered and fixed.

I'm not a great player. I play around -0.4ppg against the best backgammon bots, with an ER (error rating) around 9. So solidly intermediate according to the tables I've seen.

The 80-node bot isn't perfect - it left a few puzzling shots. For example, it played a 63 off the starting board as 13/4, leaving a shot in its home board right in front of my two back checkers. Its estimate of the probability of a gammon win don't swing to zero when the other player has borne in.

But these are really promising results! I feel quite proud that I've build a bot that game beat me (even though I suspect it was just a little lucky in that match...).

Friday, August 5, 2011

Another bug!

Arg - another bug. The game framework was accidentally calling almost all gammons backgammons.

That's most likely why the pubEval comparison was so high compared to the published results, and why so many of the simulation runs ended in backgammons.

Again this invalidates all the previous results, because the networks were basically trained to treat a gammon as three points instead of two. Perhaps that's why the trained networks left shots more often than they should have - they were too incentivized to try for a gammon (which it scored as three points instead of two).

So... bug fixed, and off we go again to re-train the networks. <Sigh>

An example game

I took the 80-hidden-node network with the maximum score against pubEval and played it against itself to see the quality of play.

Overall it's pretty solid. I'd say a weak intermediate level of play now, though still making some clear mistakes with leaving shots.

Two output networks results

I finished 2M training runs (again with alpha=beta=0.1, no alpha/beta damping or backtracking) on the new network which has two output nodes, after correcting the bug. Here are the results:

The blue dots are the benchmark runs for the 10-hidden-node network, and the blue line is the rolling 10-point average. Red is the same for the 20-hidden-node network; green is 40 nodes; and purple is 80 nodes.

The bug ended up not making that much difference; the trained networks still perform extremely well against the pubEval benchmark.

So much so that pubEval probably isn't a great benchmark anymore. The trained networks backgammon pubEval almost a quarter of the time, which is not realistic.

Note: I discovered (much) later than I had a significant bug in my pubEval implementation. After fixing that it's a much stronger player. I also had some significant bugs in my network design that held them back. Check out posts from Jan 2012 to see proper results.

Now to look at some example games that the trained networks play against themselves to get a feel for how good they are...

Thursday, August 4, 2011

An interesting perspective from the academy

An interesting article I found on training neural nets with different approaches - directly relevant to the approach I'm trying here.

A few observations from this:
  • Using a single neural net isn't optimal - you should use different nets for different parts of the game. For example, one net for bearing in, one for a race, and one for the rest. I've seen this discussed elsewhere as well. The article goes a bit nuts and has nine separate networks for different varieties of game position. I'll get to multiple nets eventually.
  • The optimal value for lambda they found was around 0.6. Not sure how this squares with comments from Tesauro in his later papers that lambda=0 is the best to train with. I've been using lambda=0.
  • They didn't evaluate against pubEval - they created their "expert" network (actually group of nine networks) and trained that for a while, and then used that as the benchmark. While training the expert they chose the rolling best performing version as the benchmark for subsequent training, and decided when to stop training based on when the performance had converged on that basis.
Pretty interesting. The meat of the paper, though, is on how different training methods compare (self-learning vs using the expert at each step to specify the best move vs training against expert v expert matches). The conclusion was that training against the expert's evaluation at each step was the most efficient, followed closely by self-learning, and that training by watching expert v expert games was not very efficient.

Wednesday, August 3, 2011

A bug!

Uh oh - I discovered a bug deep in the guts of the backgammon engine, in the piece that determines the available moves given a roll. I found it when running my trained network against itself for a 1,000 game benchmark - and found that player 1 would consistently lose 0.3ppg against player 2, even though they were the same strategy. It didn't matter if the strategy was a neural net or pubEval - either way it was broken.

Fixed now, but it does invalidate all my earlier results.

I suspect the really strong performance (stronger than the published TD gammon results) I recorded against pubEval was an accidental consequence of this. I'm re-running the training now for the two-output network (10, 20, 40, and 80 hidden node versions) and we'll see how those come out.

Note: I discovered much later that I had a few significant bugs. The most relevant here was in my implementation of pubEval - after fixing that it was a much stronger player.

Tuesday, August 2, 2011

Extending the outputs

The first network I set up had just one output node, which represents the probability of a win - any kind of win, ignoring the distinction between single wins and gammons/backgammons.

This was a simplification of the original Tesauro work which had four outputs: one for a win, one for a gammon win, one for a loss, and one for a gammon loss. There was no knowledge in the network of backgammons, under the assumption that in practice they happen so rarely that they can be ignored (this is true in regular games between experienced human players).

So the next step in the evolution of my net is to add an understanding of gammons. As with the single-output network, I want the network to do the right thing when you flip the perspective of a board and pass it in. For a multi-output network this means that prob of win->prob of loss, prob of gammon win->prob of gammon loss, and so on.

Example of learned gameplay

I tried a game where I played the 40-hidden-node player against itself, and printed out the results (using a fairly simple board representation - no pretty graphics yet!).

In general the player was pretty good. Not great; perhaps better than a beginner human player, but certainly not close to intermediate (which I judge because I think I'm intermediate and can see lots of places where it made the wrong move).

Particularly, looking at a couple examples of the gameplay, it looks like it's not very good in races or efficiently bearing in. It also leaves shots more than it should.

Monday, August 1, 2011

The next set of results

I turned off backtracking and alpha/beta decay, since that seemed to have minimal effect in my previous experiments. I also dropped alpha and beta to 0.1 to avoid jumping around too much.

Then I let the training run for 2M iterations to see if I could get something that is plausibly converged. Honestly I don't even know if a realistic neural net does converge, but looking at other articles on TD gammon training it looks like they do.

Here are the results:

The blue line and blue points are for a network with 10 hidden nodes; green is 20; and orange is 40. The points are the average points per game playing the network against the pubEval benchmark for 200 games. The lines are the rolling 10-point averages.

Sunday, July 31, 2011

Second training approach

The next step in improving the learning was to include backtracking and alpha/beta decay.

The idea: if the network drifts into a suboptimal part of parameter space because the learning pace (via alpha and beta) is too great, step back to where it was good and reduce alpha and beta.

The algorithm in practice is:
  • Every 1,000 learning iterations, run the same 200 test games against the pubEval benchmark.
  • Keep track of the best performance (ie highest win fraction, since that's what the net is optimizing).
  • If a subsequent set of 200 test games gives a win fraction less than the rolling maximum less a threshold, and alpha and beta are above specified minimum levels, step back to the weights & eligibility traces at that maximum point, and reduce alpha and beta by a fixed fraction.
  • If alpha and beta are less than the minimum after being reduced, set them to the minimum.
In principle this should give better learning behavior than without the backtracking.

The results for 40 hidden nodes:

and for 20 hidden nodes:

Not a whole lot of difference between 20 and 40 nodes here, and it seems pretty converged. Compared to the previous results without backtracking, there's less scatter, but it doesn't seem to actually cause it to improve in outright performance appreciably.

Note: the results above are invalid because I had a bug in my pubEval implementation. Fixing that made the pubEval player much strong. Check out Jan 2012 posts for believable results.

First training approach

The first training session started with lambda=0, alpha=beta=0.5, and ran for 200,000 iterations.

I ran it for two nets: one with 20 hidden nodes, and the other with 40.

This is the simplest approach - it does not decay alpha or beta, and it does not backtrack when it finds the performance suffering versus the rolling optimum. But it is an interesting benchmark.

200,000 iterations is also not that much - it's a good start and comparable to some of the original TD training that Tesauro did, but really we want to train the nets for millions of iterations. But at least we can get some idea of how it's doing.

Every 1,000 iterations of the training the simulation runs a benchmark 200 games against the pubEval benchmark. The chart below shows the results for 40 hidden nodes. The x-axis is iteration number and the y-axis is average points per game in the 200-game benchmark matches.

Note that before training the program does terribly - losing 1.075 points per game against pubEval. Very quickly (in just a few thousand iterations) it performs roughly at pubEval level and seems to converge to around +0.25ppg, with some scatter. The best benchmark performance was +0.615ppg.

Here are the results for 20 hidden nodes:

Similar kind of results to 40 hidden nodes, really, though with more scatter toward the end of the simulation.

The frustrating thing here: while I was putting this together and getting the machinery ready to save the results, I'm pretty sure I had the 40-node network playing around 75% win rate and +0.6ppg. But I can't get it to reproduce that now! I thought I got there starting with alpha=beta=0.5 but I don't see it now. Maybe I did more steps... that said, the results above are pretty consistent with other results against pubEval that I've seen published in various articles on TD gammon. I was fiddling with the pubEval strategy, so maybe I fixed a bug there that was making it artificially weak.

Note: the results above against pubEval are invalid - I had a bug in my pubEval implementation. Fixing that bug makes it a much stronger player. Check out the Jan 2012 posts for believable results.

TD Gammon neural network strategy

Once I got the basic backgammon framework working, I needed to build a player that uses the TD gammon neural net to make board value decisions, and that can evolve itself through training.

In terms of the TD gammon setup itself, the best source online I found is actually an assignment from a computer science class at Rice University. It goes through the basic learning algo in some detail and gives some tips on how to make it work.

For inputs I followed the standard Tesauro setup that doesn't encode anything about backgammon strategy - it is purely a sensible way of encoding the board layout. That's probably the coolest thing about the neural net players - there was very little game wisdom put into the game by human experts, which meant that the trained nets do not suffer from any biases that human players may have. In the case of backgammon, that turned out to be a huge deal: the neural net players ended up superior to all human players, and ended up changing the way humans played.

The backgammon framework

A neural net player needs a backgammon framework to play in, so the first thing to do is to build that framework. I chose to do this in C++ because a) I know it, and b) it's faster in execution than Python, which is my usual choice for coding. And execution speed is important here because training a neural net can take many millions of iterations, where each iteration is a full game.

The basic setup:

A board class

This tracks where each player's 15 checkers are on the board. It does not know how to play a step in the game or any rules. It just has a vector for checkers by position for each side; how many pieces on each side have been hit and sent to the bar; and how many have been borne in. 

Importantly, it has a flag called perspective, which represents whether the board is viewed from the perspective of either player 0 or player 1. Then all the methods to get checkers etc are from the perspective of that player.

So for example there is a method hit(), which returns the number of checkers on the player's side that have been hit. If perspective is 0, that corresponds to player 0; if perspective is 1 it corresponds to player 1. There is another method otherHit() which returns the number of opponent checkers hit, which returns player 1's number if perspective is 0 and player 0's number if perspective is 1.

Or, there is a method checker( int i ), which returns the number of the player's checkers at position i (between 0 and 23), and otherChecker( int i ) which returns the number of the opponent's checkers at position i. Which player is the opponent is determined by perspective, and the indexing is reversed for perspective = 1, so checker(0) is always the number of the player's checkers at their own home position 1.

The reason this is important is because in earlier iterations of this effort I ended up with a favored perspective, and I think that may have caused subtle bugs in the code. Also it makes it easier to layout the network inputs.

Strategy classes

A strategy is something that determines which of the possible moves is the best move. The base strategy class defines an abstract virtual method boardValue, which takes in a board and returns a value - the higher the value the better for the board. The board value can represent whatever is appropriate for the strategy: the expected number of points, the probability of a win, or something more arbitrary.

Ultimately all the neural network intelligence is in a derived strategy class. But there are also simpler strategy classes - for example, a random strategy that returns a random value from boardValue, and a strategy that wraps up the standard pubEval player (which we end up using for comparisons).

When a strategy calculates its boardValue, it always uses the methods on the board that show the board from the appropriate perspective - so eg always hit() and otherHit(), never explicitly the number of checkers hit for player 0 or 1. So the board must always be passed in from the perspective of the player that is moving.

A game class

An instance of a game contains a board plus a strategy for each player. It knows how many steps a game has progressed, whether a game is over, which side won, how many points they won, and so on.

Its step() method steps one ply (one side's roll) through the game. A step rolls the dice (the random number generator is the Mersenne twister), generates all possible boards for the roll, and uses the appropriate side's strategy to determine the board value of each possible position. It then chooses the position with the maximum board value.

Before sending the board to the strategy, it sets its perspective to be that of the player which is making the move.

Another method is stepToEnd(), which keeps stepping until the game is over (one side or the other has 15 pieces borne in).

The algorithm to determine the various possible positions given a two-die roll is pretty brute-force right now. I'm not sure if there's a nicer way to do it - if so it'd have a nice performance impact since this is called for every step in every game.

The game has a boolean verbose flag. If true, it prints out the board at each step and notes the dice rolls. If false, it prints nothing. This is useful for testing whether games are behaving sensibly; you can watch a full game in all its detail with verbose=true, but then turn verbose=false (the default) when running many games.

Running a game

Some sample code for running a game is like:

int main( int argc, char * argv [] )
    strategyPubEval s1;
    game g( &s1, &s1, 1 );

So this constructs a pubEval strategy used for both sides in the game, printing out the board and game information at each step. The strategy is passed (as a pointer) to the game for each side, along with an integer which seeds the random number generator used for dice rolls.

What this blog is about

Welcome to anyone who happens upon this blog!

This blog is meant to document my amateur attempts to build a backgammon player. Initially my focus is on building a neural net player akin to TD gammon.

I've actually tried three times to get a network up and learning properly. The first two times were failures - I couldn't find the bug(s), but they never actually learned to play. The third time shows some promise, and this blog is meant to document how I approached the problem and how my approach worked. Once I've got the basic machinery working I'll continue to extend it, and document that.

I'll also post interesting links or research I stumble across. As I said, I'm just an amateur, so I'm probably ignorant not just of state of the art in computational backgammon, but of backgammon itself. I'm a decent intermediate player but certainly no expert. I just find the numerical machinery behind artificial learning really intriguing and this is a way for me to play with it.