Tuesday, January 31, 2012

Player 2.4: 2-ply

The multiple ply calculations are now a little faster than before, and I'm able to get some preliminary stats on 2-ply performance.

I ran the 2-ply version of Player 2.4 for 150 cubeless money games against PubEval. It scored +0.45ppg +/- 0.11ppg. (+/- as usual is the standard error estimate.)

150 games gives a large standard error, so to tie down its performance vs 1-ply a bit better, I ran the 1-ply version against PubEval as well for the same 150 games - that is, starting them with the same random number generator seed.

This isn't perfect since the play is different and the number of steps in a game will deviate between the two players; but it should reduce the variance substantially. So the difference in performance between 2-ply and 1-ply against the same PubEval player should have much less uncertainty than the absolute equity numbers.

The 1-ply player scored 0.38ppg in the same games.

So the 2-ply player is approximately 0.07ppg stronger than 1-ply.

That seems plausible: the 1-ply player is 0.11ppg stronger than 0-ply, so having a second ply increase performance by something like 0.1ppg doesn't strike me as unreasonable.

I suspect (with no solid argument) that this gives a better low-variance estimate of the relative performance than playing 2-ply directly against 1-ply, since this way both the players get (largely) the same rolls.

Source code

In case anyone is feeling especially masochistic you can take a look at my source code, hosted at github, and licensed under the GNU GPL. I chose the GPL mostly because this is just a hobby project for me and I don't want anyone using my code in a commercial project without also sharing their code back.

Link to the repository: https://github.com/mghiggins/bgtd

That said, I'm fairly sure no one will ever look at it, so it almost certainly doesn't matter. Really I use github just to let me share development easily across my laptop and my desktop.

Some comments on the source:

  • It's C++. I chose C++ mostly because I know it and it's fast for execution. In an ideal world I'd wrap up the fast stuff in a C++ library and do the legwork using Python, but I didn't take the time to figure out how to do that yet.
  • I develop on a Mac using XCode, so it's got some XCode configuration files also included.
  • It uses boost threads for parallel computing, so needs to be linked against boost libraries.
  • It's reasonably well commented, I think. Mostly so I remember why I did what I did when I look at the code after months away from it.

A simple example to do something interesting:

#include <iostream>
#include "game.h"
#include "strategytdmult.h"
#include "strategypubeval.h"

using namespace std;

int main( int argc, char * argv [] )
{
    // Player 2.4 strategy, loading saved weights 
    // and benchmark dbs
    strategytdmult s1( "benchmark", "player24" );
    // PubEval strategy
    strategyPubEval s2;
    // initialize game with s1 as the player and 
    // s2 as the opponent with RNG seed 1
    game g( &s1, &s2, 1 ); 
    // step to the end of the game
    g.stepToEnd(); 
    // print out the board after the game is over
    g.getBoard().print(); 
    // g.winner(): 0 == player, 1 == opponent
    cout << "Winner is " << g.winner() << endl; 
    // score is 1, 2, or 3
    cout << "Winner points = " << g.winnerScore() << endl; 

    return 0;
}

This example defines a pair of strategies - a Player 2.4 strategy and a PubEval strategy - and plays them against each other for a single (cubeless money) game. When the game is over it prints out the board layout and the result.

This won't actually work for you because the Player 2.4 strategy needs to load its saved weights, and those are in a set of files in my filesystem. It also loads one-sided bearoff databases and an escape probability database (used in the Berliner primes inputs), which are just on my machine. But you get the idea. If anyone does really want to try to compile my code I'd be happy to send you the files you need.

Multiple plies: adding an equity threshold

I extended the multiple plies filtering to include filtering based on equity difference as well as number of positions.

It now uses the coarse strategy to trim down the number of moves to calculate full strength, and then filters out any move whose equity (using the coarse calculation) is less than the equity cutoff worse than the equity of the best position (again using the coarse calculation).

The table below shows timings and performance in 100 single-game matches against PubEval. The top row shows the timing and results with no filtering at all. The next six rows show the results filtering by number of moves but not equity. The subsequent groups of rows show the results for different move number and equity thresholds.

It looks like a number threshold of 8 and an equity cutoff of 0.2 gives the best results: a significant calculation time reduction while still giving results close to the unfiltered performance. That's what I will use going forward. That gives a speedup of roughly 45% vs the unfiltered case.

Player 2.4: 1-ply

I played Player 2.4 against itself, 1-ply vs 0-ply, to check the performance improvement of looking ahead.

In 10k cubeless money games, 1-ply scored +0.11ppg +/- 0.01ppg and won 53.7% +/- 0.5% of the games against 0-ply.

I also played 1-ply against PubEval. In 10k cubeless money games it scored +0.56ppg +/- 0.01ppg and won 69.1% +/- 0.5% of the games.

Monday, January 30, 2012

Player 2.4q: for filtering

For the multiple ply calculations I looked at a quick-calculating but coarse player, Player 2.3q, that used the old formulation of the primes inputs.

I just switched that to Player 2.4q, which is the same as 2.3q but uses the new Berliner primes inputs formulation. I trained it for 300k games, using alpha=0.1 for the first 100k and then switching to alpha=0.02.

Except for using only five hidden nodes, it is identical in structure to Player 2.4 (which uses 80 hidden nodes).

In 100k cubeless money games it scores +0.122ppg +/- 0.004ppg against PubEval and wins 54.0% +/- 0.1% of the games.

It performed a little better than Player 2.3q, which might just be noise, or might suggest that the Berliner primes formulation makes more of a difference when there are a small number of hidden nodes. That is, for a larger number of hidden nodes, the network can approximate the value of primes itself and does not need to have it spelled out through a custom input. But the small incremental performance gain makes it difficult to have much confidence in such a statement.


Player 2.4: Berliner primes inputs

I implemented the Berliner primes inputs "AContain" and "Contain" and extended Player 2.1 to include them. I'm calling the resulting player Player 2.4.

The summary: it's unclear whether adding the primes inputs adds any performance improvement, but the extra training did have a little benefit, and the extra inputs may improve things a little. Player 2.4 is my strongest player so far, by a small margin, against my available benchmarks.

Training started with the Player 2.1 weights and random (uniform between -0.1 and +0.1) weights for the four new inputs (two for each player). I started with alpha=0.02, dropped to alpha=0.004 at 100k iterations, then to alpha=0.0008 at 700k iterations. That continued to 1.1M iterations.

Its full definition:

  • Two neural nets: contact and race.
  • 80 hidden nodes for both networks.
  • Contact inputs are the original Tesauro inputs, less inputs for whose turn it is, plus: input that notes whether a backgammon is impossible (one for each player), as per Player 2; input that notes number of hitting shots (one for each player), as per Player 2.1; and the new inputs that track primes as per Berliner (two for each player). Race inputs are just the original Tesauro inputs.
  • One-sided bearoff database for up to fifteen checkers spread out to up to six points, so the full home board.

I won't post the training chart since the 1,000-game match results are too noisy to be meaningful. The standard error on 1,000 games (cubeless money equity) is 0.045ppg, which is much larger than the equity edge coming from the extra input. I'm going to have to change my training benchmarking to use a larger number of games in the benchmark matches (and probably do them less frequently, since 100k games is fairly expensive).

After the training, in a 100k-game match (cubeless money games), it scores +0.009ppg against Player 2.1, with a standard error of 0.005ppg, so barely a statistically significant improvement.

Even though this is a pretty weak improvement I'll stick with the Berliner formulation since it seems to be the standard (and gives comparable performance to my other tries), and put the primes inputs to bed. However, it's unclear to me whether these new inputs are necessary at all. Maybe they'd be more important with a smaller number of hidden nodes.

I tried two other experiments related to these new inputs.

The first was to see whether I should start training from scratch, or from the Player 2.1 set with random weights just for the new inputs (as described above). Starting from scratch I ran 1M training runs; in a 100k-game match the trained player lost 0.01ppg against Player 2.1. So it needed more training just to get back to the Player 2.1 starting point: it's definitely better to start training from a previously-trained player.

The second was more interesting: to see whether the equity edge I saw above came really from adding the new input, or just from training the Player 2.1 network for more training runs and/or with smaller alpha. So I took Player 2.1 without the new inputs and trained it for an additional 420k training runs (100k with alpha=0.02, the rest with alpha=0.004).

It scored +0.007ppg +/- 0.005ppg against Player 2.1. So: much of the small performance improvement noted above came from extra training on the other weights, not from adding the new inputs. That said, the magnitudes of the impacts are order of the standard error and so it is difficult to state much with confidence. Again, it makes it hard to see any real benefit from adding the new primes inputs.

Some other benchmarks for Player 2.4, again in 100k cubeless money game matches:

  • Player 2.1: scores +0.009ppg +/- 0.005ppg, 50.1% +/- 0.1% win probability
  • Player 2: score +0.043ppg +/- 0.005ppg, 51.6% +/- 0.1% win probability
  • Benchmark 2: scores +0.071ppg +/- 0.005ppg, 52.7% +/- 0.1% win probability
  • PubEval: scores +0.475ppg +/- 0.005ppg, 66.4% +/- 0.1% win probability
Note all performance numbers here relate to the 0-ply version of the players, and +/- numbers are standard errors on the score and probability of win estimates. These aren't entirely apples to apples with earlier equity estimates because those were often on 30k games instead of 100k games, but should be close.




Thursday, January 26, 2012

Berliner's BKG program and prime counting

I just found a classic reference by Hans Berliner, who wrote one of the earliest backgammon bots, BKG.

The paper is called "BKG - A Program that Plays Backgammon", published in 1976 by Carnegie Mellon's Computer Science Department. It describes the approach he took to building his bot, which in a later version went on to beat a world champion human player in a 5-game match.

There's a version available online, but it's fairly truncated and doesn't give all the detail of the original article, which isn't as easy to find.

The original version is what some people refer to as "Berliner". For example, you see it referred to in the GnuBG codebase, and by other experts.

The bot described here is more primitive than neural network-based bots and to perform even at something like an intermediate level it needed a lot of hand-coded intelligence about backgammon strategy.

But that makes it pretty interesting from the perspective of adding new inputs.

For example: counting primes. In Berliner's article he notes that doing what I tried to do with Player 2.2 (just counting the maximum consecutive points) isn't a particularly effective tool in the game, because the position relative to the opponent's blots matters, and a fairly effective blockade can sometimes be made without consecutive points.

His approach: count ways that checkers can escape the blockade. He constructs a set of twelve points in front of a checker, and arranges up to seven covered points around them. For each layout he calculates the number of rolls (out of 36) that can escape.

For a given player checker on point n he defines a function Escape(n) that returns the number of escaping rolls. Then he defines an input (which he calls AContain) that is the minimum value of Escape for n running from the opponent's 1-point to 9-point.

That basically measures how hard it is for a player with checkers in or near the opponent's home board to escape a blockade.

He adds another input that measures the difficulty of running from the opponent's 9-point to the player's home - so he looks at the minimum value of Escape for n=opponent's 9-point to the player's 2-point. That lives in a separate input called Contain.

One complication: the calculation of number of escaping rolls when facing one of the board layouts is a somewhat complex calculation. So instead of redoing that calculation over and over, he builds something like a bearoff database that tracks the results. It's quite a small one: you can represent all board layouts with 12 bits. One for each point in front of a checker from 1->12 away; value is 0 if the point has less than two checkers covering it, and 1 if it has at least two checkers covering it. 2^12 is 4,096, but to represent up to seven covered points requires only 3,301 elements.

I'm going to try to implement this.


Player 2.3: extended prime inputs

My first attempt at extending the inputs to use primes (Player 2.2) did not work very well - the improvement in performance over Player 2.1 (the same but without the primes input) was quite small. My next attempt - Player 2.3 - extends the prime input.

In particular it adds four new inputs for each player, instead of one. They are:

  1. n/3 for n<3, or 1 for n>=3, where n = max # of primes on the player's side
  2. 1 for n>=4, 0 otherwise
  3. 1 for n>=5, 0 otherwise
  4. n-5 for n>=6, 0 otherwise
The idea was similar to how checker counts are distributed among several inputs: it lets the network find a more complex nonlinear dependence on count.

I started with Player 2.1 and trained it for 1.5M iterations. Alpha was 0.02 to start, dropped to 0.004 at 200k runs, and dropped to 0.008 at 1M runs. Learning chart below:































It did not perform particularly well. I took the player at the end of the training and played 100k money games against Player 2.1. It scored +0.0107ppg, with a standard error of 0.0045ppg. So significant at better than two standard errors, but just barely. I'm unsure whether the improvement is really due to adding the new input, or if it's just because I trained Player 2.1 for longer and with smaller alpha.

This was using 80 hidden nodes for both Player 2.3 and Player 2.1.

Perhaps the inputs are still not set up optimally. Maybe I need to note the position of the max prime somewhere, rather than just the max count.

Note that Player 2.3q - the one I use for filtering moves in multiple-ply calculations - is an example of Player 2.3, but with only five hidden nodes.

Wednesday, January 25, 2012

Checkpoint: recent progress and next steps

I've been tearing around with ideas for improving my player, but I think it's worth taking a break and summarizing recent progress and thoughts on next steps.

The big change in the last month was figuring out why my networks weren't properly converging: when calculating the equity of different possible moves, I needed to assume the opponent holds the dice (since the player's move is over), but I was assuming the player holds the dice.

Fixing that immediately improved the performance of my players, even the simple ones.

The second change worth noting is discovering that my pubEval implementation was incorrect, so the numbers I was seeing when benchmarking against it were way too good. After fixing that, my network performance dropped back down to more believable levels: comparable to the performance of players I see in some of the literature, but worse than the best bots like GnuBG.

I defined a few named players: Benchmark 1, which is a single-net player that tracks wins and gammons; Benchmark 2 which adds backgammons; and Players 2 and 2.1, which are players with separate nets for race and contact game phases (and differ just in the list of inputs), and track wins, gammons, and backgammons.

I tried out a couple of new inputs. An input tracking hitting shots seemed to add some value. I'm still testing an input for the max # of primes.

Directions for the future:

  • Finish sorting out contact inputs. The primes input seems not to work very well so far, but I suspect there's a better way to define this.
  • Improve lookahead calculations so 2-ply calculations aren't so slow.
  • Add rollouts. I've got some basic rollout functionality now but it doesn't include basic stuff like variance reduction or cutting off the rollout before the end.
  • Add custom inputs for the race network, especially adding 14 separate inputs for each checker borne off (to give the network a chance to find more complex nonlinear dependencies).
  • Extend the list of networks to include something like GnuBG's crashed network - ie for the crashed phase of the game. I looked at GnuBG's definition of "crashed" and didn't really understand it, so a bit more to do there before I turn anything on.
  • Build a position benchmark database and roll out the probabilities for each entry. That'll let me do supervised learning to train the network. GnuBG did this and those folks note that with just standard TD learning they couldn't get too far; they needed the benchmark database and supervised learning, with a focus on positions where 0- and 2-ply equities were different.
  • Hook up my bot to FIBS so I can get another benchmark on its performance.
  • Add support for cube decisions. I've entirely ignored this so far as I get the cubeless play working properly, but eventually I'll need to figure this part out. Especially challenging for me since I don't know myself much about cube strategy.

1-ply benchmark for Player 2.1

I played 0- and 1-ply Player 2.1 in single-game match mode (where gammons & backgammons count as one point) against pubEval to see how it compares to GnuBG.

For 1k games, 0-ply Player 2.1 wins 68.8% of games, and 1-ply Player 2.1 wins 70.2% of games. So adding the 1-ply lookahead improves the win percentage by 1.4%, so a bit less improvement than I expected from the equivalent test for GnuBG. (And I note my 1-ply player doesn't even get to the performance of 0-ply GnuBG (71.9%).)

I also think my lookahead algorithm is inefficient now. It's really really slow for 2-ply - so slow I can't plausibly calculate statistics. Needs some trimming!

Multiple ply benchmark

Joseph Heled, one of the key GnuBG developers and the one responsible for most of the neural network pieces, has run some experiments with multiple plies against pubEval to see a comparison.

For 1k 1-game matches, 2-ply beat pubEval 75% of the time. For 3k 1-game matches, 1-ply beat pubEval 74% of the time. For 10k 1-game matches, 0-ply beat pubEval 71.9% of the time.

Those were single-game matches where all wins counted for 1 point, so not money games where gammons and backgammons count extra.

The 1- and 2-ply matches were with smaller numbers of trials, so have more statistical uncertainty attached. But roughly, adding multiple plies should improve performance by 2-3% in probability of win.

I haven't seen comparisons of 1- or 2-ply GnuBG vs its 0-ply cousin, but it seems reasonable to assume that they will win by comparable margins (2-3%).

Multiple plies

Another standard technique for improving backgammon bot performance is using "multiple plies". This technique assumes that as you get closer to the end of the game, the accuracy of the net improves. Of course this is true in the limit of the very end of the game, but many games end up with one player or the other in a dominant position well before the game ends.

A regular calculation of board value is called "0-ply" - zero extra steps beyond the current one. Note that the notation is a bit confused here: GnuBG uses 0-ply for this initial calculation, but other backgammon bots describe this as "1-ply" instead. I will continue to follow GnuBG's notation.

1-ply looks forward one step of the game: it looks at each of the possible 21 rolls that the opponent might make and for each chooses the optimal board and weights its (0-ply) value with the probability of getting that roll (1/36 for a double and 1/18 for a mixed roll).

2-ply does the same as the 1-ply calculation, but for each opponent roll it calculates each move's 1-ply value, so recursively looks at all the possible rolls that the player might make after each of the opponent roll scenarios.

"n-ply" does this recursion n times, of course.

This quickly becomes an expensive calculation. There are 21 possible rolls, and on average for each roll there are roughly 20 possible moves that need to be evaluated. So a 1-ply board evaluation costs about 400x as much as 0-ply evaluation. A 2-ply evaluation costs 160,000x as much as 0-ply.

One technique that GnuBG implemented to reduce calculation time is to use a quick & coarse strategy to trim down the number of board evaluations to do at each step. That is, for the (average) 20 possible moves after a roll, it uses a coarse strategy to order their board values, and then does the expensive calculation only on the top few.

This assumes that many of those 20 moves are clearly poor, so even a coarse strategy can effectively discard them and leave the expensive calculation for the moves we expect really are worth being accurate when evaluating.

I built a multiple-ply strategy wrapper that takes a base strategy (used for 0-ply calculations) and a filter strategy (the coarse strategy used to trim out the clear losers among the possible moves). It takes a parameter that defines how many moves maximum it will calculate the expensive recursive board value for, and determines that subset of moves using the filter strategy.

When I try this with Player 2.1 as the base strategy, Player 2.3q as the coarse strategy, and eight as the maximum number of expensive moves, I see a 1-ply version score +0.107ppg in a 5k-money game match against the 0-ply version (Player 2.1), winning 53.6% of the games. It runs 125x slower than the 0-ply version, much better than the 400x we'd see without move filtering, but still quite slow (which is why I ran only 5k games in the match - that took 8h on my multi-core desktop).

I played 20 games by hand against the 1-ply extension of Player 2.1 and it beat me handily: I lost 0.65ppg. Again, I'm not the greatest player in the world - solidly intermediate - but still that's a pretty good result. I lose 0.4ppg against the iPhone Backgammon NJ expert player (which they claim is on par with 0-ply GnuBG), but I use hints there occasionally so my real performance is a little worse than that. There's a fair bit of statistical noise on the score on a 20-game match - a standard error of 0.25ppg or so - but pretty clearly this player is way better than I am.

A more interesting benchmark would be how it plays against humans of various skill levels. There's an online backgammon server called FIBS (the First Internet Backgammon Server) which people have connected bots to before, so I think I'll try that once I'm comfortable that my player is good enough.

Also I can probably speed it up more by filtering not just beyond a fixed number of moves, but also beyond an equity cutoff. So if my move cutoff is eight moves, but four of those are calculated as terrible moves using the coarse filter strategy, I should be able to ignore them.

Tuesday, January 24, 2012

Player 2.3q: quick and dirty

I trained a player like Player 2.1 (race & contact networks, one-sided bearoff db), but with only five hidden nodes. I included the new inputs (shot hitting probability and expanded prime count).

The purpose here is to build a strategy that does something reasonably good - not optimal - but quick to calculate. We'll use it later in filtering possible moves in multiple-ply and rollout calculations to speed them up.

It ends up being better than pubEval, though not by much: it scores +0.115ppg and wins 54% of the games in a 10k-money game match.

I'll name it Player 2.3q, where the q denotes "quick".

Monday, January 23, 2012

Player 2.2: adding a primes count input

The next input I tried adding was a primes count: it counts the max number of consecutive points on the player's side of the board (positions 1->12). Of course, two inputs, one for each player.

I trained the player (which I'm calling Player 2.2) and benchmarked its performance against Player 2.1 with 1,000 games every 1,000 training iterations. Results below:






























The blue dots are the scores of the new player in 1,000-game matches, and the blue line is the rolling 30-point moving average.

In a 30k-game comparison against Player 2.1 it scores +0.017ppg and wins 50.5% of the games.

The standard error on the ppg score in 30k trials is around 0.009ppg (determined by experiment running Player 2.1 against itself), so +0.017ppg is less than two standard deviations from zero - barely significant.

I've seen a post on a backgammon forum that states prime counting is an important extra input, but these results do not show it. It might be because of how the input was set up: max # of primes / 6. In reality it's a lot better to have six primes than to have five, because you box the opponent in (if he's got a checker behind the prime). So perhaps some kind of splitting of the count among multiple inputs would do better. I'll try that next.



Sunday, January 22, 2012

Winning vs Equity

In 1-game matches, or eg in a 7-game match that's tied 6-6, you care only about winning, not how many points you win.

I added the ability to Player 2.1 to optimize for that instead of optimizing for best full equity.

As expected, it increased the win %-age a little at the expense of equity. I experimented with this in 30k-game matches:

Optimizing for any win: score +0.3497ppg, 66.82% chance of a win.
Optimizing for equity: score +0.4686ppg, 66.01% chance of a win.

So optimizing for any win does improve the odds a little (around 0.8%) but dramatically reduces equity.

Saturday, January 21, 2012

PubEval: final fix

I had one last bug with PubEval: when evaluating board value for proposed moves, it should use the race or contact network weights based on the starting board, not based on whether the proposed move boards are race or contact. That's because the race network and the contact network return quite different numerical values and you can't compare them against each other (unlike networks whose outputs represent probability of win etc).

I fixed that bug, and now I'm pretty sure I've got a proper implementation of the standard PubEval player. After those fixes, in 30k cubeless money matches, here are how my players perform:

Benchmark 1: +0.351ppg, 62.9% chance of win
Benchmark 2: +0.426ppg, 64.3% chance of win
Player 2: +0.435ppg, 64.6% chance of win
Player 2.1: +0.469ppg, 66.0% chance of win

Compare that to gnubg 0-ply, which scores +0.630ppg and wins 70.9% of games. So I've still got some work to do. :)


Player 2.1: new input for hitting shots

I'm now starting to experiment with some inputs beyond the basic Tesauro ones. One I had already added: an input indicating whether it's no longer possible to lose a backgammon (for each player).

I tried a new one: the number of ways to hit a shot (one input from the perspective of each player). I added it to the Player 2 player, which has the race & contact networks plus bearoff.

I trained it for 460k runs, using alpha=0.02 for the first 200k then dropping to alpha=0.004. I started with a smaller alpha than usual because I used the same weights as the trained Player 2 weights, but started with a small random weight for the new input (uniform random between -0.1 and +0.1).

After training the player with the new input scored +0.045ppg against the player without the new input, in a 30k cubeless money games. So a noticeable improvement but nothing dramatic.

I'll call the version of Player 2 with the new input Player 2.1.

Thursday, January 19, 2012

Benchmark 2: different number of hidden nodes

The next test on the Benchmark 2 player is to see how the performance depends on the number of hidden nodes. In general one expects performance to improve as the number of hidden nodes increases, but that is balanced by a higher computational overhead and potentially more difficulty converging during training.

The Benchmark 2 player described earlier used 80 hidden nodes.

10 Hidden Nodes

I trained another version with 10 hidden nodes (from initial small random weights this time) and benchmarked it against the  trained 80-node version. It converged relatively quickly: in about 75k training runs (all at alpha=0.1):






























Its converged performance against the 80-node benchmark, however, was quite poor. In a 10k-game match it scored -0.417ppg on average and won 36.9% of the games. It won +0.056ppg against pubeval in a 10k-game match and 51.7% of its matches. (Note: pubeval results were updated after the pubeval player was fixed.)

40 Hidden Nodes

The next version had 40 hidden nodes. Benchmarked against the 80-hidden node version in training:































It used alpha=0.1 for the first 200k runs then switched to alpha=0.02, and seemed quite converged by 250k training runs.

In a 10k-game match against the 80-node benchmark it scored -0.064ppg and won 47.5% of the games. It scored +0.325ppg against pubeval in a 10k-game match and won 60.7% of the games. (Note: pubeval results were updated after the pubeval player was fixed.)

120 Hidden Nodes

The final test was for 120 hidden nodes. Benchmarked against the 80-hidden node version in training:






























As with the other training, I started with alpha=0.1 and dropped to alpha=0.02 after 200k runs. The network looks converged after roughly 300k training runs.

The equity difference vs the 80-node benchmark was looking quite small so for a comparison I ran 30k games to reduce the statistical error (vs 10k games for the previous two tests). In those games the 120-node player score -0.018ppg and won 49.0% of the games.

Conclusions

The 80-hidden node benchmark seems to be the best of the 10, 40, 80, and 120-node choices. It is clear for the 10- and 40-node versions that adding more nodes improves performance. It is somewhat surprising that moving to 120 nodes showed a decline in performance, though the difference was small. It seems like convergence in the hidden node direction is around 80 nodes for this type of setup. I will continue to use the 80-node version of Benchmark 2 as a future benchmark.


PubEval Benchmark Fixed

Found the bug.

After the fix, the Benchmark 2 player (with 80 nodes) scored +0.459ppg and won 65.3% of the games in a 10k-game match.

This is a result that seems fairly plausible. It's significantly less than 0-ply gnubg, and roughly where I read other networks have reached after a similar amount of TD training.

One note: whenever I quote a score, it's always for cubeless money games. So no doubling, but gammons count as two points and backgammons as three. Win percentage is for those same games, but counting any win as 1.

PubEval Benchmark

In my training I've been seeing very high win percentages against the pubeval benchmark - north of 80% for most of my benchmarks, and north of +0.9ppg in average score.

This struck me as somewhat suspicious, since whatever information I could find online, it seems like no strategy does as well as my results.

Joseph Heled, one of the brains behind gnubg, was kind enough to run a benchmark on his end of 0-ply gnubg against their implementation of pubeval. In 10k games it scored +0.6296ppg and won 70.87% of the games.

So there's clearly something wrong with what I'm doing; no way my players are anywhere near as good as gnubg. Nikos Papahristou and Joseph both calculated the pubeval value of possible moves after a 3-1 roll and they agree with my pubeval implementation in that case, so there must be something else wrong. Or perhaps there's something awry with how I'm running games; I'm being a bit lax with how I re-seed the RNG independent games. That may be affecting the stats, though I have a hard time seeing how it'll add such a big positive bias.

Anyways: all my pubeval comparison results are probably wrong.



Wednesday, January 18, 2012

Benchmark 2: adding backgammon outputs

The second benchmark player, Benchmark 2, is similar to Benchmark 1 except that it adds two extra output nodes: the probability of backgammon win and the probability of backgammon loss.

In addition it does some (fairly obvious) validation on the probabilities that Benchmark 1 did not do: it checks that the estimate of gammon probability is less than or equal to the win probability, that the backgammon probability is less than or equal to the gammon probability, and it properly overrides the network calculations for gammon and backgammon values in the cases where it knows the results (for example, zero probability of gammon loss if the player has already borne off a checker).

Here is a chart of training performance vs Benchmark 1 over 450k training runs. I started with the Benchmark 1 network weights and added small random backgammon->middle weights (uniform random between -0.1 and +0.1). I used alpha=0.1 for the first 200k runs and then dropped to alpha=0.02.


The blue dots are the results of 1,000-game matches against Benchmark 1, and the blue line is the 10-match moving average. It looks reasonably converged after 450k iterations.

It is somewhat surprising that it took so long to converge: I haven't checked, but based on the Benchmark 1 and Player 2 training from scratch, I would have expected training from scratch to converge in around 300k iterations. And starting from a more sensible point (the Benchmark 1 weights) should have made it better. That said, training the backgammon weights is the slowest bit of training since most games do not end in backgammon.

Benchmark 2 also uses 80 hidden nodes and the standard Tesauro inputs (less the two inputs noting whose turn it is, as usual). I added two custom inputs: one which is zero if the player can still be backgammon and one if the player cannot be backgammon; and another similar one for the opponent. These two extra nodes let the backgammon output nodes identify more exactly when they should be zero.

In a 10k-game match against Benchmark 1 it averages +0.129ppg and wins 52.4% of the games.

In a 10k-game match against Player 2 it averages +0.007ppg and wins 50.4% of the games.

In a 10k-game match against pubeval it averages +0.426ppg and wins 64.3% of the games. (Note: updated with corrected performance after I fixed the pubeval player.)

Also somewhat surprising that it is essentially equivalent in performance to Player 2, which has the extra race network and uses a one-sided bearoff database. I suspect this means that the race network is not particularly efficient for Player 2.





Sunday, January 15, 2012

Player 2: corrected

I had a bug in the board evaluation function for Player 2, where it would often return an equity of 1 in cases where it was almost certain to win a gammon or backgammon. So that was doing two things: in playing against the Benchmark 1 opponent it was occasionally making the wrong choice of optimal board; and in training it was training the networks to do the wrong thing.

I re-trained the network for 235k training runs, starting with the network weights I had from the earlier (buggy) training rather than starting from scratch.

Before re-training, Player 2 wins +0.112ppg against Benchmark 1, and wins 52.0% of the games (in a 10,000 game match).

After re-training, Player 2 wins +0.134ppg against Benchmark 1, and wins 52.0% of the games (also 10k games).

So it turns out the bug didn't affect the training that much; it's more that it meant Player 2 didn't properly take advantage of Benchmark 1 when it could not distinguish between gammons and backgammons. That is, Player 2 on its own, playing against itself, records about 0.8% of its games as backgammons. When playing against Benchmark 1 that jumped to 10.3%, so the majority of the equity gain against Benchmark 1 comes from those backgammons: if they'd been gammons instead, the equity advantage would have dropped 0.095.

Even aside from the gammon vs backgammon advantage, it does play better than Benchmark 1 by a small amount. It may be unfair to allocate all that 0.095 equity difference to Benchmark 1's inability to distinguish between gammons and backgammons. Really I should make a new Tesauro-style benchmark that has backgammon nodes to make sure.

But I'd still expect better performance due to the race/contact network split and benchmark database for end-game races.

For reference: against pubeval it scores +0.435ppg and wins 64.6% of its games, in a 10k-game match. (Note: pubeval results were updated after the pubeval player was fixed.)

Friday, January 13, 2012

Player 2: two nets + bearoff

I trained the new player which has two networks (race & contact) and uses the one-sided bearoff databases for up to nine points and fifteen checkers. I'll call it Player 2.

It trained for 780k runs, using alpha=0.1 for the first 200k and alpha=0.02 for subsequent runs. Plotting its performance against the old benchmark (Benchmark 1 wasn't ready) it converged after about 300k training runs.

I then played it against Benchmark 1 (single network with prob of win & gammon outputs). It did not perform well: in a 10,000 game match (cubeless, of course) it won 51.5% of the games and won an average of only 0.005ppg. So effectively the same as Benchmark 1.

This is pretty surprising. Player 2 has outputs for backgammon win/loss, so should perform better in edge case games where Benchmark 1 just doesn't bother taking its checkers out of the opponent's home board, so loses a backgammon instead of a gammon. It uses a bearoff database for end game races, so should perform better there. And it splits race vs contact phases into separate networks, so should perform better in each.

I'll look at some specific examples of moves in various game phases to understand the performance a bit better.

Benchmark 1: the first strategy for benchmarking

I said before that I wanted the more complex player (race & contact networks + one-sided bearoff) to be Benchmark 1.

But that was probably a bit premature. Really a simpler first benchmark would be a properly trained Tesauro-style single network with no bearoff database. Now that I solved the problem with who holds the dice, I trained a simple network with three outputs (probability of win, probability of gammon win, and probability of gammon loss). It took around 200k training runs to converge very nicely, with alpha=0.1. I ran it out to 450k training runs with alpha=0.02 but no further convergence happened.

After converging it wins 62.9% of its games against pub eval, with an average equity of +0.351ppg, in a 10,000 game match. (Note: pubeval results were updated after the pubeval player was fixed.)

I played against it myself and it seems like a reasonable intermediate player. A few clearly erroneous moves but generally pretty good.

To be more clear on the specification of the Benchmark 1 player:

  • One network
  • Three output nodes: probability of any win, probability of a gammon win, and probability of a gammon loss (no backgammon probability outputs)
  • 80 hidden nodes
  • Standard Tesauro inputs, but no input for whose turn it is. The network always returns probabilities assuming that the player holds the dice (ie hasn't rolled yet). No extra inputs that encode more game-specific information.
I trained it using standard TD learning through playing games against itself. I used lambda=0 everywhere and did not use "alpha splitting" (where you use a different alpha for the output->hidden node weights vs the hidden->input node weights). I used alpha=0.1 for the first 200k training runs, then dropped to alpha=0.02 for the subsequent ones (out to 450k training runs). Plotting performance against my old benchmark it was clearly converged around 200k training runs.

I will use this benchmark to compare other strategies, and as a benchmark when training to look for convergence of performance.

Thursday, January 12, 2012

Symmetric design re-tested - it really doesn't work

I re-ran the comparison between the normal and symmetric networks, after fixing both their respective training so that they properly calculate board values assuming the opponent holds the dice.

This time the comparison is stark: the symmetric network does much, much worse than the regular one. Here is a plot showing the comparison results. Three series: the blue data show the performance of the normal network playing against the symmetric one; the green line shows the normal network against the benchmark; and the orange line shows the symmetric network against the benchmark.

The y-axis shows probability of win, since again, these networks have only one output. The benchmark is the same one I'm comparing the other new network against (80-node normal network with win and gammon outputs, but not trained efficiently - still, much better than pub eval).

The symmetric network does much, much worse than the regular one. So we can really put this to bed now.

Early training results: new benchmark player

After just 200k training runs using an alpha of 0.1 I had a player that was beating my previous best player by 0.75ppg. That previous-best player has a single network with win and gammon win/loss outputs, no bearoff database, 80 hidden nodes, and had an input for whose turn it is (no symmetry assumption). That player was trained incorrectly because of the bug I mentioned earlier, but it was still pretty good - +0.7ppg-ish against pub eval, and it got lucky and beat me in a set of 10 games once.

This new player is my new benchmark, and I think the first one that's got most everything correct. I'm naming it Benchmark 1. I know, over-brimming with creativity on the name.

I played 20 games against it myself and won 0.1ppg, but I was a bit lucky in the first half of the games (I won 0.7ppg in the first ten but lost 0.5ppg in the second ten). So even after a limited number of training runs it plays at an intermediate level (which is where I place my own meager backgammon skills). It didn't make any big errors in checker play that I noticed.

That's not many training runs, and I'm continuing to train now with more runs. I'm also trying alpha-damping, since after 200k runs it didn't seem to be getting noticeably better with more runs. At 200k runs I'm dropping alpha to 0.02, then to 0.004 at 1M runs, then to 0.0008 at 10M runs. I'm not sure how sensible that is, but I had a hard time finding any references to a good way to determine how to damp alpha for TD training.

This player uses two networks (race and contact) and one-sided bearoff databases for nine points and fifteen checkers. Both networks have outputs for probability of any win, probability of gammon win, probability of gammon loss, probability of backgammon win, and probability of backgammon loss. Both have 80 nodes in their hidden layers. The relatively large bearoff database parameter space means two things: the player is more accurate in most late races; and I can train the race network directly with supervised learning, which is more efficient than the usual TD learning that the network has to do when there's no direct benchmark for probability.

However, to load the two bearoff databases requires a process a bit over 4GB of RAM, which even on a big machine limits the number of calculations I can do in parallel. Once the training is complete I'll try using a much smaller bearoff database to see how well the supervised learning makes the race network approximate the more exact calculation.

Also this player does not include a "crashed" network, which GnuBG does. Partly that's because I want to try something simpler to start, and partly it's because I don't really understand the definition of a crashed board in this context - that is, why the definition GnuBG uses is an interesting one in terms of a phase of play.

Wednesday, January 11, 2012

Ah ha! Another bug squashed

Once again I was burned by subtleties around who's holding the dice.

The rule the strategies use to determine the optimal move given a board layout and dice throw is to look at all the possible moves and choose the one that has the highest equity. BUT: that equity needs to be calculated assuming that the opponent holds the dice, not the player, since after the move the dice switch hands.

I'd been erroneously calculating the equity in those cases still assuming the player holds the dice, so no surprise that the learning wasn't converging properly.

I've fixed that and am now training a player that uses one-sided bearoff databases, a race net, and a contact net. After 31k training runs it's already beating my best-performing previous network by 0.38ppg, so that definitely looks to have been a significant error!

Results coming when I get a decent number of training results complete.

More on One-Sided Bearoff Databases

In my last post I mentioned that, when the bearoff database runs out to the full fifteen checkers, you need to build a second database to track the distribution of number of steps to take off the first checker, which is used in approximating the probability of a gammon win or loss.

A few more words on this.

First, my initial guess was that this database would be much smaller than the regular one-sided bearoff database (which tracks the distribution of the number of steps to take off the final checker), because it seemed like there should far fewer positions that are relevant. In fact, that's not the case: the number of possible board layouts of fifteen checkers is large and contains many layouts where there is more than one step to taking off the first checker.

For example, for a database covering nine points and fifteen checkers, there are 1.3M board layouts in the regular one-sided database. After trimming it down to just the ones relevant for the gammon bearoff database, there are 490k layouts. Not as big, but more than 1/3 the size of the full set of layouts. The ratio of data in the two databases is actually smaller than the ratio of board layouts, however, because many of the gammon db entries have only a few entries in their probability distribution, whereas the bearoff database is close to the maximum 10 probability entries for most of the layout entries. (10 is the maximum number of points I chose to track in the probability distributions, which I estimated based on impact on the approximate equity calculations vs the exact equity calculations in the two-sided database. Anything over 8 entries did not affect the approximation for that smaller parameter space.) The file containing the gammon bearoff data is 18% of the size of the regular one.

The other point worth mentioning is that a lot of the layouts tracked in the gammon database are not ones that would reasonably come up in a game - for example, a few checkers in the home board and the rest piled up on the points just outside the home board.

That's true of many of the entries in the regular one-sided database as well. I suspect I could trim down the size of both database significantly by running many test games using a reasonable player and including only database entries that are accessed more than 1 in every 1,000 games or something like that. Occasionally in a real game we'd hit a layout that's not cached in the database, and in those rare cases we'd just calculate it on the fly.

Also, currently I store the data in a hash of string representation of board->probabilities. I suspect a fair amount of space is taken up by the string names and the hash overhead, and I'm wondering whether there's a nice way to map the board layout to an integer, where the integers run from 1->max number of board layouts without any gaps or overlaps. Then I could store the data as a standard vector and hopefully reduce memory usage.

Tuesday, January 10, 2012

One-Sided Bearoff Databases

The two-sided bearoff database I put together before is beautiful because it is exact; but the size of the database quickly becomes cumbersome. In practice I've been using a database for up to six points and nine checkers, which already needs 580MB of disk space to store and around 2GB of RAM to pull in-process. That limit on the number of points and checkers means it does not get used for much of each game and does not help performance significantly.

An alternative, which is an approximation but uses a lot less memory, is to use a one-sided bearoff database. The number of positions in a two-sided database is the square of the number in a one-sided database, so a one-sided database is significantly smaller.

To give a sense of scale: a two-sided database for 6 points and 9 checkers has 25M positions. A one-sided database for 6 points and 9 checkers has 5k positions. A one-sided database for 10 points and 15 checkers has 3.3M positions; assuming roughly 10 probabilities to define the distribution for each position (I limit the distribution to a maximum of 10 points), that's around 30M numbers, so roughly the same amount of memory as the two-sided 6x9 database. That larger parameter set should cover a significant chunk of race positions, and also gives us a benchmark to directly train a race network against using supervised learning instead of TD learning (for some significant fraction of race positions).

The idea here: we look at just one player's position on the board and ignore the other player, and build up the full probability distribution of number of steps to take off all the checkers.

The approximation: when calculating the probability distribution we need to make some assumption about how to make the optimal move. In reality the optimal move will depend on the other player's layout as well, but we approximate the optimal move as the one that minimizes the expected number of steps to finish. That approximation makes the optimal move independent of the opponent's position, which lets us have a much smaller database.

Once we have the probability distribution of both players we can calculate the probability that one of the players wins the game: if that player is on move, it's the probability that their number of steps to finish is less than or equal to the opponent's number of steps. Those two probability distributions are independent because they're generated from independent dice throws, so we can write the probability of player winning the game, as seen before their roll, as:

Prob of Player Win = Sum[ PO(i) Sum[ PP(j), { j, 1, i } ], { i, 1, Infinity } ]

where PP(i) is the probability of the player taking off the last checker in i steps, and PO(i) is the probability of the opponent taking off their last checker in i steps.

Again, this is an approximation based on the assumption that, when generating the distribution of number of steps, we can choose the optimal move based on minimizing the expected number of steps to finish. We can test that approximation against the two-sided database for that smaller parameter space.

I compared the one-sided equity approximation against the exact two-sided equity for up to six points and nine checkers in all of the 9M two-sided positions. The average equity difference was +0.00019; the standard deviation of the difference was 0.0024; and the maximum and minimum differences were +0.023 and -0.026 respectively. So most of the time the approximation is accurate to within 0.003ppg, which is quite good. The average difference is statistically significant, so in principle there is a bias but in practice it is so small as to be irrelevant.

This all assumes that the probability of gammon and backgammon are zero, so that the equity is the usual 2 * prob of win - 1.Ignoring the probability of backgammon is fine since we will not build even a one-sided database that includes points out to the opponent's home board; but the probability of gammon is an important one to track.

We can generalize by building another one-sided database that tracks the distribution of the number of steps before the first checker is taken in. The size of that database will be significantly smaller than the regular one-sided database since we care only about positions that have checkers outside the player's home board and where there are exactly 15 player checkers on the board.

I haven't checked the literature yet on this, so maybe there's a more elegant way to construct a one-sided database with a superior approximation to the one I described above.