Saturday, February 25, 2012

Testing a new contact/crashed input: hit & cover

I tried a new input for contact & crashed: like the hitting shots input, but where the opponent covers the point with two (or more) checkers when hitting, without leaving another blot elsewhere.

I actually broke it into two pieces: one for the odds of being hit & covered in the opponent's home board, and one for everywhere else. (And of course, both inputs were applied to each player, so four new inputs.)

I started with a random-weights version of Player 3.2 and added the new inputs, and trained it using supervised learning on the GNUbg training set.

The player converged after 150 epochs to Contact ER 14.0, Crashed ER 12.4, and Race ER 2.08. This compares to 14.0, 12.6, and 2.08 for Player 3.2. So marginally better on Crashed ER but otherwise identical.

Since we know that Crashed ER does not matter much in terms of game play, this new player really is almost identical to Player 3.2.

The conclusion must be that these new inputs do not add noticeable value. So I will not include them going forward. The hunt continues!


Saturday, February 18, 2012

Player 3.2 1-ply benchmark

I ran my 1-ply version of Player 3.2 through the GNUbg benchmarks: Contact ER 11.0, Crashed ER 10.8, Race ER 1.67. That compares favorably to the 0-ply scores of 14.0, 12.6, and 2.08, but still worse than the 0-ply GNUbg scores.

Friday, February 17, 2012

Bug in benchmark calculation: fixed

I just discovered a bug in my benchmark calculation: if the player's board was not found in the list of best-five rolled out boards in the benchmarks, it was assuming an equity error of zero instead of the worst equity error in the list. So it was making those edge cases look much better than they should have, and skewing the average ERs a bit better.

Below is a table of the corrected benchmark results for the nine players in the original results, plus four new ones:

Player
GNUbg Contact ER
GNUbg Crashed ER
GNUgb Race ER
PubEval Avg Ppg
Benchmark 2 Avg Ppg
10.5
11.0
1.01
-
14.0
12.6
2.08
0.547
0.131
33.7
26.8
2.40
0.146
-0.282
14.9
14.2
2.08
0.548
0.106
14.9
14.2
2.67
0.550
0.108
18.2
19.3
1.98
0.480
0.072
38.3
41.1
4.70
0.119
-0.283
18.7
19.8
2.01
0.460
0.069
20.5
30.0
2.09
0.442
0.021
21.5
23.7
5.44
0.432
0
Benchmark 2 (10)
42.7
37.5
13.20
0.064
-0.418
Benchmark 2 (40)
26.2
25.9
5.94
0.330
-0.067
23.0
24.5
9.66
0.351
-0.101
PubEval
44.1
49.7
3.54
0
-0.437


Redoing the one-variable regressions on the corrected & expanded data:


MetricPubEval Ppg vs Contact ERPubEval Ppg vs Crashed ERPubEval Ppg vs Race ERBM2 Ppg vs Contact ERBM2 Ppg vs Crashed ERBM2 Ppg vs Race ER
Slope-0.0182-0.0161-0.0268-0.0189-0.0165-0.0275
Intercept0.80650.76380.46310.39690.34680.0362
R-Squared98.8%83.5%22.5%98.0%80.6%22.3%


Redoing the multivariate linear regression:

Benchmark
Intercept
Contact ER Slope
Crashed ER Slope
Race ER Slope
R-Squared
PubEval
0.8037
-0.01980
+0.00133
+0.00202
98.9%
Benchmark 2
0.3945
-0.02049
+0.00209
-0.00246
98.4%
Benchmark 2 (only good players)
0.4102
-0.01756
-0.00063
-0.00694
98.5%

The main conclusion: Contact ER mostly determines cubeless money play score. The single regression of score against Crashed ER has a relatively high R^2, but that is only because Crashed ER is highly correlated with Contact ER. When properly separated with the multivariate regression it becomes clear that Contact ER is the only measure that really matters.

Player 3.2q: a new filter strategy

For multiple plies I was previously using Player 2.4q as a coarse strategy to trim down the list of moves that I need to run the more expensive 0-ply evaluation on.

I've now moved that to a new coarse strategy: Player 3.2q. This is just like Player 3.2 but with five hidden nodes instead of 120 for each of the three networks.

It scores Contact ER 33.7, Crashed ER 24.8, and Race ER 2.40. Significantly stronger than Player 2.4q. (Benchmark scores updated after fix to benchmark calculation.)


In 100k cubeless money games against Player 2.4q it scores +0.210ppg +/- 0.004ppg.

Player 3.2: adding a crashed network

In my earlier tests I could not find a significant benefit of using a crashed network, even following the GNUbg definition of crashed.

But that was using TD learning on self-play, and I wanted to see whether the improved training using supervised learning would find a significant benefit.

Indeed it did: a noticeable benefit, so I spawned my new best player, Player 3.2, which is like Player 3.1 but adds a crashed network.

Its benchmark scores: Contact ER 14.0, Crashed ER 12.6, and Race ER 2.08 (the same as Player 3.1). (Starting to get close to GNUbg 0-ply! Extrapolating from the regression on Benchmark 2, GNUbg should beat Player 3.2 by only around 0.05ppg.) (Benchmark scores updated after fix to benchmark calculation.)

Adding the crashed network made the biggest difference to the Crashed ER, with a noticeable but smaller improvement in Contact ER. This is a bit different to the experience of the GNUbg team, who saw a relatively small improvement in Crashed ER but a big difference in Contact ER after adding a crashed network, since the contact network could focus its optimization on a smaller set of game layouts.

That said, my player's performance is still reasonably far from GNUbg's. I think there's still some work to do on extra inputs to bridge the gap.

Player 3.2 summary:
  • 120 hidden nodes.
  • Trained using supervised learning on the GNUbg training databases, starting with the Player 3.1 weights (using the contact weights as the initial guess for the crashed weights).
  • Contact, crashed, and race networks.
  • One-sided bearoff database used when both players have all checkers in their home boards.
  • Contact inputs as per Player 2.4, with Berliner prime and hitting shot inputs in addition to the original Tesauro inputs.
  • Crashed inputs are the same as contact inputs.
  • Race inputs are the original Tesauro inputs plus the 14 extra inputs added with Player 3.1.
I trained it for 128 epochs, with an alpha schedule of 1 until the 8th iteration, then 0.32 until the 20th iteration, then 0.1 until the 60th iteration, then 0.032 until the 100th iteration, then 0.01 afterward.


In 100k cubeless money games, Player 3.2 scores +0.021ppg +/- 0.004ppg against Player 3.1. 

In 40k cubeless money games against Benchmark 2 it scores +0.131ppg; the prediction from the multivariate regression is +0.152ppg.

In 40k cubeless money games against PubEval it scores +0.574ppg; the prediction from the multivariate regression is +0.547ppg.

Wednesday, February 15, 2012

Player 3.1: expanded race inputs

In the past I tried a couple extra inputs for the contact network; now I'm trying a new set of inputs for the race network.

This follows what Joseph Heled did for GNUbg's race network: instead of using a single input that represents the number of checkers borne off, using 14 separate inputs. The i'th input is 1 if the number of borne-off checkers is greater than or equal to i, and 0 otherwise. Splitting out the number into separate inputs like this is what the regular Tesauro inputs do for number of checkers on a point, and it lets the neural network discover more complex nonlinear dependencies.

Player 3.1 implements this expanded set of race inputs. Its contact network is the same as Player 3, so its Contact and Crashed ER are the same as that player (14.9 and 14.2). Its Race ER is improved: 2.08, vs 2.67 for Player 3. (Benchmark scores updated after fix to benchmark calculation.)

So a relatively small improvement, but a noticeable one worth around 0.004ppg in score.

Player 3.1 summary:
  • 120 hidden nodes.
  • Trained from scratch (uniform random inputs in [-0.1,0.1]) using supervised learning on the Race GNUbg training database, not TD learning. 
  • Contact and race networks. No crashed network.
  • One-sided bearoff database used when both players have all checkers in their home boards.
  • Contact inputs as per Player 2.4, with Berliner prime and hitting shot inputs in addition to the original Tesauro inputs.
  • Race inputs are the original Tesauro inputs plus the 14 extra inputs per player as described above.
I trained it for 110 epochs, with an alpha schedule of 1 until the 8th iteration, then 0.32 until the 20th iteration, then 0.1 until the 60th iteration, then 0.032 until the 100th iteration, then 0.01 afterward. Really it converged pretty quickly - after about 20 iterations. I let it run longer to see if would improve incrementally with smaller alpha, but it did not.




Tuesday, February 14, 2012

Player 3: training with supervised learning

I tried training a player using supervised learning on the GNUbg training databases, instead of training using the TD learning that I have used up until now.

I took a setup like Player 2.4 and trained it from scratch (uniform random weights in [-0.1,0.1]), one version with 80 hidden nodes and another with 120 hidden nodes.

I'm naming the 120-node version Player 3, since it has a significantly better performance than my previous best player. The step from TD learning to supervised learning made quite a large difference in performance, consistent with the experience of the GNUbg team.

Its benchmark scores: Contact ER 14.9, Crashed ER 14.2, and Race ER 2.67. (Benchmark scores updated after fix to benchmark calculation.)

Moving from 80 to 120 hidden nodes makes a significant difference, unlike in earlier tests on TD-trained players.

Player 3 summary:
  • 120 hidden nodes. I also trained an 80-node version which performed noticeably worse. I may occasionally refer to the 80-node version as Player 3 (80).
  • Trained from scratch (uniform random inputs in [-0.1,0.1]) using supervised learning on the GNUbg training databases, not TD learning. 
  • Contact and race networks. No crashed network.
  • One-sided bearoff database used when both players have all checkers in their home boards.
  • Contact inputs as per Player 2.4, with Berliner prime and hitting shot inputs in addition to the original Tesauro inputs.
  • Race inputs are the original Tesauro inputs.
The training approach: I combined the crashed & contact training sets into a single set, and kept the race training set as is. One "epoch" of training (borrowing Joseph Heled's terminology): randomly order the crashed/contact training set and train the contact net with that; then randomly order the race training set and train the race net. Alpha is kept constant for an epoch, but can vary between epochs. If Contact ER does not improve over an epoch, the training sets are (randomly) re-ordered.

I trained the 80- and 120-node versions using supervised learning on the GNUbg training databases. The schedule for alpha was 1 until epoch 8, 0.32 until epoch 30, 0.1 until epoch 60, 0.032 to epoch 100, and 0.01 afterwards.

I trained the 80-node version for 112 epochs. It scored Contact ER 14.8, Crashed ER 12.3, and Race ER 1.59.

The 120-node version was trained for 100 epochs. It scored Contact ER 13.9, Crashed ER 11.8, and Race ER 1.56 - significantly better than the 80-node version.

Some head to head match results for Player 3 (the 120-node version), 100k cubeless money games (standard error 0.004ppg):
  • Against the 80-node version: +0.018ppg
  • Against Player 2.4: +0.047ppg
  • Against Benchmark 2: +0.120ppg. The estimate from the multivariate regression is +0.129ppg.
  • Against PubEval: +0.546ppg. The estimate from the multivariate regression is +0.529ppg.



Sunday, February 12, 2012

GNUbg 0-ply benchmarks reference

As a reference, here are the average error rates for GNUbg 0-ply on the three benchmark databases:

Contact: 10.5
Crashed: 11.0
Race: 1.01

The values are milli-error rates to be comparable to the values in the table in my benchmark results post (so 10.5 means 10.5e-3 = 0.0105 ppg/move average error rate).

My best player so far (Player 2.4) scores 16.4, 13.8, and 1.70 respectively. A multivariate linear regression  of Benchmark 2 score against the three benchmarks predicts GNUbg 0-ply would beat my best player by +0.11ppg. Most of the difference is in the Contact benchmark. Hopefully this will see some substantial relative improvement when I train using supervised learning instead of TD learning.

Crashed and Race came from Joseph Heled's page describing the GNUbg training program, and Contact came from an email to the bug-gnubg mailing list from Ã˜ystein Schønning-Johansen.



Supervised learning using GNUbg training databases

In addition to the benchmark databases to use for evaluating a bot, the GNUbg team also constructed training databases to use for supervised learning.

There are three databases, as with the benchmarks: Contact, Crashed, and Race. Each has a list of boards and rolled-out probabilities, generated from a large sample of self-play games and real-world games sourced from FIBS.

The databases are available through FTP, and I have mirrored them at dropbox: contact, crashed, and race.

An example line from the contact file is

NIKOIJADCANIGOAHADBA 0.61959 0.24388 0.00981 0.08774 0.00183

The first element is a string representation of the board - the same one used in the benchmark databases and described in my post about those. The numbers following are probabilities: probability of any win; probability of any gammon win; probability of backgammon win; probability of any gammon loss; and probability of backgammon loss.

The probabilities assume that the player holds the dice, and were calculated from rollouts using 0-ply GNUbg.

This format is different to the benchmark databases: there, the data included the starting board and a roll, plus the equities of the top five best moves as determined by rollout. Here the data are just the board and the rolled-out probabilities - no dice roll involved.

The purpose of the training databases is to train neural network players using supervised learning instead of the traditional TD learning through self play. Supervised learning should be need a smaller training set than TD learning, and generating the positions from real play instead of self play should give a (somewhat) less biased training set.

There are 605,054 positions in the Contact database; 305,146 in Crashed; and 251,862 in Race. None of the boards in the training databases appear in the benchmark databases - the benchmarks are out of sample vs the training, as required for a proper benchmark.

The GNUbg team (in particular Joseph Heled, who was a key developer of GNUbg's neural networks) notes that TD learning was only able to generate an intermediate-strength player, and to advance beyond that required using these training databases and supervised learning. So hopefully my player will also respond well!



GNUbg benchmark results

I'm moving to use the GNUbg benchmark databases as a new benchmark, but want to compare that to the benchmarking I've done in the past: playing many games against a reference player.

I looked at nine different players of varying skill and calculated the GNUbg average error rate for each, for each of the three benchmark databases (Contact, Crashed, and Race). I also ran 40k cubeless money games for each against both PubEval and my Benchmark 2 player, so that I can compare the GNUbg average error rate benchmarks against the scores in those games.

Note: the results below are a bit incorrect due to a bug in the benchmark calculation. Corrected results here.

Results:


Player GNUbg Contact ER GNUbg Crashed ER GNUgb Race ER PubEval Avg Ppg Benchmark 2 Avg Ppg
Player 2.4 16.4 13.8 1.70 0.480 0.072
Player 2.4q 32.7 34.3 3.35 0.119 -0.283
Player 2.1 16.9 14.5 1.73 0.460 0.069
Player 2 18.3 15.6 1.79 0.442 0.021
Benchmark 2 19.1 18.8 4.42 0.432 0
Benchmark 2 (10) 35.6 30.2 10.5 0.064 -0.418
Benchmark 2 (40) 22.7 20.8 4.86 0.330 -0.067
Benchmark 1 20.2 18.5 6.12 0.351 -0.101
PubEval 37.0 43.4 2.22 0 -0.437

"Benchmark 2 (10)" and "Benchmark 2 (40)" are like Benchmark 2 but with 10 and 40 hidden nodes respectively (vs 80 for the usual Benchmark 2).

The ER numbers in the table above are average equity errors, multiplied by 1,000 to make the display a bit nicer. The Avg Ppg numbers are the average scores in the 40k cubeless money game matches.

To make some sense of these results I tried running a few regressions:


Metric PubEval Ppg vs Contact ER PubEval Ppg vs Crashed ER PubEval Ppg vs Race ER BM2 Ppg vs Contact ER BM2 Ppg vs Crashed ER BM2 Ppg vs Race ER
Slope -0.0222 -0.0174 -0.0268 -0.0238 -0.0183 -0.0347
Intercept 0.8369 0.7033 0.4068 0.4519 0.3001 0.0143
R-Squared 99.1% 92.6% 17.2% 97.4% 87.6% 24.6%

One notable result: the GNUbg benchmark that is most predictive of match score (in R^2 terms) is the Contact benchmark. Crashed is also fairly good, but Race doesn't tell much. I think that's because all the strategies are relatively good in Race (compared to the other two game phases), so overall performance differences do not depend much on that phase's performance.

I also ran a multivariate linear regression of the PubEval and Benchmark 2 scores against the three benchmarks. Adding the three factors vs just Contact did not improve R^2 that much - more so for the Benchmark 2 case, but still marginal:

Benchmark
Intercept
Contact ER Slope
Crashed ER Slope
Race ER Slope
R-Squared
PubEval
0.8121

-0.01562
-0.00505
-0.00417
99.3%
Benchmark 2
0.4199
-0.01323
-0.00732

-0.01340
98.6%

From now on I will use the Contact GNUbg benchmark average error per game as the standard benchmark, while also referring to the Crashed and Race results as secondary benchmarks.


Saturday, February 11, 2012

A better benchmark: the GNUbg benchmark database

Until now I've been benchmarking my players by running matches against various opponents - either my own benchmark players or standards like PubEval.

A better benchmark is the "benchmark databases" created by the folks at GNUbg. This post describes the benchmark and how to use it, and subsequent posts will show results and compare it to the other benchmarks I have been using.

They gathered many hundreds of thousands of representative positions, chose both random and interesting ones (based on 0-ply vs 2-ply equity differences) to include in a benchmark list, and then rolled out the equity for different potential moves for a random dice roll for each.

The benchmark calculation is the following: for each position+dice roll in the database, use your strategy to choose a move. Then find the strategy's choice in the list of rolled-out moves and add the equity difference vs the best rolled-out move to a sum. Then at the end divide that sum by the total number of elements to get the average equity error.

This has the advantage that it does not depend on dice rolls in a set of simulated games, so removes an element of randomness. Plus the list of positions is a "real world" set, generated from games played by both humans and bots on FIBS. The alternative would be to choose a list of positions from bot self play, which runs the risk of bias due to playing style of that bot.

It does depend on the accuracy of the rolled-out equities, which were calculated with GNUbg's 0-ply player. But those should be quite accurate in most cases, especially as a broad metric of goodness of play.

The GNUbg benchmarks are split into three databases: crashed, contact, and race. They are available via FTP. I have also mirrored them at dropbox: crashed, contact, and race. There are 100,000 position+move elements for crashed positions; 107,485 for contact; and 111,008 for race.

The files have quite a lot of information about the included positions, but the subset you need for the benchmark are lines that look like:

m OAHDPAABDAOAHDPAABDA 2 4 JIGHPAABDAOAHDPAABDA 0.133905 OAHDMJABDAOAHDPAABDA 0.0835662 OAHDOEABCBOAHDPAABDA 0.111429 OAHDOBABCEOAHDPAABDA 0.124954 OAHDPAABAJOAHDPAABDA 0.189375

Any line that starts with an 'm' is a move line, and those are the relevant ones for this calculation. The strings are board representations (more on that later). The first string is the representation of the starting board; the next two numbers are the dice rolls. So in the example above, the roll is a 2 and a 4. The next two elements represent the best move from the rollout: the string is the board representation and the number is the equity in that position. The rest of the line represents the other moves, ordered by equity. The string is the board representation and the number is the equity difference vs the best move. So the second element in the list of moves is the board represented by the string OAHDMJABDAOAHDPAABDA and its equity is 0.0835662 worse than the best move.

All equities in the benchmark files are cubeless.

Only the top five moves are included in the list, and it is possible that the strategy's choice is outside that list. If this happens, I assumed that the equity error for the strategy's choice is equal to the worst of the rolled-out equities. This makes the average error look a bit better than it really should, but hopefully this doesn't happen very often.

The string board representation is a way of serializing a board layout. It starts with the GNUbg position key and translates the resulting 10-byte number into a 20-character string so it can be represented as text. This is not the usual GNUbg position ID - it uses a different serialization. The translation from that string to the board position is slightly complex but the code is in the open source GNUbg repository, in gnubg.c, function ParsePosition. (Thanks to Phillippe Michel and Michael Petch for pointing this out to me.)

Other lines in the benchmark files show GNUbg system settings; details of the rollouts for positions (lines starting with '#R' show the rolled-out probability of win, probability of gammon win, probability of backgammon win, probability of gammon loss, and probability of backgammon loss); and benchmark cube decisions (lines starting with 'o'). Those are not required for the benchmark calculation described above.

Wednesday, February 8, 2012

Evaluating a "crashed" network

Player 2.4, my best player so far, has two different neural networks for two phases of the game: race and contact.

The idea behind multiple networks is that it lets them specialize in strategies for different game phases. 

Other neural network players, for example GnuBG, or the reference network in this academic study, include more networks than just contact and race. A common one is for the "crashed" phase of the game, when the player is bearing in checkers against an opponent's anchor or blot and has to "crash" his points as the checkers come in.

UPDATE: as Ian Shaw pointed out to me, this is not actually what a crashed network is about. It's more about having most of your checkers on your 1 and 2 points, which means you have a lot less flexibility when trying to race any remaining checkers back home.

I tried training a player which is like Player 2.4 but a separate network for crashed boards. I tried this for a few different definitions of "crashed":

  • Contact, plus at least one player has all their pieces at their nine point or closer. So getting close to bearing in, but still with a risk of getting hit. This covers quite a lot of game states. Probably too many, since there is no noticeable improvement against Player 2.4 - that is, adding the crashed net does not improve performance, within about 0.007ppg standard error.
  • Contact, plus at least one player has borne off at least six checkers. So further along in the bearing off process where you are already being forced to significantly dismantle any barricade. Also: no evidence for any improvement in performance.
  • GnuBG's contact definition. This is a bit more complex, designed around making sure that any layout that starts in crashed always remains in crashed. But basically it's one where at least nine checkers have been taken off. Also: no evidence for improvement in performance.
For all three cases, training started with Player 2.4, using its contact network as the starting point for the crashed network. Training started with alpha=0.02, then dropped to 0.0063 at 100k training runs. The benchmark was 40k cubeless money games against Player 2.4. If that showed no improvement by 250k training steps then I counted it as a failure.

So despite trying a few different approaches, I see no benefit of adding a crashed network on top of the race and contact networks.