Monday, August 20, 2012

Improved GNUbg benchmarks

The GNUbg team (in particular, Philippe Michel) has created new benchmark databases for Contact, Crashed, and Race layouts, using the same set of board positions but rolling out the equities with more accuracy. This corrects the significant errors found in the old Crashed benchmark, and improves the Contact and Race benchmarks.

They are available for download here, in the benchmarks subdirectory.

Philippe also did some work on improving the board positions included in the Crashed training database, which is available for download in the training_data subdirectory at that link.

I re-ran the statistics for several of my players, as well as for PubEval. Also Player 3.6 as the most comprehensive benchmark.

GNUbg Contact ER
GNUbg Crashed ER
GNUgb Race ER
PubEval Avg Ppg
Player 3.6 Avg Ppg

For the games against PubEval I ran 40k cubeless money games; standard errors are +/- 0.006ppg. Down to Player 3.2, for the games against Player 3.6 I ran 400k cubeless money games to get more accuracy; standard errors are +/- 0.002ppg or better. For players worse than Player 3.2 I played 100k games against Player 3.6 as the average scores were larger; standard errors are +/- 0.004ppg.

Phillippe Michel was gracious enough to provide the GNUbg 0-ply scores against the newly-created benchmarks. Also it seems like I had the scores against the old benchmarks incorrect: they were Contact 10.4, Crashed 7.72, and Race 0.589. The Contact score was close, but the other two I had significantly worse.

Sunday, August 19, 2012

Player 3.6: longer training results

I haven't had much time lately to work on this project, but while I'm engaged elsewhere I thought I'd run training for a long period and see whether it continued to improve.

It did - fairly significantly. So my players before clearly were not fully converged.

The result is my new best player, Player 3.6. Its GNUbg benchmark scores are Contact 12.7, Crashed 11.1, and Race 0.766. In 400k cubeless money games against Player 3.5 it averages 0.0027ppg +/- 0.0018 ppg, a small improvement.

In 40k games against Benchmark 2 it averages 0.181 +/- 0.005 ppg, and against PubEval 0.601 +/- 0.006 ppg.

For training I used supervised learning with three parallel and independent streams: one with alpha=0.01, one with alpha=0.03, and finally one with alpha=0.1. This was meant to test the optimal alpha to use.

Surprisingly, alpha=0.01 was not the best level to use. alpha=0.03 improved almost 3x as quickly. alpha=0.1 did not improve well on the Contact benchmark score but did improve the best for the Crashed benchmark score.

I take from this that alpha=0.03 is the best level of alpha to use for long term convergence.

The Crashed benchmark score we know is not that useful: the Crashed benchmark itself is flawed, and a multi-linear regression showed very little impact on score of the Crashed benchmark. That said, I tried a little experiment where I used the Contact network for crashed positions in Player 3.5 and it definitely worsened performance in self-play: 0.04ppg on average. That is a significant margin at this point in the player development.

I ran 4,000 supervised learning steps in the training, for each of the three alpha levels. In each step I training on a randomly-arranged set of Contact and Crashed training positions from the training benchmark databases. This took a month and a half. The benchmark scores were still slowly improving for alpha=0.01 and alpha=0.03, so there is still scope for improvement. I stopped just because the GNUbg folks have put out new benchmark and training databases that I want to switch to.

Tuesday, June 12, 2012

GNUbg Crashed benchmark issue

It looks like the Crashed benchmark set in the GNUbg benchmarks isn't very accurate in some cases.

There is a thread discussing it in the GNUbg mailing list.

Interesting to know, and hopefully the GNUbg team will improve on it; but the Crashed benchmark score is not very predictive for overall gameplay, as I've discovered while comparing players of different strengths.

Monday, May 14, 2012

Player 3.5: new input, escapes from the bar

I tried another new input for contact and crashed networks: this time, the expected escapes if you have a single checker on the bar. That is, looking at the available spaces in the opponent home board and weighting the probability of landing in the space with the standard escape count from the Berliner primes calculation. It is meant to give some indication of how good or bad it'd be to get hit. I'm focusing on inputs along these lines because when looking at which positions are calculated most poorly in the benchmarks, it tends to be boards where there is a significant chance of being hit and landing behind a prime.

This one had some success, and while the improvement is still incremental, it resulted in my best player to date. The resulting player that uses the new input is Player 3.5. It is identical to Player 3.4, except for two new inputs: the input as described above, one for each player.

Its GNUbg benchmark scores are Contact 13.0, Crashed 11.5, and Race 0.766. Player 3.4's scores are 13.3, 11.7, and 0.766, so noticeably better but still nothing dramatic (though notably some improvement in Contact, the most important benchmark). It seems that to get a significantly stronger player I'll have to add a bunch of inputs, each of which offers reasonably incremental benefits.

In cubeless money player against Player 3.4, it scores an average +0.0033ppg +/- 0.0021ppg in 400k games. Against PubEval it scores an average +0.592ppg +/- 0.005ppg in 100k games and wins 69.5% of the games.

Still not nearly as good as GNUbg 0-ply! But creeping closer.

To be honest I'm not really sure whether the improved performance came because of the new input or because I slightly changed the training algorithm. In this case I started with random weights for the new inputs and ran supervised learning against the GNUbg training databases (contact & crashed). And instead of bouncing back and forth between a large alpha (1) and smaller alphas, I just used a small and constant alpha of 0.03. The resulting benchmark score slowly improved over 1,100 iterations, which took several days to run.

Friday, April 27, 2012

New inputs failure: bar hit/entry probability

I've been spending a little time looking at cases where my Player 3.4 does poorly in the GNUbg contact benchmarks database, to get some feel for what new inputs I might try.

It looks it's leaving blots too often when the opponent has a good prime blocking the way out of the home board.

So I tried two new inputs: the odds of entering the opponent's home board if there were a checker on the bar; and the odds of hitting an opponent blot in his home board if there were a checker on the bar.

I tried two training approaches: first, adding random weights for just those four new weights (the two inputs times two players) and doing supervised learning on the GNUbg training databases; and also starting from scratch, random weights everywhere, and doing TD training through self-play and then SL on the GNUbg training databases.

The conclusion: neither worked. In both cases the new player was about the same as or a little worse than Player 3.4. So these aren't the right new inputs to add.

Back to the drawing board.

Wednesday, April 25, 2012

Jump model final version posted

I've posted a new version of my jump model for cube decision points:

This version is quite similar to the last one, with just a few elaborations added after another set of very productive discussions with Rick Janowski. He's been a huge help in pulling this together.

I doubt this paper will change again, though I'll probably revisit the model in the future with another paper. Probably to focus on how to estimate the local jump volatility accurately.

Tuesday, April 17, 2012

PubEval trained using ListNet

I spent some time looking at PubEval again - not my implementation, which is fine now, but rather how Tesauro came up with it in the first place. One discussion suggests that he trained it using "comparison training", a machine learning approach he seems to have come up with - some kind of supervised learning on a manually-prepared set of benchmarks. Each benchmark (I'm assuming) was a list of moves given a starting point and a dice roll, where the moves were ordered by goodness.

I tried to reproduce this. I couldn't find any proper references to "comparison training", but there's a lot of relatively recent literature on machine learning algorithms for generating rankings, which is the same sort of thing.

We can do a lot better than Tesauro's hand crafted training set: we have the GNUbg benchmark databases that are much larger and more accurate.

So what we want is an algorithm that we can feed a training set, where each element of the set has the five boards listed for each move and the rolled-out equities for each. The inputs that define the board are the PubEval inputs, and the evaluation function should be a linear function of the inputs (like PubEval is).

Wikipedia has a nice summary of different machine learning approaches to ranking estimators.

The ListNet algorithm seems like a good choice. I implemented it and trained it on the GNUbg contact and race benchmark databases.

It fairly quickly converged to a better solution than Tesauro's original PubEval. That is, the weights I found can be plugged into the usual PubEval setup, but give a slightly stronger player. Not much better, but noticeably stronger. Not surprising given the more comprehensive training set.

The weights themselves, and the output values, were quite different to PubEval. The ListNet algorithm effectively trained the regression to approximate equity, so in this approach the board evaluations correspond to equity (though optimized for equity differences on similar boards rather than outright equity).

The GNUbg benchmark scores were: Contact 41.5, Crashed 47.5, and Race 1.90. This compares to PubEval's scores of 44.1, 49.7, and 3.54.

The weights are available on my Dropbox account: race and contact.

In 100k cubeless games (with variance reduction) against PubEval it scores +0.043ppg +/- 0.004ppg. Again, noticeably better.

Of course this is a terrible player compared to neural network players, but it's neat to be able to reproduce something like what Tesauro did with PubEval. And it was a good excuse to play around with the newer machine learning algorithms focused on ranking.

As well this might be an interesting training approach for a neural network. The network would be optimized for checker play, so would be less efficient at absolute equity estimation required for doubling decisions. But perhaps one could have two sets of networks, one optimized for checker play, the other for doubling decisions.