Monday, August 20, 2012

Improved GNUbg benchmarks

The GNUbg team (in particular, Philippe Michel) has created new benchmark databases for Contact, Crashed, and Race layouts, using the same set of board positions but rolling out the equities with more accuracy. This corrects the significant errors found in the old Crashed benchmark, and improves the Contact and Race benchmarks.

They are available for download here, in the benchmarks subdirectory.

Philippe also did some work on improving the board positions included in the Crashed training database, which is available for download in the training_data subdirectory at that link.

I re-ran the statistics for several of my players, as well as for PubEval. Also Player 3.6 as the most comprehensive benchmark.

Player
GNUbg Contact ER
GNUbg Crashed ER
GNUgb Race ER
PubEval Avg Ppg
Player 3.6 Avg Ppg
GNUbg
10.5
5.89
0.643
0.63
N/A
12.7
9.17
0.817
0.601
0.0
13.1
9.46
0.817
0.597
-0.0027
13.4
9.63
0.817
0.596
-0.0119
13.4
9.89
0.985
0.595
-0.0127
14.1
10.7
2.14
0.577
-0.041
33.7
26.2
2.45
0.140
-0.466
18.2
21.7
2.05
0.484
-0.105
21.6
23.2
5.54
0.438
-0.173
41.7
50.5
2.12
0.048
-0.532
44.2
51.3
3.61
0
-0.589

For the games against PubEval I ran 40k cubeless money games; standard errors are +/- 0.006ppg. Down to Player 3.2, for the games against Player 3.6 I ran 400k cubeless money games to get more accuracy; standard errors are +/- 0.002ppg or better. For players worse than Player 3.2 I played 100k games against Player 3.6 as the average scores were larger; standard errors are +/- 0.004ppg.

Phillippe Michel was gracious enough to provide the GNUbg 0-ply scores against the newly-created benchmarks. Also it seems like I had the scores against the old benchmarks incorrect: they were Contact 10.4, Crashed 7.72, and Race 0.589. The Contact score was close, but the other two I had significantly worse.



Sunday, August 19, 2012

Player 3.6: longer training results

I haven't had much time lately to work on this project, but while I'm engaged elsewhere I thought I'd run training for a long period and see whether it continued to improve.

It did - fairly significantly. So my players before clearly were not fully converged.

The result is my new best player, Player 3.6. Its GNUbg benchmark scores are Contact 12.7, Crashed 11.1, and Race 0.766. In 400k cubeless money games against Player 3.5 it averages 0.0027ppg +/- 0.0018 ppg, a small improvement.

In 40k games against Benchmark 2 it averages 0.181 +/- 0.005 ppg, and against PubEval 0.601 +/- 0.006 ppg.

For training I used supervised learning with three parallel and independent streams: one with alpha=0.01, one with alpha=0.03, and finally one with alpha=0.1. This was meant to test the optimal alpha to use.

Surprisingly, alpha=0.01 was not the best level to use. alpha=0.03 improved almost 3x as quickly. alpha=0.1 did not improve well on the Contact benchmark score but did improve the best for the Crashed benchmark score.

I take from this that alpha=0.03 is the best level of alpha to use for long term convergence.

The Crashed benchmark score we know is not that useful: the Crashed benchmark itself is flawed, and a multi-linear regression showed very little impact on score of the Crashed benchmark. That said, I tried a little experiment where I used the Contact network for crashed positions in Player 3.5 and it definitely worsened performance in self-play: 0.04ppg on average. That is a significant margin at this point in the player development.

I ran 4,000 supervised learning steps in the training, for each of the three alpha levels. In each step I training on a randomly-arranged set of Contact and Crashed training positions from the training benchmark databases. This took a month and a half. The benchmark scores were still slowly improving for alpha=0.01 and alpha=0.03, so there is still scope for improvement. I stopped just because the GNUbg folks have put out new benchmark and training databases that I want to switch to.