Wednesday, March 14, 2012

Player 3.3: TD then SL

Following the suggestion of Øystein Schønning-Johansen (in comments here), I tried training my player from random weights using TD learning, then once that had converged, switching to supervised learning on the GNUbg training databases.

The GNUbg team's experience is that this converges to a better player. Indeed, looking at Player 3.2's self-play, it does have some odd features: for example, roughly 7% of its games against itself result in a backgammon, much higher than the network odds predict from the starting board.

The new player that resulted is Player 3.3. Its benchmark scores: Contact ER 13.3, Crashed ER 12.0, and Race ER 0.93. In 40k cubeless money games it scores +0.595ppg against PubEval, +0.174ppg against Benchmark 2, and +0.028ppg against Player 3.2 in 100k cubeless money games. In terms of network setup it is the same as Player 3.2 (contact, crashed, and race networks; 120 hidden nodes; same extended inputs).

The first step was TD learning from random weights. I ran training for over 5M iterations trying to get solid convergence. I switched to use the GNUbg benchmarks as training benchmarks instead of gameplay against a reference player, since this is more accurate and quicker. Here are the learning charts for Contact, Crashed, and Race ER, starting at training run 100k to remove the initial rapid convergence, and using log scale on the x-axis:

The best benchmark scores were Contact ER 17.9, Crashed ER 21.5, and Race ER 1.79. Worse than Player 2.4 on Crashed ER but otherwise my strongest TD-trained player. In 100k cubeless money games against Player 3.2 it scored an average of -0.051ppg +/- 0.004ppg, and against Player 2.4 it scored +0.015ppg +/- 0.004ppg. In self-play the statistics of wins, gammons, and backgammons look plausible: 27% of the games end in gammon and 0.8% in backgammon.

The next step was to continue to train this player using supervised learning on the GNUbg training databases, starting with the TD-trained networks.

For a learning algorithm I used standard backpropogation on a randomly-sorted training database. I started with alpha=1 and kept alpha fixed over epochs as long as the Contact ER benchmark score did not increase for three consecutive epochs. If it did, I dropped alpha by a factor of sqrt(10). When alpha dropped below 0.1/sqrt(10) I popped it back up to 1 and continued. This follows what Joseph Heled described in his GNUbg training, though with alphas about 10x smaller (I tried using alpha=10 but it did not converge at all). The idea here is that if the training finds a local minimum and cannot converge, you use a larger alpha to kick it out of the local minimum and try again. Of course the best-performing networks are always saved so you can return to them. Once the player stopped improving I ran it for several hundred epochs at smaller alpha to help it to converge.

After this training the best benchmark scores were Contact ER 13.3, Crashed ER 12.0, and Race ER 0.93. So a significant improvement over the TD-trained player, which is not surprising, but also better than my best SL-trained player, Player 3.2 (which scored 14.0, 12.8, and 2.08).

In addition, the self-play statistics look more realistic. In 10k self-play games, 26.9% were gammons of any kind and 1.4% were backgammons.

It beats Player 3.2 by 0.028ppg, which is significantly more than any of the multivariate regressions would predict.

The player's benchmark performance is still significantly worse than GNUbg, except in Race ER where it is marginally better. Contact ER is much worse, which is the most important benchmark. It looks like there's still work to do on better inputs.

That said, I'm switching focus a bit to the cube handling side, so Player 3.3 will likely be my best cubeless player for a while.

No comments:

Post a Comment