I tried training a player using supervised learning on the GNUbg training databases, instead of training using the TD learning that I have used up until now.
I took a setup like Player 2.4 and trained it from scratch (uniform random weights in [-0.1,0.1]), one version with 80 hidden nodes and another with 120 hidden nodes.
I'm naming the 120-node version Player 3, since it has a significantly better performance than my previous best player. The step from TD learning to supervised learning made quite a large difference in performance, consistent with the experience of the GNUbg team.
Its benchmark scores: Contact ER 14.9, Crashed ER 14.2, and Race ER 2.67. (Benchmark scores updated after fix to benchmark calculation.)
Moving from 80 to 120 hidden nodes makes a significant difference, unlike in earlier tests on TD-trained players.
I took a setup like Player 2.4 and trained it from scratch (uniform random weights in [-0.1,0.1]), one version with 80 hidden nodes and another with 120 hidden nodes.
I'm naming the 120-node version Player 3, since it has a significantly better performance than my previous best player. The step from TD learning to supervised learning made quite a large difference in performance, consistent with the experience of the GNUbg team.
Its benchmark scores: Contact ER 14.9, Crashed ER 14.2, and Race ER 2.67. (Benchmark scores updated after fix to benchmark calculation.)
Moving from 80 to 120 hidden nodes makes a significant difference, unlike in earlier tests on TD-trained players.
Player 3 summary:
- 120 hidden nodes. I also trained an 80-node version which performed noticeably worse. I may occasionally refer to the 80-node version as Player 3 (80).
- Trained from scratch (uniform random inputs in [-0.1,0.1]) using supervised learning on the GNUbg training databases, not TD learning.
- Contact and race networks. No crashed network.
- One-sided bearoff database used when both players have all checkers in their home boards.
- Contact inputs as per Player 2.4, with Berliner prime and hitting shot inputs in addition to the original Tesauro inputs.
- Race inputs are the original Tesauro inputs.
The training approach: I combined the crashed & contact training sets into a single set, and kept the race training set as is. One "epoch" of training (borrowing Joseph Heled's terminology): randomly order the crashed/contact training set and train the contact net with that; then randomly order the race training set and train the race net. Alpha is kept constant for an epoch, but can vary between epochs. If Contact ER does not improve over an epoch, the training sets are (randomly) re-ordered.
I trained the 80- and 120-node versions using supervised learning on the GNUbg training databases. The schedule for alpha was 1 until epoch 8, 0.32 until epoch 30, 0.1 until epoch 60, 0.032 to epoch 100, and 0.01 afterwards.
I trained the 80-node version for 112 epochs. It scored Contact ER 14.8, Crashed ER 12.3, and Race ER 1.59.
The 120-node version was trained for 100 epochs. It scored Contact ER 13.9, Crashed ER 11.8, and Race ER 1.56 - significantly better than the 80-node version.
Some head to head match results for Player 3 (the 120-node version), 100k cubeless money games (standard error 0.004ppg):
Some head to head match results for Player 3 (the 120-node version), 100k cubeless money games (standard error 0.004ppg):
- Against the 80-node version: +0.018ppg
- Against Player 2.4: +0.047ppg
- Against Benchmark 2: +0.120ppg. The estimate from the multivariate regression is +0.129ppg.
- Against PubEval: +0.546ppg. The estimate from the multivariate regression is +0.529ppg.
This goes forward!
ReplyDeleteIt is Ian's and mine experience that the best nets are those that are started as TD and then the training is extended with supervised training.
"Epoch" is not only Josephs term. It's the common term in the ML literature.
-Øystein
One interesting point: Joseph noted that when he ran supervised learning he was using huge alphas, like 20, and minimum alphas around 0.5 or 1. That's not what I'm seeing: with big alphas it doesn't converge. I started with alpha=1 and went down from there, without cycling back up to large alphas if performance doesn't improve (like Joseph did).
ReplyDelete