After just 200k training runs using an alpha of 0.1 I had a player that was beating my previous best player by 0.75ppg. That previous-best player has a single network with win and gammon win/loss outputs, no bearoff database, 80 hidden nodes, and had an input for whose turn it is (no symmetry assumption). That player was trained incorrectly because of the bug I mentioned earlier, but it was still pretty good - +0.7ppg-ish against pub eval, and it got lucky and beat me in a set of 10 games once.
This new player is my new benchmark, and I think the first one that's got most everything correct. I'm naming it Benchmark 1. I know, over-brimming with creativity on the name.
I played 20 games against it myself and won 0.1ppg, but I was a bit lucky in the first half of the games (I won 0.7ppg in the first ten but lost 0.5ppg in the second ten). So even after a limited number of training runs it plays at an intermediate level (which is where I place my own meager backgammon skills). It didn't make any big errors in checker play that I noticed.
That's not many training runs, and I'm continuing to train now with more runs. I'm also trying alpha-damping, since after 200k runs it didn't seem to be getting noticeably better with more runs. At 200k runs I'm dropping alpha to 0.02, then to 0.004 at 1M runs, then to 0.0008 at 10M runs. I'm not sure how sensible that is, but I had a hard time finding any references to a good way to determine how to damp alpha for TD training.
This player uses two networks (race and contact) and one-sided bearoff databases for nine points and fifteen checkers. Both networks have outputs for probability of any win, probability of gammon win, probability of gammon loss, probability of backgammon win, and probability of backgammon loss. Both have 80 nodes in their hidden layers. The relatively large bearoff database parameter space means two things: the player is more accurate in most late races; and I can train the race network directly with supervised learning, which is more efficient than the usual TD learning that the network has to do when there's no direct benchmark for probability.
However, to load the two bearoff databases requires a process a bit over 4GB of RAM, which even on a big machine limits the number of calculations I can do in parallel. Once the training is complete I'll try using a much smaller bearoff database to see how well the supervised learning makes the race network approximate the more exact calculation.
Also this player does not include a "crashed" network, which GnuBG does. Partly that's because I want to try something simpler to start, and partly it's because I don't really understand the definition of a crashed board in this context - that is, why the definition GnuBG uses is an interesting one in terms of a phase of play.
This new player is my new benchmark, and I think the first one that's got most everything correct. I'm naming it Benchmark 1. I know, over-brimming with creativity on the name.
I played 20 games against it myself and won 0.1ppg, but I was a bit lucky in the first half of the games (I won 0.7ppg in the first ten but lost 0.5ppg in the second ten). So even after a limited number of training runs it plays at an intermediate level (which is where I place my own meager backgammon skills). It didn't make any big errors in checker play that I noticed.
That's not many training runs, and I'm continuing to train now with more runs. I'm also trying alpha-damping, since after 200k runs it didn't seem to be getting noticeably better with more runs. At 200k runs I'm dropping alpha to 0.02, then to 0.004 at 1M runs, then to 0.0008 at 10M runs. I'm not sure how sensible that is, but I had a hard time finding any references to a good way to determine how to damp alpha for TD training.
This player uses two networks (race and contact) and one-sided bearoff databases for nine points and fifteen checkers. Both networks have outputs for probability of any win, probability of gammon win, probability of gammon loss, probability of backgammon win, and probability of backgammon loss. Both have 80 nodes in their hidden layers. The relatively large bearoff database parameter space means two things: the player is more accurate in most late races; and I can train the race network directly with supervised learning, which is more efficient than the usual TD learning that the network has to do when there's no direct benchmark for probability.
However, to load the two bearoff databases requires a process a bit over 4GB of RAM, which even on a big machine limits the number of calculations I can do in parallel. Once the training is complete I'll try using a much smaller bearoff database to see how well the supervised learning makes the race network approximate the more exact calculation.
Also this player does not include a "crashed" network, which GnuBG does. Partly that's because I want to try something simpler to start, and partly it's because I don't really understand the definition of a crashed board in this context - that is, why the definition GnuBG uses is an interesting one in terms of a phase of play.
As per a later post: I'm actually scrapping the idea of making this strategy the first benchmark in favor of a simpler Tesauro-style network strategy.
ReplyDelete