I said before that I wanted the more complex player (race & contact networks + one-sided bearoff) to be Benchmark 1.
But that was probably a bit premature. Really a simpler first benchmark would be a properly trained Tesauro-style single network with no bearoff database. Now that I solved the problem with who holds the dice, I trained a simple network with three outputs (probability of win, probability of gammon win, and probability of gammon loss). It took around 200k training runs to converge very nicely, with alpha=0.1. I ran it out to 450k training runs with alpha=0.02 but no further convergence happened.
After converging it wins 62.9% of its games against pub eval, with an average equity of +0.351ppg, in a 10,000 game match. (Note: pubeval results were updated after the pubeval player was fixed.)
I played against it myself and it seems like a reasonable intermediate player. A few clearly erroneous moves but generally pretty good.
To be more clear on the specification of the Benchmark 1 player:
But that was probably a bit premature. Really a simpler first benchmark would be a properly trained Tesauro-style single network with no bearoff database. Now that I solved the problem with who holds the dice, I trained a simple network with three outputs (probability of win, probability of gammon win, and probability of gammon loss). It took around 200k training runs to converge very nicely, with alpha=0.1. I ran it out to 450k training runs with alpha=0.02 but no further convergence happened.
After converging it wins 62.9% of its games against pub eval, with an average equity of +0.351ppg, in a 10,000 game match. (Note: pubeval results were updated after the pubeval player was fixed.)
I played against it myself and it seems like a reasonable intermediate player. A few clearly erroneous moves but generally pretty good.
To be more clear on the specification of the Benchmark 1 player:
- One network
- Three output nodes: probability of any win, probability of a gammon win, and probability of a gammon loss (no backgammon probability outputs)
- 80 hidden nodes
- Standard Tesauro inputs, but no input for whose turn it is. The network always returns probabilities assuming that the player holds the dice (ie hasn't rolled yet). No extra inputs that encode more game-specific information.
I trained it using standard TD learning through playing games against itself. I used lambda=0 everywhere and did not use "alpha splitting" (where you use a different alpha for the output->hidden node weights vs the hidden->input node weights). I used alpha=0.1 for the first 200k training runs, then dropped to alpha=0.02 for the subsequent ones (out to 450k training runs). Plotting performance against my old benchmark it was clearly converged around 200k training runs.
I will use this benchmark to compare other strategies, and as a benchmark when training to look for convergence of performance.
No comments:
Post a Comment