Sunday, July 31, 2011

Second training approach

The next step in improving the learning was to include backtracking and alpha/beta decay.

The idea: if the network drifts into a suboptimal part of parameter space because the learning pace (via alpha and beta) is too great, step back to where it was good and reduce alpha and beta.

The algorithm in practice is:
  • Every 1,000 learning iterations, run the same 200 test games against the pubEval benchmark.
  • Keep track of the best performance (ie highest win fraction, since that's what the net is optimizing).
  • If a subsequent set of 200 test games gives a win fraction less than the rolling maximum less a threshold, and alpha and beta are above specified minimum levels, step back to the weights & eligibility traces at that maximum point, and reduce alpha and beta by a fixed fraction.
  • If alpha and beta are less than the minimum after being reduced, set them to the minimum.
In principle this should give better learning behavior than without the backtracking.

The results for 40 hidden nodes:


and for 20 hidden nodes:


Not a whole lot of difference between 20 and 40 nodes here, and it seems pretty converged. Compared to the previous results without backtracking, there's less scatter, but it doesn't seem to actually cause it to improve in outright performance appreciably.

Note: the results above are invalid because I had a bug in my pubEval implementation. Fixing that made the pubEval player much strong. Check out Jan 2012 posts for believable results.



First training approach

The first training session started with lambda=0, alpha=beta=0.5, and ran for 200,000 iterations.

I ran it for two nets: one with 20 hidden nodes, and the other with 40.

This is the simplest approach - it does not decay alpha or beta, and it does not backtrack when it finds the performance suffering versus the rolling optimum. But it is an interesting benchmark.

200,000 iterations is also not that much - it's a good start and comparable to some of the original TD training that Tesauro did, but really we want to train the nets for millions of iterations. But at least we can get some idea of how it's doing.

Every 1,000 iterations of the training the simulation runs a benchmark 200 games against the pubEval benchmark. The chart below shows the results for 40 hidden nodes. The x-axis is iteration number and the y-axis is average points per game in the 200-game benchmark matches.

Note that before training the program does terribly - losing 1.075 points per game against pubEval. Very quickly (in just a few thousand iterations) it performs roughly at pubEval level and seems to converge to around +0.25ppg, with some scatter. The best benchmark performance was +0.615ppg.

Here are the results for 20 hidden nodes:

Similar kind of results to 40 hidden nodes, really, though with more scatter toward the end of the simulation.

The frustrating thing here: while I was putting this together and getting the machinery ready to save the results, I'm pretty sure I had the 40-node network playing around 75% win rate and +0.6ppg. But I can't get it to reproduce that now! I thought I got there starting with alpha=beta=0.5 but I don't see it now. Maybe I did more steps... that said, the results above are pretty consistent with other results against pubEval that I've seen published in various articles on TD gammon. I was fiddling with the pubEval strategy, so maybe I fixed a bug there that was making it artificially weak.

Note: the results above against pubEval are invalid - I had a bug in my pubEval implementation. Fixing that bug makes it a much stronger player. Check out the Jan 2012 posts for believable results.

TD Gammon neural network strategy

Once I got the basic backgammon framework working, I needed to build a player that uses the TD gammon neural net to make board value decisions, and that can evolve itself through training.

In terms of the TD gammon setup itself, the best source online I found is actually an assignment from a computer science class at Rice University. It goes through the basic learning algo in some detail and gives some tips on how to make it work.

For inputs I followed the standard Tesauro setup that doesn't encode anything about backgammon strategy - it is purely a sensible way of encoding the board layout. That's probably the coolest thing about the neural net players - there was very little game wisdom put into the game by human experts, which meant that the trained nets do not suffer from any biases that human players may have. In the case of backgammon, that turned out to be a huge deal: the neural net players ended up superior to all human players, and ended up changing the way humans played.

The backgammon framework

A neural net player needs a backgammon framework to play in, so the first thing to do is to build that framework. I chose to do this in C++ because a) I know it, and b) it's faster in execution than Python, which is my usual choice for coding. And execution speed is important here because training a neural net can take many millions of iterations, where each iteration is a full game.

The basic setup:

A board class

This tracks where each player's 15 checkers are on the board. It does not know how to play a step in the game or any rules. It just has a vector for checkers by position for each side; how many pieces on each side have been hit and sent to the bar; and how many have been borne in. 

Importantly, it has a flag called perspective, which represents whether the board is viewed from the perspective of either player 0 or player 1. Then all the methods to get checkers etc are from the perspective of that player.

So for example there is a method hit(), which returns the number of checkers on the player's side that have been hit. If perspective is 0, that corresponds to player 0; if perspective is 1 it corresponds to player 1. There is another method otherHit() which returns the number of opponent checkers hit, which returns player 1's number if perspective is 0 and player 0's number if perspective is 1.

Or, there is a method checker( int i ), which returns the number of the player's checkers at position i (between 0 and 23), and otherChecker( int i ) which returns the number of the opponent's checkers at position i. Which player is the opponent is determined by perspective, and the indexing is reversed for perspective = 1, so checker(0) is always the number of the player's checkers at their own home position 1.

The reason this is important is because in earlier iterations of this effort I ended up with a favored perspective, and I think that may have caused subtle bugs in the code. Also it makes it easier to layout the network inputs.

Strategy classes

A strategy is something that determines which of the possible moves is the best move. The base strategy class defines an abstract virtual method boardValue, which takes in a board and returns a value - the higher the value the better for the board. The board value can represent whatever is appropriate for the strategy: the expected number of points, the probability of a win, or something more arbitrary.

Ultimately all the neural network intelligence is in a derived strategy class. But there are also simpler strategy classes - for example, a random strategy that returns a random value from boardValue, and a strategy that wraps up the standard pubEval player (which we end up using for comparisons).

When a strategy calculates its boardValue, it always uses the methods on the board that show the board from the appropriate perspective - so eg always hit() and otherHit(), never explicitly the number of checkers hit for player 0 or 1. So the board must always be passed in from the perspective of the player that is moving.

A game class

An instance of a game contains a board plus a strategy for each player. It knows how many steps a game has progressed, whether a game is over, which side won, how many points they won, and so on.

Its step() method steps one ply (one side's roll) through the game. A step rolls the dice (the random number generator is the Mersenne twister), generates all possible boards for the roll, and uses the appropriate side's strategy to determine the board value of each possible position. It then chooses the position with the maximum board value.

Before sending the board to the strategy, it sets its perspective to be that of the player which is making the move.

Another method is stepToEnd(), which keeps stepping until the game is over (one side or the other has 15 pieces borne in).

The algorithm to determine the various possible positions given a two-die roll is pretty brute-force right now. I'm not sure if there's a nicer way to do it - if so it'd have a nice performance impact since this is called for every step in every game.

The game has a boolean verbose flag. If true, it prints out the board at each step and notes the dice rolls. If false, it prints nothing. This is useful for testing whether games are behaving sensibly; you can watch a full game in all its detail with verbose=true, but then turn verbose=false (the default) when running many games.

Running a game

Some sample code for running a game is like:

int main( int argc, char * argv [] )
{
    strategyPubEval s1;
    game g( &s1, &s1, 1 );
    g.verbose=true;
    g.stepToEnd();
}

So this constructs a pubEval strategy used for both sides in the game, printing out the board and game information at each step. The strategy is passed (as a pointer) to the game for each side, along with an integer which seeds the random number generator used for dice rolls.

What this blog is about

Welcome to anyone who happens upon this blog!

This blog is meant to document my amateur attempts to build a backgammon player. Initially my focus is on building a neural net player akin to TD gammon.

I've actually tried three times to get a network up and learning properly. The first two times were failures - I couldn't find the bug(s), but they never actually learned to play. The third time shows some promise, and this blog is meant to document how I approached the problem and how my approach worked. Once I've got the basic machinery working I'll continue to extend it, and document that.

I'll also post interesting links or research I stumble across. As I said, I'm just an amateur, so I'm probably ignorant not just of state of the art in computational backgammon, but of backgammon itself. I'm a decent intermediate player but certainly no expert. I just find the numerical machinery behind artificial learning really intriguing and this is a way for me to play with it.