Friday, August 5, 2011

An example game

I took the 80-hidden-node network with the maximum score against pubEval and played it against itself to see the quality of play.

Overall it's pretty solid. I'd say a weak intermediate level of play now, though still making some clear mistakes with leaving shots.

Two output networks results

I finished 2M training runs (again with alpha=beta=0.1, no alpha/beta damping or backtracking) on the new network which has two output nodes, after correcting the bug. Here are the results:



The blue dots are the benchmark runs for the 10-hidden-node network, and the blue line is the rolling 10-point average. Red is the same for the 20-hidden-node network; green is 40 nodes; and purple is 80 nodes.

The bug ended up not making that much difference; the trained networks still perform extremely well against the pubEval benchmark.

So much so that pubEval probably isn't a great benchmark anymore. The trained networks backgammon pubEval almost a quarter of the time, which is not realistic.

Note: I discovered (much) later than I had a significant bug in my pubEval implementation. After fixing that it's a much stronger player. I also had some significant bugs in my network design that held them back. Check out posts from Jan 2012 to see proper results.

Now to look at some example games that the trained networks play against themselves to get a feel for how good they are...

Thursday, August 4, 2011

An interesting perspective from the academy

An interesting article I found on training neural nets with different approaches - directly relevant to the approach I'm trying here.

A few observations from this:
  • Using a single neural net isn't optimal - you should use different nets for different parts of the game. For example, one net for bearing in, one for a race, and one for the rest. I've seen this discussed elsewhere as well. The article goes a bit nuts and has nine separate networks for different varieties of game position. I'll get to multiple nets eventually.
  • The optimal value for lambda they found was around 0.6. Not sure how this squares with comments from Tesauro in his later papers that lambda=0 is the best to train with. I've been using lambda=0.
  • They didn't evaluate against pubEval - they created their "expert" network (actually group of nine networks) and trained that for a while, and then used that as the benchmark. While training the expert they chose the rolling best performing version as the benchmark for subsequent training, and decided when to stop training based on when the performance had converged on that basis.
Pretty interesting. The meat of the paper, though, is on how different training methods compare (self-learning vs using the expert at each step to specify the best move vs training against expert v expert matches). The conclusion was that training against the expert's evaluation at each step was the most efficient, followed closely by self-learning, and that training by watching expert v expert games was not very efficient.


Wednesday, August 3, 2011

A bug!

Uh oh - I discovered a bug deep in the guts of the backgammon engine, in the piece that determines the available moves given a roll. I found it when running my trained network against itself for a 1,000 game benchmark - and found that player 1 would consistently lose 0.3ppg against player 2, even though they were the same strategy. It didn't matter if the strategy was a neural net or pubEval - either way it was broken.

Fixed now, but it does invalidate all my earlier results.

I suspect the really strong performance (stronger than the published TD gammon results) I recorded against pubEval was an accidental consequence of this. I'm re-running the training now for the two-output network (10, 20, 40, and 80 hidden node versions) and we'll see how those come out.

Note: I discovered much later that I had a few significant bugs. The most relevant here was in my implementation of pubEval - after fixing that it was a much stronger player.

Tuesday, August 2, 2011

Extending the outputs

The first network I set up had just one output node, which represents the probability of a win - any kind of win, ignoring the distinction between single wins and gammons/backgammons.

This was a simplification of the original Tesauro work which had four outputs: one for a win, one for a gammon win, one for a loss, and one for a gammon loss. There was no knowledge in the network of backgammons, under the assumption that in practice they happen so rarely that they can be ignored (this is true in regular games between experienced human players).

So the next step in the evolution of my net is to add an understanding of gammons. As with the single-output network, I want the network to do the right thing when you flip the perspective of a board and pass it in. For a multi-output network this means that prob of win->prob of loss, prob of gammon win->prob of gammon loss, and so on.

Example of learned gameplay

I tried a game where I played the 40-hidden-node player against itself, and printed out the results (using a fairly simple board representation - no pretty graphics yet!).

In general the player was pretty good. Not great; perhaps better than a beginner human player, but certainly not close to intermediate (which I judge because I think I'm intermediate and can see lots of places where it made the wrong move).

Particularly, looking at a couple examples of the gameplay, it looks like it's not very good in races or efficiently bearing in. It also leaves shots more than it should.

Monday, August 1, 2011

The next set of results

I turned off backtracking and alpha/beta decay, since that seemed to have minimal effect in my previous experiments. I also dropped alpha and beta to 0.1 to avoid jumping around too much.

Then I let the training run for 2M iterations to see if I could get something that is plausibly converged. Honestly I don't even know if a realistic neural net does converge, but looking at other articles on TD gammon training it looks like they do.

Here are the results:


The blue line and blue points are for a network with 10 hidden nodes; green is 20; and orange is 40. The points are the average points per game playing the network against the pubEval benchmark for 200 games. The lines are the rolling 10-point averages.