Tuesday, April 17, 2012

PubEval trained using ListNet

I spent some time looking at PubEval again - not my implementation, which is fine now, but rather how Tesauro came up with it in the first place. One discussion suggests that he trained it using "comparison training", a machine learning approach he seems to have come up with - some kind of supervised learning on a manually-prepared set of benchmarks. Each benchmark (I'm assuming) was a list of moves given a starting point and a dice roll, where the moves were ordered by goodness.

I tried to reproduce this. I couldn't find any proper references to "comparison training", but there's a lot of relatively recent literature on machine learning algorithms for generating rankings, which is the same sort of thing.

We can do a lot better than Tesauro's hand crafted training set: we have the GNUbg benchmark databases that are much larger and more accurate.

So what we want is an algorithm that we can feed a training set, where each element of the set has the five boards listed for each move and the rolled-out equities for each. The inputs that define the board are the PubEval inputs, and the evaluation function should be a linear function of the inputs (like PubEval is).

Wikipedia has a nice summary of different machine learning approaches to ranking estimators.

The ListNet algorithm seems like a good choice. I implemented it and trained it on the GNUbg contact and race benchmark databases.

It fairly quickly converged to a better solution than Tesauro's original PubEval. That is, the weights I found can be plugged into the usual PubEval setup, but give a slightly stronger player. Not much better, but noticeably stronger. Not surprising given the more comprehensive training set.

The weights themselves, and the output values, were quite different to PubEval. The ListNet algorithm effectively trained the regression to approximate equity, so in this approach the board evaluations correspond to equity (though optimized for equity differences on similar boards rather than outright equity).

The GNUbg benchmark scores were: Contact 41.5, Crashed 47.5, and Race 1.90. This compares to PubEval's scores of 44.1, 49.7, and 3.54.

The weights are available on my Dropbox account: race and contact.

In 100k cubeless games (with variance reduction) against PubEval it scores +0.043ppg +/- 0.004ppg. Again, noticeably better.

Of course this is a terrible player compared to neural network players, but it's neat to be able to reproduce something like what Tesauro did with PubEval. And it was a good excuse to play around with the newer machine learning algorithms focused on ranking.

As well this might be an interesting training approach for a neural network. The network would be optimized for checker play, so would be less efficient at absolute equity estimation required for doubling decisions. But perhaps one could have two sets of networks, one optimized for checker play, the other for doubling decisions.

No comments:

Post a Comment