Sunday, July 11, 2021

The latest GnuBG benchmark files

I found the latest versions of the GnuBG benchmark databases here. They've moved around in the 10y since I last looked for them! The file format remains the same, though I had to rebuild the parser for the 20-character string board descriptor in here.

Remember: the benchmark databases given a bunch of starting boards, rolls, and for each, the top five boards with post-roll equity for each.

There's also a set of training databases that gives a list of boards with pre-roll probabilities for the different game outcomes; that's what we do supervised learning against. Those are here. A description of the contents is here.

These training databases have grown substantially in the intervening years: there are now 1,225,297 entries for contact, 600,895 for crashed; and 516,818 for race.

Rebuilding TD-lambda learning

I tried to get TD-lambda learning to work with scikit-learn's MLPClassifier tools, but couldn't get it to accept probabilities as inputs rather than a small set of categories (1 or 0 values). Then I tried MLPRegressor, but that doesn't seem to have a nice way of making the outputs bounded in (0,1).

So rather than bang my head against that, I just rolled my own neural network again - this time in Python, but using numpy vectorized calculations to speed things up.

It's still pretty slow in execution - I can train a network with 80 hidden nodes at the pace of 20,000 games per hour on my desktop machine. But, it let me get back into the weeds with how this all works.

This time I followed Tesauro's setup a bit more closely, in that the inputs are explicitly white and black checker positions rather than "whichever player's on move", and I kept the two "whose move is it" inputs too. The outputs are: probability of white single win, probability of white gammon, probability of black gammon, probability of white backgammon, and probability of black backgammon. The probability of black single win was equal to one minus the sum of the other probabilities.

I'm able to reproduce most of the initial cubeless play results from my earlier work, though I've yet to add the inputs for the number of checks available to hit, or the Berliner primes. It takes around 200,000 game iterations to train up to something like the Benchmark 2 level. This was using alpha=0.1, lambda=0, and no "alpha splitting" (using a different learning rate for the input->hidden weights and the hidden->output weights).

So now I've convinced myself that I remember how to build one of these players from scratch. For the next step I'm going to download the latest GnuBG benchmark databases and do supervised learning on those - it should be much easier to plug that into an external package like scikit-learn.

Thursday, July 1, 2021

Gerald Tesauro original paper

 For reference: here is a link to the original Tesauro paper on TD-Gammon.

Rebuilding the backgammon framework and pubEval

I need a working bot to tell me the best move once I get the "parse the board state from a photo of a board" bit sorted.

I'm still following the same approach as originally, but this time I'm reimplementing it in Python, not C++. Two reasons for that: first, I enjoy Python coding a whole lot more because it's easier; and second, there are some truly excellent open source machine learning libraries in Python that I'll leverage. Execution speed of course is much slower, which might cause some angst when training, so I might need to do some optimization. Perhaps use Cython or numba or something to speed that up - we'll see.

And a third reason: rebuilding it from scratch will help me remember how all the details work!

In any case I've rebuilt the structure now, and have got a pubEval player to play simple games against. I had to google around a fair bit to find a nice description of pubEval, and finally found one here, where Gerry Tesauro published a post with a C implementation. Basically pubEval is just a linear regression (actually two regressions: one for a race, one for contact) that takes a 122-element representation of the board state and returns a score, with a higher score being better. All you really need then is the board state encoding, plus the linear regression weights for the two regressions, which you can find at that link.

After playing some games manually against pubEval, I can honestly say that it's not very good. It's not terrible, but I now remember why it's pretty easy to beat.

And Happy Canada Day! :)

Thursday, June 17, 2021

Checker classifier results

 I took that set of 25,000-ish checker images and trained a multi-layer perceptron (scikit-learn's MLPClassifier) on them to see how it'd do.

Each image was reduced to a 16x16 grayscale image, and those 256 numbers were the inputs to the MLP. They were normalized by dividing by 255 and then subtracting 0.5, so the range was [-0.5,0.5] for each of the 256 inputs.

As discussed before, I used a 99% threshold for classifying an image as a white or black checker, to improve precision at the expense of recall.

I split the input data into training and test sets: the training set used 75% of randomly-sampled inputs, and the test set was the remaining 25%.

I judged the performance of the classifier by looking at a confusion table, plus using it to identify checker on a reference board.

This post gives some results for different topologies of the neural network. In all cases I used the "adam" solver (a stochastic gradient descent solver) and an L2 penalty parameter "alpha" equal to 1e-4.

One Hidden Layer, 10 Nodes

With one hidden layer and ten nodes, the test data set performance statistics were:

              precision    recall  f1-score   support

           0       0.85      1.00      0.92      5013
           1       0.93      0.10      0.18       636
           2       0.96      0.48      0.64       608

    accuracy                           0.86      6257
   macro avg       0.91      0.52      0.58      6257
weighted avg       0.87      0.86      0.81      6257

Here, category 1 is white checkers, and category 2 is black checkers. Category 0 is no checker.

The precision ended up quite good, but the recall was pretty terrible - it was missing a lot of valid checkers.

When I run this on my reference board (a board image that was not used to generate any of the input data), it shows that it indeed does a pretty good job finding white checkers, but a relatively poor job with the black checkers:

This is how I'll qualitatively evaluate the performance. Each white dot in the image above is where, on one scan, the classifier thought it found a white checker (placed in the center of the window). As it did multiple scans, the same checker is found over and over, and you get a grouping of white dots. There are a few places on the board, outside the main board area, where it placed groupings of white dots, but generally it's finding the real white checkers pretty effectively.

It does a noticeably worse job with the black checkers (here colored blue on my board) - way fewer blue dots on those checker.

One Hidden Layer, 100 Nodes

If I jack the number of hidden nodes in a single hidden layer from 10 to 100, the performance noticeably improves:

              precision    recall  f1-score   support

           0       0.94      0.99      0.96      5013
           1       0.95      0.64      0.76       636
           2       0.95      0.83      0.89       608

    accuracy                           0.94      6257
   macro avg       0.94      0.82      0.87      6257
weighted avg       0.94      0.94      0.94      6257

The precision is a bit better, but the recall has improved dramatically. And when it identifies checkers on my reference board it does materially better with the black checkers:

Two Hidden Layers, 10 Nodes in Each

If we change the topology of the network to have two hidden layers, each with ten nodes, the performance is:

              precision    recall  f1-score   support

           0       0.87      0.99      0.93      5013
           1       0.92      0.25      0.39       636
           2       0.92      0.58      0.71       608

    accuracy                           0.88      6257
   macro avg       0.90      0.61      0.68      6257
weighted avg       0.88      0.88      0.85      6257

This does a bit better than a single hidden layer with 10 nodes, but worse than the single layer with 100 nodes.

Two Hidden Layers, 100 Nodes in Each

If we change to 100 layers in each of two hidden layers, the performance is much stronger:

              precision    recall  f1-score   support

           0       0.94      0.99      0.96      5013
           1       0.93      0.61      0.74       636
           2       0.94      0.88      0.91       608

    accuracy                           0.94      6257
   macro avg       0.94      0.82      0.87      6257
weighted avg       0.94      0.94      0.93      6257

Adding the second hidden layer doesn't seem to improve the situation much compared to the one hidden layer with 100 nodes.

One Hidden Layer, 200 Nodes

With 200 nodes in one hidden layer, the performance doesn't change much from the 100 node case:

              precision    recall  f1-score   support

           0       0.94      0.99      0.96      5013
           1       0.95      0.65      0.77       636
           2       0.95      0.83      0.89       608

    accuracy                           0.94      6257
   macro avg       0.95      0.82      0.87      6257
weighted avg       0.94      0.94      0.94      6257

Variation by Number of Hidden Nodes (One Layer)

The recall statistic for black checker (class 1 in the table) classification is the most sensitive to the topology. Here is a chart of performance and recall % for that category as a function of the number of hidden nodes for one hidden layer:

In general it seems like performance is pretty stable across the board, and recall seems to stabilize around 50 hidden nodes and more.

So I'll stick with one hidden layer and 50 nodes.

A checker classifier

 I started with the second step in my pipeline: how to identify the checkers on the board.

The approach I'm trying:

  • Start with the raw image of the board, laid out so that it's wider than it is tall, so that the points are vertical.
  • Define a small square window, something comparable to the checker size.
  • Scan that window across the board. At each step, look at the part of the raw image that's in the window and decide whether it contains a white checker, a black checker, or no checker.
  • Process the window image: downsample into a 16x16 grid of grayscale pixels.
  • My metric for deciding whether a checker is in the image: one checker has > 75% of its area in the window, no other checkers have > 25% of their areas in the window, and the center of the window lands inside the one main checker.
  • If a checker is counted as in the window, remember a point equal to the center of the window, tagged as white or black depending on which checker was found.
  • Run that scan across the whole board, for a range of window sizes, and for a range of starting pixel positions for the scan, so it gets a lot of views of the board in slightly different ways.
At the end of this I end up with a bunch of dots, hopefully grouped around the actual checker positions. Then I need some way of identifying the groupings as checkers, but that's the next step.

I decided to train a multi-layer perceptron - a neural network - to identify the three categories for each window image: no checker, a black checker, or a white checker. For this I used scikit-learn's MLPClassifier, with the following assumptions:
  • Adam solver (a stochastic gradient descent solver)
  • Relu activation function for every node
  • Two hidden layers with 100 nodes in each
  • L2 penalty parameter "alpha" equal to 1e-4
In this use case, I care about the precision of the classifier - that is, the ability of the classifier not to label as positive a sample that is negative - more than the recall - the ability of the classifier to find all the true samples. That's because it does this grid scan, and should identify the same checker many times in its scans. If it misses a checker on a few scans (because it's focused on precision over recall), it should catch it on other scans.

To amp the precision, I count a window as having a white or black checker only if the classifier has > 99% probability. That means a bunch of real checkers will be missed and counted as false negatives, which reduces the recall, but the false positives should be quite low, which pumps up the precision.

Before I could start training, though, I needed some labeled input data to test against. That is, images of that are about the size of the window, where each is labeled as "has no clear checker", "has a white checker", and "has a black checker". Where to get these labeled inputs?

I took a picture of a few backgammon boards, and started by manually cropping out images and saving them, then manually labeling them. That was pretty slow - it took me a half hour just to get 100 or so.

Then I realized I could use my scanning approach. I took a picture of a board, then manually identified where the checkers were on the board: center point and radius for each, as well as whether it was white or black. I saved that identification data to a file, then I ran a scan over the board, and could use the manually-identified checker positions to figure out whether a window contained 75% of one checker, less than 25% of any other checker, and that the center of the window was inside the area of the one checker.

I did that with a handful of boards, and that generated about 2,400 black checker images, 2,600 white checker images, and 200,000 images with no checker. I discarded 90% of the "no checker" images just so the input data set wasn't too skewed towards that side, so ended up with about 80% of the examples being "no checker", and about 10% each in white and black checkers.

That served as my training set for the checker classifier.

Draft of overall approach

I started by thinking through what the algorithm should be for parsing information off the board, as a pipeline that it walks through. I expect this will get refined as I start implementing, but the high level first draft of the steps is:

  1. Figure out the board outline. That is, look for lines/rectangles in the image and use them to figure out the overall board dimensions and what part of the image is in the board and what's out.
  2. Find all the checkers on the board inside the board outline. It needs to distinguish between white and black checkers on the board, as well as possibly checkers on the bar. I won't bother with checkers that have been born off; I'll just assume any checkers that aren't on the board have been born off.
  3. Identify the dice and the numbers that are showing. For this purpose I'm just going to look at the two regular dice, not the doubling cube, because it's a bit easier. Maybe later I'll add in the doubling cube. I'll assume the dice are inside the board outline and not on top of any checkers. It'll also need to identify whether the dice are white or black so it can figure out whose move it is.
At the end of that it'll know how the board is laid out, whose turn it is, and what the roll is - that's enough to figure out the next move, at least if we're ignoring the doubling die.

Some assumptions I'm going to make:
  • The picture of the board is from above, without a lot of perspective skewing. That means all the checkers are roughly the same size, the board outline is pretty close to rectangular, and we can see only one side of each of the dice.
  • No checkers are stacked. For example, if there are six checkers on a point, they should all be laid out in a row, without stacking them on top of each other. I think it'll be very hard for a classifier to identify stacked checkers.
  • The doubling cube doesn't matter when choosing the best move. Of course this isn't really the case, but it's generally not bad.
Maybe later I'll amend this to include the doubling cube, but for the first stab I'm leaving it out, just to keep things simple.

Tech stack and ML toolkit for this experiment

I'm using Python as my programming language here, partly because it's my preferred coding language, but also because there is an embarrassment of options for machine learning packages in Python. And, it's a language that makes it easy to experiment and get things done quickly.

The two machine learning packages I considered for this were scikit-learn and Tensorflow. scikit-learn has a great breadth of machine learning algorithms; Tensorflow is focused on neural networks and has a lot of depth there.

I'm starting with scikit-learn mostly because it was easier to get started with, and to swap in and out different types of (for example) classifier algorithms with a consistent API for fitting and predicting. Plus, I'm expecting to use machine learning techniques other than neural networks for some of the stages in the processing of a board image, which makes Tensorflow less appropriate.

Of course, if it ends up making sense, I can use Tensorflow in places and scikit-learn in others, but that makes the code harder to maintain later, so I'll try to stick with one ML package if I can.

New project! Computer vision to identify the layout of a backgammon board.

I still play quite a lot of backgammon, in real life. My first project to create a backgammon bot did not, as I hoped, make me a noticeably better gammon player - instead, I just created a bot that could play (much!) better than me.

So my next project is aiming to help me figure out what to do when I'm stuck in a real life game: I want to take a picture of the board with my phone and then have my bot figure out what the best move is. Of course this isn't something to use in a competitive match! But in a friendly game it might be an interesting part of the conversation.

It feels like it would make a pretty handy phone app, but mostly it's an interesting computer vision/machine learning problem that, like my original TD-Gammon experiment, will let me improve my skills.

And like the last project, I'll document my approach as I experiment along with this. I'm feeling somewhat less confident about this project than the last one, since with TD-Gammon there was already a published result that I was pretty confident that I could match. This one has a less certain conclusion, but hopefully I'll end up with something that works. Let's see!