Computational Backgammon

Using PyTorch for the neural networks

2023-12-10T12:45:00.000-08:00

I've been thinking a bit about whether GPUs are useful for neural network calculations with backgammon bots. It's not totally clear: there is overhead in shipping memory to the GPUs as a cost, measured against the benefits of parallelization.

That said, it is not very hard to check: there are nice open source packages available for neural networks that already support GPU calculations.

A simple one that seems decent for our purposes is PyTorch. This is a Python package for representing neural networks, and, as desired, it supports GPUs through CUDA (which works if you've got an NVIDIA GPU, as I do on my desktop machine).

There are two possible benefits of the GPU. First, evaluation of the network might be faster - so that it's quicker to do bot calculations during a game. Second, the training might be faster - that doesn't show up in the post-training game play, but could let you experiment more easily during training with different network sizes and configurations.

For the training calculations, the GPU really only helps with supervised learning - for example, training against the GNUBG training examples - but not with TD learning. That's because TD learning requires you to simulate a game to get the target to train against - you don't know the set of inputs and targets as training data before the TD training starts.

I constructed a simple feed forward neural network with the same shape as the ones I've discussed before: 196 inputs (the standard Tesauro inputs with no extensions), 120 hidden nodes (as a representative example), and 5 output nodes.

Then I just timed the calculation of network outputs given the inputs. The PyTorch version was noticeably slower even than my hand-rolled Python neural network class (which uses numpy vectorization to speed it up - not as fast as native C++ code but with a factor of 2-3). This is true even though the PyTorch networks do their calculations in C++ themselves. These calculations didn't use the GPU, just the CPU, in serial.

Then I tried the PyTorch calculations parallelizing on the GPU. I've got an NVIDIA GeForce GTX 1060 3GB GPU in my desktop machine, with 1,152 CUDA cores. It was slower, with this number of hidden nodes, than the CPU version of the PyTorch calculation. So the memory transfer overhead outweighed the parallelization benefits in this case.

I tried it with a larger network to see what happened - as the number of hidden nodes goes up, the PyTorch evaluations start to outperform my numpy-based evaluations, especially when using the GPU

So for post-training bot evaluations, it doesn't seem like using the GPU will give much improvement, if we're using nets of the size we're used to.

The GPU does, however, make a massive difference when training! In practice it's running about 10-20x faster than the serial version of the supervised training against the GNUBG training data. I'm excited about this one. And I can always take the trained weights off the PyTorch-based network and paste them onto the numpy-based one, which executes faster post training.

Custom GPTs can run only Python code

2023-12-04T08:37:00.000-08:00

I just discovered that OpenAI's custom GPTs cannot run compiled code like C++ - just Python.

That's an interesting limitation of these things, and one that I suspect will disappear in the not-so-distant future, probability to do with how simple the temporary environments need to be that execute the code.

In any case, it means my backgammon bots - for now! - will need to be pure Python. That said, the Python environment does have most of the standard numerical Python packages like numpy and pandas, and more advanced packages like scipy and sklearn that themselves contain some machine learning functionality. So maybe I'll try to build an sklearn-based neural net and see if that works better than my hand-rolled one that uses numpy vectorization.

The tutorbot begins to take shape

2023-12-03T18:18:00.000-08:00

I've named my custom GPT the "Backgammon Tutorbot".

It's now getting a bit better. You can tell it a position in text, like "show me the starting position", and it knows what that means, shows you a proper image of the position, and internally knows what checker layout it corresponds to.

You can then (slowly!) step through a game. If you ask it, for example, "show me the position after a 5-1", it'll call the backgammon bot Python code to figure out the best move for a 5-1, then show you an image of the resulting position, with a bit of commentary. And it remembers the new position as the current one.

Next, if you say "show me the position after the opponent rolls a 6-2", it'll figure out the opponent's best 6-2, then show you an image of that position, with some commentary.

And so on, until the game is over (or you hit the GPT-4 time cap, which I did a bunch).

So it's getting better at the mechanics of representing and advancing a real game. It's quite interesting to use the chat interface as the UI rather than a traditional application. In some ways it's much slower - but it's also chatting in regular English, with a lot of flexibility about how you actually ask your questions. The chatbot is very good at dealing with that kind of flexibility.

But it's still not very good at commentary yet. I've tried uploading one doc with human discussions of opening moves, but that's very early still, and no results to report.

A backgammon tutor chatbot

2023-11-30T14:59:00.000-08:00

I'm trying a new project to get myself familiar with LLMs and the growing infrastructure around them.

In particular, I want to try to solve one of the biggest practical problems that people have when they learn the game through bot analysis (like XG): the bot tells you which move has the highest equity, but it doesn't tell you why that move is the best in terms of more qualitative strategic and tactical decisions.

For example, why is making the 5-point with an opening 3-1 the right move? XG will tell you it's because it's the move that has the highest equity. If you asked an expert human player why that is the best move, though, they would talk about making home board points in order, stopping the opponent from making the Golden Point, and so on.

What I want to create is a chatbot that can give those more qualitative explanations about why a move is best.

The end state: a chatbot where you can enter a position (in a bunch of ways, including pasting a photo of a board) and ask it what the top moves are for a given dice roll. It'll tell you the standard probability and equity information that bots currently show, but it'll also give you the qualitative explanation of why the best move is the best. And similar functionality for cube decisions.

My implementation of this uses OpenAI's custom GPT framework. (To use this, I think you need to have a paid OpenAI account.) This lets you create a custom chatbot that is fine tune-trained on data you give it, and also has access to whatever code and data files you upload to it.

Then, with regular English instructions, you can tell it to, for example, call certain Python functions in response to certain types of request, or load information from a file, and things like that.

I managed to create a really simple first version that knows how to calculate the top moves, using a simple bot about the level of Benchmark 1. It also ignores the cube (for now). You can ask the chatbot questions like:

Show me an example of a backgammon board

I added a file with a list of example positions, and it'll load a random example when you ask this. It pulls it in and shows you the board.

What are the best moves for a 3-1?

It'll call the backgammon bot and get the regular bot information: moves, game probabilities, equity, and so on. It then summarizes the top three moves, describing the move and showing the probabilities. Then it tries to explain why the best move is the best, but it doesn't do a very good job, because I haven't trained that part yet. :)

Here's a link to the chatbot. Note that it might randomly not work as I play with it; and I suspect that you need a paid OpenAI account to use it.

Some thoughts after doing this really quite simple experiment:

Developing a chatbot is something you can do with English instructions. And when something goes wrong, you can ask the chatbot itself for help, and often it really does help.
It feels like the craft of building these things is just in its infancy, and I (at least!) don't know what kinds of development standards are best. For example, if you want a set of instructions for the chatbot to get set up, should they all go in one file, or is it better to use multiple files for different categories of instructions?
With OpenAI's custom GPTs, if you do a bunch of setup and successfully Update the chatbot, and then do another bit of setup and Update again, you generally lose all the information for the first setup. You need to give the custom GPT all the instructions each time you update it.
Everything runs really slowly! The chatbot often runs Python code, which takes ~30s to get set up and execute. Feels pretty sluggish. Okay for a proof of concept, but needs to be way faster for anyone to really use it.

Some stuff yet to do:

Make it easier to tell it what backgammon position you care about. Right now you need to tell it a gnubg-nn position string, which no one but three or four people in the world know about. It'd be nice to be able to paste in a photo of a board and have it parse the position out of that, but that's a pretty difficult computer vision problem. (Apparently someone has recently built an Android phone app that can do this, but I've never seen it in action.) Absent the photo parsing, I need to figure out some simple-ish way of describing a board to the chatbot.
Train it to describe why the move is best. This is the real meat of this project - will this work? I'm putting together some training data where each element is a position and dice roll, plus the best move, the checker layout after the best move, the game state probabilities after the move, and a description (created by me, to start) of why the best move is best.
Improve the bot. I was thinking of having it call gnubg, but I don't really know how to upload all the binaries and data that needs, or how to have it run as a server. So maybe I'll just use one of my old bots from earlier in this project. If this gets any traction I can figure out how to integrate with a proper bot.

The latest GnuBG benchmark files

2021-07-11T12:02:00.002-07:00

I found the latest versions of the GnuBG benchmark databases here. They've moved around in the 10y since I last looked for them! The file format remains the same, though I had to rebuild the parser for the 20-character string board descriptor in here.

Remember: the benchmark databases given a bunch of starting boards, rolls, and for each, the top five boards with post-roll equity for each.

There's also a set of training databases that gives a list of boards with pre-roll probabilities for the different game outcomes; that's what we do supervised learning against. Those are here. A description of the contents is here.

These training databases have grown substantially in the intervening years: there are now 1,225,297 entries for contact, 600,895 for crashed; and 516,818 for race.

Rebuilding TD-lambda learning

2021-07-11T11:17:00.003-07:00

I tried to get TD-lambda learning to work with scikit-learn's MLPClassifier tools, but couldn't get it to accept probabilities as inputs rather than a small set of categories (1 or 0 values). Then I tried MLPRegressor, but that doesn't seem to have a nice way of making the outputs bounded in (0,1).

So rather than bang my head against that, I just rolled my own neural network again - this time in Python, but using numpy vectorized calculations to speed things up.

It's still pretty slow in execution - I can train a network with 80 hidden nodes at the pace of 20,000 games per hour on my desktop machine. But, it let me get back into the weeds with how this all works.

This time I followed Tesauro's setup a bit more closely, in that the inputs are explicitly white and black checker positions rather than "whichever player's on move", and I kept the two "whose move is it" inputs too. The outputs are: probability of white single win, probability of white gammon, probability of black gammon, probability of white backgammon, and probability of black backgammon. The probability of black single win was equal to one minus the sum of the other probabilities.

I'm able to reproduce most of the initial cubeless play results from my earlier work, though I've yet to add the inputs for the number of checks available to hit, or the Berliner primes. It takes around 200,000 game iterations to train up to something like the Benchmark 2 level. This was using alpha=0.1, lambda=0, and no "alpha splitting" (using a different learning rate for the input->hidden weights and the hidden->output weights).

So now I've convinced myself that I remember how to build one of these players from scratch. For the next step I'm going to download the latest GnuBG benchmark databases and do supervised learning on those - it should be much easier to plug that into an external package like scikit-learn.

Gerald Tesauro original paper

2021-07-01T06:39:00.002-07:00

For reference: here is a link to the original Tesauro paper on TD-Gammon.

Rebuilding the backgammon framework and pubEval

2021-07-01T05:52:00.003-07:00

I need a working bot to tell me the best move once I get the "parse the board state from a photo of a board" bit sorted.

I'm still following the same approach as originally, but this time I'm reimplementing it in Python, not C++. Two reasons for that: first, I enjoy Python coding a whole lot more because it's easier; and second, there are some truly excellent open source machine learning libraries in Python that I'll leverage. Execution speed of course is much slower, which might cause some angst when training, so I might need to do some optimization. Perhaps use Cython or numba or something to speed that up - we'll see.

And a third reason: rebuilding it from scratch will help me remember how all the details work!

In any case I've rebuilt the structure now, and have got a pubEval player to play simple games against. I had to google around a fair bit to find a nice description of pubEval, and finally found one here, where Gerry Tesauro published a post with a C implementation. Basically pubEval is just a linear regression (actually two regressions: one for a race, one for contact) that takes a 122-element representation of the board state and returns a score, with a higher score being better. All you really need then is the board state encoding, plus the linear regression weights for the two regressions, which you can find at that link.

After playing some games manually against pubEval, I can honestly say that it's not very good. It's not terrible, but I now remember why it's pretty easy to beat.

And Happy Canada Day! :)

Checker classifier results

2021-06-17T13:03:00.003-07:00

I took that set of 25,000-ish checker images and trained a multi-layer perceptron (scikit-learn's MLPClassifier) on them to see how it'd do.

Each image was reduced to a 16x16 grayscale image, and those 256 numbers were the inputs to the MLP. They were normalized by dividing by 255 and then subtracting 0.5, so the range was [-0.5,0.5] for each of the 256 inputs.

As discussed before, I used a 99% threshold for classifying an image as a white or black checker, to improve precision at the expense of recall.

I split the input data into training and test sets: the training set used 75% of randomly-sampled inputs, and the test set was the remaining 25%.

I judged the performance of the classifier by looking at a confusion table, plus using it to identify checker on a reference board.

This post gives some results for different topologies of the neural network. In all cases I used the "adam" solver (a stochastic gradient descent solver) and an L2 penalty parameter "alpha" equal to 1e-4.

One Hidden Layer, 10 Nodes

With one hidden layer and ten nodes, the test data set performance statistics were:

precision recall f1-score support

0 0.85 1.00 0.92 5013

1 0.93 0.10 0.18 636

2 0.96 0.48 0.64 608

accuracy 0.86 6257

macro avg 0.91 0.52 0.58 6257

weighted avg 0.87 0.86 0.81 6257

Here, category 1 is white checkers, and category 2 is black checkers. Category 0 is no checker.

The precision ended up quite good, but the recall was pretty terrible - it was missing a lot of valid checkers.

When I run this on my reference board (a board image that was not used to generate any of the input data), it shows that it indeed does a pretty good job finding white checkers, but a relatively poor job with the black checkers:

This is how I'll qualitatively evaluate the performance. Each white dot in the image above is where, on one scan, the classifier thought it found a white checker (placed in the center of the window). As it did multiple scans, the same checker is found over and over, and you get a grouping of white dots. There are a few places on the board, outside the main board area, where it placed groupings of white dots, but generally it's finding the real white checkers pretty effectively.

It does a noticeably worse job with the black checkers (here colored blue on my board) - way fewer blue dots on those checker.

One Hidden Layer, 100 Nodes

If I jack the number of hidden nodes in a single hidden layer from 10 to 100, the performance noticeably improves:

precision recall f1-score support

0 0.94 0.99 0.96 5013

1 0.95 0.64 0.76 636

2 0.95 0.83 0.89 608

accuracy 0.94 6257

macro avg 0.94 0.82 0.87 6257

weighted avg 0.94 0.94 0.94 6257

The precision is a bit better, but the recall has improved dramatically. And when it identifies checkers on my reference board it does materially better with the black checkers:

Two Hidden Layers, 10 Nodes in Each

If we change the topology of the network to have two hidden layers, each with ten nodes, the performance is:

precision recall f1-score support

0 0.87 0.99 0.93 5013

1 0.92 0.25 0.39 636

2 0.92 0.58 0.71 608

accuracy 0.88 6257

macro avg 0.90 0.61 0.68 6257

weighted avg 0.88 0.88 0.85 6257

This does a bit better than a single hidden layer with 10 nodes, but worse than the single layer with 100 nodes.

Two Hidden Layers, 100 Nodes in Each

If we change to 100 layers in each of two hidden layers, the performance is much stronger:

precision recall f1-score support

0 0.94 0.99 0.96 5013

1 0.93 0.61 0.74 636

2 0.94 0.88 0.91 608

accuracy 0.94 6257

macro avg 0.94 0.82 0.87 6257

weighted avg 0.94 0.94 0.93 6257

Adding the second hidden layer doesn't seem to improve the situation much compared to the one hidden layer with 100 nodes.

One Hidden Layer, 200 Nodes

With 200 nodes in one hidden layer, the performance doesn't change much from the 100 node case:

precision recall f1-score support

0 0.94 0.99 0.96 5013

1 0.95 0.65 0.77 636

2 0.95 0.83 0.89 608

accuracy 0.94 6257

macro avg 0.95 0.82 0.87 6257

weighted avg 0.94 0.94 0.94 6257

Variation by Number of Hidden Nodes (One Layer)

The recall statistic for black checker (class 1 in the table) classification is the most sensitive to the topology. Here is a chart of performance and recall % for that category as a function of the number of hidden nodes for one hidden layer:

In general it seems like performance is pretty stable across the board, and recall seems to stabilize around 50 hidden nodes and more.

So I'll stick with one hidden layer and 50 nodes.

A checker classifier

2021-06-17T11:55:00.004-07:00

I started with the second step in my pipeline: how to identify the checkers on the board.

The approach I'm trying:

Start with the raw image of the board, laid out so that it's wider than it is tall, so that the points are vertical.
Define a small square window, something comparable to the checker size.
Scan that window across the board. At each step, look at the part of the raw image that's in the window and decide whether it contains a white checker, a black checker, or no checker.
Process the window image: downsample into a 16x16 grid of grayscale pixels.
My metric for deciding whether a checker is in the image: one checker has > 75% of its area in the window, no other checkers have > 25% of their areas in the window, and the center of the window lands inside the one main checker.
If a checker is counted as in the window, remember a point equal to the center of the window, tagged as white or black depending on which checker was found.
Run that scan across the whole board, for a range of window sizes, and for a range of starting pixel positions for the scan, so it gets a lot of views of the board in slightly different ways.

At the end of this I end up with a bunch of dots, hopefully grouped around the actual checker positions. Then I need some way of identifying the groupings as checkers, but that's the next step.

I decided to train a multi-layer perceptron - a neural network - to identify the three categories for each window image: no checker, a black checker, or a white checker. For this I used scikit-learn's MLPClassifier, with the following assumptions:

Adam solver (a stochastic gradient descent solver)
Relu activation function for every node
Two hidden layers with 100 nodes in each
L2 penalty parameter "alpha" equal to 1e-4

In this use case, I care about the precision of the classifier - that is, the ability of the classifier not to label as positive a sample that is negative - more than the recall - the ability of the classifier to find all the true samples. That's because it does this grid scan, and should identify the same checker many times in its scans. If it misses a checker on a few scans (because it's focused on precision over recall), it should catch it on other scans.

To amp the precision, I count a window as having a white or black checker only if the classifier has > 99% probability. That means a bunch of real checkers will be missed and counted as false negatives, which reduces the recall, but the false positives should be quite low, which pumps up the precision.

Before I could start training, though, I needed some labeled input data to test against. That is, images of that are about the size of the window, where each is labeled as "has no clear checker", "has a white checker", and "has a black checker". Where to get these labeled inputs?

I took a picture of a few backgammon boards, and started by manually cropping out images and saving them, then manually labeling them. That was pretty slow - it took me a half hour just to get 100 or so.

Then I realized I could use my scanning approach. I took a picture of a board, then manually identified where the checkers were on the board: center point and radius for each, as well as whether it was white or black. I saved that identification data to a file, then I ran a scan over the board, and could use the manually-identified checker positions to figure out whether a window contained 75% of one checker, less than 25% of any other checker, and that the center of the window was inside the area of the one checker.

I did that with a handful of boards, and that generated about 2,400 black checker images, 2,600 white checker images, and 200,000 images with no checker. I discarded 90% of the "no checker" images just so the input data set wasn't too skewed towards that side, so ended up with about 80% of the examples being "no checker", and about 10% each in white and black checkers.

That served as my training set for the checker classifier.

Draft of overall approach

2021-06-17T10:18:00.000-07:00

I started by thinking through what the algorithm should be for parsing information off the board, as a pipeline that it walks through. I expect this will get refined as I start implementing, but the high level first draft of the steps is:

Figure out the board outline. That is, look for lines/rectangles in the image and use them to figure out the overall board dimensions and what part of the image is in the board and what's out.
Find all the checkers on the board inside the board outline. It needs to distinguish between white and black checkers on the board, as well as possibly checkers on the bar. I won't bother with checkers that have been born off; I'll just assume any checkers that aren't on the board have been born off.
Identify the dice and the numbers that are showing. For this purpose I'm just going to look at the two regular dice, not the doubling cube, because it's a bit easier. Maybe later I'll add in the doubling cube. I'll assume the dice are inside the board outline and not on top of any checkers. It'll also need to identify whether the dice are white or black so it can figure out whose move it is.

At the end of that it'll know how the board is laid out, whose turn it is, and what the roll is - that's enough to figure out the next move, at least if we're ignoring the doubling die.

Some assumptions I'm going to make:

The picture of the board is from above, without a lot of perspective skewing. That means all the checkers are roughly the same size, the board outline is pretty close to rectangular, and we can see only one side of each of the dice.
No checkers are stacked. For example, if there are six checkers on a point, they should all be laid out in a row, without stacking them on top of each other. I think it'll be very hard for a classifier to identify stacked checkers.
The doubling cube doesn't matter when choosing the best move. Of course this isn't really the case, but it's generally not bad.

Maybe later I'll amend this to include the doubling cube, but for the first stab I'm leaving it out, just to keep things simple.

Tech stack and ML toolkit for this experiment

2021-06-17T06:39:00.002-07:00

I'm using Python as my programming language here, partly because it's my preferred coding language, but also because there is an embarrassment of options for machine learning packages in Python. And, it's a language that makes it easy to experiment and get things done quickly.

The two machine learning packages I considered for this were scikit-learn and Tensorflow. scikit-learn has a great breadth of machine learning algorithms; Tensorflow is focused on neural networks and has a lot of depth there.

I'm starting with scikit-learn mostly because it was easier to get started with, and to swap in and out different types of (for example) classifier algorithms with a consistent API for fitting and predicting. Plus, I'm expecting to use machine learning techniques other than neural networks for some of the stages in the processing of a board image, which makes Tensorflow less appropriate.

Of course, if it ends up making sense, I can use Tensorflow in places and scikit-learn in others, but that makes the code harder to maintain later, so I'll try to stick with one ML package if I can.

New project! Computer vision to identify the layout of a backgammon board.

2021-06-17T06:30:00.001-07:00

I still play quite a lot of backgammon, in real life. My first project to create a backgammon bot did not, as I hoped, make me a noticeably better gammon player - instead, I just created a bot that could play (much!) better than me.

So my next project is aiming to help me figure out what to do when I'm stuck in a real life game: I want to take a picture of the board with my phone and then have my bot figure out what the best move is. Of course this isn't something to use in a competitive match! But in a friendly game it might be an interesting part of the conversation.

It feels like it would make a pretty handy phone app, but mostly it's an interesting computer vision/machine learning problem that, like my original TD-Gammon experiment, will let me improve my skills.

And like the last project, I'll document my approach as I experiment along with this. I'm feeling somewhat less confident about this project than the last one, since with TD-Gammon there was already a published result that I was pretty confident that I could match. This one has a less certain conclusion, but hopefully I'll end up with something that works. Let's see!

Improved GNUbg benchmarks

2012-08-20T15:08:00.001-07:00

The GNUbg team (in particular, Philippe Michel) has created new benchmark databases for Contact, Crashed, and Race layouts, using the same set of board positions but rolling out the equities with more accuracy. This corrects the significant errors found in the old Crashed benchmark, and improves the Contact and Race benchmarks.

They are available for download here, in the benchmarks subdirectory.

Philippe also did some work on improving the board positions included in the Crashed training database, which is available for download in the training_data subdirectory at that link.

I re-ran the statistics for several of my players, as well as for PubEval. Also Player 3.6 as the most comprehensive benchmark.

Player	GNUbg Contact ER	GNUbg Crashed ER	GNUgb Race ER	PubEval Avg Ppg	Player 3.6 Avg Ppg
GNUbg	10.5	5.89	0.643	0.63	N/A
Player 3.6	12.7	9.17	0.817	0.601	0.0
Player 3.5	13.1	9.46	0.817	0.597	-0.0027
Player 3.4	13.4	9.63	0.817	0.596	-0.0119
Player 3.3	13.4	9.89	0.985	0.595	-0.0127
Player 3.2	14.1	10.7	2.14	0.577	-0.041
Player 3.2q	33.7	26.2	2.45	0.140	-0.466
Player 2.4	18.2	21.7	2.05	0.484	-0.105
Benchmark 2	21.6	23.2	5.54	0.438	-0.173
PubEval (ListNet)	41.7	50.5	2.12	0.048	-0.532
PubEval	44.2	51.3	3.61	0	-0.589

For the games against PubEval I ran 40k cubeless money games; standard errors are +/- 0.006ppg. Down to Player 3.2, for the games against Player 3.6 I ran 400k cubeless money games to get more accuracy; standard errors are +/- 0.002ppg or better. For players worse than Player 3.2 I played 100k games against Player 3.6 as the average scores were larger; standard errors are +/- 0.004ppg.

Phillippe Michel was gracious enough to provide the GNUbg 0-ply scores against the newly-created benchmarks. Also it seems like I had the scores against the old benchmarks incorrect: they were Contact 10.4, Crashed 7.72, and Race 0.589. The Contact score was close, but the other two I had significantly worse.

Player 3.6: longer training results

2012-08-19T00:57:00.001-07:00

I haven't had much time lately to work on this project, but while I'm engaged elsewhere I thought I'd run training for a long period and see whether it continued to improve.

It did - fairly significantly. So my players before clearly were not fully converged.

The result is my new best player, Player 3.6. Its GNUbg benchmark scores are Contact 12.7, Crashed 11.1, and Race 0.766. In 400k cubeless money games against Player 3.5 it averages 0.0027ppg +/- 0.0018 ppg, a small improvement.

In 40k games against Benchmark 2 it averages 0.181 +/- 0.005 ppg, and against PubEval 0.601 +/- 0.006 ppg.

For training I used supervised learning with three parallel and independent streams: one with alpha=0.01, one with alpha=0.03, and finally one with alpha=0.1. This was meant to test the optimal alpha to use.

Surprisingly, alpha=0.01 was not the best level to use. alpha=0.03 improved almost 3x as quickly. alpha=0.1 did not improve well on the Contact benchmark score but did improve the best for the Crashed benchmark score.

I take from this that alpha=0.03 is the best level of alpha to use for long term convergence.

The Crashed benchmark score we know is not that useful: the Crashed benchmark itself is flawed, and a multi-linear regression showed very little impact on score of the Crashed benchmark. That said, I tried a little experiment where I used the Contact network for crashed positions in Player 3.5 and it definitely worsened performance in self-play: 0.04ppg on average. That is a significant margin at this point in the player development.

I ran 4,000 supervised learning steps in the training, for each of the three alpha levels. In each step I training on a randomly-arranged set of Contact and Crashed training positions from the training benchmark databases. This took a month and a half. The benchmark scores were still slowly improving for alpha=0.01 and alpha=0.03, so there is still scope for improvement. I stopped just because the GNUbg folks have put out new benchmark and training databases that I want to switch to.

GNUbg Crashed benchmark issue

2012-06-12T10:27:00.001-07:00

It looks like the Crashed benchmark set in the GNUbg benchmarks isn't very accurate in some cases.

There is a thread discussing it in the GNUbg mailing list.

Interesting to know, and hopefully the GNUbg team will improve on it; but the Crashed benchmark score is not very predictive for overall gameplay, as I've discovered while comparing players of different strengths.

Player 3.5: new input, escapes from the bar

2012-05-14T14:14:00.000-07:00

I tried another new input for contact and crashed networks: this time, the expected escapes if you have a single checker on the bar. That is, looking at the available spaces in the opponent home board and weighting the probability of landing in the space with the standard escape count from the Berliner primes calculation. It is meant to give some indication of how good or bad it'd be to get hit. I'm focusing on inputs along these lines because when looking at which positions are calculated most poorly in the benchmarks, it tends to be boards where there is a significant chance of being hit and landing behind a prime.

This one had some success, and while the improvement is still incremental, it resulted in my best player to date. The resulting player that uses the new input is Player 3.5. It is identical to Player 3.4, except for two new inputs: the input as described above, one for each player.

Its GNUbg benchmark scores are Contact 13.0, Crashed 11.5, and Race 0.766. Player 3.4's scores are 13.3, 11.7, and 0.766, so noticeably better but still nothing dramatic (though notably some improvement in Contact, the most important benchmark). It seems that to get a significantly stronger player I'll have to add a bunch of inputs, each of which offers reasonably incremental benefits.

In cubeless money player against Player 3.4, it scores an average +0.0033ppg +/- 0.0021ppg in 400k games. Against PubEval it scores an average +0.592ppg +/- 0.005ppg in 100k games and wins 69.5% of the games.

Still not nearly as good as GNUbg 0-ply! But creeping closer.

To be honest I'm not really sure whether the improved performance came because of the new input or because I slightly changed the training algorithm. In this case I started with random weights for the new inputs and ran supervised learning against the GNUbg training databases (contact & crashed). And instead of bouncing back and forth between a large alpha (1) and smaller alphas, I just used a small and constant alpha of 0.03. The resulting benchmark score slowly improved over 1,100 iterations, which took several days to run.

New inputs failure: bar hit/entry probability

2012-04-27T18:56:00.001-07:00

I've been spending a little time looking at cases where my Player 3.4 does poorly in the GNUbg contact benchmarks database, to get some feel for what new inputs I might try.

It looks it's leaving blots too often when the opponent has a good prime blocking the way out of the home board.

So I tried two new inputs: the odds of entering the opponent's home board if there were a checker on the bar; and the odds of hitting an opponent blot in his home board if there were a checker on the bar.

I tried two training approaches: first, adding random weights for just those four new weights (the two inputs times two players) and doing supervised learning on the GNUbg training databases; and also starting from scratch, random weights everywhere, and doing TD training through self-play and then SL on the GNUbg training databases.

The conclusion: neither worked. In both cases the new player was about the same as or a little worse than Player 3.4. So these aren't the right new inputs to add.

Back to the drawing board.

Jump model final version posted

2012-04-25T12:26:00.004-07:00

I've posted a new version of my jump model for cube decision points:

http://arxiv.org/abs/1203.5692

This version is quite similar to the last one, with just a few elaborations added after another set of very productive discussions with Rick Janowski. He's been a huge help in pulling this together.

I doubt this paper will change again, though I'll probably revisit the model in the future with another paper. Probably to focus on how to estimate the local jump volatility accurately.

PubEval trained using ListNet

2012-04-17T21:06:00.000-07:00

I spent some time looking at PubEval again - not my implementation, which is fine now, but rather how Tesauro came up with it in the first place. One discussion suggests that he trained it using "comparison training", a machine learning approach he seems to have come up with - some kind of supervised learning on a manually-prepared set of benchmarks. Each benchmark (I'm assuming) was a list of moves given a starting point and a dice roll, where the moves were ordered by goodness.

I tried to reproduce this. I couldn't find any proper references to "comparison training", but there's a lot of relatively recent literature on machine learning algorithms for generating rankings, which is the same sort of thing.

We can do a lot better than Tesauro's hand crafted training set: we have the GNUbg benchmark databases that are much larger and more accurate.

So what we want is an algorithm that we can feed a training set, where each element of the set has the five boards listed for each move and the rolled-out equities for each. The inputs that define the board are the PubEval inputs, and the evaluation function should be a linear function of the inputs (like PubEval is).

Wikipedia has a nice summary of different machine learning approaches to ranking estimators.

The ListNet algorithm seems like a good choice. I implemented it and trained it on the GNUbg contact and race benchmark databases.

It fairly quickly converged to a better solution than Tesauro's original PubEval. That is, the weights I found can be plugged into the usual PubEval setup, but give a slightly stronger player. Not much better, but noticeably stronger. Not surprising given the more comprehensive training set.

The weights themselves, and the output values, were quite different to PubEval. The ListNet algorithm effectively trained the regression to approximate equity, so in this approach the board evaluations correspond to equity (though optimized for equity differences on similar boards rather than outright equity).

The GNUbg benchmark scores were: Contact 41.5, Crashed 47.5, and Race 1.90. This compares to PubEval's scores of 44.1, 49.7, and 3.54.

The weights are available on my Dropbox account: race and contact.

In 100k cubeless games (with variance reduction) against PubEval it scores +0.043ppg +/- 0.004ppg. Again, noticeably better.

Of course this is a terrible player compared to neural network players, but it's neat to be able to reproduce something like what Tesauro did with PubEval. And it was a good excuse to play around with the newer machine learning algorithms focused on ranking.

As well this might be an interesting training approach for a neural network. The network would be optimized for checker play, so would be less efficient at absolute equity estimation required for doubling decisions. But perhaps one could have two sets of networks, one optimized for checker play, the other for doubling decisions.

Average number of games in a match

2012-04-10T13:33:00.000-07:00

Some more interesting statistics on match play: the average number of games per match.

I ran self-play for two players using Player 3.4 for checker play (cubeful equity optimization) and Janowski match model with cube life index 0.70 and looked at how many games it took on average to finish a match.

The most interesting result is that the average number of games divided by the match length seems to converge reasonably well to a value around 0.65.

Here is a chart of the results: blue line/left axis shows average number of games, red line/right axis shows the average number of games divided by the match length:

The only other source I could find for similar statistics gave similar results (out to match length 11).

I ran the experiment with x=0.5 and x=0.9 as well to see how that affected the converged average number of games per match length. x=0.5 gave a converged ratio of 0.51; x=0.9 gave 0.90. This compares to 0.65 for x=0.7.

So the average number of games per match divided by the match length converges (for long matches) to something surprisingly close to the cube life index itself! I'm sure this is just a coincidence, but an interesting one.

Optimal Janowski cube life index in match play

2012-04-10T12:39:00.003-07:00

In my last post I looked at extending Janowski's cubeful equity model to match play.

The conclusion: match play also favors a cube life index very close to 0.70.

I played a Janowski match strategy with different cube life indexes against a Janowski match strategy with cube life index 0.70 as a benchmark. I ran 40k matches with variance reduction and recorded the average points per match.

The results:

Match Length	x=0.5	x=0.6	x=0.8	x=0.9
3	-0.004	0.000	-0.004	-0.007
5	-0.007	+0.007	+0.002	-0.024
7	-0.021	-0.003	-0.008	-0.029
9	-0.028	-0.006	+0.003	-0.037

If I fit parabolas through the results and force them to pass through zero ppm at x=0.7, I find optimal cube life indexes of x=0.61 for match length 3, x=0.64 for match length 5, x=0.68 for match length 7, and x=0.69 for match length 9.

All average points per match have a standard error of +/- 0.004ppm, so the statistics are marginal for the shorter match lengths.

There is some evidence for a smaller cube life index for shorter matches, but not much. In general the optimal match cube life index looks very close to the optimal money cube life index.

UPDATE: I ran longer simulations for more values of the cube life index for match lengths 3, 5, and 7 to try to get more accurate statistics. From those data I get optimal cube life indexes of 0.70, 0.67, and 0.69 for match lengths 3, 5, and 7 respectively. So no evidence of a smaller optimal cube life index for shorter matches: everything should use 0.70.

That said, the performance difference for short matches of using a suboptimal cube life index is pretty infinitesimal. It becomes a bigger deal for longer matches.

Janowski model extended to match play

2012-04-09T08:15:00.000-07:00

When I was looking at match equity tables I wondered whether you could extend Janowski's model (using a "cube life index" to interpolate between dead and live cube equities as a proxy for the jumpiness of game state) to match play. I'm pretty sure this is what GNUbg does based on their docs.

Turns out it's pretty straightforward if you assume the same match equity table as you calculate with Tom Keith's algorithm, which is a live cube model - it assumes game-winning probability changes diffusively, and that W and L (the conditional points won on win or lost on loss) are independent of the game-winning probability. That's the same as Janowski's live cube limit, except W and L are calculated from entries in the match equity table instead of the usual money scores for wins, gammons, and backgammons.

The Keith algorithm mentioned before gives you the cubeful match equity in the live cube limit. The dead cube limit has cubeful equity that's linear in P, running from -L at P=0 to +W at P=1. The model cubeful equity is just the weighted sum of the live and dead cube cubeful equities.

I implemented this to see how it compares to using a Janowski money model for doubling decisions in tournament play. That's of course inefficient - it doesn't account for match equity - but it's an interesting benchmark to show how much a match-aware doubling strategy matters.

Checker play was Player 3.4 optimizing on cubeful equity, and I ran 40k matches (with variance reduction) for a range of match lengths and cube life indexes. For a given cube life index, both the match and money doubling strategies share the same cube life index, which seems a fair comparison. Remember, for money play, a cube life index of x=0.70 was optimal.

The entries in the table are average points per match in the 40k matches. All values have a standard error of +/- 0.005ppm.

Match Length	x=0	x=0.25	x=0.50	x=0.75	x=1
3	+0.056	+0.047	+0.051	+0.066	+0.054
5	+0.073	+0.080	+0.091	+0.105	+0.067
7	+0.115	+0.138	+0.137	+0.142	+0.061
9	+0.115	+0.148	+0.150	+0.154	+0.081

The main conclusion here is that using a proper match-aware doubling strategy makes a huge difference in outcome. The impact is larger for longer matches because any per-game edge gets magnified over a match. In principle for very long matches the match strategy should become the money strategy, but for matches up to length 9, anyways, that is not apparent in the statistics.

This assumes the same match equity table as before, which is based on the live cube model. Really I should recalculate the match equity table assuming the same Janowski model for cube decisions, which I think will change it a bit. But I'll leave that for another day.

Or I could extend my jump model to match play - it should be a relatively simple extension, since like with Janowski it's just about changing W and L to be based off match equities.

Cubeful vs cubeless equities in checker play decisions

2012-04-07T13:36:00.002-07:00

So far all my players have made checker play decisions by choosing the position with the best cubeless equity, even after I introduced doubling strategies to decide when to offer & take doubles. This is clearly suboptimal: the doubling strategies can calculate cubeful equity, so the checker play should choose moves based on the position with the best cubeful equity, not best cubeless equity.

I'm referring to money games; for match play they should optimize match equity.

I extended my backgammon framework to allow the players to optimize on cubeful equity in checker play to see how much that would improve things.

The answer, it seems: at least for my intermediate-strength players, not much!

I played 100k cubeful money games (with variance reduction) between two players (both Player 3.4) that both use Janowski x=0.7 for their doubling strategy. The first used cubeful equities for checker play and the second used cubeless equity.

The first player scored -0.0004ppg +/- 0.0044ppg. Hardly a significant advantage to using cubeful equity when choosing the best move - the score is indistinguishable from zero.

I also thought I'd run the cubeful-equity player through the GNUbg benchmarks. The benchmarks are for cubeless play, so that should give something worse than for the cubeless-equity player. But the benchmarks are all about choosing the best move out of the possible moves, so all that matters there is relative equity difference, not absolute equity, and perhaps that's not affected much by cubeful vs cubeless.

The cubeless benchmark scores are Contact 13.32, Crashed 11.69, and Race 0.766.

Using a cubeful-equity player, always assuming a centered cube, the scores were 13.44, 12.16, and 0.766. So the Contact and Crashed scores got a little bit worse, but hardly changed. Using cubeful equity in checker play decisions does not seem to make a big difference in almost every case.

Assuming player-owned cube the scores were 13.41, 12.03, and 0.768. Assuming opponent-owned cube the scores were 13.42, 11.82, and 0.764. So cube position does not matter much either.

The conclusion: it doesn't matter much if you use cubeful or cubeless equity in checker play decisions.

Nonetheless the cubeful equity performance is marginally better, and there are probably edge case conditions where it matters more, so I'll start using cubeful equity in money play.

Player 3.4: incrementally better

2012-04-06T14:07:00.002-07:00

I let the supervised learning algorithm train for a while longer, starting with the Player 3.3 networks (i.e. the same networks and inputs).

Its benchmark scores were Contact 13.3, Crashed 11.7, and Race 0.766, so my best player yet, but only incrementally better than Player 3.3's scores of 13.3, 12.0, and 0.93.

On the most important benchmark, Contact, it was unchanged. And in self-play against Player 3.3 (100k games with variance reduction) its score was zero within a standard error of 0.0009ppg. Not exactly a startling improvement!

Nonetheless it is measurably better in the other GNUbg benchmarks so I'll start using it as my best player.

Playing 100k cubeless money games against PubEval it scored +0.586ppg +/- 0.001ppg, winning 69.22% of the games; Player 3.3 in the same games scores +0.585ppg +/- 0.001ppg and wins 69.20% of the games. So again very little difference, but still incrementally my best player.