Computational Backgammon

Thursday, June 17, 2021

Draft of overall approach

I started by thinking through what the algorithm should be for parsing information off the board, as a pipeline that it walks through. I expect this will get refined as I start implementing, but the high level first draft of the steps is:

Figure out the board outline. That is, look for lines/rectangles in the image and use them to figure out the overall board dimensions and what part of the image is in the board and what's out.
Find all the checkers on the board inside the board outline. It needs to distinguish between white and black checkers on the board, as well as possibly checkers on the bar. I won't bother with checkers that have been born off; I'll just assume any checkers that aren't on the board have been born off.
Identify the dice and the numbers that are showing. For this purpose I'm just going to look at the two regular dice, not the doubling cube, because it's a bit easier. Maybe later I'll add in the doubling cube. I'll assume the dice are inside the board outline and not on top of any checkers. It'll also need to identify whether the dice are white or black so it can figure out whose move it is.

At the end of that it'll know how the board is laid out, whose turn it is, and what the roll is - that's enough to figure out the next move, at least if we're ignoring the doubling die.

Some assumptions I'm going to make:

The picture of the board is from above, without a lot of perspective skewing. That means all the checkers are roughly the same size, the board outline is pretty close to rectangular, and we can see only one side of each of the dice.
No checkers are stacked. For example, if there are six checkers on a point, they should all be laid out in a row, without stacking them on top of each other. I think it'll be very hard for a classifier to identify stacked checkers.
The doubling cube doesn't matter when choosing the best move. Of course this isn't really the case, but it's generally not bad.

Maybe later I'll amend this to include the doubling cube, but for the first stab I'm leaving it out, just to keep things simple.

Tech stack and ML toolkit for this experiment

I'm using Python as my programming language here, partly because it's my preferred coding language, but also because there is an embarrassment of options for machine learning packages in Python. And, it's a language that makes it easy to experiment and get things done quickly.

The two machine learning packages I considered for this were scikit-learn and Tensorflow. scikit-learn has a great breadth of machine learning algorithms; Tensorflow is focused on neural networks and has a lot of depth there.

I'm starting with scikit-learn mostly because it was easier to get started with, and to swap in and out different types of (for example) classifier algorithms with a consistent API for fitting and predicting. Plus, I'm expecting to use machine learning techniques other than neural networks for some of the stages in the processing of a board image, which makes Tensorflow less appropriate.

Of course, if it ends up making sense, I can use Tensorflow in places and scikit-learn in others, but that makes the code harder to maintain later, so I'll try to stick with one ML package if I can.

New project! Computer vision to identify the layout of a backgammon board.

I still play quite a lot of backgammon, in real life. My first project to create a backgammon bot did not, as I hoped, make me a noticeably better gammon player - instead, I just created a bot that could play (much!) better than me.

So my next project is aiming to help me figure out what to do when I'm stuck in a real life game: I want to take a picture of the board with my phone and then have my bot figure out what the best move is. Of course this isn't something to use in a competitive match! But in a friendly game it might be an interesting part of the conversation.

It feels like it would make a pretty handy phone app, but mostly it's an interesting computer vision/machine learning problem that, like my original TD-Gammon experiment, will let me improve my skills.

And like the last project, I'll document my approach as I experiment along with this. I'm feeling somewhat less confident about this project than the last one, since with TD-Gammon there was already a published result that I was pretty confident that I could match. This one has a less certain conclusion, but hopefully I'll end up with something that works. Let's see!

Monday, August 20, 2012

Improved GNUbg benchmarks

The GNUbg team (in particular, Philippe Michel) has created new benchmark databases for Contact, Crashed, and Race layouts, using the same set of board positions but rolling out the equities with more accuracy. This corrects the significant errors found in the old Crashed benchmark, and improves the Contact and Race benchmarks.

They are available for download here, in the benchmarks subdirectory.

Philippe also did some work on improving the board positions included in the Crashed training database, which is available for download in the training_data subdirectory at that link.

I re-ran the statistics for several of my players, as well as for PubEval. Also Player 3.6 as the most comprehensive benchmark.

Player	GNUbg Contact ER	GNUbg Crashed ER	GNUgb Race ER	PubEval Avg Ppg	Player 3.6 Avg Ppg
GNUbg	10.5	5.89	0.643	0.63	N/A
Player 3.6	12.7	9.17	0.817	0.601	0.0
Player 3.5	13.1	9.46	0.817	0.597	-0.0027
Player 3.4	13.4	9.63	0.817	0.596	-0.0119
Player 3.3	13.4	9.89	0.985	0.595	-0.0127
Player 3.2	14.1	10.7	2.14	0.577	-0.041
Player 3.2q	33.7	26.2	2.45	0.140	-0.466
Player 2.4	18.2	21.7	2.05	0.484	-0.105
Benchmark 2	21.6	23.2	5.54	0.438	-0.173
PubEval (ListNet)	41.7	50.5	2.12	0.048	-0.532
PubEval	44.2	51.3	3.61	0	-0.589

For the games against PubEval I ran 40k cubeless money games; standard errors are +/- 0.006ppg. Down to Player 3.2, for the games against Player 3.6 I ran 400k cubeless money games to get more accuracy; standard errors are +/- 0.002ppg or better. For players worse than Player 3.2 I played 100k games against Player 3.6 as the average scores were larger; standard errors are +/- 0.004ppg.

Phillippe Michel was gracious enough to provide the GNUbg 0-ply scores against the newly-created benchmarks. Also it seems like I had the scores against the old benchmarks incorrect: they were Contact 10.4, Crashed 7.72, and Race 0.589. The Contact score was close, but the other two I had significantly worse.

Sunday, August 19, 2012

Player 3.6: longer training results

I haven't had much time lately to work on this project, but while I'm engaged elsewhere I thought I'd run training for a long period and see whether it continued to improve.

It did - fairly significantly. So my players before clearly were not fully converged.

The result is my new best player, Player 3.6. Its GNUbg benchmark scores are Contact 12.7, Crashed 11.1, and Race 0.766. In 400k cubeless money games against Player 3.5 it averages 0.0027ppg +/- 0.0018 ppg, a small improvement.

In 40k games against Benchmark 2 it averages 0.181 +/- 0.005 ppg, and against PubEval 0.601 +/- 0.006 ppg.

For training I used supervised learning with three parallel and independent streams: one with alpha=0.01, one with alpha=0.03, and finally one with alpha=0.1. This was meant to test the optimal alpha to use.

Surprisingly, alpha=0.01 was not the best level to use. alpha=0.03 improved almost 3x as quickly. alpha=0.1 did not improve well on the Contact benchmark score but did improve the best for the Crashed benchmark score.

I take from this that alpha=0.03 is the best level of alpha to use for long term convergence.

The Crashed benchmark score we know is not that useful: the Crashed benchmark itself is flawed, and a multi-linear regression showed very little impact on score of the Crashed benchmark. That said, I tried a little experiment where I used the Contact network for crashed positions in Player 3.5 and it definitely worsened performance in self-play: 0.04ppg on average. That is a significant margin at this point in the player development.

I ran 4,000 supervised learning steps in the training, for each of the three alpha levels. In each step I training on a randomly-arranged set of Contact and Crashed training positions from the training benchmark databases. This took a month and a half. The benchmark scores were still slowly improving for alpha=0.01 and alpha=0.03, so there is still scope for improvement. I stopped just because the GNUbg folks have put out new benchmark and training databases that I want to switch to.

Tuesday, June 12, 2012

GNUbg Crashed benchmark issue

It looks like the Crashed benchmark set in the GNUbg benchmarks isn't very accurate in some cases.

There is a thread discussing it in the GNUbg mailing list.

Interesting to know, and hopefully the GNUbg team will improve on it; but the Crashed benchmark score is not very predictive for overall gameplay, as I've discovered while comparing players of different strengths.

Monday, May 14, 2012

Player 3.5: new input, escapes from the bar

I tried another new input for contact and crashed networks: this time, the expected escapes if you have a single checker on the bar. That is, looking at the available spaces in the opponent home board and weighting the probability of landing in the space with the standard escape count from the Berliner primes calculation. It is meant to give some indication of how good or bad it'd be to get hit. I'm focusing on inputs along these lines because when looking at which positions are calculated most poorly in the benchmarks, it tends to be boards where there is a significant chance of being hit and landing behind a prime.

This one had some success, and while the improvement is still incremental, it resulted in my best player to date. The resulting player that uses the new input is Player 3.5. It is identical to Player 3.4, except for two new inputs: the input as described above, one for each player.

Its GNUbg benchmark scores are Contact 13.0, Crashed 11.5, and Race 0.766. Player 3.4's scores are 13.3, 11.7, and 0.766, so noticeably better but still nothing dramatic (though notably some improvement in Contact, the most important benchmark). It seems that to get a significantly stronger player I'll have to add a bunch of inputs, each of which offers reasonably incremental benefits.

In cubeless money player against Player 3.4, it scores an average +0.0033ppg +/- 0.0021ppg in 400k games. Against PubEval it scores an average +0.592ppg +/- 0.005ppg in 100k games and wins 69.5% of the games.

Still not nearly as good as GNUbg 0-ply! But creeping closer.

To be honest I'm not really sure whether the improved performance came because of the new input or because I slightly changed the training algorithm. In this case I started with random weights for the new inputs and ran supervised learning against the GNUbg training databases (contact & crashed). And instead of bouncing back and forth between a large alpha (1) and smaller alphas, I just used a small and constant alpha of 0.03. The resulting benchmark score slowly improved over 1,100 iterations, which took several days to run.

Thursday, June 17, 2021

Monday, August 20, 2012

Sunday, August 19, 2012

Tuesday, June 12, 2012

Monday, May 14, 2012

About Me