I took that set of 25,000-ish checker images and trained a multi-layer perceptron (scikit-learn's MLPClassifier) on them to see how it'd do.
Each image was reduced to a 16x16 grayscale image, and those 256 numbers were the inputs to the MLP. They were normalized by dividing by 255 and then subtracting 0.5, so the range was [-0.5,0.5] for each of the 256 inputs.
As discussed before, I used a 99% threshold for classifying an image as a white or black checker, to improve precision at the expense of recall.
I split the input data into training and test sets: the training set used 75% of randomly-sampled inputs, and the test set was the remaining 25%.
I judged the performance of the classifier by looking at a confusion table, plus using it to identify checker on a reference board.
This post gives some results for different topologies of the neural network. In all cases I used the "adam" solver (a stochastic gradient descent solver) and an L2 penalty parameter "alpha" equal to 1e-4.
One Hidden Layer, 10 Nodes
With one hidden layer and ten nodes, the test data set performance statistics were:
precision recall f1-score support
0 0.85 1.00 0.92 5013
1 0.93 0.10 0.18 636
2 0.96 0.48 0.64 608
accuracy 0.86 6257
macro avg 0.91 0.52 0.58 6257
weighted avg 0.87 0.86 0.81 6257
Here, category 1 is white checkers, and category 2 is black checkers. Category 0 is no checker.
The precision ended up quite good, but the recall was pretty terrible - it was missing a lot of valid checkers.
When I run this on my reference board (a board image that was not used to generate any of the input data), it shows that it indeed does a pretty good job finding white checkers, but a relatively poor job with the black checkers:
This is how I'll qualitatively evaluate the performance. Each white dot in the image above is where, on one scan, the classifier thought it found a white checker (placed in the center of the window). As it did multiple scans, the same checker is found over and over, and you get a grouping of white dots. There are a few places on the board, outside the main board area, where it placed groupings of white dots, but generally it's finding the real white checkers pretty effectively.
It does a noticeably worse job with the black checkers (here colored blue on my board) - way fewer blue dots on those checker.
One Hidden Layer, 100 Nodes
If I jack the number of hidden nodes in a single hidden layer from 10 to 100, the performance noticeably improves:
precision recall f1-score support
0 0.94 0.99 0.96 5013
1 0.95 0.64 0.76 636
2 0.95 0.83 0.89 608
accuracy 0.94 6257
macro avg 0.94 0.82 0.87 6257
weighted avg 0.94 0.94 0.94 6257
The precision is a bit better, but the recall has improved dramatically. And when it identifies checkers on my reference board it does materially better with the black checkers:
Two Hidden Layers, 10 Nodes in Each
If we change the topology of the network to have two hidden layers, each with ten nodes, the performance is:
precision recall f1-score support
0 0.87 0.99 0.93 5013
1 0.92 0.25 0.39 636
2 0.92 0.58 0.71 608
accuracy 0.88 6257
macro avg 0.90 0.61 0.68 6257
weighted avg 0.88 0.88 0.85 6257
This does a bit better than a single hidden layer with 10 nodes, but worse than the single layer with 100 nodes.
Two Hidden Layers, 100 Nodes in Each
If we change to 100 layers in each of two hidden layers, the performance is much stronger:
precision recall f1-score support
0 0.94 0.99 0.96 5013
1 0.93 0.61 0.74 636
2 0.94 0.88 0.91 608
accuracy 0.94 6257
macro avg 0.94 0.82 0.87 6257
weighted avg 0.94 0.94 0.93 6257
Adding the second hidden layer doesn't seem to improve the situation much compared to the one hidden layer with 100 nodes.
One Hidden Layer, 200 Nodes
With 200 nodes in one hidden layer, the performance doesn't change much from the 100 node case:
precision recall f1-score support
0 0.94 0.99 0.96 5013
1 0.95 0.65 0.77 636
2 0.95 0.83 0.89 608
accuracy 0.94 6257
macro avg 0.95 0.82 0.87 6257
weighted avg 0.94 0.94 0.94 6257
Variation by Number of Hidden Nodes (One Layer)
The recall statistic for black checker (class 1 in the table) classification is the most sensitive to the topology. Here is a chart of performance and recall % for that category as a function of the number of hidden nodes for one hidden layer:
In general it seems like performance is pretty stable across the board, and recall seems to stabilize around 50 hidden nodes and more.
So I'll stick with one hidden layer and 50 nodes.