I've been thinking a bit about whether GPUs are useful for neural network calculations with backgammon bots. It's not totally clear: there is overhead in shipping memory to the GPUs as a cost, measured against the benefits of parallelization.
That said, it is not very hard to check: there are nice open source packages available for neural networks that already support GPU calculations.
A simple one that seems decent for our purposes is PyTorch. This is a Python package for representing neural networks, and, as desired, it supports GPUs through CUDA (which works if you've got an NVIDIA GPU, as I do on my desktop machine).
There are two possible benefits of the GPU. First, evaluation of the network might be faster - so that it's quicker to do bot calculations during a game. Second, the training might be faster - that doesn't show up in the post-training game play, but could let you experiment more easily during training with different network sizes and configurations.
For the training calculations, the GPU really only helps with supervised learning - for example, training against the GNUBG training examples - but not with TD learning. That's because TD learning requires you to simulate a game to get the target to train against - you don't know the set of inputs and targets as training data before the TD training starts.
I constructed a simple feed forward neural network with the same shape as the ones I've discussed before: 196 inputs (the standard Tesauro inputs with no extensions), 120 hidden nodes (as a representative example), and 5 output nodes.
Then I just timed the calculation of network outputs given the inputs. The PyTorch version was noticeably slower even than my hand-rolled Python neural network class (which uses numpy vectorization to speed it up - not as fast as native C++ code but with a factor of 2-3). This is true even though the PyTorch networks do their calculations in C++ themselves. These calculations didn't use the GPU, just the CPU, in serial.
Then I tried the PyTorch calculations parallelizing on the GPU. I've got an NVIDIA GeForce GTX 1060 3GB GPU in my desktop machine, with 1,152 CUDA cores. It was slower, with this number of hidden nodes, than the CPU version of the PyTorch calculation. So the memory transfer overhead outweighed the parallelization benefits in this case.
I tried it with a larger network to see what happened - as the number of hidden nodes goes up, the PyTorch evaluations start to outperform my numpy-based evaluations, especially when using the GPU
So for post-training bot evaluations, it doesn't seem like using the GPU will give much improvement, if we're using nets of the size we're used to.
The GPU does, however, make a massive difference when training! In practice it's running about 10-20x faster than the serial version of the supervised training against the GNUBG training data. I'm excited about this one. And I can always take the trained weights off the PyTorch-based network and paste them onto the numpy-based one, which executes faster post training.