Computational Backgammon: GNUbg benchmark results

I'm moving to use the GNUbg benchmark databases as a new benchmark, but want to compare that to the benchmarking I've done in the past: playing many games against a reference player.

I looked at nine different players of varying skill and calculated the GNUbg average error rate for each, for each of the three benchmark databases (Contact, Crashed, and Race). I also ran 40k cubeless money games for each against both PubEval and my Benchmark 2 player, so that I can compare the GNUbg average error rate benchmarks against the scores in those games.

Note: the results below are a bit incorrect due to a bug in the benchmark calculation. Corrected results here.

Results:

Player	GNUbg Contact ER	GNUbg Crashed ER	GNUgb Race ER	PubEval Avg Ppg	Benchmark 2 Avg Ppg
Player 2.4	16.4	13.8	1.70	0.480	0.072
Player 2.4q	32.7	34.3	3.35	0.119	-0.283
Player 2.1	16.9	14.5	1.73	0.460	0.069
Player 2	18.3	15.6	1.79	0.442	0.021
Benchmark 2	19.1	18.8	4.42	0.432	0
Benchmark 2 (10)	35.6	30.2	10.5	0.064	-0.418
Benchmark 2 (40)	22.7	20.8	4.86	0.330	-0.067
Benchmark 1	20.2	18.5	6.12	0.351	-0.101
PubEval	37.0	43.4	2.22	0	-0.437

"Benchmark 2 (10)" and "Benchmark 2 (40)" are like Benchmark 2 but with 10 and 40 hidden nodes respectively (vs 80 for the usual Benchmark 2).

The ER numbers in the table above are average equity errors, multiplied by 1,000 to make the display a bit nicer. The Avg Ppg numbers are the average scores in the 40k cubeless money game matches.

To make some sense of these results I tried running a few regressions:

Metric	PubEval Ppg vs Contact ER	PubEval Ppg vs Crashed ER	PubEval Ppg vs Race ER	BM2 Ppg vs Contact ER	BM2 Ppg vs Crashed ER	BM2 Ppg vs Race ER
Slope	-0.0222	-0.0174	-0.0268	-0.0238	-0.0183	-0.0347
Intercept	0.8369	0.7033	0.4068	0.4519	0.3001	0.0143
R-Squared	99.1%	92.6%	17.2%	97.4%	87.6%	24.6%

One notable result: the GNUbg benchmark that is most predictive of match score (in R^2 terms) is the Contact benchmark. Crashed is also fairly good, but Race doesn't tell much. I think that's because all the strategies are relatively good in Race (compared to the other two game phases), so overall performance differences do not depend much on that phase's performance.

I also ran a multivariate linear regression of the PubEval and Benchmark 2 scores against the three benchmarks. Adding the three factors vs just Contact did not improve R^2 that much - more so for the Benchmark 2 case, but still marginal:

Benchmark	Intercept	Contact ER Slope	Crashed ER Slope	Race ER Slope	R-Squared
PubEval	0.8121	-0.01562	-0.00505	-0.00417	99.3%
Benchmark 2	0.4199	-0.01323	-0.00732	-0.01340	98.6%

From now on I will use the Contact GNUbg benchmark average error per game as the standard benchmark, while also referring to the Crashed and Race results as secondary benchmarks.

Computational Backgammon

Sunday, February 12, 2012

GNUbg benchmark results

No comments:

Post a Comment

About Me