Sunday, February 12, 2012

GNUbg benchmark results

I'm moving to use the GNUbg benchmark databases as a new benchmark, but want to compare that to the benchmarking I've done in the past: playing many games against a reference player.

I looked at nine different players of varying skill and calculated the GNUbg average error rate for each, for each of the three benchmark databases (Contact, Crashed, and Race). I also ran 40k cubeless money games for each against both PubEval and my Benchmark 2 player, so that I can compare the GNUbg average error rate benchmarks against the scores in those games.

Note: the results below are a bit incorrect due to a bug in the benchmark calculation. Corrected results here.

Results:


Player GNUbg Contact ER GNUbg Crashed ER GNUgb Race ER PubEval Avg Ppg Benchmark 2 Avg Ppg
Player 2.4 16.4 13.8 1.70 0.480 0.072
Player 2.4q 32.7 34.3 3.35 0.119 -0.283
Player 2.1 16.9 14.5 1.73 0.460 0.069
Player 2 18.3 15.6 1.79 0.442 0.021
Benchmark 2 19.1 18.8 4.42 0.432 0
Benchmark 2 (10) 35.6 30.2 10.5 0.064 -0.418
Benchmark 2 (40) 22.7 20.8 4.86 0.330 -0.067
Benchmark 1 20.2 18.5 6.12 0.351 -0.101
PubEval 37.0 43.4 2.22 0 -0.437

"Benchmark 2 (10)" and "Benchmark 2 (40)" are like Benchmark 2 but with 10 and 40 hidden nodes respectively (vs 80 for the usual Benchmark 2).

The ER numbers in the table above are average equity errors, multiplied by 1,000 to make the display a bit nicer. The Avg Ppg numbers are the average scores in the 40k cubeless money game matches.

To make some sense of these results I tried running a few regressions:


Metric PubEval Ppg vs Contact ER PubEval Ppg vs Crashed ER PubEval Ppg vs Race ER BM2 Ppg vs Contact ER BM2 Ppg vs Crashed ER BM2 Ppg vs Race ER
Slope -0.0222 -0.0174 -0.0268 -0.0238 -0.0183 -0.0347
Intercept 0.8369 0.7033 0.4068 0.4519 0.3001 0.0143
R-Squared 99.1% 92.6% 17.2% 97.4% 87.6% 24.6%

One notable result: the GNUbg benchmark that is most predictive of match score (in R^2 terms) is the Contact benchmark. Crashed is also fairly good, but Race doesn't tell much. I think that's because all the strategies are relatively good in Race (compared to the other two game phases), so overall performance differences do not depend much on that phase's performance.

I also ran a multivariate linear regression of the PubEval and Benchmark 2 scores against the three benchmarks. Adding the three factors vs just Contact did not improve R^2 that much - more so for the Benchmark 2 case, but still marginal:

Benchmark
Intercept
Contact ER Slope
Crashed ER Slope
Race ER Slope
R-Squared
PubEval
0.8121

-0.01562
-0.00505
-0.00417
99.3%
Benchmark 2
0.4199
-0.01323
-0.00732

-0.01340
98.6%

From now on I will use the Contact GNUbg benchmark average error per game as the standard benchmark, while also referring to the Crashed and Race results as secondary benchmarks.


No comments:

Post a Comment