I'm moving to use the GNUbg benchmark databases as a new benchmark, but want to compare that to the benchmarking I've done in the past: playing many games against a reference player.
I looked at nine different players of varying skill and calculated the GNUbg average error rate for each, for each of the three benchmark databases (Contact, Crashed, and Race). I also ran 40k cubeless money games for each against both PubEval and my Benchmark 2 player, so that I can compare the GNUbg average error rate benchmarks against the scores in those games.
Note: the results below are a bit incorrect due to a bug in the benchmark calculation. Corrected results here.
Results:
"Benchmark 2 (10)" and "Benchmark 2 (40)" are like Benchmark 2 but with 10 and 40 hidden nodes respectively (vs 80 for the usual Benchmark 2).
The ER numbers in the table above are average equity errors, multiplied by 1,000 to make the display a bit nicer. The Avg Ppg numbers are the average scores in the 40k cubeless money game matches.
To make some sense of these results I tried running a few regressions:
One notable result: the GNUbg benchmark that is most predictive of match score (in R^2 terms) is the Contact benchmark. Crashed is also fairly good, but Race doesn't tell much. I think that's because all the strategies are relatively good in Race (compared to the other two game phases), so overall performance differences do not depend much on that phase's performance.
I also ran a multivariate linear regression of the PubEval and Benchmark 2 scores against the three benchmarks. Adding the three factors vs just Contact did not improve R^2 that much - more so for the Benchmark 2 case, but still marginal:
I looked at nine different players of varying skill and calculated the GNUbg average error rate for each, for each of the three benchmark databases (Contact, Crashed, and Race). I also ran 40k cubeless money games for each against both PubEval and my Benchmark 2 player, so that I can compare the GNUbg average error rate benchmarks against the scores in those games.
Note: the results below are a bit incorrect due to a bug in the benchmark calculation. Corrected results here.
Results:
Player | GNUbg Contact ER | GNUbg Crashed ER | GNUgb Race ER | PubEval Avg Ppg | Benchmark 2 Avg Ppg |
---|---|---|---|---|---|
Player 2.4 | 16.4 | 13.8 | 1.70 | 0.480 | 0.072 |
Player 2.4q | 32.7 | 34.3 | 3.35 | 0.119 | -0.283 |
Player 2.1 | 16.9 | 14.5 | 1.73 | 0.460 | 0.069 |
Player 2 | 18.3 | 15.6 | 1.79 | 0.442 | 0.021 |
Benchmark 2 | 19.1 | 18.8 | 4.42 | 0.432 | 0 |
Benchmark 2 (10) | 35.6 | 30.2 | 10.5 | 0.064 | -0.418 |
Benchmark 2 (40) | 22.7 | 20.8 | 4.86 | 0.330 | -0.067 |
Benchmark 1 | 20.2 | 18.5 | 6.12 | 0.351 | -0.101 |
PubEval | 37.0 | 43.4 | 2.22 | 0 | -0.437 |
"Benchmark 2 (10)" and "Benchmark 2 (40)" are like Benchmark 2 but with 10 and 40 hidden nodes respectively (vs 80 for the usual Benchmark 2).
The ER numbers in the table above are average equity errors, multiplied by 1,000 to make the display a bit nicer. The Avg Ppg numbers are the average scores in the 40k cubeless money game matches.
To make some sense of these results I tried running a few regressions:
Metric | PubEval Ppg vs Contact ER | PubEval Ppg vs Crashed ER | PubEval Ppg vs Race ER | BM2 Ppg vs Contact ER | BM2 Ppg vs Crashed ER | BM2 Ppg vs Race ER |
---|---|---|---|---|---|---|
Slope | -0.0222 | -0.0174 | -0.0268 | -0.0238 | -0.0183 | -0.0347 |
Intercept | 0.8369 | 0.7033 | 0.4068 | 0.4519 | 0.3001 | 0.0143 |
R-Squared | 99.1% | 92.6% | 17.2% | 97.4% | 87.6% | 24.6% |
One notable result: the GNUbg benchmark that is most predictive of match score (in R^2 terms) is the Contact benchmark. Crashed is also fairly good, but Race doesn't tell much. I think that's because all the strategies are relatively good in Race (compared to the other two game phases), so overall performance differences do not depend much on that phase's performance.
I also ran a multivariate linear regression of the PubEval and Benchmark 2 scores against the three benchmarks. Adding the three factors vs just Contact did not improve R^2 that much - more so for the Benchmark 2 case, but still marginal:
Benchmark
|
Intercept
|
Contact ER Slope
|
Crashed ER Slope
|
Race ER Slope
|
R-Squared
|
---|---|---|---|---|---|
PubEval
| 0.8121 |
-0.01562
|
-0.00505
|
-0.00417
|
99.3%
|
Benchmark 2
|
0.4199
|
-0.01323
| -0.00732 |
-0.01340
|
98.6%
|
From now on I will use the Contact GNUbg benchmark average error per game as the standard benchmark, while also referring to the Crashed and Race results as secondary benchmarks.
No comments:
Post a Comment