## Intro

In this post we report success in using reinforcement to learn the game of nim. We had previously cited two theses (ERIK JÄRLEBERG (2011) and PAUL GRAHAM & WILLIAM LORD (2015)) that used Q-learning to learn the game of nim. However, in this setting, the scaling issues with Q-learning are much more severe than with value-learning. In this post we use a value-based approach with a table. Because the value-based approach is much more efficient than Q-learning no functional approximation is needed, up to reasonable heap sizes.

## Scaling

maxHeapSize | numberOfHeaps | States | Actions | q_table |
---|---|---|---|---|

7 | 4 | 3.30E+02 | 2.40E+03 | 1.96E+06 |

9 | 5 | 2.00E+03 | 5.90E+04 | 4.82E+07 |

11 | 6 | 1.24E+04 | 1.77E+06 | 1.45E+09 |

13 | 6 | 2.71E+04 | 4.83E+06 | 3.94E+09 |

14 | 8 | 3.20E+05 | 1.48E+09 | 1.20E+12 |

Given k heaps with a maximum heap size n the number of states, which is the number of different starting positions, can be calculated accurately: This is equivalent to calculating the number of multisets of cardinality k, with elements taken from a finite set of cardinality n, This is called the multiset coefficient or multiset number and its notation is (we use the hash # symbol to signify “number of”, so # States means Number of States):

(i) #Different starting positions = #States = {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)}

It can be evaluataed as

(ii) #States = {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)} = { \!{n + k - 1 \choose k}\!}

To estimate the number of actions we use the same approach as Graham & Lord namely

(iii) #Actions = {\displaystyle \textstyle \!\!\prod_{i=1}^{n} h_{i}\!\! }

Therefore the number of entries in the Q-table is

(iv) Size (Q-table) = #States * #Actions =

{ \!{n + k - 1 \choose k}\!} * {\displaystyle \textstyle \!\!\prod_{i=1}^{n} h_{i}\!\! }

whereas the number of entries in the Value tables is

(ii)Size (Value-table) = #States = {\displaystyle \textstyle \left(\!\!{n \choose k}\!\!\right)} = { \!{n + k - 1 \choose k}\!}

It is apparent that the scalabilty is much better for the Value-table based approach, as the Q-value table is bigger by the actions factor of equation (iii).

## Package, Python package, comes with tests

For the solution we create a package complete with tests which can be found on github:

https://github.com/hfwittmann/MEAV

Description of the package

Tricks

- Memory replay: The package uses standard memory replay to make optimal use of the history of played-out moves
- Moves are not stored in the history when the choice was suboptimal, as is sometimes for exploration. Cf. Exploitation Exploration trade-off.
- Exploitation Exploration trade-off is implemented using the standard epsilon-greedy mechanism.

## Results

## 3 heaps Size 10

Load the packages…

In [1]:

1 2 3 4 5 |
import matplotlib.style as style import matplotlib.pyplot as plt style.use('ggplot') from MEAV import Memory, Agent, Environment, ValueTable, PerfectTest, RunOnPolicy |

1 2 3 4 5 6 |
rOP = RunOnPolicy(Memory, Environment, Agent, ValueTable, PerfectTest, maxHeapSize=10) accuracy_predictions, tableSize, memorySize,reward, \ stats_rewards, stats_differences, stats_accuracy = \ rOP.runEpisodes(desired_accuracy_predictions=0.99, \ maxHeapSize=7, numberOfHeaps=3) |

1 2 3 4 5 6 7 8 |
plt.plot(stats_accuracy) plt.xlabel('Number of episodes') plt.ylabel('Out of sample accuracy') plt.show() plt.plot(stats_differences, color = 'Blue') plt.xlabel('Number of episodes') plt.ylabel('Sum of Differences to Optimal moves (Exploration)') |

Next we’ll look at two different startting positions, namely

## 3 heaps Size 20 = (20, 20, 20)

If you run this in the same directory as a previous run, make sure you delete the cache folder inside your working directory.

You can use the same code as above, except one line that needs to be changed.

1 2 3 4 |
accuracy_predictions, tableSize, memorySize,reward, \ stats_rewards, stats_differences, stats_accuracy = \ rOP.runEpisodes(desired_accuracy_predictions=0.99, \ maxHeapSize= 20, numberOfHeaps=3) |

## 4 heaps size 7 = (7, 7, 7, 7)

Next we restrict the heaps to those positions that can be reached from a (7,7,7,7) starting position. Among those is the classical so-called Marienbad Position, namely (1,3,5,7).

If you run this in the same directory as a previous run, make sure you delete the cache folder inside your working directory.

You can use the same code as above, except one line that needs to be changed.

1 2 3 4 |
accuracy_predictions, tableSize, memorySize,reward, \ stats_rewards, stats_differences, stats_accuracy = \ rOP.runEpisodes(desired_accuracy_predictions=0.99, \ maxHeapSize= 7, numberOfHeaps=4) |

Success!

## Summary

We have implemented a system that shows several improvements compared to earlier work.

The system is more scalable, by virtue of the fact that we’ve used a value based approach, coupled with a redundancy reducing trick.

Also, we have implemented a package version which makes extensive use of test cases.

The result is a system which can play the game of nim up to reasonable heap sizes without the need for function approximation.