Investigating Scale Independent UCT Exploration Factor Strategies
cs.AI
/ Authors
/ Abstract
The Upper Confidence Bounds For Trees (UCT) algorithm is not agnostic to the reward scale of the game it is applied to. For zero-sum games with the sparse rewards of $\{-1,0,1\}$ at the end of the game, this is not a problem, but many games often feature dense rewards with hand-picked reward scales, causing a node's Q-value to span different magnitudes across different games. In this paper, we evaluate various strategies for adaptively choosing the UCT exploration constant $λ$, called $λ$-strategies, that are agnostic to the game's reward scale. These $λ$-strategies include those proposed in the literature as well as five new strategies. Given our experimental results, we recommend using one of our newly suggested $λ$-strategies, which is to choose $λ$ as $2 \cdot σ$ where $σ$ is the empirical standard deviation of all state-action pairs' Q-values of the search tree. This method outperforms existing $λ$-strategies across a wide range of tasks both in terms of a single parameter value and the peak performances obtained by optimizing all available parameters.