Parthasarathy D, Kontes G, Plinge A, Mutschler C (2026)
Publication Status: Submitted
Publication Type: Unpublished / Preprint
Future Publication Type: Journal article
Publication year: 2026
DOI: 10.48550/arXiv.2305.16209
The Constrained Markov Decision Process (CMDP) formulation allows to solve
safety-critical decision making tasks that are subject to constraints. While CMDPs
have been extensively studied in the Reinforcement Learning literature, little
attention has been given to sampling-based planning algorithms such as Monte
Carlo Tree Search (MCTS) for solving them. Previous approaches are conservative
with respect to costs as they avoid constraint violations by using Monte Carlo
cost estimates that suffer from high variance. We propose Constrained MCTS
(C-MCTS), which estimates cost using a safety critic that is trained with Temporal
Difference learning in an offline phase prior to agent deployment. The critic limits
exploration to unsafe regions during deployment by pruning unsafe trajectories
within MCTS. This makes C-MCTS more efficient w.r.t. planning steps. Compared
to previous work, it achieves higher rewards by operating closer to the constraint
boundary (while satisfying cost constraints) and is less susceptible to cost violations
under model mismatch between the planner and the deployment environment.
APA:
Parthasarathy, D., Kontes, G., Plinge, A., & Mutschler, C. (2026). C-MCTS: Safe Planning with Monte Carlo Tree Search. (Unpublished, Submitted).
MLA:
Parthasarathy, Dinesh, et al. C-MCTS: Safe Planning with Monte Carlo Tree Search. Unpublished, Submitted. 2026.
BibTeX: Download