Risk-Sensitive Piecewise-Linear Policy Iteration for Stochastic Shortest Path Markov Decision Processes

Pastor, Henrique Dias; Borges, Igor Oliveira; Freire, Valdinei; Delgado, Karina Valdivia; de Barros, Leliane Nunes; MartinezVillasenor, L; HerreraAlcantara, O; Ponce, H; CastroEspinoza, FA

Full text
Author(s):	Pastor, Henrique Dias ; Borges, Igor Oliveira ; Freire, Valdinei ; Delgado, Karina Valdivia ; de Barros, Leliane Nunes ; MartinezVillasenor, L ; HerreraAlcantara, O ; Ponce, H ; CastroEspinoza, FA Total Authors: 9
Document type:	Journal article
Source:	ADVANCES IN SOFT COMPUTING, MICAI 2020, PT I; v. 12468, p. 13-pg., 2020-01-01.
Abstract
A Markov Decision Process (MDP) is commonly used to model a sequential decision-making problem where an agent interacts with an uncertain environment while looking for minimizing the expected cost accumulated along the process. If the process horizon is infinite, a discount factor gamma is an element of [0, 1] is used to indicate the importance the agent gives to future states. If the agent's mission is to achieve a goal state, the process becomes a Stochastic Shortest Path MDP (SSP-MDP), the in fact model used for probabilistic planning in AI. Although several efficient solutions have been proposed to solve SSP-MDPs, there are little research carried out when we consider the "risk" in such processes. A Risk Sensitive MDP (RS-MDP) allows modeling the agent's risk-averse and risk-prone attitudes, by including a risk and a discount factor in the MDP definition. The proof of convergence of known solutions based on dynamic programming adapted for RS-MDPs, such as risk-sensitive value iteration (VI) and risk-sensitive policy iteration (PI), rely on the discount factor. However, when solving an SSP-MDP we look for a proper policy, i.e. a policy that guarantees to reach the goal while minimizing the accumulated expected cost, which is naturally modeled without discount factor. Besides, it has been shown that the discount factor can modify the chosen risk attitude when solving a risk sensitive SSP-MDP. Thus, in this work we aim to formally proof the convergence of the PI algorithm for a Risk Sensitive SSP-MDP based on operators that use a piecewise-linear transformation function, without a discount factor. We also run experiments in the benchmark River domain showing how the intended risk attitude, in an interval of extreme risk-averse and extreme risk-prone, varies with the discount factor gamma, i.e. how an optimal policy for Risk Sensitive SSP-MDP can go from being a risk-prune policy to a risk-averse one, depending on the discount factor. (AU)

FAPESP's process:	18/11236-9 - Markov decision process and risk
Grantee:	Karina Valdivia Delgado
Support Opportunities:	Regular Research Grants

Short URL