A0488
Title: Enhancing data efficiency in online deep reinforcement learning under partial observability
Authors: Valentina Zangirolami - University of Milano-Bicocca (Italy) [presenting]
Abstract: Model-free (MF) methods have been widespread in online deep reinforcement learning (DRL) literature due to their good asymptotic performance and reliance only on policy estimation. DRL effectively addresses complex real-world scenarios with high dimensional states, leveraging scalable neural networks. Such scenarios can also be affected by the partial observability of states and can be described by partially observable Markov decision processes (POMDPs). MF methods typically require many interactions, resulting in data efficiency issues. The aim is to propose a novel model-based (MB) DRL method called deep recurrent Dyna-Q, which adapts the existing deep Dyna-Q framework to partial observability. MB-DRL introduces the concept of planning, which consists of interacting with the learned POMDP dynamics. Essentially, Dyna methods combine MF and MB-DRL, where the value function is updated employing both the real and simulated experience from the learned dynamics, thus leveraging the asymptotic performance of MF while improving data efficiency through MB. Variational recurrent neural networks are used as a model for estimating the conditional density of observations involving additional stochasticity in hidden states. Experiments are conducted using this novel framework for self-driving cars with different sample update methods, providing a comprehensive statistical analysis and benchmarking against state-of-the-art methods.