View Submission

A0335

Title: Bi-level offline reinforcement learning with limited exploration Authors: Wenzhuo Zhou - University of California Irvine (United States) [presenting]
Abstract: Offline reinforcement learning (RL), which seeks to learn a good policy based on a fixed, pre-collected dataset, is studied. A fundamental challenge behind this task is the distributional shift due to the dataset lacking sufficient exploration, especially under function approximation. To tackle this issue, a bi-level structured policy optimization algorithm is proposed that models a hierarchical interaction between the policy (upper level) and the value function (lower level). The lower level focuses on constructing a confidence set of value estimates that maintain sufficiently small weighted average Bellman errors while controlling uncertainty arising from distribution mismatch. Subsequently, at the upper level, the policy aims to maximize a conservative value estimate from the confidence set formed at the lower level. This novel formulation preserves the maximum flexibility of the implicitly induced exploratory data distribution, enabling the power of model extrapolation. In practice, it can be solved through a computationally efficient, penalized adversarial estimation procedure. The theoretical regret guarantees do not rely on any data-coverage and completeness-type assumptions, only requiring realizability. These guarantees also demonstrate that the learned policy represents the best effort among all policies, as no other policies can outperform it.