Offline RL for Adaptive Policy Retrieval in Prior Authorization
Summary: arXiv:2604.05125v1 Announce Type: cross
Abstract: Prior authorization (PA) requires interpretation of complex and fragmented coverage policies, yet existing retrieval-augmented systems rely on static top-$K$ strategies with fixed numbers of retrieved sections. Such fixed retrieval can be inefficient and gather irrelevant or insufficient information.
We model policy retrieval for PA as a sequential decision-making problem, formulating adaptive retrieval as a Markov Decision Process (MDP). In our system, an agent iteratively selects policy chunks from a top-$K$ candidate set or chooses to stop and issue a decision. The reward balances decision correctness against retrieval cost, capturing the trade-off between accuracy and efficiency.
Methodology
We train policies using three distinct approaches:
- Conservative Q-Learning (CQL)
- Implicit Q-Learning (IQL)
- Direct Preference Optimization (DPO)
These methods are implemented in an offline reinforcement learning (RL) setting on logged trajectories generated from baseline retrieval strategies over synthetic PA requests derived from publicly available CMS coverage data.
Results
Our experiments involved a corpus of 186 policy chunks spanning 10 CMS procedures. The results of our training can be summarized as follows:
- CQL: Achieves 92% decision accuracy (+30 percentage points over the best fixed-$K$ baseline) via exhaustive retrieval.
- IQL: Matches the best baseline accuracy using 44% fewer retrieval steps and achieves the only positive episodic return among all policies.
- DPO: Matches CQL’s 92% accuracy while using 47% fewer retrieval steps (10.6 vs. 20.0), occupying a “selective-accurate” region on the Pareto frontier that dominates both CQL and Behavioral Cloning (BC).
A behavioral cloning baseline matches CQL’s performance, confirming that advantage-weighted or preference-based policy extraction is critical for learning selective retrieval.
Discussion
Lambda ablation over step costs ($\lambda \in \{0.05, 0.1, 0.2\}$) reveals a clear accuracy-efficiency inflection. Notably, only at $\lambda = 0.2$ does CQL transition from exhaustive to selective retrieval, highlighting the delicate balance between the accuracy of decision-making and the efficiency of information retrieval.
This study underscores the potential of utilizing offline reinforcement learning techniques to improve adaptive policy retrieval in the context of prior authorization. The promising results suggest that further exploration of these methods could lead to more efficient and accurate decision-making frameworks in healthcare policy interpretation.
