Offline RL Enhances Adaptive Policy Retrieval in Prior Authorization

Date:

Offline RL for Adaptive Policy Retrieval in Prior Authorization

Summary: arXiv:2604.05125v1 Announce Type: cross

Abstract: Prior authorization (PA) requires interpretation of complex and fragmented coverage policies, yet existing retrieval-augmented systems rely on static top-$K$ strategies with fixed numbers of retrieved sections. Such fixed retrieval can be inefficient and gather irrelevant or insufficient information.

We model policy retrieval for PA as a sequential decision-making problem, formulating adaptive retrieval as a Markov Decision Process (MDP). In our system, an agent iteratively selects policy chunks from a top-$K$ candidate set or chooses to stop and issue a decision. The reward balances decision correctness against retrieval cost, capturing the trade-off between accuracy and efficiency.

Methodology

We train policies using three distinct approaches:

  • Conservative Q-Learning (CQL)
  • Implicit Q-Learning (IQL)
  • Direct Preference Optimization (DPO)

These methods are implemented in an offline reinforcement learning (RL) setting on logged trajectories generated from baseline retrieval strategies over synthetic PA requests derived from publicly available CMS coverage data.

Results

Our experiments involved a corpus of 186 policy chunks spanning 10 CMS procedures. The results of our training can be summarized as follows:

  • CQL: Achieves 92% decision accuracy (+30 percentage points over the best fixed-$K$ baseline) via exhaustive retrieval.
  • IQL: Matches the best baseline accuracy using 44% fewer retrieval steps and achieves the only positive episodic return among all policies.
  • DPO: Matches CQL’s 92% accuracy while using 47% fewer retrieval steps (10.6 vs. 20.0), occupying a “selective-accurate” region on the Pareto frontier that dominates both CQL and Behavioral Cloning (BC).

A behavioral cloning baseline matches CQL’s performance, confirming that advantage-weighted or preference-based policy extraction is critical for learning selective retrieval.

Discussion

Lambda ablation over step costs ($\lambda \in \{0.05, 0.1, 0.2\}$) reveals a clear accuracy-efficiency inflection. Notably, only at $\lambda = 0.2$ does CQL transition from exhaustive to selective retrieval, highlighting the delicate balance between the accuracy of decision-making and the efficiency of information retrieval.

This study underscores the potential of utilizing offline reinforcement learning techniques to improve adaptive policy retrieval in the context of prior authorization. The promising results suggest that further exploration of these methods could lead to more efficient and accurate decision-making frameworks in healthcare policy interpretation.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.