Ladder-BAI: Online RLHF with Linear Dependence on Reward Scale

Printable PDF

Department of Mathematics,
University of California San Diego

****************************

Undergraduate Honors Presentation

Jingwen Gu

University of California, San Diego

Ladder-BAI: Online RLHF with Linear Dependence on Reward Scale

Abstract:

This thesis studies a central theoretical challenge in online reinforcement learning from human feedback (RLHF): under Bradley--Terry preference feedback, comparing against a fixed weak reference can make learning exponentially inefficient in the reward scale because preference signals saturate. To isolate this issue, the thesis considers a simplified dueling-bandit setting and proposes Ladder-BAI, a self-updating baseline algorithm that repeatedly promotes the current best arm and identifies better arms through simple fixed-baseline comparisons. The main result shows that Ladder-BAI finds an $\epsilon$-optimal arm using $\tilde{O}(K R_{\max} + K/\epsilon^2)$ preference queries, achieving linear dependence on the reward scale $R_{\max}$. This improves substantially over prior exponential or higher-degree polynomial guarantees. The analysis is based on a reward-ladder argument: each epoch yields a constant reward improvement by keeping comparisons informative, and a final refinement step achieves $\epsilon$-accuracy. Synthetic experiments support the theory, confirming linear scaling in reward scale and number of arms, as well as the expected $1/\epsilon^2$ dependence on target accuracy.

Advisor: Professor Lijun Ding

May 8, 2026

4:00 PM

APM 5829

Research Areas

Probability Theory

****************************

Department of Mathematics, University of California San Diego

Undergraduate Honors Presentation

Jingwen Gu

University of California, San Diego

Ladder-BAI: Online RLHF with Linear Dependence on Reward Scale

Abstract:

May 8, 2026

4:00 PM

APM 5829

Department of Mathematics,
University of California San Diego