Department of Mathematics,
University of California San Diego
****************************
Undergraduate Honors Presentation
Jingwen Gu
University of California, San Diego
Ladder-BAI: Online RLHF with Linear Dependence on Reward Scale
Abstract:
This thesis studies a central theoretical challenge in online reinforcement learning from human feedback (RLHF): under Bradley--Terry preference feedback, comparing against a fixed weak reference can make learning exponentially inefficient in the reward scale because preference signals saturate. To isolate this issue, the thesis considers a simplified dueling-bandit setting and proposes Ladder-BAI, a self-updating baseline algorithm that repeatedly promotes the current best arm and identifies better arms through simple fixed-baseline comparisons. The main result shows that Ladder-BAI finds an $\epsilon$-optimal arm using $\tilde{O}(K R_{\max} + K/\epsilon^2)$ preference queries, achieving linear dependence on the reward scale $R_{\max}$. This improves substantially over prior exponential or higher-degree polynomial guarantees. The analysis is based on a reward-ladder argument: each epoch yields a constant reward improvement by keeping comparisons informative, and a final refinement step achieves $\epsilon$-accuracy. Synthetic experiments support the theory, confirming linear scaling in reward scale and number of arms, as well as the expected $1/\epsilon^2$ dependence on target accuracy.
Advisor: Professor Lijun Ding
May 8, 2026
4:00 PM
APM 5829
Research Areas
Probability Theory****************************

