Showing 1–20 of 33 results
/ Date/ Name
Feb 5, 2021Finite Sample Analysis of Minimax Offline Reinforcement Learning: Completeness, Fast Rates and First-Order EfficiencyJun 13, 2021Bellman-consistent Pessimism for Offline Reinforcement LearningNov 8, 2022ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline DataApr 4, 2024Direct Nash Optimization: Teaching Language Models to Self-Improve with General PreferencesApr 22, 2024Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy DataMay 26, 2025Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental LimitsMay 21, 2025Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM ReasoningMar 16, 2026POLCA: Stochastic Generative Optimization with LLMNov 2, 2020A Variant of the Wang-Foster-Kakade Lower Bound for the Discounted SettingJun 9, 2021Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement LearningFeb 5, 2022Adversarially Trained Actor Critic for Offline Reinforcement LearningMay 31, 2024Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHFJun 18, 2024Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-ExpertsJul 18, 2024Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference OptimizationJun 16, 2022Interaction-Grounded Learning with Action-inclusive FeedbackJun 8, 2019Towards Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance SamplingMar 9, 2020Q* Approximation Schemes for Batch Reinforcement Learning: A Theoretical ComparisonJun 9, 2021Interaction-Grounded LearningFeb 20, 2024CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual ExamplesFeb 14, 2025Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective