[Interests] [Experiences] [Publications] [Professional Activities] [Back to Home]

Off-policy Learning

How can we evaluate the quality of a new policy using data collected by another policy? Answer to this question finds a wide range of applications in the industry, alleviating the need for frequent online experimentation that can be costly, time-consuming, and risky. This problem is very related to covariate-shift and causal effect estimation.

Z. Tang, Y. Duan, S. Zhu, S. Zhang, and L. Li: Estimating long-term effects from experimental data. In the 16th ACM Conference on Recommender Systems (RecSys), Industry Track, 2022.
C. Xiao, Y. Wu, T. Lattimore, B. Dai, J. Mei, L. Li, Cs. Szepesvari, and D. Schuurmans: On the optimality of batch policy optimization algorithms. In the 38th International Conference on Machine Learning (ICML), 2021. [arXiv]
A. Bennett, N. Kallus, L. Li, and A. Mousavi: Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders. In the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021. [arXiv]
O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans: AlgaeDICE: Policy gradient from arbitrary experience. [arXiv]
B. Dai, O. Nachum, Y. Chow, L. Li, Cs. Szepesvari, D. Schuurmans: CoinDICE: Off-policy confidence interval estimation. In Advances in Neural Information Processing Systems 33 (NeurIPS), spotlight, 2020.
M. Yang, O. Nachum, B. Dai, L. Li, D. Schuurmans: Off-policy evaluation via the regularized Lagrangian. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020. [arXiv]
J. Wen, B. Dai, L. Li, and D. Schuurmans: Batch stationary distribution estimation. In the 37th International Conference on Machine Learning (ICML), 2020. [arXiv]
R. Zhang, B. Dai, L. Li, and D. Schuurmans: GenDICE: Generalized offline estimation of stationary values. In the 8th International Conference on Learning Representations (ICLR), 2020. [link, arXiv]
Z. Tang, Y. Feng, L. Li, D. Zhou, and Q. Liu: Doubly robust bias reduction in infinite horizon off-policy estimation. In the 8th International Conference on Learning Representations (ICLR), 2020. [link]
A. Mousavi, L. Li, Q. Liu, and D. Zhou: Black-box off-policy estimation for infinite-horizon reinforcement learning. In the 8th International Conference on Learning Representations (ICLR), 2020. [link, arXiv]
O. Nachum, Y. Chow, B. Dai, and L. Li: DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems 32 (NeurIPS), spotlight, 2019. [arXiv]
L. Li: A perspective on off-policy evaluation in reinforcement learning (Invited Paper). Frontiers of Computer Science, 13(5):911-912, 2019. [link, PDF]
Q. Liu, L. Li, Z. Tang, and D. Zhou: Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems 31 (NeurIPS), spotlight, 2018. [link]
N. Jiang and L. Li: Doubly robust off-policy value evaluation for reinforcement learning. In the 33rd International Conference on Machine Learning (ICML), 2016. [link]
M. Zoghi, T. Tunys, L. Li, D. Jose, J. Chen, C.-M. Chin, and M. de Rijke: Click-based hot fixes for underperforming torso queries. In the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2016. [link]
K. Hofmann, L. Li, and F. Radlinski: Online Evaluation for Information Retrieval. Foundations and Trends in Information Retrieval, 10(1):1--107, 2016. ISBN 978-1-68083-163-4. [link, PDF]
L. Li, R. Munos, and Cs. Szepesvari: Toward minimax off-policy value estimation. In the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. [link]
L. Li, S. Chen, J. Kleban, and A. Gupta: Counterfactual estimation and optimization of click metrics in search engines: A case study. In the 24th International Conference on World Wide Web (WWW), Companion, 2015. [link]
L. Li, J. Kim, and I. Zitouni: Toward predicting the outcome of an A/B experiment for search relevance. In the 8th International Conference on Web Search and Data Mining (WSDM), 2015. [link]
D. Yankov, P. Berkhin, and L. Li: Evaluation of explore-exploit policies in multi-result ranking systems. Microsoft Journal on Applied Research, volume 3, pages 54--60, 2015. Also available as Microsoft Research Technical Report MSR-TR-2015-34, May 2015.
M. Dudik, D. Erhan, J. Langford, and L. Li: Doubly robust policy evaluation and optimization. In Statistical Science, 29(4):485--511, 2014.
M. Dudik, D. Erhan, J. Langford, and L. Li: Sample-efficient nonstationary-policy evaluation for contextual bandits. In the 28th Conference on Uncertainty in Artificial Intelligence (UAI), 2012.
L. Li, W. Chu, J. Langford, T. Moon, and X. Wang: An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. In Journal of Machine Learning Research - Workshop and Conference Proceedings 26: On-line Trading of Exploration and Exploitation 2, 2012.
M. Dudik, J. Langford, and L. Li: Doubly robust policy evaluation and learning. In the 28th International Conference on Machine Learning (ICML), 2011.
D. Agarwal, L. Li, and A.J. Smola: Linear-time algorithms for propensity scores. In the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
L. Li, W. Chu, J. Langford, and X. Wang: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In the 4th ACM International Conference on Web Search and Data Mining (WSDM), 2011.
A.L. Strehl, J. Langford, L. Li, and S. Kakade: Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23 (NIPS), spotlight, 2011.