[Interests] [Experiences] [Publications] [Professional Activities] [Back to Home]


Off-policy Learning

How can we evaluate the quality of a new policy using data collected by another policy? Answer to this question finds a wide range of applications in the industry, alleviating the need for frequent online experimentation that can be costly, time-consuming, and risky. This problem is very related to covariate-shift and causal effect estimation.
  • C. Xiao, Y. Wu, T. Lattimore, B. Dai, J. Mei, L. Li, Cs. Szepesvari, and D. Schuurmans: On the optimality of batch policy optimization algorithms. In the 38th International Conference on Machine Learning (ICML), 2021. [arXiv]
  • A. Bennett, N. Kallus, L. Li, and A. Mousavi: Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders. In the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021. [arXiv]
  • O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans: AlgaeDICE: Policy gradient from arbitrary experience. [arXiv]
  • B. Dai, O. Nachum, Y. Chow, L. Li, Cs. Szepesvari, D. Schuurmans: CoinDICE: Off-policy confidence interval estimation. In Advances in Neural Information Processing Systems 33 (NeurIPS), spotlight, 2020.
  • M. Yang, O. Nachum, B. Dai, L. Li, D. Schuurmans: Off-policy evaluation via the regularized Lagrangian. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020. [arXiv]
  • J. Wen, B. Dai, L. Li, and D. Schuurmans: Batch stationary distribution estimation. In the 37th International Conference on Machine Learning (ICML), 2020. [arXiv]
  • R. Zhang, B. Dai, L. Li, and D. Schuurmans: GenDICE: Generalized offline estimation of stationary values. In the 8th International Conference on Learning Representations (ICLR), 2020. [link, arXiv]
  • Z. Tang, Y. Feng, L. Li, D. Zhou, and Q. Liu: Doubly robust bias reduction in infinite horizon off-policy estimation. In the 8th International Conference on Learning Representations (ICLR), 2020. [link]
  • A. Mousavi, L. Li, Q. Liu, and D. Zhou: Black-box off-policy estimation for infinite-horizon reinforcement learning. In the 8th International Conference on Learning Representations (ICLR), 2020. [link, arXiv]
  • O. Nachum, Y. Chow, B. Dai, and L. Li: DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems 32 (NeurIPS), spotlight, 2019. [arXiv]
  • L. Li: A perspective on off-policy evaluation in reinforcement learning (Invited Paper). Frontiers of Computer Science, 13(5):911-912, 2019. [link, PDF]
  • Q. Liu, L. Li, Z. Tang, and D. Zhou: Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems 31 (NeurIPS), spotlight, 2018. [link]
  • N. Jiang and L. Li: Doubly robust off-policy value evaluation for reinforcement learning. In the 33rd International Conference on Machine Learning (ICML), 2016. [link]
  • M. Zoghi, T. Tunys, L. Li, D. Jose, J. Chen, C.-M. Chin, and M. de Rijke: Click-based hot fixes for underperforming torso queries. In the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2016. [link]
  • K. Hofmann, L. Li, and F. Radlinski: Online Evaluation for Information Retrieval. Foundations and Trends in Information Retrieval, 10(1):1--107, 2016. ISBN 978-1-68083-163-4. [link, PDF]
  • L. Li, R. Munos, and Cs. Szepesvari: Toward minimax off-policy value estimation. In the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. [link]
  • L. Li, S. Chen, J. Kleban, and A. Gupta: Counterfactual estimation and optimization of click metrics in search engines: A case study. In the 24th International Conference on World Wide Web (WWW), Companion, 2015. [link]
  • L. Li, J. Kim, and I. Zitouni: Toward predicting the outcome of an A/B experiment for search relevance. In the 8th International Conference on Web Search and Data Mining (WSDM), 2015. [link]
  • D. Yankov, P. Berkhin, and L. Li: Evaluation of explore-exploit policies in multi-result ranking systems. Microsoft Journal on Applied Research, volume 3, pages 54--60, 2015. Also available as Microsoft Research Technical Report MSR-TR-2015-34, May 2015.
  • M. Dudik, D. Erhan, J. Langford, and L. Li: Doubly robust policy evaluation and optimization. In Statistical Science, 29(4):485--511, 2014.
  • M. Dudik, D. Erhan, J. Langford, and L. Li: Sample-efficient nonstationary-policy evaluation for contextual bandits. In the 28th Conference on Uncertainty in Artificial Intelligence (UAI), 2012.
  • L. Li, W. Chu, J. Langford, T. Moon, and X. Wang: An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. In Journal of Machine Learning Research - Workshop and Conference Proceedings 26: On-line Trading of Exploration and Exploitation 2, 2012.
  • M. Dudik, J. Langford, and L. Li: Doubly robust policy evaluation and learning. In the 28th International Conference on Machine Learning (ICML), 2011.
  • D. Agarwal, L. Li, and A.J. Smola: Linear-time algorithms for propensity scores. In the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
  • L. Li, W. Chu, J. Langford, and X. Wang: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In the 4th ACM International Conference on Web Search and Data Mining (WSDM), 2011.
  • A.L. Strehl, J. Langford, L. Li, and S. Kakade: Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23 (NIPS), spotlight, 2011.