[Interests] [Experiences] [Professional Activities] [Back to Home]

Publicaitons

Preprints

Z. Wei, W. Yao, Y. Liu, W. Zhang, Q. Lu, L. Qiu, C. Yu, P. Xu, C. Zhang, B. Yin, H. Yun, and L. Li: WebAgent-R1: Training web agents via end-to-end multi-turn reinforcement learning, submitted, 2025.
O. Nachum, B. Dai, I. Kostrikov, Y. Chow, L. Li, and D. Schuurmans: AlgaeDICE: Policy gradient from arbitrary experience. [arXiv]

Conference

Q. Zhang, L. Qiu, I. Hong, Z. Xu, T. Liu, S. Li, R. Zhang, Z. Li, L. Li, B. Yin, C. Zhang, J. Chen, H. Jiang, and T. Zhao: Self-rewarding PPO: Aligning large language models with demonstrations only. In the 2nd Conference on Language Modeling (COLM), 2025.
X. Chen, J. Hu, C. Jin, L. Li, and L. Wang: Understanding domain randomization for sim-to-real transfer. In the 10th International Conference on Learning Representations (ICLR), 2022. [arXiv]
Z. Tang, Y. Duan, S. Zhu, S. Zhang, and L. Li: Estimating long-term effects from experimental data. In the 16th ACM Conference on Recommender Systems (RecSys), Industry Track, 2022.
C. Xiao, Y. Wu, T. Lattimore, B. Dai, J. Mei, L. Li, Cs. Szepesvari, and D. Schuurmans: On the optimality of batch policy optimization algorithms. In the 38th International Conference on Machine Learning (ICML), 2021. [arXiv]
X. Chen, J. Hu, C. Jin, L. Li, and L. Wang: Near-optimal representation learning for linear bandits and linear RL. In the 38th International Conference on Machine Learning (ICML), 2021. [arXiv]
A. Bennett, N. Kallus, L. Li, and A. Mousavi: Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders. In the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), 2021. [arXiv]
X. Chen, J. Hu, L. Li, and L. Wang: Efficient reinforcement learning in factored MDPs with application to constrained RL. In the 9th International Conference on Learning Representations (ICLR), 2021.
W. Zhang, D. Zhou, L. Li, and Q. Gu: Neural Thompson sampling. In the 9th International Conference on Learning Representations (ICLR), 2021. [arXiv]
J. Mei, C. Xiao, B. Dai, L. Li, Cs. Szepesvari, and D. Schuurmans: Escaping the gravitational pull of softmax. In Advances in Neural Information Processing Systems 33 (NeurIPS), oral, 2020.
B. Dai, O. Nachum, Y. Chow, L. Li, Cs. Szepesvari, and D. Schuurmans: CoinDICE: Off-policy confidence interval estimation. In Advances in Neural Information Processing Systems 33 (NeurIPS), spotlight, 2020.
M. Yang, O. Nachum, B. Dai, L. Li, and D. Schuurmans: Off-policy evaluation via the regularized Lagrangian. In Advances in Neural Information Processing Systems 33 (NeurIPS), 2020. [arXiv]
J. Wen, B. Dai, L. Li, and D. Schuurmans: Batch stationary distribution estimation. In the 37th International Conference on Machine Learning (ICML), 2020. [arXiv]
D. Zhou, L. Li, and Q. Gu: Neural contextual bandits with UCB-based exploration. In the 37th International Conference on Machine Learning (ICML), 2020. [arXiv]
B. Kveton, M. Zaheer, Cs. Szepesvari, L. Li, M. Ghavamzadeh, and C. Boutilier: Randomized exploration in generalized linear bandits. In the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), 2020. [arXiv]
R. Zhang, B. Dai, L. Li, and D. Schuurmans: GenDICE: Generalized offline estimation of stationary values. In the 8th International Conference on Learning Representations (ICLR), 2020. [link, arXiv]
Z. Tang, Y. Feng, L. Li, D. Zhou, and Q. Liu: Doubly robust bias reduction in infinite horizon off-policy estimation. In the 8th International Conference on Learning Representations (ICLR), 2020. [link]
A. Mousavi, L. Li, Q. Liu, and D. Zhou: Black-box off-policy estimation for infinite-horizon reinforcement learning. In the 8th International Conference on Learning Representations (ICLR), 2020. [link, arXiv]
O. Nachum, Y. Chow, B. Dai, and L. Li: DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems 32 (NeurIPS), spotlight, 2019. [arXiv]
Y. Feng, L. Li, and Q. Liu: A kernel loss for solving the Bellman equation. In Advances in Neural Information Processing Systems 32 (NeurIPS), 2019. [arXiv]
C. Dann, L. Li, W. Wei, and E. Brunskill: Policy certificates: Towards accountable reinforcement learning. In the 36th International Conference on Machine Learning (ICML), 2019. [link]
H. Dong, J. Mao, T. Lin, C. Wang, L. Li, and D. Zhou: Neural logic machines. In the 7th International Conference on Learning Representations (ICLR), 2019. [arXiv]
Q. Liu, L. Li, Z. Tang, and D. Zhou: Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems 31 (NeurIPS), spotlight, 2018. [link]
K.-S. Jun, L. Li, Y. Ma, and J. Zhu: Adversarial attacks on stochastic bandits. In Advances in Neural Information Processing Systems 31 (NeurIPS), 2018. [link]
Y. Ma, K.-S. Jun, L. Li, and J. Zhu: Data poisoning attacks in contextual bandits. In the 9th Conference on Decision and Game Theory for Security (GameSec), 2018. [arXiv]
D. Tang, X. Li, J. Gao, C. Wang, L. Li, and T. Jebara: Subgoal discovery for hierarchical dialogue policy learning. In the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. [link]
B. Dai, A. Shaw, L. Li, L. Xiao, N. He, Z. Liu, J. Chen, and L. Song: SBEED: Convergent reinforcement learning with nonlinear function approximation. In the 35th International Conference on Machine Learning (ICML), 2018. [link, arXiv, slides]
Y. Chen, L. Li, and M. Wang: Bilinear π learning using state and action features. In the 35th International Conference on Machine Learning (ICML), 2018. [link, arXiv]
B. Dai, A. Shaw, N. He, L. Li, and L. Song: Boosting the actor with dual critic. In the 6th International Conference on Learning Representations (ICLR), 2018. [arXiv]
Z. Lipton, X. Li, J. Gao, L. Li, F. Ahmed, and L. Deng: Efficient dialogue policy learning with BBQ-networks. In the 32nd AAAI Conference on Artificial Intelligence (AAAI), 2018. [link]
J. Chen, C. Wang, L. Xiao, J. He, L. Li, and L. Deng: Q-LDA: Uncovering latent patterns in text-based sequential decision processes. In Advances in Neural Information Processing Systems 30 (NIPS), 2017. [link]
B. Peng, X. Li, L. Li, J. Gao, A. Celikyilmaz, S. Lee, and K.-F. Wong: Composite task-completion dialogue system via hierarchical deep reinforcement learning. In the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017. [link]
L. Li, Y. Lu, and D. Zhou: Provably optimal algorithms for generalized linear contextual bandits. In the 34th International Conference on Machine Learning (ICML), 2017. [link]
S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou: Stochastic variance reduction methods for policy evaluation. In the 34th International Conference on Machine Learning (ICML), 2017. [link]
B. Dhingra, L. Li, X. Li, J. Gao, Y.-N. Chen, F. Ahmed, and L. Deng: Towards end-to-end reinforcement learning of dialogue agents for information access. In the 55th Annual Meeting of the Association for Computational Linguistics (ACL), 2017. [link]
E. Parisotto, A. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli: Neuro-symbolic program synthesis. In the 5th International Conference on Learning Representations (ICLR), 2017. [link]
T.K. Huang, L. Li, A. Vartanian, S. Amershi, and J. Zhu: Active learning with oracle epiphany. In Advances in Neural Information Processing Systems 29 (NIPS), 2016. [link]
J. He, M. Ostendorf, X. He, J. Chen, J. Gao, L. Li, and L. Deng: Deep reinforcement learning with a combinatorial action space for predicting and tracking popular discussion threads. In the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016. [link]
C.-Y. Liu and L. Li: On the Prior Sensitivity of Thompson Sampling. In the 27th International Conference on Algorithmic Learning Theory (ALT), 2016. [link]
J. He, J. Chen, X. He, J. Gao, L. Li, L. Deng, and M. Ostendorf: Deep reinforcement learning with a natural language action space. In the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016. [link]
N. Jiang and L. Li: Doubly robust off-policy value evaluation for reinforcement learning. In the 33rd International Conference on Machine Learning (ICML), 2016. [link]
S. Agrawal, N. R. Devanur, and L. Li: An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. In the 29th Annual Conference on Learning Theory (COLT), 2016. [link]
M. Zoghi, T. Tunys, L. Li, D. Jose, J. Chen, C.-M. Chin, and M. de Rijke: Click-based hot fixes for underperforming torso queries. In the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2016. [link]
J. He, J. Chen, X. He, J. Gao, L. Li, L. Deng, and M. Ostendorf: Deep reinforcement learning with an unbounded action space. In the International Conference on Learning Representations (ICLR), Workshop Track, 2016.
L. Li, R. Munos, and Cs. Szepesvari: Toward minimax off-policy value estimation. In the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. [link]
L. Li, S. Chen, J. Kleban, and A. Gupta: Counterfactual estimation and optimization of click metrics in search engines: A case study. In the 24th International Conference on World Wide Web (WWW), Companion, 2015. [link]
L. Li, J. Kim, and I. Zitouni: Toward predicting the outcome of an A/B experiment for search relevance. In the 8th International Conference on Web Search and Data Mining (WSDM), 2015. [link]
L. Li, H. He, and J.D. Williams: Temporal supervised learning for inferring a dialog policy from example conversations. In the IEEE Spoken Language Technology Workshop (SLT), 2014.
A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R.E. Schapire: Taming the monster: A fast and simple algorithm for contextual bandits. In the 31st International Conference on Machine Learning (ICML), 2014.
E. Brunskill and L. Li: PAC-inspired option discovery in lifelong reinforcement learning. In the 31st International Conference on Machine Learning (ICML), 2014.
E. Brunskill and L. Li: Sample complexity of multi-task reinforcement learning. In the 29th Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
M. Dudik, D. Erhan, J. Langford, and L. Li: Sample-efficient nonstationary-policy evaluation for contextual bandits. In the 28th Conference on Uncertainty in Artificial Intelligence (UAI), 2012.
L. Li, W. Chu, J. Langford, T. Moon, and X. Wang: An unbiased offline evaluation of contextual bandit algorithms with generalized linear models. In Journal of Machine Learning Research - Workshop and Conference Proceedings 26: On-line Trading of Exploration and Exploitation 2, 2012.
V. Navalpakkam, R. Kumar, L. Li, and D. Sivakumar: Attention and selection in online choice tasks. In the 20th International Conference on User Modeling, Adaptation and Personalization (UMAP), 2012
H. Wang, A. Dong, L. Li, Y. Chang, and E. Gabrilovich: Joint relevance and freshness learning From clickthroughs for news search. In the 21st International Conference on World Wide Web (WWW), 2012.
O. Chapelle and L. Li: An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems 24 (NIPS), 2011.
M. Dudik, J. Langford, and L. Li: Doubly robust policy evaluation and learning. In the 28th International Conference on Machine Learning (ICML), 2011.
W. Chu, M. Zinkevich, L. Li, A. Thomas, and B. Tseng: Unbiased online active learning in data streams. In the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2011.
D. Agarwal, L. Li, and A.J. Smola: Linear-time algorithms for propensity scores. In the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R.E. Schapire: Contextual bandit algorithms with supervised learning guarantees. In the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011. Co-winner of the Notable Paper Award.
W. Chu, L. Li, L. Reyzin, and R. Schapire: Linear contextual bandit problems. In the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
L. Li, W. Chu, J. Langford, and X. Wang: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In the 4th ACM International Conference on Web Search and Data Mining (WSDM), 2011. Winner of the Best Paper Award.
A.L. Strehl, J. Langford, L. Li, and S. Kakade: Learning from logged implicit exploration data. In Advances in Neural Information Processing Systems 23 (NIPS), spotlight, 2011.
M. Zinkevich, M. Weimer, A.J. Smola, and L. Li: Parallelized stochastic gradient descent. In Advances in Neural Information Processing Systems 23 (NIPS), 2011.
T. Moon, L. Li, W. Chu, C. Liao, Z. Zheng, and Y. Chang: Online learning for recency search ranking using real-time user feedback (short paper). In the 19th ACM Conference on Information and Knowledge Management (CIKM), 2010.
L. Li, W. Chu, J. Langford, and R.E. Schapire: A contextual-bandit approach to personalized news article recommendation. In the 19th International Conference on World Wide Web (WWW), 2010.
L. Li, J.D. Williams, and S. Balakrishnan: Reinforcement learning for spoken dialog management using least-squares policy iteration and fast feature selection. In the 10th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2009.
C. Diuk, L. Li, and B.R. Leffler: The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In the 26th International Conference on Machine Learning (ICML), 2009.
J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate: A Bayesian sampling approach to exploration in reinforcement learning. In the 25th International Conference on Uncertainty in Artificial Intelligence (UAI), 2009.
L. Li, M.L. Littman and C.R. Mansley: Online exploration in least-squares policy iteration. In the 8th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2009.
L. Langford, L. Li, and T. Zhang: Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems 21 (NIPS), spotlight, 2009.
L. Li: A worst-case comparison between temporal difference and residual gradient. In the 25th International Conference on Machine Learning (ICML), 2008.
L. Li, M.L. Littman, and T.J. Walsh: Knows what it knows: A framework for self-aware learning. In the 25th International Conference on Machine Learning (ICML), 2008. Co-winner of the Best Student Paper Award. A Google Student Award winner at the New York Academy of Sciences Symposium on Machine Learning, 2008.
R. Parr, L. Li, G. Taylor, C. Painter-Wakefield, and M.L. Littman: An analysis of linear models, linear value function approximation, and feature selection for reinforcement learning. In the 25th International Conference on Machine Learning (ICML), 2008.
E. Brunskill, B.R. Leffler, L. Li, M.L. Littman, and N. Roy: CORL: A continuous-state offset-dynamics reinforcement learner. In the 24th Conference on Uncertainty in Artificial Intelligence (UAI), 2008.
L. Li and M.L. Littman: Efficient value-function approximation via online linear regression. In the 10th International Symposium on Artificial Intelligence and Mathematics (AI&Math), 2008.
J. Wortman, Y. Vorobeychik, L. Li, and J. Langford: Maintaining equilibria during exploration in sponsored search auctions. In the 3rd International Workshop on Internet and Network Economics (WINE), LNCS 4858, 2007.
T.J. Walsh, A. Nouri, L. Li, and M.L. Littman: Planning and learning in environments with delayed feedback. In the 18th European Conference on Machine Learning (ECML), LNCS 4701, 2007.
R. Parr, C. Painter-Wakefield, L. Li, and M.L. Littman: Analyzing feature generation for value-function approximation. In the 24th International Conference on Machine Learning (ICML), 2007.
A.L. Strehl, L. Li, E. Wiewiora, J. Langford, and M.L. Littman: PAC model-free reinforcement learning. In the 23rd International Conference on Machine Learning (ICML), 2006. Best Student Poster Award winner at the New York Academy of Sciences Symposium on Machine Learning, 2006.
A.L. Strehl, L. Li, and M.L. Littman: Incremental model-based learners with formal learning-time guarantees. In the 22nd Conference on Uncertainty in Artificial Intelligence (UAI), 2006.
L. Li, T.J. Walsh, and M.L. Littman: Towards a unified theory of state abstraction for MDPs. In the 9th International Symposium on Artificial Intelligence and Mathematics (AI&Math), 2006.
L. Li, M.L. Littman: Lazy approximation for solving continuous finite-horizon MDPs. In the 20th National Conference on Artificial Intelligence (AAAI), 2005.
L. Li, V. Bulitko, and R. Greiner: Batch reinforcement learning with state importance (extended abstract). In the 15th European Conference on Machine Learning (ECML), LNCS 3201, 2004.
V. Bulitko, L. Li, R. Greiner, and I. Levner: Lookahead pathologies for single agent search (poster paper). In the 18th International Joint Conference on Artificial Intelligence (IJCAI), 2003.
I. Levner, V. Bulitko, L. Li, G. Lee, and R. Greiner: Towards automated creation of image interpretation systems. In the 16th Australian Joint Conference on Artificial Intelligence, LNCS 2903, 2003.
L. Li, V. Bulitko, R. Greiner, and I. Levner: Improving an adaptive image interpretation system by leveraging. In the 8th Australian and New Zealand Intelligent Information System Conference, 2003.

Journal

L. Li: A perspective on off-policy evaluation in reinforcement learning (Invited Paper). Frontiers of Computer Science, 13(5):911-912, 2019. [link, PDF]
M. Dudik, D. Erhan, J. Langford, and L. Li: Doubly robust policy evaluation and optimization. In Statistical Science, 29(4):485--511, 2014.
J. Bian, B. Long, L. Li, T. Moon, A. Dong, and Y. Chang: Exploiting user preference for online learning in Web content optimization systems. In ACM Transactions on Intelligent Systems and Technology, 5(2), 2014.
T. Moon, W. Chu, L. Li, Z. Zheng, and Y. Chang: Refining recency search results with user click feedback. In ACM Transactions on Information Systems, 30(4), 2012.
J. Langford, L. Li, P. McAfee, and K. Papineni: Cloud control: Voluntary admission control for Intranet traffic management. In Information Systems and e-Business Management, 10(3):295--308, 2012.
L. Li, M.L. Littman, T.J. Walsh, and A.L. Strehl: Knows what it knows: A framework for self-aware learning. In Machine Learning, 82(3):399--443, 2011.
L. Li and M.L. Littman: Reducing reinforcement learning to KWIK online regression. In the Annals of Mathematics and Artificial Intelligence, 58(3--4):217--237, 2010.
J. Langford, L. Li, J. Wortman, and Y. Vorobeychik: Maintaining equilibria during exploration in sponsored search auctions. In Algorithmica, 58(4):990--1021, 2010.
A.L. Strehl, L. Li, and M.L. Littman: Reinforcement learning in finite MDPs: PAC analysis. In the Journal of Machine Learning Research, 10:2413--2444, 2009.
E. Brunskill, B.R. Leffler, L. Li, M.L. Littman, and N. Roy: Provably efficient learning with typed parametric models. In the Journal of Machine Learning Research, 10:1955--1988, 2009.
J. Langford, L. Li, and T. Zhang: Sparse online learning via truncated gradient. In the Journal of Machine Learning Research, 10:777--801, 2009.
T.J. Walsh, A. Nouri, L. Li, and M.L. Littman: Planning and learning in environments with delayed feedback. In the Journal of Autonomous Agents and Multi-Agent Systems, 18(1):83--105, 2009.
L. Li, V. Bulitko, and R. Greiner: Focus of attention in reinforcement learning. In the Journal of Universal Computer Science, 13(9):1246--1269, 2007.

Theses, Surveys, Books, and Chapters

J. Gao, M. Galley, and L. Li: Neural approaches to Conversational AI: Question answering, task-oriented dialogues and social chatbots. Foundations and Trends in Information Retrieval, 13(2-3):127-298, 2019. ISBN 978-1-68083-552-6. [link, arXiv]
K. Hofmann, L. Li, and F. Radlinski: Online Evaluation for Information Retrieval. Foundations and Trends in Information Retrieval, 10(1):1--107, 2016. ISBN 978-1-68083-163-4. [link, PDF]
L. Li: Sample complexity bounds of exploration. In Marco Wiering and Martijn van Otterlo, editors, Reinforcement Learning: State of the Art, Springer Verlag, 2012. ISBN 978-3642276446.
L. Li: A unifying framework for computational reinforcement learning theory. Doctoral dissertation, Department of Computer Science, Rutgers University, New Brunswick, NJ, USA, May, 2009. [link]
L. Li: Focus of attention in reinforcement learning. MSc thesis, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada, July, 2004.

Others

X. Li, Z.C. Lipton, B. Dhingra, L. Li, J. Gao, Y.-N. Chen: A user simulator for task-completion dialogues. MSR technical report, December 2016.
E. Brunskill and L. Li: The online discovery problem and its application to lifelong reinforcement learning. CoRR abs/1506.03379, June 2015.
D. Yankov, P. Berkhin, and L. Li: Evaluation of explore-exploit policies in multi-result ranking systems. Microsoft Journal on Applied Research, volume 3, pages 54--60, 2015. Also available as Microsoft Research Technical Report MSR-TR-2015-34, May 2015.
Z. Qin, V. Petricek, N. Karampatziakis, L. Li, and J. Langford: Efficient online bootstrapping for large scale learning. NIPS Workshop on Big Data, December, 2013. Also available as Microsoft Research Technical Report MSR-TR-2013-132.
L. Li and O. Chapelle: Regret bounds for Thompson sampling (Open Problems). In the Twenty-Fifth Annual Conference on Learning Theory (COLT), 2012
L. Li and M.L. Littman: Prioritized sweeping converges to the optimal value function. Technical report DCS-TR-631, Department of Computer Science, Rutgers University, May 2008.
A.L. Strehl, L. Li, and M.L. Littman: PAC reinforcement learning bounds for RTDP and Rand-RTDP. AAAI technical report WS-06-11, pages 50-56, July 2006.
L. Li and M.L. Littman: Lazy approximation: A new approach for solving continuous finite-horizon MDPs. Technical report DCS-TR-577, Department of Computer Science, Rutgers University, May 2005.