skip to main content
survey
Free Access
Just Accepted

Natural Language Reasoning, A Survey

Online AM:09 May 2024Publication History
Skip Abstract Section

Abstract

This survey paper proposes a clearer view of natural language reasoning in the field of Natural Language Processing (NLP), both conceptually and practically. Conceptually, we provide a distinct definition for natural language reasoning in NLP, based on both philosophy and NLP scenarios, discuss what types of tasks require reasoning, and introduce a taxonomy of reasoning. Practically, we conduct a comprehensive literature review on natural language reasoning in NLP, mainly covering classical logical reasoning, natural language inference, multi-hop question answering, and commonsense reasoning. The paper also identifies and views backward reasoning, a powerful paradigm for multi-step reasoning, and introduces defeasible reasoning as one of the most important future directions in natural language reasoning research. We focus on single-modality unstructured natural language text, excluding neuro-symbolic research and mathematical reasoning.

References

  1. Shourya Aggarwal, Divyanshu Mandowara, Vishwajeet Agrawal, Dinesh Khandelwal, Parag Singla, and Dinesh Garg. 2021. Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 3050–3065. https://doi.org/10.18653/v1/2021.acl-long.238Google ScholarGoogle ScholarCross RefCross Ref
  2. Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. 2022. What learning algorithm is in-context learning? Investigations with linear models. CoRR abs/2211.15661(2022). https://doi.org/10.48550/arXiv.2211.15661 arXiv:2211.15661Google ScholarGoogle ScholarCross RefCross Ref
  3. Peter Adam Angeles. 1981. Dictionary of Philosophy. Barnes & Noble Books.Google ScholarGoogle Scholar
  4. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. CoRR abs/2302.04023(2023). https://doi.org/10.48550/arXiv.2302.04023 arXiv:2302.04023Google ScholarGoogle ScholarCross RefCross Ref
  5. Qiming Bao, Alex Yuxuan Peng, Tim Hartill, Neset Tan, Zhenyun Deng, Michael Witbrock, and Jiamou Liu. 2022. Multi-Step Deductive Reasoning Over Natural Language: An Empirical Study on Out-of-Distribution Generalisation. The 2nd International Joint Conference on Learning and Reasoning and 16th International Workshop on Neural-Symbolic Learning and Reasoning (IJCLR-NeSy 2022).Google ScholarGoogle Scholar
  6. Gregor Betz, Christian Voigt, and Kyle Richardson. 2021. Critical Thinking for Language Models. In IWCS. Association for Computational Linguistics, 63–75.Google ScholarGoogle Scholar
  7. Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-tau Yih, and Yejin Choi. 2020. Abductive Commonsense Reasoning. In ICLR. OpenReview.net.Google ScholarGoogle Scholar
  8. Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 7432–7439. https://ojs.aaai.org/index.php/AAAI/article/view/6239Google ScholarGoogle Scholar
  9. Simon Blackburn. 2008. The Oxford Dictionary of Philosophy. Oxford University Press.Google ScholarGoogle Scholar
  10. Michael Boratko, Xiang Li, Tim O’Gorman, Rajarshi Das, Dan Le, and Andrew McCallum. 2020. ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning. In EMNLP (1). Association for Computational Linguistics, 1122–1136.Google ScholarGoogle Scholar
  11. Kaj Bostrom, Xinyu Zhao, Swarat Chaudhuri, and Greg Durrett. 2021. Flexible Generation of Natural Language Deductions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 6266–6278. https://doi.org/10.18653/v1/2021.emnlp-main.506Google ScholarGoogle ScholarCross RefCross Ref
  12. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton (Eds.). The Association for Computational Linguistics, 632–642. https://doi.org/10.18653/v1/d15-1075Google ScholarGoogle ScholarCross RefCross Ref
  13. The Editors of Encyclopaedia Britannica. 2017. inference. Encyclopedia Britannica, 16 Jun. 2017(2017). https://www.britannica.com/topic/inference-reason.Google ScholarGoogle Scholar
  14. The Editors of Encyclopaedia Britannica. 2020. reason. Encyclopedia Britannica, 15 May. 2020(2020). https://www.britannica.com/topic/reason.Google ScholarGoogle Scholar
  15. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.htmlGoogle ScholarGoogle Scholar
  16. Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of Artificial General Intelligence: Early experiments with GPT-4. CoRR abs/2303.12712(2023). https://doi.org/10.48550/arXiv.2303.12712 arXiv:2303.12712Google ScholarGoogle ScholarCross RefCross Ref
  17. Kevin Burton, Akshay Java, and Ian Soboroff. 2009. The icwsm 2009 spinn3r dataset. In Third Annual Conference on Weblogs and Social Media (ICWSM 2009).Google ScholarGoogle Scholar
  18. Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom. 2018. e-SNLI: Natural Language Inference with Natural Language Explanations. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (Eds.). 9560–9572. https://proceedings.neurips.cc/paper/2018/hash/4c7a167bb329bd92580a99ce422d6fa6-Abstract.htmlGoogle ScholarGoogle Scholar
  19. Tuhin Chakrabarty, Debanjan Ghosh, Adam Poliak, and Smaranda Muresan. 2021. Figurative Language in Recognizing Textual Entailment. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021(Findings of ACL, Vol.  ACL/IJCNLP 2021), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 3354–3361. https://doi.org/10.18653/v1/2021.findings-acl.297Google ScholarGoogle ScholarCross RefCross Ref
  20. Jifan Chen and Greg Durrett. 2019. Understanding Dataset Design Choices for Multi-hop Reasoning. In NAACL-HLT (1). Association for Computational Linguistics, 4026–4032.Google ScholarGoogle Scholar
  21. Tongfei Chen, Zhengping Jiang, Adam Poliak, Keisuke Sakaguchi, and Benjamin Van Durme. 2020. Uncertain Natural Language Inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 8772–8779. https://doi.org/10.18653/v1/2020.acl-main.774Google ScholarGoogle ScholarCross RefCross Ref
  22. Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020. TabFact: A Large-scale Dataset for Table-based Fact Verification. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rkeJRhNYDHGoogle ScholarGoogle Scholar
  23. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022. Scaling Instruction-Finetuned Language Models. CoRR abs/2210.11416(2022). https://doi.org/10.48550/arXiv.2210.11416 arXiv:2210.11416Google ScholarGoogle ScholarCross RefCross Ref
  24. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. CoRR abs/1803.05457(2018). arXiv:1803.05457 http://arxiv.org/abs/1803.05457Google ScholarGoogle Scholar
  25. Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2020. Transformers as Soft Reasoners over Language. In IJCAI. ijcai.org, 3882–3890.Google ScholarGoogle Scholar
  26. Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 2475–2485. https://doi.org/10.18653/v1/d18-1269Google ScholarGoogle ScholarCross RefCross Ref
  27. Antonia Creswell and Murray Shanahan. 2022. Faithful Reasoning Using Large Language Models. CoRR abs/2208.14271(2022).Google ScholarGoogle Scholar
  28. Antonia Creswell, Murray Shanahan, and Irina Higgins. 2022. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. CoRR abs/2205.09712(2022).Google ScholarGoogle Scholar
  29. Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto. 2013. Recognizing Textual Entailment: Models and Applications. Morgan & Claypool Publishers. https://doi.org/10.2200/S00509ED1V01Y201305HLT023Google ScholarGoogle ScholarCross RefCross Ref
  30. Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Zhifang Sui, and Furu Wei. 2022. Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers. CoRR abs/2212.10559(2022). https://doi.org/10.48550/arXiv.2212.10559 arXiv:2212.10559Google ScholarGoogle ScholarCross RefCross Ref
  31. Bhavana Dalvi, Peter Jansen, Oyvind Tafjord, Zhengnan Xie, Hannah Smith, Leighanna Pipatanangkura, and Peter Clark. 2021. Explaining Answers with Entailment Trees. In EMNLP (1). Association for Computational Linguistics, 7358–7370.Google ScholarGoogle Scholar
  32. Ishita Dasgupta, Andrew K. Lampinen, Stephanie C. Y. Chan, Antonia Creswell, Dharshan Kumaran, James L. McClelland, and Felix Hill. 2022. Language models show human-like content effects on reasoning. CoRR abs/2207.07051(2022).Google ScholarGoogle Scholar
  33. Xiang Deng, Yu Su, Alyssa Lees, You Wu, Cong Yu, and Huan Sun. 2021. ReasonBERT: Pre-trained to Reason with Distant Supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 6112–6127. https://doi.org/10.18653/v1/2021.emnlp-main.494Google ScholarGoogle ScholarCross RefCross Ref
  34. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423Google ScholarGoogle ScholarCross RefCross Ref
  35. Qingxiu Dong, Ziwei Qin, Heming Xia, Tian Feng, Shoujie Tong, Haoran Meng, Lin Xu, Zhongyu Wei, Weidong Zhan, Baobao Chang, Sujian Li, Tianyu Liu, and Zhifang Sui. 2022. Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 932–946. https://doi.org/10.18653/v1/2022.acl-long.66Google ScholarGoogle ScholarCross RefCross Ref
  36. Li Du, Xiao Ding, Ting Liu, and Bing Qin. 2021. Learning Event Graph Knowledge for Abductive Reasoning. In ACL/IJCNLP (1). Association for Computational Linguistics, 5181–5190.Google ScholarGoogle Scholar
  37. Li Du, Xiao Ding, Kai Xiong, Ting Liu, and Bing Qin. 2022. e-CARE: a New Dataset for Exploring Explainable Causal Reasoning. In ACL (1). Association for Computational Linguistics, 432–446.Google ScholarGoogle Scholar
  38. Denis Emelin, Ronan Le Bras, Jena D. Hwang, Maxwell Forbes, and Yejin Choi. 2021. Moral Stories: Situated Reasoning about Norms, Intents, Actions, and their Consequences. In EMNLP (1). Association for Computational Linguistics, 698–718.Google ScholarGoogle Scholar
  39. Zichu Fei, Qi Zhang, Tao Gui, Di Liang, Sirui Wang, Wei Wu, and Xuanjing Huang. 2022. CQG: A Simple and Effective Controlled Generation Framework for Multi-hop Question Generation. In ACL (1). Association for Computational Linguistics, 6896–6906.Google ScholarGoogle Scholar
  40. Yanlin Feng, Xinyue Chen, Bill Yuchen Lin, Peifeng Wang, Jun Yan, and Xiang Ren. 2020. Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering. In EMNLP (1). Association for Computational Linguistics, 1295–1309.Google ScholarGoogle Scholar
  41. Maurice A Finocchiaro. 1984. Informal logic and the theory of reasoning. Informal Logic 6, 2 (1984).Google ScholarGoogle Scholar
  42. Maxwell Forbes, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. 2020. Social Chemistry 101: Learning to Reason about Social and Moral Norms. In EMNLP (1). Association for Computational Linguistics, 653–670.Google ScholarGoogle Scholar
  43. Ahti-Veikko Pietarinen Francesco Bellucci. 2022. Peirce’s Logic. The Internet Encyclopedia of Philosophy, ISSN 2161-0002 (2022). https://iep.utm.edu/peir-log/.Google ScholarGoogle Scholar
  44. Saadia Gabriel, Skyler Hallinan, Maarten Sap, Pemi Nguyen, Franziska Roesner, Eunsol Choi, and Yejin Choi. 2022. Misinfo Reaction Frames: Reasoning about Readers’ Reactions to News Headlines. In ACL (1). Association for Computational Linguistics, 3108–3127.Google ScholarGoogle Scholar
  45. Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Trans. Assoc. Comput. Linguistics 9 (2021), 346–361.Google ScholarGoogle ScholarCross RefCross Ref
  46. Alvin I Goldman. 1986. Epistemology and cognition. harvard university Press.Google ScholarGoogle Scholar
  47. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing. CoRR abs/2305.11738(2023). https://doi.org/10.48550/ARXIV.2305.11738 arXiv:2305.11738Google ScholarGoogle ScholarCross RefCross Ref
  48. Trudy Govier. 1989. Critical thinking as argument analysis. Argumentation 3, 2 (1989), 115–126.Google ScholarGoogle ScholarCross RefCross Ref
  49. Ivan Habernal, Henning Wachsmuth, Iryna Gurevych, and Benno Stein. 2018. The Argument Reasoning Comprehension Task: Identification and Reconstruction of Implicit Warrants. In NAACL-HLT. Association for Computational Linguistics, 1930–1940.Google ScholarGoogle Scholar
  50. Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, David Peng, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Shafiq Joty, Alexander R. Fabbri, Wojciech Kryscinski, Xi Victoria Lin, Caiming Xiong, and Dragomir Radev. 2022. FOLIO: Natural Language Reasoning with First-Order Logic. CoRR abs/2209.00840(2022).Google ScholarGoogle Scholar
  51. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=d7KBjmI3GmQGoogle ScholarGoogle Scholar
  52. Jaakko J. Hintikka. 2022. logic. Encyclopedia Britannica, 9 Jun. 2022(2022). https://www.britannica.com/topic/logic.Google ScholarGoogle Scholar
  53. Matthew Ho, Aditya Sharma, Justin Chang, Michael Saxon, Sharon Levy, Yujie Lu, and William Yang Wang. 2022. WikiWhy: Answering and Explaining Cause-and-Effect Questions. CoRR abs/2210.12152(2022). https://doi.org/10.48550/arXiv.2210.12152 arXiv:2210.12152Google ScholarGoogle ScholarCross RefCross Ref
  54. Namgyu Ho, Laura Schmid, and Se-Young Yun. 2022. Large Language Models Are Reasoning Teachers. CoRR abs/2212.10071(2022). https://doi.org/10.48550/arXiv.2212.10071 arXiv:2212.10071Google ScholarGoogle ScholarCross RefCross Ref
  55. Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In COLING. International Committee on Computational Linguistics, 6609–6625.Google ScholarGoogle Scholar
  56. Ruixin Hong, Hongming Zhang, Xintong Yu, and Changshui Zhang. 2022. METGEN: A Module-Based Entailment Tree Generation Framework for Answer Explanation. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 1887–1905. https://doi.org/10.18653/v1/2022.findings-naacl.145Google ScholarGoogle ScholarCross RefCross Ref
  57. Md Mosharaf Hossain, Venelin Kovatchev, Pranoy Dutta, Tiffany Kao, Elizabeth Wei, and Eduardo Blanco. 2020. An Analysis of Natural Language Inference Benchmarks through the Lens of Negation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 9106–9118. https://doi.org/10.18653/v1/2020.emnlp-main.732Google ScholarGoogle ScholarCross RefCross Ref
  58. Hai Hu, Kyle Richardson, Liang Xu, Lu Li, Sandra Kübler, and Lawrence S. Moss. 2020. OCNLI: Original Chinese Natural Language Inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020(Findings of ACL, Vol.  EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 3512–3526. https://doi.org/10.18653/V1/2020.FINDINGS-EMNLP.314Google ScholarGoogle ScholarCross RefCross Ref
  59. Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards Reasoning in Large Language Models: A Survey. CoRR abs/2212.10403(2022). https://doi.org/10.48550/arXiv.2212.10403 arXiv:2212.10403Google ScholarGoogle ScholarCross RefCross Ref
  60. Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2023. Large Language Models Cannot Self-Correct Reasoning Yet. CoRR abs/2310.01798(2023). https://doi.org/10.48550/ARXIV.2310.01798 arXiv:2310.01798Google ScholarGoogle ScholarCross RefCross Ref
  61. Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2022. Large Language Models Can Self-Improve. CoRR abs/2210.11610(2022). https://doi.org/10.48550/arXiv.2210.11610 arXiv:2210.11610Google ScholarGoogle ScholarCross RefCross Ref
  62. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 2391–2401.Google ScholarGoogle Scholar
  63. Yongjie Huang and Meng Yang. 2021. Breadth First Reasoning Graph for Multi-hop Question Answering. In NAACL-HLT. Association for Computational Linguistics, 5810–5821.Google ScholarGoogle Scholar
  64. Yinya Huang, Hongming Zhang, Ruixin Hong, Xiaodan Liang, Changshui Zhang, and Dong Yu. 2022. MetaLogic: Logical Reasoning Explanations with Fine-Grained Structure. CoRR abs/2210.12487(2022).Google ScholarGoogle Scholar
  65. Patrick J Hurley. 2014. A concise introduction to logic. Cengage Learning.Google ScholarGoogle Scholar
  66. Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2021. (Comet-) Atomic 2020: On Symbolic and Neural Commonsense Knowledge Graphs. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 6384–6392. https://ojs.aaai.org/index.php/AAAI/article/view/16792Google ScholarGoogle Scholar
  67. Naoya Inoue, Pontus Stenetorp, and Kentaro Inui. 2020. R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason. In ACL. Association for Computational Linguistics, 6740–6750.Google ScholarGoogle Scholar
  68. Naoya Inoue, Harsh Trivedi, Steven Sinha, Niranjan Balasubramanian, and Kentaro Inui. 2021. Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension. In EMNLP (1). Association for Computational Linguistics, 6064–6080.Google ScholarGoogle Scholar
  69. Harsh Jhamtani and Peter Clark. 2020. Learning to Explain: Datasets and Models for Identifying Valid Reasoning Chains in Multihop Question-Answering. In EMNLP (1). Association for Computational Linguistics, 137–150.Google ScholarGoogle Scholar
  70. Yichen Jiang and Mohit Bansal. 2019. Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA. In ACL (1). Association for Computational Linguistics, 2726–2736.Google ScholarGoogle Scholar
  71. Fangkai Jiao, Yangyang Guo, Xuemeng Song, and Liqiang Nie. 2022. MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 3496–3509. https://doi.org/10.18653/v1/2022.findings-acl.276Google ScholarGoogle ScholarCross RefCross Ref
  72. Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. 2022. Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations. CoRR abs/2205.11822(2022). https://doi.org/10.48550/arXiv.2205.11822 arXiv:2205.11822Google ScholarGoogle ScholarCross RefCross Ref
  73. Daniel Kahneman. 2011. Thinking, fast and slow. Macmillan.Google ScholarGoogle Scholar
  74. Seyed Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and Deepak Ramachandran. 2022. LAMBADA: Backward Chaining for Automated Reasoning in Natural Language. CoRR abs/2212.13894(2022). https://doi.org/10.48550/arXiv.2212.13894 arXiv:2212.13894Google ScholarGoogle ScholarCross RefCross Ref
  75. Daniel Khashabi, Arman Cohan, Siamak Shakeri, Pedram Hosseini, Pouya Pezeshkpour, Malihe Alikhani, Moin Aminnaseri, Marzieh Bitaab, Faeze Brahman, Sarik Ghazarian, Mozhdeh Gheini, Arman Kabiri, Rabeeh Karimi Mahabadi, Omid Memarrast, Ahmadreza Mosallanezhad, Erfan Noury, Shahab Raji, Mohammad Sadegh Rasooli, Sepideh Sadeghi, Erfan Sadeqi Azer, Niloofar Safi Samghabadi, Mahsa Shafaei, Saber Sheybani, Ali Tazarv, and Yadollah Yaghoobzadeh. 2021. ParsiNLU: A Suite of Language Understanding Challenges for Persian. Trans. Assoc. Comput. Linguistics 9 (2021), 1147–1162. https://doi.org/10.1162/TACL_A_00419Google ScholarGoogle ScholarCross RefCross Ref
  76. Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. 2020. QASC: A Dataset for Question Answering via Sentence Composition. In AAAI. AAAI Press, 8082–8090.Google ScholarGoogle Scholar
  77. Tushar Khot, Daniel Khashabi, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2021. Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 1264–1279. https://doi.org/10.18653/v1/2021.naacl-main.99Google ScholarGoogle ScholarCross RefCross Ref
  78. Tushar Khot, Kyle Richardson, Daniel Khashabi, and Ashish Sabharwal. 2022. Hey AI, Can You Solve Complex Tasks by Talking to Agents?. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 1808–1823. https://doi.org/10.18653/V1/2022.FINDINGS-ACL.142Google ScholarGoogle ScholarCross RefCross Ref
  79. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. SciTaiL: A Textual Entailment Dataset from Science Question Answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 5189–5197. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17368Google ScholarGoogle ScholarCross RefCross Ref
  80. Tassilo Klein and Moin Nabi. 2019. Attention Is (not) All You Need for Commonsense Reasoning. In ACL (1). Association for Computational Linguistics, 4831–4836.Google ScholarGoogle Scholar
  81. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. CoRR abs/2205.11916(2022).Google ScholarGoogle Scholar
  82. Yash Kumar Lal, Nathanael Chambers, Raymond J. Mooney, and Niranjan Balasubramanian. 2021. TellMeWhy: A Dataset for Answering Why-Questions in Narratives. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021(Findings of ACL, Vol.  ACL/IJCNLP 2021), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 596–610. https://doi.org/10.18653/v1/2021.findings-acl.53Google ScholarGoogle ScholarCross RefCross Ref
  83. Hung Le, Chinnadhurai Sankar, Seungwhan Moon, Ahmad Beirami, Alborz Geramifard, and Satwik Kottur. 2021. DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 5651–5665. https://doi.org/10.18653/v1/2021.acl-long.439Google ScholarGoogle ScholarCross RefCross Ref
  84. Kyungjae Lee, Seung-won Hwang, Sang-eun Han, and Dohyeon Lee. 2021. Robustifying Multi-hop QA through Pseudo-Evidentiality Training. In ACL/IJCNLP (1). Association for Computational Linguistics, 6110–6119.Google ScholarGoogle Scholar
  85. Douglas B. Lenat. 1995. CYC: A Large-Scale Investment in Knowledge Infrastructure. Commun. ACM 38, 11 (1995), 32–38. https://doi.org/10.1145/219717.219745Google ScholarGoogle ScholarDigital LibraryDigital Library
  86. Zhengzhong Liang, Steven Bethard, and Mihai Surdeanu. 2021. Explainable Multi-hop Verbal Reasoning Through Internal Monologue. In NAACL-HLT. Association for Computational Linguistics, 1225–1250.Google ScholarGoogle Scholar
  87. Bill Yuchen Lin, Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Xiang Ren, and William W. Cohen. 2021. Differentiable Open-Ended Commonsense Reasoning. In NAACL-HLT. Association for Computational Linguistics, 4611–4625.Google ScholarGoogle Scholar
  88. Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. 2020. CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020(Findings of ACL, Vol.  EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 1823–1840. https://doi.org/10.18653/v1/2020.findings-emnlp.165Google ScholarGoogle ScholarCross RefCross Ref
  89. Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning Over Paragraph Effects in Situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, MRQA@EMNLP 2019, Hong Kong, China, November 4, 2019, Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen (Eds.). Association for Computational Linguistics, 58–62. https://doi.org/10.18653/v1/D19-5808Google ScholarGoogle ScholarCross RefCross Ref
  90. Hanmeng Liu, Leyang Cui, Jian Liu, and Yue Zhang. 2021. Natural Language Inference in Context - Investigating Contextual Reasoning over Long Texts. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 13388–13396. https://ojs.aaai.org/index.php/AAAI/article/view/17580Google ScholarGoogle ScholarCross RefCross Ref
  91. Hugo Liu and Push Singh. 2004. ConceptNet—a practical commonsense reasoning tool-kit. BT technology journal 22, 4 (2004), 211–226.Google ScholarGoogle Scholar
  92. Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. 2020. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, Christian Bessiere (Ed.). ijcai.org, 3622–3628. https://doi.org/10.24963/ijcai.2020/501Google ScholarGoogle ScholarCross RefCross Ref
  93. Jiacheng Liu, Alisa Liu, Ximing Lu, Sean Welleck, Peter West, Ronan Le Bras, Yejin Choi, and Hannaneh Hajishirzi. 2022. Generated Knowledge Prompting for Commonsense Reasoning. In ACL (1). Association for Computational Linguistics, 3154–3169.Google ScholarGoogle Scholar
  94. John Locke. 1847. An essay concerning human understanding. Kay & Troutman.Google ScholarGoogle Scholar
  95. Man Luo, Shrinidhi Kumbhar, Ming Shen, Mihir Parmar, Neeraj Varshney, Pratyay Banerjee, Somak Aditya, and Chitta Baral. 2023. Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models. CoRR abs/2310.00836(2023). https://doi.org/10.48550/ARXIV.2310.00836 arXiv:2310.00836Google ScholarGoogle ScholarCross RefCross Ref
  96. Aman Madaan, Dheeraj Rajagopal, Niket Tandon, Yiming Yang, and Eduard H. Hovy. 2021. Could you give me a hint ? Generating inference graphs for defeasible reasoning. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021(Findings of ACL, Vol.  ACL/IJCNLP 2021), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 5138–5147. https://doi.org/10.18653/v1/2021.findings-acl.456Google ScholarGoogle ScholarCross RefCross Ref
  97. Aman Madaan, Niket Tandon, Dheeraj Rajagopal, Peter Clark, Yiming Yang, and Eduard H. Hovy. 2021. Think about it! Improving defeasible reasoning by first modeling the question scenario. In EMNLP (1). Association for Computational Linguistics, 6291–6310.Google ScholarGoogle Scholar
  98. Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adámek, Eric Malmi, and Aliaksei Severyn. 2022. Teaching Small Language Models to Reason. CoRR abs/2212.08410(2022). https://doi.org/10.48550/arXiv.2212.08410 arXiv:2212.08410Google ScholarGoogle ScholarCross RefCross Ref
  99. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, 3428–3448. https://doi.org/10.18653/v1/p19-1334Google ScholarGoogle ScholarCross RefCross Ref
  100. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 2381–2391. https://doi.org/10.18653/v1/d18-1260Google ScholarGoogle ScholarCross RefCross Ref
  101. Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional Questions Do Not Necessitate Multi-hop Reasoning. In ACL (1). Association for Computational Linguistics, 4249–4257.Google ScholarGoogle Scholar
  102. Sewon Min, Victor Zhong, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2019. Multi-hop Reading Comprehension through Question Decomposition and Rescoring. In ACL (1). Association for Computational Linguistics, 6097–6109.Google ScholarGoogle Scholar
  103. Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, and James F. Allen. 2016. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, Kevin Knight, Ani Nenkova, and Owen Rambow (Eds.). The Association for Computational Linguistics, 839–849. https://doi.org/10.18653/v1/n16-1098Google ScholarGoogle ScholarCross RefCross Ref
  104. Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A New Benchmark for Natural Language Understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 4885–4901. https://doi.org/10.18653/V1/2020.ACL-MAIN.441Google ScholarGoogle ScholarCross RefCross Ref
  105. Yasumasa Onoe, Michael J. Q. Zhang, Eunsol Choi, and Greg Durrett. 2021. CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge. In NeurIPS Datasets and Benchmarks.Google ScholarGoogle Scholar
  106. Santiago Ontañón, Joshua Ainslie, Vaclav Cvicek, and Zachary Fisher. 2022. LogicInference: A New Dataset for Teaching Logical Inference to seq2seq Models. CoRR abs/2203.15099(2022). https://doi.org/10.48550/arXiv.2203.15099 arXiv:2203.15099Google ScholarGoogle ScholarCross RefCross Ref
  107. Liangming Pan, Wenhu Chen, Wenhan Xiong, Min-Yen Kan, and William Yang Wang. 2021. Unsupervised Multi-hop Question Answering by Question Generation. In NAACL-HLT. Association for Computational Linguistics, 5866–5880.Google ScholarGoogle Scholar
  108. Pruthvi Patel, Swaroop Mishra, Mihir Parmar, and Chitta Baral. 2022. Is a Question Decomposition Unit All We Need?CoRR abs/2205.12538(2022). https://doi.org/10.48550/arXiv.2205.12538 arXiv:2205.12538Google ScholarGoogle ScholarCross RefCross Ref
  109. Charles Sanders Peirce. 1992. Reasoning and the logic of things: The Cambridge conferences lectures of 1898. Harvard University Press.Google ScholarGoogle Scholar
  110. Ethan Perez, Patrick S. H. Lewis, Wen-tau Yih, Kyunghyun Cho, and Douwe Kiela. 2020. Unsupervised Question Decomposition for Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 8864–8880. https://doi.org/10.18653/v1/2020.emnlp-main.713Google ScholarGoogle ScholarCross RefCross Ref
  111. Xinyu Pi, Qian Liu, Bei Chen, Morteza Ziyadi, Zeqi Lin, Yan Gao, Qiang Fu, Jian-Guang Lou, and Weizhu Chen. 2022. Reasoning Like Program Executors. CoRR abs/2201.11473(2022). arXiv:2201.11473 https://arxiv.org/abs/2201.11473Google ScholarGoogle Scholar
  112. Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. 2018. Hypothesis Only Baselines in Natural Language Inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT 2018, New Orleans, Louisiana, USA, June 5-6, 2018, Malvina Nissim, Jonathan Berant, and Alessandro Lenci (Eds.). Association for Computational Linguistics, 180–191. https://doi.org/10.18653/v1/s18-2023Google ScholarGoogle ScholarCross RefCross Ref
  113. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. 2022. Measuring and Narrowing the Compositionality Gap in Language Models. CoRR abs/2210.03350(2022).Google ScholarGoogle Scholar
  114. Ben Prystawski and Noah D. Goodman. 2023. Why think step-by-step? Reasoning emerges from the locality of experience. CoRR abs/2304.03843(2023). https://doi.org/10.48550/arXiv.2304.03843 arXiv:2304.03843Google ScholarGoogle ScholarCross RefCross Ref
  115. Peng Qi, Haejun Lee, Tg Sido, and Christopher D. Manning. 2021. Answering Open-Domain Questions of Varying Reasoning Steps from Text. In EMNLP (1). Association for Computational Linguistics, 3599–3614.Google ScholarGoogle Scholar
  116. Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2022. Reasoning with Language Model Prompting: A Survey. CoRR abs/2212.09597(2022). https://doi.org/10.48550/arXiv.2212.09597 arXiv:2212.09597Google ScholarGoogle ScholarCross RefCross Ref
  117. Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chandra Bhagavatula, Elizabeth Clark, and Yejin Choi. 2019. Counterfactual Story Reasoning and Generation. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 5042–5052.Google ScholarGoogle Scholar
  118. Lianhui Qin, Vered Shwartz, Peter West, Chandra Bhagavatula, Jena D. Hwang, Ronan Le Bras, Antoine Bosselut, and Yejin Choi. 2020. Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning. In EMNLP (1). Association for Computational Linguistics, 794–805.Google ScholarGoogle Scholar
  119. Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Li, Weinan Zhang, and Yong Yu. 2019. Dynamically Fused Graph Network for Multi-hop Reasoning. In ACL (1). Association for Computational Linguistics, 6140–6150.Google ScholarGoogle Scholar
  120. Hanhao Qu, Yu Cao, Jun Gao, Liang Ding, and Ruifeng Xu. 2022. Interpretable Proof Generation via Iterative Backward Reasoning. In NAACL-HLT. Association for Computational Linguistics, 2968–2981.Google ScholarGoogle Scholar
  121. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. OpenAI.Google ScholarGoogle Scholar
  122. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21(2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.htmlGoogle ScholarGoogle Scholar
  123. Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain Yourself! Leveraging Language Models for Commonsense Reasoning. In ACL (1). Association for Computational Linguistics, 4932–4942.Google ScholarGoogle Scholar
  124. Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A. Smith, and Yejin Choi. 2018. Event2Mind: Commonsense Inference on Events, Intents, and Reactions. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, 463–473. https://doi.org/10.18653/v1/P18-1043Google ScholarGoogle ScholarCross RefCross Ref
  125. Abhilasha Ravichander, Matt Gardner, and Ana Marasovic. 2022. CONDAQA: A Contrastive Reading Comprehension Dataset for Reasoning about Negation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, 8729–8755. https://aclanthology.org/2022.emnlp-main.598Google ScholarGoogle ScholarCross RefCross Ref
  126. Danilo Neves Ribeiro, Shen Wang, Xiaofei Ma, Rui Dong, Xiaokai Wei, Henghui Zhu, Xinchi Chen, Peng Xu, Zhiheng Huang, Andrew O. Arnold, and Dan Roth. 2022. Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 465–475. https://doi.org/10.18653/v1/2022.findings-naacl.35Google ScholarGoogle ScholarCross RefCross Ref
  127. Kyle Richardson and Ashish Sabharwal. 2022. Pushing the Limits of Rule Reasoning in Transformers through Natural Language Satisfiability. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022. AAAI Press, 11209–11219. https://doi.org/10.1609/AAAI.V36I10.21371Google ScholarGoogle ScholarCross RefCross Ref
  128. Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21-23, 2011. AAAI. http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418Google ScholarGoogle Scholar
  129. Rachel Rudinger, Vered Shwartz, Jena D. Hwang, Chandra Bhagavatula, Maxwell Forbes, Ronan Le Bras, Noah A. Smith, and Yejin Choi. 2020. Thinking Like a Skeptic: Defeasible Inference in Natural Language. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020(Findings of ACL, Vol.  EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 4661–4675. https://doi.org/10.18653/v1/2020.findings-emnlp.418Google ScholarGoogle ScholarCross RefCross Ref
  130. Dagobert D Runes. 2001. The dictionary of philosophy. Citadel Press.Google ScholarGoogle Scholar
  131. Mobashir Sadat and Cornelia Caragea. 2022. SciNLI: A Corpus for Natural Language Inference on Scientific Text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 7399–7409. https://doi.org/10.18653/v1/2022.acl-long.511Google ScholarGoogle ScholarCross RefCross Ref
  132. Marzieh Saeidi, Max Bartolo, Patrick S. H. Lewis, Sameer Singh, Tim Rocktäschel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. Interpretation of Natural Language Rules in Conversational Machine Reading. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 2087–2097. https://doi.org/10.18653/v1/d18-1233Google ScholarGoogle ScholarCross RefCross Ref
  133. Swarnadeep Saha, Sayan Ghosh, Shashank Srivastava, and Mohit Bansal. 2020. PRover: Proof Generation for Interpretable Reasoning over Rules. In EMNLP (1). Association for Computational Linguistics, 122–136.Google ScholarGoogle Scholar
  134. Swarnadeep Saha, Yixin Nie, and Mohit Bansal. 2020. ConjNLI: Natural Language Inference Over Conjunctive Sentences. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 8240–8252. https://doi.org/10.18653/v1/2020.emnlp-main.661Google ScholarGoogle ScholarCross RefCross Ref
  135. Swarnadeep Saha, Prateek Yadav, and Mohit Bansal. 2021. multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning. In NAACL-HLT. Association for Computational Linguistics, 3662–3677.Google ScholarGoogle Scholar
  136. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 8732–8740. https://ojs.aaai.org/index.php/AAAI/article/view/6399Google ScholarGoogle ScholarCross RefCross Ref
  137. Soumya Sanyal, Zeyi Liao, and Xiang Ren. 2022. RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning. CoRR abs/2205.12598(2022). https://doi.org/10.48550/arXiv.2205.12598 arXiv:2205.12598Google ScholarGoogle ScholarCross RefCross Ref
  138. Soumya Sanyal, Harman Singh, and Xiang Ren. 2022. FaiRR: Faithful and Robust Deductive Reasoning over Natural Language. In ACL (1). Association for Computational Linguistics, 1075–1093.Google ScholarGoogle Scholar
  139. Soumya Sanyal, Yichong Xu, Shuohang Wang, Ziyi Yang, Reid Pryzant, Wenhao Yu, Chenguang Zhu, and Xiang Ren. 2022. APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning. CoRR abs/2212.09282(2022). https://doi.org/10.48550/arXiv.2212.09282 arXiv:2212.09282Google ScholarGoogle ScholarCross RefCross Ref
  140. Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. In AAAI. AAAI Press, 3027–3035.Google ScholarGoogle Scholar
  141. Maarten Sap, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. 2020. Social Bias Frames: Reasoning about Social and Power Implications of Language. In ACL. Association for Computational Linguistics, 5477–5490.Google ScholarGoogle Scholar
  142. Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. 2019. Social IQa: Commonsense Reasoning about Social Interactions. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 4462–4472.Google ScholarGoogle Scholar
  143. Abulhair Saparov and He He. 2022. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. CoRR abs/2210.01240(2022).Google ScholarGoogle Scholar
  144. Kumar Shridhar, Alessandro Stolfo, and Mrinmaya Sachan. 2022. Distilling Multi-Step Reasoning Capabilities of Large Language Models into Smaller Models via Semantic Decompositions. CoRR abs/2212.00193(2022). https://doi.org/10.48550/arXiv.2212.00193 arXiv:2212.00193Google ScholarGoogle ScholarCross RefCross Ref
  145. Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L. Hamilton. 2019. CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 4505–4514.Google ScholarGoogle Scholar
  146. Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 4444–4451. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972Google ScholarGoogle ScholarCross RefCross Ref
  147. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Santilli, Andreas Stuhlmüller, Andrew M. Dai, Andrew La, Andrew K. Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakas, and et al. 2022. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. CoRR abs/2206.04615(2022).Google ScholarGoogle Scholar
  148. Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A Corpus for Reasoning about Natural Language Grounded in Photographs. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, 6418–6428. https://doi.org/10.18653/v1/p19-1644Google ScholarGoogle ScholarCross RefCross Ref
  149. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. 2022. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. CoRR abs/2210.09261(2022).Google ScholarGoogle Scholar
  150. Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. In ACL/IJCNLP (Findings)(Findings of ACL, Vol.  ACL/IJCNLP 2021). Association for Computational Linguistics, 3621–3634.Google ScholarGoogle ScholarCross RefCross Ref
  151. Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2022. Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning. CoRR abs/2210.12217(2022). https://doi.org/10.48550/arXiv.2210.12217 arXiv:2210.12217Google ScholarGoogle ScholarCross RefCross Ref
  152. Alon Talmor and Jonathan Berant. 2018. The Web as a Knowledge-Base for Answering Complex Questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, 641–651. https://doi.org/10.18653/V1/N18-1059Google ScholarGoogle ScholarCross RefCross Ref
  153. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4149–4158. https://doi.org/10.18653/v1/n19-1421Google ScholarGoogle ScholarCross RefCross Ref
  154. Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, and Jonathan Berant. 2020. Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge. In NeurIPS.Google ScholarGoogle Scholar
  155. Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. 2021. CommonsenseQA 2.0: Exposing the Limits of AI through Gamification. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/3ef815416f775098fe977004015c6193-Abstract-round1.htmlGoogle ScholarGoogle Scholar
  156. Alexandre Tamborrino, Nicola Pellicanò, Baptiste Pannier, Pascal Voitot, and Louise Naudin. 2020. Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning. In ACL. Association for Computational Linguistics, 3878–3887.Google ScholarGoogle Scholar
  157. Niket Tandon, Bhavana Dalvi, Joel Grus, Wen-tau Yih, Antoine Bosselut, and Peter Clark. 2018. Reasoning about Actions and State Changes by Injecting Commonsense Knowledge. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 57–66. https://doi.org/10.18653/v1/d18-1006Google ScholarGoogle ScholarCross RefCross Ref
  158. Niket Tandon, Bhavana Dalvi, Keisuke Sakaguchi, Peter Clark, and Antoine Bosselut. 2019. WIQA: A dataset for ”What if...” reasoning over procedural text. In EMNLP/IJCNLP (1). Association for Computational Linguistics, 6075–6084.Google ScholarGoogle Scholar
  159. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2020. Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning. In EMNLP (1). Association for Computational Linguistics, 8846–8863.Google ScholarGoogle Scholar
  160. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. MuSiQue: Multihop Questions via Single-hop Question Composition. Trans. Assoc. Comput. Linguistics 10 (2022), 539–554. https://doi.org/10.1162/tacl_a_00475Google ScholarGoogle ScholarCross RefCross Ref
  161. Masatoshi Tsuchiya. 2018. Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2018/summaries/786.htmlGoogle ScholarGoogle Scholar
  162. Gladys Tyen, Hassan Mansoor, Peter Chen, Tony Mak, and Victor Carbune. 2023. LLMs cannot find reasoning errors, but can correct them!CoRR abs/2311.08516(2023). https://doi.org/10.48550/ARXIV.2311.08516 arXiv:2311.08516Google ScholarGoogle ScholarCross RefCross Ref
  163. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  164. David Vilares and Carlos Gómez-Rodríguez. 2019. HEAD-QA: A Healthcare Dataset for Complex Reasoning. In ACL (1). Association for Computational Linguistics, 960–966.Google ScholarGoogle Scholar
  165. Douglas N Walton. 1990. What is reasoning? What is an argument?The journal of Philosophy 87, 8 (1990), 399–419.Google ScholarGoogle Scholar
  166. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 3261–3275. https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.htmlGoogle ScholarGoogle Scholar
  167. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=rJ4km2R5t7Google ScholarGoogle Scholar
  168. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, and Denny Zhou. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models. CoRR abs/2203.11171(2022).Google ScholarGoogle Scholar
  169. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent Abilities of Large Language Models. CoRR abs/2206.07682(2022). https://doi.org/10.48550/arXiv.2206.07682 arXiv:2206.07682Google ScholarGoogle ScholarCross RefCross Ref
  170. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. CoRR abs/2201.11903(2022).Google ScholarGoogle Scholar
  171. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing Datasets for Multi-hop Reading Comprehension Across Documents. Trans. Assoc. Comput. Linguistics 6 (2018), 287–302.Google ScholarGoogle ScholarCross RefCross Ref
  172. Jason Weston, Antoine Bordes, Sumit Chopra, and Tomás Mikolov. 2016. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. In ICLR (Poster).Google ScholarGoogle Scholar
  173. Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Marilyn A. Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, 1112–1122. https://doi.org/10.18653/v1/n18-1101Google ScholarGoogle ScholarCross RefCross Ref
  174. Tomer Wolfson, Mor Geva, Ankit Gupta, Yoav Goldberg, Matt Gardner, Daniel Deutch, and Jonathan Berant. 2020. Break It Down: A Question Understanding Benchmark. Trans. Assoc. Comput. Linguistics 8 (2020), 183–198. https://doi.org/10.1162/TACL_A_00309Google ScholarGoogle ScholarCross RefCross Ref
  175. Yuxiang Wu, Matt Gardner, Pontus Stenetorp, and Pradeep Dasigi. 2022. Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 2660–2676. https://doi.org/10.18653/v1/2022.acl-long.190Google ScholarGoogle ScholarCross RefCross Ref
  176. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2022. An Explanation of In-context Learning as Implicit Bayesian Inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=RdJVFCHjUMIGoogle ScholarGoogle Scholar
  177. Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan. 2020. CLUE: A Chinese Language Understanding Evaluation Benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, Donia Scott, Núria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, 4762–4772. https://doi.org/10.18653/V1/2020.COLING-MAIN.419Google ScholarGoogle ScholarCross RefCross Ref
  178. Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Kentaro Inui, Satoshi Sekine, Lasha Abzianidze, and Johan Bos. 2019. Can Neural Networks Understand Monotonicity Reasoning?. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, Tal Linzen, Grzegorz Chrupala, Yonatan Belinkov, and Dieuwke Hupkes (Eds.). Association for Computational Linguistics, 31–40. https://doi.org/10.18653/v1/W19-4804Google ScholarGoogle ScholarCross RefCross Ref
  179. Hitomi Yanaka, Koji Mineshima, Daisuke Bekki, Kentaro Inui, Satoshi Sekine, Lasha Abzianidze, and Johan Bos. 2019. HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics, *SEM@NAACL-HLT 2019, Minneapolis, MN, USA, June 6-7, 2019, Rada Mihalcea, Ekaterina Shutova, Lun-Wei Ku, Kilian Evang, and Soujanya Poria (Eds.). Association for Computational Linguistics, 250–255. https://doi.org/10.18653/v1/s19-1027Google ScholarGoogle ScholarCross RefCross Ref
  180. Kaiyu Yang, Jia Deng, and Danqi Chen. 2022. Generating Natural Language Proofs with Verifier-Guided Search. CoRR abs/2205.12443(2022). https://doi.org/10.48550/arXiv.2205.12443 arXiv:2205.12443Google ScholarGoogle ScholarCross RefCross Ref
  181. Zonglin Yang, Li Dong, Xinya Du, Hao Cheng, Erik Cambria, Xiaodong Liu, Jianfeng Gao, and Furu Wei. 2022. Language Models as Inductive Reasoners. CoRR abs/2212.10923(2022). https://doi.org/10.48550/arXiv.2212.10923 arXiv:2212.10923Google ScholarGoogle ScholarCross RefCross Ref
  182. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In EMNLP. Association for Computational Linguistics, 2369–2380.Google ScholarGoogle Scholar
  183. Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. 2021. QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. In NAACL-HLT. Association for Computational Linguistics, 535–546.Google ScholarGoogle Scholar
  184. Xi Ye, Srinivasan Iyer, Asli Celikyilmaz, Ves Stoyanov, Greg Durrett, and Ramakanth Pasunuru. 2022. Complementary Explanations for Effective In-Context Learning. CoRR abs/2211.13892(2022). https://doi.org/10.48550/arXiv.2211.13892 arXiv:2211.13892Google ScholarGoogle ScholarCross RefCross Ref
  185. Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, and Kai-Wei Chang. 2021. Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 2115–2129. https://doi.org/10.18653/v1/2021.emnlp-main.162Google ScholarGoogle ScholarCross RefCross Ref
  186. Wenpeng Yin, Dragomir R. Radev, and Caiming Xiong. 2021. DocNLI: A Large-scale Dataset for Document-level Natural Language Inference. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021(Findings of ACL, Vol.  ACL/IJCNLP 2021), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 4913–4922. https://doi.org/10.18653/v1/2021.findings-acl.435Google ScholarGoogle ScholarCross RefCross Ref
  187. Nathan Young, Qiming Bao, Joshua Bensemann, and Michael Witbrock. 2022. AbductionRules: Training Transformers to Explain Unexpected Inputs. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 218–227. https://doi.org/10.18653/v1/2022.findings-acl.19Google ScholarGoogle ScholarCross RefCross Ref
  188. Jianxing Yu, Wei Liu, Shuang Qiu, Qinliang Su, Kai Wang, Xiaojun Quan, and Jian Yin. 2020. Low-Resource Generation of Multi-hop Reasoning Questions. In ACL. Association for Computational Linguistics, 6729–6739.Google ScholarGoogle Scholar
  189. Ping Yu, Tianlu Wang, Olga Golovneva, Badr AlKhamissy, Gargi Ghosh, Mona T. Diab, and Asli Celikyilmaz. 2022. ALERT: Adapting Language Models to Reasoning Tasks. CoRR abs/2212.08286(2022). https://doi.org/10.48550/arXiv.2212.08286 arXiv:2212.08286Google ScholarGoogle ScholarCross RefCross Ref
  190. Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. 2020. ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning. In ICLR. OpenReview.net.Google ScholarGoogle Scholar
  191. Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. STaR: Bootstrapping Reasoning With Reasoning. CoRR abs/2203.14465(2022).Google ScholarGoogle Scholar
  192. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 93–104. https://doi.org/10.18653/v1/d18-1009Google ScholarGoogle ScholarCross RefCross Ref
  193. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, 4791–4800. https://doi.org/10.18653/v1/p19-1472Google ScholarGoogle ScholarCross RefCross Ref
  194. Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, and Guy Van den Broeck. 2022. On the Paradox of Learning to Reason from Data. CoRR abs/2205.11502(2022).Google ScholarGoogle Scholar
  195. Li Zhang, Qing Lyu, and Chris Callison-Burch. 2020. Reasoning about Goals, Steps, and Temporal Ordering with WikiHow. In EMNLP (1). Association for Computational Linguistics, 4630–4639.Google ScholarGoogle Scholar
  196. Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D. Manning, and Jure Leskovec. 2022. GreaseLM: Graph REASoning Enhanced Language Models. In ICLR. OpenReview.net.Google ScholarGoogle Scholar
  197. Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic Chain of Thought Prompting in Large Language Models. CoRR abs/2210.03493(2022).Google ScholarGoogle Scholar
  198. Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul N. Bennett, and Saurabh Tiwary. 2020. Transformer-XH: Multi-Evidence Reasoning with eXtra Hop Attention. In ICLR. OpenReview.net.Google ScholarGoogle Scholar
  199. Chen Zheng and Parisa Kordjamshidi. 2020. SRLGRN: Semantic Role Labeling Graph Reasoning Network. In EMNLP (1). Association for Computational Linguistics, 8881–8891.Google ScholarGoogle Scholar
  200. Victor Zhong and Luke Zettlemoyer. 2019. E3: Entailment-driven Extracting and Editing for Conversational Machine Reading. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Anna Korhonen, David R. Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, 2310–2320. https://doi.org/10.18653/v1/p19-1223Google ScholarGoogle ScholarCross RefCross Ref
  201. Wanjun Zhong, Tingting Ma, Jiahai Wang, Jian Yin, Tiejun Zhao, Chin-Yew Lin, and Nan Duan. 2022. Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers. CoRR abs/2210.11265(2022).Google ScholarGoogle Scholar
  202. Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Yining Chen, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan. 2022. Analytical Reasoning of Text. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 2306–2319. https://doi.org/10.18653/v1/2022.findings-naacl.177Google ScholarGoogle ScholarCross RefCross Ref
  203. Ben Zhou, Kyle Richardson, Xiaodong Yu, and Dan Roth. 2022. Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts. CoRR abs/2210.16865(2022). https://doi.org/10.48550/arXiv.2210.16865 arXiv:2210.16865Google ScholarGoogle ScholarCross RefCross Ref
  204. Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. 2022. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. CoRR abs/2205.10625(2022).Google ScholarGoogle Scholar
  205. Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara, and Xiang Ren. 2021. RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 7560–7579. https://doi.org/10.18653/v1/2021.emnlp-main.598Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Natural Language Reasoning, A Survey

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Computing Surveys
        ACM Computing Surveys Just Accepted
        ISSN:0360-0300
        EISSN:1557-7341
        Table of Contents

        Copyright © 2024 Copyright held by the owner/author(s).

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Online AM: 9 May 2024
        • Accepted: 26 April 2024
        • Revised: 9 March 2024
        • Received: 6 May 2023

        Check for updates

        Qualifiers

        • survey
      • Article Metrics

        • Downloads (Last 12 months)186
        • Downloads (Last 6 weeks)186

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader