← Back to papers

Paper deep dive

Abductive Reasoning with Syllogistic Forms in Large Language Models

Hirohiko Abe, Risako Ando, Takanobu Morishita Kentaro Ozeki, Koji Mineshima, Mitsuhiro Okada

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 43

Abstract

Abstract:Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern. Prior studies have indicated that LLMs and humans share similar biases, such as dismissing logically valid inferences that contradict common beliefs. However, criticizing LLMs for these biases might be unfair, considering our reasoning not only involves formal deduction but also abduction, which draws tentative conclusions from limited information. Abduction can be regarded as the inverse form of syllogism in its basic structure, that is, a process of drawing a minor premise from a major premise and conclusion. This paper explores the accuracy of LLMs in abductive reasoning by converting a syllogistic dataset into one suitable for abduction. It aims to investigate whether the state-of-the-art LLMs exhibit biases in abduction and to identify potential areas for improvement, emphasizing the importance of contextualized reasoning beyond formal deduction. This investigation is vital for advancing the understanding and application of LLMs in complex reasoning tasks, offering insights into bridging the gap between machine and human cognition.

Tags

ai-safety (imported, 100%)cscl (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%

Last extracted: 3/13/2026, 12:19:27 AM

Summary

This paper investigates the abductive reasoning capabilities of Large Language Models (LLMs) by converting syllogistic datasets into abductive inference tasks. The study compares LLM performance on abduction versus deduction, finding that LLMs generally perform worse on abductive tasks and exhibit human-like belief biases in both reasoning forms, suggesting a gap between machine and human cognition in complex reasoning.

Entities (7)

Abductive Reasoning · reasoning-method · 100%Charles Sanders Peirce · person · 100%Deductive Reasoning · reasoning-method · 100%GPT-4 · model · 100%Large Language Models · technology · 100%Llama 3 · model · 100%NeuBAROCO · dataset · 90%

Relation Signals (3)

Large Language Models exhibits Belief Bias

confidence 95% · we found that LLMs exhibit human-like belief biases in both abduction and deduction.

Abductive Reasoning inverseof Syllogism

confidence 90% · Abduction can be regarded as the inverse form of syllogism in its basic structure

Large Language Models performspoorlyon Abductive Reasoning

confidence 90% · We revealed that LLMs generally performed more poorly on abductive tasks compared to deductive tasks.

Cypher Suggestions (3)

Map the relationship between models and observed biases · confidence 95% · unvalidated

MATCH (m:Model)-[:EXHIBITS]->(b:Bias) RETURN m.name, b.name

Find all models evaluated in the study · confidence 90% · unvalidated

MATCH (m:Model)-[:EVALUATED_IN]->(p:Paper {id: 'df6dcdf1-ff2b-4efb-abbd-8f0613821280'}) RETURN m.name

Identify reasoning methods and their properties · confidence 80% · unvalidated

MATCH (r:ReasoningMethod) RETURN r.name, r.description

Full Text

42,545 characters extracted from source content.

Expand or collapse full text

Abductive Reasoning with Syllogistic Forms in Large Language Models Hirohiko Abe 1 , Risako Ando 1 , Takanobu Morishita 1 Kentaro Ozeki 1,2 , Koji Mineshima 1 , and Mitsuhiro Okada 1 1 Keio University, Tokyo, Japan 2 The University of Tokyo, Tokyo, Japan hirohiko-abe@keio.jp,risakochaan@keio.jp,kentaro.ozeki@gmail.com morishita@keio.jp,minesima,mitsu@abelard.flet.keio.ac.jp Abstract. Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reason- ing has become a key concern. Prior studies have indicated that LLMs and humans share similar biases, such as dismissing logically valid in- ferences that contradict common beliefs. However, criticizing LLMs for these biases might be unfair, considering our reasoning not only involves formal deduction but also abduction, which draws tentative conclusions from limited information. Abduction can be regarded as the inverse form of syllogism in its basic structure, that is, a process of drawing a mi- nor premise from a major premise and conclusion. This paper explores the accuracy of LLMs in abductive reasoning by converting a syllogistic dataset into one suitable for abduction. It aims to investigate whether the state-of-the-art LLMs exhibit biases in abduction and to identify potential areas for improvement, emphasizing the importance of contex- tualized reasoning beyond formal deduction. This investigation is vital for advancing the understanding and application of LLMs in complex reasoning tasks, offering insights into bridging the gap between machine and human cognition. Keywords: Abduction· Deduction· Syllogism· Reasoning bias· Large Language Models. 1 Introduction Research in Large-Language Models (LLMs) is rapidly advancing, with a signifi- cant focus on comparing their performance to human reasoning. Previous studies have shown that while LLMs generally excel at reasoning tasks [6, 32, 21], they exhibit similar biases to humans, such as dismissing logically valid inferences that contradict common beliefs [8, 3]. Although these studies often emphasize deductive reasoning, our everyday reasoning encompasses more than just deduc- tion. Given that LLMs are developed by learning natural language used in daily contexts without specialized logical training, it would be unreasonable to criti- cize them for bias tendencies in deduction tasks. Since our reasoning involves not arXiv:2603.06428v1 [cs.CL] 6 Mar 2026 2H. Abe et al. only formal deduction but also abduction, which draws hypotheses from limited information, it is crucial to investigate LLMs’ capabilities in making abductive inferences. Abduction is a natural form of reasoning that seeks reasons and explanation. For example, when discussing the reason for the delay of the train by asking “Why was the train late?” it is natural to trace back from the observed fact to the reason explaining it, such as “because the traffic lights failed.” However, it is rare to ask in the form of a deduction, as in “The traffic lights failed, therefore the train was late.” The explanatory aspect of abduction, as well as the logical consistency in deductive reasoning, is important in investigating natural explanations and realizing an explainable AI (XAI) that naturally answers why-questions [23]. Investigating how accurately current LLMs can perform abduction provides a theoretical basis for research into XAI. Abduction is important in knowledge acquisition. Charles Sanders Peirce [18] regarded abduction as a process of inquiry along with deduction and induction. Abduction plays a more important role than deduction, especially when it comes to the discovery of the unknown. Evaluating LLMs’ abductive reasoning abilities is essential in determining whether LLMs can gain new knowledge, particularly from limited information. Inquiry, or the activity of acquiring knowledge, is considered one of the chief topics in recent epistemology, and norms of inquiry have been studied [19, 20, 15] Given that abduction is related to inquiry in Peirce’s [18] philosophy, it can be considered a form of logic that guides inquiry, for example, by providing hypotheses and guidance on what to investigate. Assessing LLMs’ abductive reasoning ability is important when addressing the question of whether LLMs can be used to guide our everyday inquiry. In this paper, we introduce a dataset to test the abductive reasoning abilities of LLMs, compare LLMs’ accuracy on abductive reasoning tasks with deductive reasoning tasks, and explore whether LLMs show human-like belief biases on reasoning. 3 By LLMs, we focus on in-context learning pretrained models such as GPT [25, 24] and Llama [2], rather than those requiring fine-tuning such as BERT [9]. These in-context learning models adapt to a specific task using a task description or a few examples of correct answers as input, called a prompt, without changing the models’ parameters. We revealed that LLMs generally performed more poorly on abductive tasks compared to deductive tasks. In ad- dition, we found that LLMs exhibit human-like belief biases in both abduction and deduction. With regard to the comparison between deduction and abduction, it has been pointed out that in diagnostic inference, the inference from effect to cause, there is reason to believe that the deductive model is a more natural reasoning scheme for humans than the abductive model, under a probabilistic setting [30, 29]. Although this paper shares interests with these trends in the comparison between deduction and abduction, our dataset specifically focuses on abductions 3 The dataset is available at https://github.com/kmineshima/abduction-syllogism- llm. Abductive Reasoning in Large Language Models3 that can be generated within the framework of syllogisms, especially by swap- ping premises and conclusions of deductive syllogisms. This approach enables a systematic comparison of abduction and deduction, thereby providing a founda- tion for exploring more complex forms of abductive reasoning, including causal and practical variations. 2 Background 2.1 Abductive Reasoning Our everyday reasoning involves not only deduction but also abduction, that is, the type of reasoning that hypothetically derives new information from limited information. Abduction is believed to be ubiquitous in ordinary our life. For example, abductive reasoning is considered to be operative in cognitive process and testimonial trust. In addition, abduction is regarded as a cornerstone of scientific methodology [11, 12]. In general, abduction is a form of reasoning that leads to a hypothesis ex- plaining an observed fact. Abduction is considered to be of two types: hypothesis selection and hypothesis generation. The former refers to the selection of the best explanation from among several hypotheses. It is also known as Inference to the Best Explanation (IBE). On the other hand, the latter kind involves generating a hypothesis that explains the observed fact from given observations. Charles Sanders Peirce first introduced abduction, distinguishing it from de- duction and induction. Peirce [18] understood it as “the process of forming an explanatory hypothesis” and “the only logical operation which introduces any new idea” (CP 5.171). According to Peirce, abduction is amplicative in that it adds new information besides the premises, while deduction is not. Peirce initially organized abduction in a syllogistic framework [4]. According to Peirce, abduction is made by changing the minor premise and the conclusion in a deductively valid syllogism. The following is an example of a deductively valid syllogism, which is called a first figure syllogism. Major premise: All A are B Minor premise: C is A Conclusion: C is B By changing the minor premise and the conclusion, we obtain the following form of abduction. Major premise: All A are B Conclusion: C is B Minor premise: C is A Note that this form of inference is the so-called Affirming the Consequent, a typical instance of deductively invalid inferences. As a formal characterization of abduction, Peirce [18] says, “The surprising fact,C, is observed. But ifA were true,C would be a matter of course. Hence, there is reason to suspect thatA 4H. Abe et al. is true” (CP 5.189). In this paper, we call the first premise of abduction Rule, the second premise Observation, and the conclusion Hypothesis. The following is a concrete example of abduction based on this terminology. Rule: All things that were in the bag are white. Observation: These balls are white. Hypothesis: These balls were in the bag. In the context of AI and Natural Language Processing, there have been stud- ies on evaluating machine learning (deep learning) models using abductive rea- soning. Among others, Bhagavatula et al. [5] focuses on the form of abduction that selects the most plausible hypothesis that explains given observations. This form of abduction can be subsumed under the IBE type abduction as described above. In this paper, based on Peirce’s account, we instead focus on the form of abduction that is converted from syllogism, namely, one that derives a minor premise from the major premise and conclusion of a deductively valid syllogism. 2.2 Deductive Reasoning Abilities of LLMs With the rapid progress in research on LLMs, the importance of assessing their reasoning ability has increased and the abilities have been researched using a variety of tasks [31]. Among these, more research that compared LLMs with humans has been conducted on deductive syllogistic reasoning, which has been studied in cognitive psychology [22, 7, 16]. Datasets of syllogistic reasoning tasks for LLMs are recently introduced by Dong et al. [10], Guebelmann et al. [17], Aghahadi et al. [1], and Ando et al. [3]. However, they only focus on deductions and do not deal with other kinds of inferences including abduction. Recent research on LLM’s reasoning abilities with syllogisms showed that while LLMs generally perform well on syllogisms, they tend to exhibit some human-like biases [14, 28]. More specifically, Dasgupta et al. [8] found that LLMs reason more accurately about believable or realistic situations in reasoning tasks including syllogism and Wason’s selection task. In addition, they revealed that LLMs tend to judge inferences with believable content as valid and those with the sentences that clash our commonsense belief as invalid regardless of forms of inferences, thus failing to separate forms from contents (the content effects). Ando et al. [3] introduced a syllogism dataset called NeuBAROCO, where syl- logisms are presented in both English and Japanese. They showed that LLMs exhibit reasoning biases known in the psychological studies of syllogisms, in- cluding belief biases, conversion errors, and atmosphere effects. Ozeki et al. [26] extended the NeuBAROCO dataset and conducted a more detailed evaluation of a wide range of models by implementing various reasoning tasks, including those that require translating syllogisms into logical formulas and explaining the reasoning steps. Based on these previous findings, this paper compares the reasoning abilities of LLMs in deduction and abduction and explores whether LLMs show human- like belief biases. Abductive Reasoning in Large Language Models5 Table 1. Eight patterns of abduction. R: Rule, O: Observation, H: Hypothesis. Those in yellow are correct abductions, while those ingrey are incorrect. Correct abductions are those in which H explains why O is the case, given the rule R. R: All C are B O: These A are B H: These A are C R: All B are C O: These A are B H: None R: All C are B O: These A are not B H: None R: All B are C O: These A are not B H: These A are not C R: No C are B O: These A are B H: None R: No B are C O: These A are B H: None R: No C are B O: These A are not B H: These A are C R: No B are C O: These A are not B H: These A are C Table 2. Examples abductive syllogisms labeled as Consistent, Inconsistent, and Neu- tral. The numbers in brackets show the number of each type. TypeExample Consistent (66) Rule: All people that had a fun time are smiling. Observation: These people are smiling. Hypothesis: These people had a fun time. Inconsistent (66) Rule: All things that are made in the sweet restaurant are spicy. Observation: These cakes are made in the sweet restaurant. Hypothesis: These cakes are spicy. Neutral (84) Rule: All things that were in the bag are white. Observation: These balls are white. Hypothesis: These balls were in the bag. 3 Datasets We formulated abduction as an inference from Rule and Observation to Hypothesis as shown in Section 2.1. Rule consists of a sentence of the form All A are B (Universal Affirmative) or No A are B (Universal Negative), while Observation and Hypothesis consist of a sentence of the form These A are B (Particular Affirmative) or These A are not B (Particular Negative). We classified four patterns of abductive inference, which are shown in Table 1. One of the essential features of abduction is that it is amplicative. Abduc- tion contributes to the acquisition of new knowledge by drawing the conclusions whose content is beyond those contained in the premises. In this respect, abduc- tion differs from deduction. The patterns of abduction identified above fulfil this characteristic. On the other hand, the greyed-out patterns in Table 1 do not, and they are deductively valid. Restricting the attention to syllogistic forms helps us to define what counts as correct patterns of deductive and abductive reasoning. To construct a set of abductive inferences having these patterns, we first cre- ated a triple(A,B,C) of terms, whereA is a subject term,B is an observational predicate, andC is a non-observational predicate. Observational predicates are those that can be verified through direct observation, while non-observational predicates are those that cannot. For example, are white is an observational 6H. Abe et al. predicate, while were in the bag is a non-observational predicate. In this study, we manually created 27 triples like (balls, are white, were in the bag). By instantiating the inference patterns shown in Table 1 with these terms, we obtained 216 problems of abductive inference in total, with 108 correct pat- terns and 108 incorrect patterns. To annotate information about belief biases, we classified each problem by three labels, consistent, inconsistent, or neutral. The problem is consistent if Rule is considered to be true as inferred from our common-sense beliefs; it is inconsistent if Rule contradicts our common-sense beliefs; if neither holds, it is neutral. Table 2 shows some examples of each type. In the same way, we constructed 216 patterns of corresponding deductive inferences, with 108 valid patterns and 108 invalid patterns. We obtained in- stances of deduction by changing Observation and Hypothesis in abduction. For example, from the first example of abduction in Table 2, we obtained an instance of valid deduction, All people that had a fun time are smiling. These people are smiling. Therefore, These people had a fun time. Table 3 shows all eight patterns of deduction in comparison with the corresponding abductions in Table 1. Table 3. Eight patterns of deduction. P1: Major Premise, P2: Minor Premise, C: Con- clusion. Those in yellow are valid deductions, while those ingrey are invalid. P1: All C are B P2: These A are B C: None P1: All B are C P2: These A are B C: These A are C P1: All C are B P2: These A are not B C: These A are not C P1: All B are C P2: These A are not B C: None P1: No C are B P2: These A are B C: These A are not C P1: No B are C P2: These A are B C: These A are not C P1: No C are B P2: These A are not B C: None P1: No B are C P2: These A are not B C: None 4 Experiments 4.1 Experimental Settings and Evaluated Models We conducted experiments on two tasks: the Abduction task and the Deduc- tion task, using the dataset created by the method described in Section 3. All experiments were conducted in English. In each iteration, a single problem was provided as input along with a prompt, and the model’s output was collected as an answer. The performance of the LLMs was evaluated using overall accu- racy and accuracy for each problem type, providing a basic assessment of their capabilities. In our experiments, we evaluated four state-of-the-art models with varying parameter sizes: GPT-3.5 [25], GPT-4 [24], Llama-3-8B (8 billion parameters), and Llama-3-70B (70 billion parameters) [2]. The GPT models are closed-source, and their specific details, including the exact number of parameters, are not Abductive Reasoning in Large Language Models7 publicly disclosed. 4 For hyperparameters of the models, we set the maximum output token length to 10 to prevent redundant responses, while keeping other hyperparameters at their default values. We employed in-context learning with prompts and did not perform any fine-tuning on the models. 4.2 Tasks We compare two tasks in our experiments, Abduction task and Deduction task. Table 4 shows example prompts of each task. Abduction task provides sentences for Rule and Observation and asks to choose the most plausible hypothesis. Given a hypothesisH, The answer is selected from three options:H, the negation ofH, and “Neither is a good expla- nation,” as shown in Table 4. In a similar way, the Deduction task provides two premise sentences and three options for the conclusion. We conducted experiments on the Abduction and Deduction tasks in both zero-shot and few-shot settings [6]. For the few-shot prompts, we included eight examples using the same set of terms, corresponding to the eight patterns of abduction shown in Table 1, which were inserted between the task description and the problem. Details of the few-shot prompts can be found in Table 8 and Table 9 in the Appendix. We have also tested alternative prompts, which are shown in Table 7 in the Appendix. However, since there was no performance improvement compared to the prompts listed in Table 4, they were not adopted. Table 4. Example prompts for the Abduction task and Deduction task. Input (Abduction Task) Based on Rule and Observation, from a logical perspective, select the most reasonable hypothesis that explains why Observation holds true. Choose one from the following options (1-3) and answer with the corresponding number. Note that there is a logical relationship between the Rule, Observation, and Hypothesis, where the Observation is logically derived from the Rule and Hypothesis. Rule: All things that were in the bag are white. Observation: These balls are white. Hypothesis: 1. These balls were in the bag. 2. These balls were not in the bag. 3. Neither is a good explanation. The answer is: Input (Deduction Task) Select a sentence that serves as a conclusion based on the following two premises. Choose one from the following options (1-3) and answer with the corresponding number. P1: All things that were in the bag are white. P2: These balls are white. 1. These balls were in the bag. 2. These balls were not in the bag. 3. Neither. The answer is: 4 The versions of the GPT models used are gpt-3.5-turbo-0125 and gpt-4-0613, accessed via OpenAI’s API. 8H. Abe et al. Table 5. Accuracy (%) on the Abduction task (n = 216). Condition ModelOverall Positive Negative Neither Consistent Inconsistent Neutral Zero-Shot GPT-3.531.02 48.15 100.000.9331.8227.2733.33 GPT-441.67 80.2592.590.0046.9734.8542.86 Llama3-8B 37.50 61.7366.6712.0442.4227.2741.67 Llama3-70B 37.04 64.20 100.000.9345.4531.8234.52 Few-Shot GPT-3.529.63 44.4496.301.8533.3322.7332.14 GPT-428.70 65.4322.222.7831.8219.7033.33 Llama-3-8B 28.70 41.98 100.000.9333.3327.2726.19 Llama-3-70B 75.46 90.1281.4862.9674.2472.7378.57 Table 6. Accuracy (%) on the Deduction task (n = 216). Condition ModelOverall Positive Negative Neither Consistent Inconsistent Neutral Zero-Shot GPT-3.533.80 100.00 54.321.8539.3928.7933.33 GPT-472.22 100.00 100.00 44.4474.2468.1873.81 Llama-3-8B 43.52 100.00 40.7431.4837.8837.8852.38 Llama-3-70B 53.24 100.00 80.2521.3060.6140.9157.14 Few-Shot GPT-3.546.30 85.1991.362.7848.4842.4247.62 GPT-495.83 100.00 96.3094.44100.0092.4295.24 Llama-3-8B 49.54 100.00 98.770.0050.0050.0048.81 Llama-3-70B 84.72 92.5980.2586.1190.9172.7389.29 4.3 Results Tables 5 and 6 show the Abduction and Deduction task results. The columns labeled Positive, Negative, and Neither correspond to instances where the correct answer is the hypothesisH (positive form), the negation ofH, and “Neither is a good explanation,” respectively. For the Abduction task in the zero-shot setting, the overall accuracy of the highest-performing model (GPT-4) was around 42%, which was slightly above the chance level. While the model achieved over 80% accuracy on problems with correct answers labeled as Positive or Negative, it performed poorly on problems where the correct answer was Neither. With regard to the content types, the accuracy of Inconsistent problems was around 10% lower than the other two types (Consistent and Inconsistent). This suggests that belief biases are also reproduced in abduction tasks. In the few-shot setting, Llama-3-70B was the only model that showed a significant performance improvement, achieving approximately 63% accuracy on the problems whose answer type was Neither and around 75% overall accuracy. The overall accuracy for GPT-3.5 showed a slight improvement over the zero-shot setting, while for GPT-4, the overall accuracy was lower than in the zero-shot setting. The score for problems where the correct answer was Neither slightly increased for both GPT models. For the Deduction task, the few-shot setting improved performance across all models compared to the zero-shot setting, with gains ranging from 6.02 to 31.48 points in the overall accuracy. GPT-4 was the best-performing model in Abductive Reasoning in Large Language Models9 both settings, achieving an overall accuracy of 72.22% in the zero-shot setting and 95.83% in the few-shot setting. However, except for Llama-3-8B, accuracy remained lower for the problems labeled Inconsistent compared to those labeled Consistent and Neutral. 4.4 Discussion Are the models abductive reasoners? The results on deduction tasks gen- erally show the similar tendencies to the previous findings [8, 3, 13]. That is, LLMs’ performance was quite low in the problems whose correct answer were Neither and the sentences in the problems contradict common sense belief. The exception is Llama-3-70B in the few-shot setting; still the accuracy for abduction (75.46%) was lower than that for deduction (84.72%). It was anticipated that the results for the abduction task would be better than those for the deduction task, as abduction is more akin to everyday human reasoning, whereas deduction requires more reflective reasoning. However, the results surprisingly showed that the accuracy on abduction tasks was low overall. In particular, for the problems where the correct answer was Neither, LLMs often failed to solve them at all. Given that abduction has been studied less than deduction, there is a possi- bility that while deduction cases are included more in the LLMs’ training data, abduction problems are fewer. Also, considering that it is expected that sen- tences that perform hypothesis selection and hypothesis generation are included in natural texts, it is possible that there is difficulty in applying abduction to the syllogistic form. The observation human-like belief biases in abduction is consistent with Pereira et al. [27], which reports that belief bias is observed in abductions. Why do the models tend to mistakenly choose Negative? In the Abduc- tion task, for problems where Neither was the correct answer (108 problems), the distribution of GPT-4’s predictions was as follows: 32 problems were answered as Positive, 76 as Negative, and 0 as Neither. In contrast, in the Deduction task, the distribution was as follows: 26 problems were answered as Positive, 34 as Negative, and 48 as Neither. Thus, in the Abduction task, there was a more pro- nounced tendency to incorrectly answer Neither problems and choose Negative as the correct answer. To analyze this tendency in more detail, we calculated the rate at which a Negative was selected as the hypothesis when “No” or “not” appears in the Rule or Observation. In the case of GPT-4, this rate was 67.90% for the Abduction task (with the actual rate of the correct answer being Negative at 16.67%), while it was 70.99% for the Deduction task (with the actual rate of the correct answer being Negative at 50%). Thus, in both tasks, Negative sentences were selected at a higher rate than the actual rate of correct answers being Negative, with this tendency being more pronounced in the Abduction task. This tendency may be due to an effect similar to the atmosphere effects [7], where the presence of 10H. Abe et al. negation in the Rule or Observation leads to the selection of a hypothesis that also contains negation. Do the models answer the problems as deduction? We examined the scores of abduction problems when labeled as if they were deduction problems. For example, an affirmation of the antecedent (e.g., All B are C, A is B) has no correct hypothesis for abduction, and therefore the correct answer is Neither. However, if it is conceived as deduction, it logically entails A is C and is labeled as Positive. When comparing the correct labels for deduction to GPT-4’s predic- tions, the agreement rate was 51.85% for Overall, 100.0% for Positive, 93.83% for Negative, and 8.33% for Neither. In Overall, this was about 10 points higher than the accuracy for the original abduction labels. For example, among infer- ences of the form “affirmation of the antecedent,” all inferences of the form All B are C, A is B led to the selection of the correct deductive answer, and 89% of inferences of the form No B are C, A is B also led to the selection of the correct deductive answer. This suggests that LLMs are influenced by deduction when solving abduction problems. However, in general, the agreement rate does not reach the level of accuracy in the Deduction task (falling short by about 20 points for Overall and by about 35 points for Neither), suggesting that LLMs are not completely mistaking abduction problems for deduction problems. Does the word “Hypothesis” mislead by suggesting entailment rela- tionship? In Natural Language Inference (NLI) tasks [33], the conclusion that entails the premise is usually called “Hypothesis.” Given this fact, we investi- gated the possibility that the word “Hypothesis” itself does not function as a hypothesis to explain Observation, but instead it suggests an entailment or de- ductive relation. In particular, substituting the term “Hypothesis” with “Reason” had little effect on improving the score. Do the models choose contradictory answers? To specify error tenden- cies, we investigated whether LLMs choose the answer regardless of logical con- sistency. We checked whether the LLMs choose the answer that contradicts the given Rule or Observation, but few such cases are observed. 5 Conclusion and Future Work In this paper, we created a dataset to test abductive reasoning abilities of LLMs and compared LLM’s accuracy on abductive reasoning tasks with deductive reasoning tasks. The results showed that LLMs performed worse in abduction than in deduction. In addition, human-like belief biases are observed in abduction as well as in deduction. Abduction is considered to be a more ordinary inference than deduction is. Therefore, it is expected that humans would perform better for abduction than for deduction. This expected tendency is different from the tendency of LLMs, as Abductive Reasoning in Large Language Models11 shown in this paper. Comparisons between LLMs and humans for the abductive reasoning tasks, as well as further investigations on the error tendencies or biases in abductive reasoning through these comparisons, are topics for future work. We adopted Peirce’s initial characterization of abduction in syllogistic frame- work and focused on hypothesis generation tasks. Although three options, Posi- tive, Negative, and Neither, are included in each problem, the options other than the correct answer do not serve as hypothesis from the logical perspective, so the task can be seen as a simpler version of hypothesis generation, rather than a task of choosing the best one from multiple hypotheses. However, other char- acterizations are also possible. Abduction can be understood as Inference to the Best Explanation. Tasks such as selecting the best hypothesis that explains the premises from the plural candidates that are already logical explanations are expected in future work. Also, although we characterized abduction with syllo- gistic framework, it can be understood as a probable reasoning. Comparison of humans and LLMs by a probabilistic (Bayesian) approach to abduction is an area for future work. Furthermore, evaluating more complex types of reasoning, such as extended syllogisms and conditionals, are also left for future work. Acknowledgements We thank the anonymous reviewers for their helpful comments and sugges- tions, which have improved the paper. This work is partially supported by JST, CREST Grant Number JPMJCR2114, JST BOOST, Japan Grant Number JP- MJBS2409, the KGRI Challenge Grant from the Keio University Global Re- search Institute, and JSPS Kakenhi Grant Numbers JP24K00004, JP21K00016, JP21H00467, JP23K20416, and JP21K18339. References 1. Aghahadi, Z., Talebpour, A.: Avicenna: a challenge dataset for nat- ural language generation toward commonsense syllogistic reason- ing. Journal of Applied Non-Classical Logics 32(1), 55–71 (2022). https://doi.org/10.1080/11663081.2022.2041352 2. AI@Meta: Llama 3 Model Card (2024) 3. Ando, R., Morishita, T., Abe, H., Mineshima, K., Okada, M.: Evaluating large language models with NeuBAROCO: Syllogistic reasoning ability and human-like biases. In: Proceedings of the 4th Natural Logic Meets Machine Learning Work- shop. p. 1–11 (2023) 4. Bellucci, F., Pietarinen, A.V.: Peirce’s Abduction, p. 1–14. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-10135-9_7 5. Bhagavatula, C., Bras, R.L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin, H., Downey, D., tau Yih, W., Choi, Y.: Abductive commonsense reasoning. In: International Conference on Learning Representations (2020) 6. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) 12H. Abe et al. 7. Chater, N., Oaksford, M.: The probability heuristics model of syl- logistic reasoning. Cognitive Psychology 38(2), 191–258 (1999). https://doi.org/10.1006/cogp.1998.0696 8. Dasgupta, I., Lampinen, A.K., Chan, S.C.Y., Sheahan, H.R., Creswell, A., Kumaran, D., McClelland, J.L., Hill, F.: Language models show human-like content effects on reasoning tasks. arXiv preprint arXiv:2207.07051 (2023). https://doi.org/10.48550/arXiv.2207.07051 9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. p. 4171–4186 (2019) 10. Dong, T., Li, C., Bauckhage, C., Li, J., Wrobel, S., Cremers, A.B.: Learning syllogism with Euler neural-networks. arXiv preprint arXiv:2007.07320 (2020). https://doi.org/10.48550/arXiv.2007.07320 11. Douven, I.: Abduction. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philos- ophy. Metaphysics Research Lab, Stanford University, Summer 2021 edn. (2021) 12. Douven, I.: The Art of Abduction. The MIT Press (2022). https://doi.org/10.7551/mitpress/14179.001.0001 13. Eisape, T., Tessler, M., Dasgupta, I., Sha, F., van Steenkiste, S., Linzen, T.: A sys- tematic comparison of syllogistic reasoning in humans and language models. arXiv preprint arXiv:2311.00445 (2024). https://doi.org/10.48550/arXiv.2311.00445 14. Evans, J.S.T.: Bias in Human Reasoning: Causes and Consequences. Lawrence Erlbaum Associates, Inc (1989) 15. Friedman, J.: Zetetic epistemology. In: Reed, B., Flowerree, A.K. (eds.) Towards an Expansive Epistemology: Norms, Action, and the Social Sphere. Routledge (forth- coming) 16. Geurts, B.: Reasoning with quantifiers. Cognition 86(3), 223–251 (2003). https://doi.org/10.1016/S0010-0277(02)00180-4 17. Gubelmann, R., Niklaus, C., Handschuh, S.: A philosophically-informed contribu- tion to the generalization problem of neural natural language inference: Shallow heuristics, bias, and the varieties of inference. In: Proceedings of the 3rd Natural Logic Meets Machine Learning Workshop (NALOMA I). p. 38–50 (2022) 18. Hartshorne, C., Weiss, P., Burks, A.W. (eds.): Collected Papers of Charles Sanders Peirce. Harvard University Press, Cambridge, Massachusetts (1931–1958), volumes 1–6 edited by Charles Hartshorne and Paul Weiss, 1931–1935; volumes 7–8 edited by Arthur W. Burks, 1958 19. Hookway, C.: Epistemology and inquiry: The primacy of practice. In: Hethering- ton, S. (ed.) Epistemology Futures, p. 95–110. Oxford University Press (2006). https://doi.org/10.1093/oso/9780199273317.003.0006 20. Hookway, C.: Questions, epistemology, and inquiries. Grazer Philosophische Stu- dien 77(1), 1–21 (2008). https://doi.org/10.1163/18756735-90000841 21. Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems 35, 22199–22213 (2022) 22. Manktelow, K.: Reasoning and Thinking. Psychology Press (1999) 23. Medianovskyi, K., Pietarinen, A.: On explainable AI and abductive inference. Philosophies 7(2), 35 (2022). https://doi.org/10.3390/philosophies7020035 24. OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023). https://doi.org/10.48550/arXiv.2303.08774 Abductive Reasoning in Large Language Models13 25. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instruc- tions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022) 26. Ozeki, K., Ando, R., Morishita, T., Abe, H., Mineshima, K., Okada, M.: Explor- ing reasoning biases in large language models through syllogism: Insights from the NeuBAROCO dataset. In: Findings of the Association for Computational Linguis- tics: ACL 2024 (2024) 27. Pereira, L.M., Dietz, E.A., Hölldobler, S.: Contextual abductive reasoning with side-effects. Theory and practice of logic programming 14(4-5), 633–648 (2014). https://doi.org/10.1017/S1471068414000258 28. Pohl, R.F.: Cognitive Illusions: A Handbook on Fallacies and Bi- ases in Thinking, Judgement and Memory. Routledge, 3 edn. (2022). https://doi.org/https://doi.org/10.4324/9780203720615 29. Stilgenbauer, J.L., Baratgin, J.: Assessing the accuracy of diagnostic probability estimation: Evidence for defeasible modus ponens. International Journal of Approx- imate Reasoning 105, 229–240 (2019). https://doi.org/10.1016/j.ijar.2018.11.015 30. Stilgenbauer, J.L., Baratgin, J., Douven, I.: Reasoning strategies for diagnostic probability estimates in causal contexts: Preference for defeasible deduction over abduction. In: Proceedings of the 4th International Workshop on Defeasible and Ampliative Reasoning (DARe-17) (2017) 31. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: Superglue: A stickier benchmark for general-purpose language under- standing systems. Advances in neural information processing systems 32 (2019) 32. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35, 24824–24837 (2022) 33. Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sen- tence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). p. 1112–1122 (2018). https://doi.org/10.18653/v1/N18-1101 A Details of Prompts Table 7 shows examples of prompts we tested that scored lower than the prompts we finally adopted. Tables 8 and 9 show the prompts with eight exemplars (8-shot prompts) used in the few-shot setting. 14H. Abe et al. Table 7. Examples of alternative prompts not adopted. Input (Abduction Task) Suppose the following Observation is logically derived from Rule and Hypothesis. Choose the most appropriate sentence for the Hypothesis from the following options (1-3) and answer with the corresponding number. Rule: All things that were in the bag are white. Hypothesis: ??? – Observation: These balls are white. 1. These balls were in the bag. 2. These balls were not in the bag. 3. Neither. The answer is: Input (Abduction Task) You are an inquirer. You know that the following Rule holds true in the world. Additionally, you have recently confirmed that the following Observation is also true. Given this information, you want to discover the mechanism behind why these hold true. Based on the Rule and Observation below, select the most plausible hypothesis from a logical perspective that explains why the Observation is valid. Please respond with the corresponding number from the numbers 1-3. Rule: All things that were in the bag are white. Observation: These balls are white. Hypothesis: 1. These balls were in the bag. 2. These balls were not in the bag. 3. Neither is a good explanation. The answer is: Abductive Reasoning in Large Language Models15 Table 8. An example few-shot prompt for the Abduction task. Input (Abduction Task) Based on Rule and Observation, from a logical perspective, select the most reasonable hypothesis that explains why Observation holds true. Choose one from the following options (1-3) and answer with the corresponding number. Note that there is a logical relationship between the Rule, Observation, and Hypothesis, where the Observation is logically derived from the Rule and Hypothesis. Rule: All things that are sold at the shop are waterproof. Observation: These shoes are waterproof. Hypothesis: 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither is a good explanation. The answer is: 1 Rule: All things that are waterproof are sold at the shop. Observation: These shoes are waterproof. Hypothesis: 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither is a good explanation. The answer is: 3 Rule: All things that are sold at the shop are waterproof. Observation: These shoes are not waterproof. Hypothesis: 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither is a good explanation. The answer is: 3 Rule: All things that are waterproof are sold at the shop. Observation: These shoes are not waterproof. Hypothesis: 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither is a good explanation. The answer is: 2 Rule: No things that are sold at the shop are waterproof. Observation: These shoes are waterproof. Hypothesis: 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither is a good explanation. The answer is: 3 Rule: No things that are waterproof are sold at the shop. Observation: These shoes are waterproof. Hypothesis: 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither is a good explanation. The answer is: 3 Rule: No things that are sold at the shop are waterproof. Observation: These shoes are not waterproof. Hypothesis: 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither is a good explanation. The answer is: 1 Rule: No things that are waterproof are sold at the shop. Observation: These shoes are not waterproof. Hypothesis: 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither is a good explanation. The answer is: 1 Rule: All things that were in the bag are white. Observation: These balls are white. Hypothesis: 1. These balls were in the bag. 2. These balls were not in the bag. 3. Neither is a good explanation. The answer is: 16H. Abe et al. Table 9. An example few-shot prompt for the Deduction task. Input (Deduction Task) Select a sentence that serves as a conclusion based on the following two premises. Choose one from the following options (1-3) and answer with the corresponding number. P1: All things that are sold at the shop are waterproof. P2: These shoes are waterproof. 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither. The answer is: 3 P1: All things that are waterproof are sold at the shop. P2: These shoes are waterproof. 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither. The answer is: 1 P1: All things that are sold at the shop are waterproof. P2: These shoes are not waterproof. 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither. The answer is: 2 P1: All things that are waterproof are sold at the shop. P2: These shoes are not waterproof. 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither. The answer is: 3 P1: No things that are sold at the shop are waterproof. P2: These shoes are waterproof. 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither. The answer is: 2 P1: No things that are waterproof are sold at the shop. P2: These shoes are waterproof. 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither. The answer is: 2 P1: No things that are sold at the shop are waterproof. P2: These shoes are not waterproof. 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither. The answer is: 3 P1: No things that are waterproof are sold at the shop. P2: These shoes are not waterproof. 1. These shoes are sold at the shop. 2. These shoes are not sold at the shop. 3. Neither. The answer is: 3 P1: All things that were in the bag are white. P2: These balls are white. 1. These balls were in the bag. 2. These balls were not in the bag. 3. Neither. The answer is: