Paper deep dive
Parallelograms Strike Back: LLMs Generate Better Analogies than People
Qiawen Ella Liu, Raja Marjieh, Jian-Qiao Zhu, Adele E. Goldberg, Thomas L. Griffiths
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/22/2026, 6:12:13 AM
Summary
This paper investigates whether the 'parallelogram model' of word analogies (A:B::C:D) is inherently flawed or if humans simply struggle to generate relation-preserving analogies. By comparing human and LLM performance on analogy tasks, the authors find that LLMs generate higher-quality analogies that align more closely with the parallelogram structure in embedding space. The study concludes that the parallelogram model remains a valid account of what constitutes a 'good' analogy, but humans often fail to satisfy this constraint, whereas LLMs do so more consistently.
Entities (5)
Relation Signals (3)
Parallelogram model → predicts → Analogy Quality
confidence 95% · greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher
LLM → outperforms → Human
confidence 90% · LLM-generated analogies are reliably judged as better than human-generated ones
LLM → uses → GloVe
confidence 90% · LLM analogies are better captured by both parallelogram and local similarity measures in GloVe
Cypher Suggestions (2)
Find all models compared in the study · confidence 90% · unvalidated
MATCH (m:Model)-[:EVALUATED_ON]->(d:Dataset) RETURN m.name, d.name
Identify relations between cognitive models and their performance metrics · confidence 85% · unvalidated
MATCH (c:CognitiveModel)-[:PREDICTS]->(m:Metric) RETURN c.name, m.name
Abstract
Abstract:Four-term word analogies (A:B::C:D) are classically modeled geometrically as ''parallelograms,'' yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.
Tags
Links
- Source: https://arxiv.org/abs/2603.19066v1
- Canonical: https://arxiv.org/abs/2603.19066v1
Full Text
39,134 characters extracted from source content.
Expand or collapse full text
Parallelograms Strike Back: LLMs Generate Better Analogies than People Qiawen Ella Liu1, Raja Marjieh1, Jian-Qiao Zhu1,2, Adele E. Goldberg1, & Thomas L. Griffiths1 1Princeton University 2The University of Hong Kong Abstract Four-term word analogies (A:B::C:D) are classically modeled geometrically as “parallelograms,” yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account [15]. But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from [15]. We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently. Keywords: Analogy, LLM, Parallelogram model Introduction A simple but canonical paradigm for studying analogy is the four-term word analogy (e.g., king is to queen as man is to woman, abbreviated to king:queen::man:woman). A classic theory of this type of analogy is the parallelogram model, which posits that concepts exist as points in a geometric mental space where relations are represented as vectors [19]: to find the word that completes the analogy A:B::C:??, one simply applies the vector difference between A and B to the third term, C. However, recent behavioral evidence has called the parallelogram model into question. [15] found that for human-generated word analogies the parallelogram model was outperformed by simple local similarity heuristics that did not take relational similarity into account at all (e.g., by choosing a word highly similar to the C term). These findings have been taken to suggest that geometric models may fail to capture how people represent and reason about relations. An alternative possibility, however, is that people are simply not very good at producing analogies, as generating a precise analogy may be cognitively demanding. Time pressure, knowledge constraints, and retrieval failures may push people toward more accessible but less relationally aligned responses, even if they are capable of recognizing better analogies when presented with them. The emergence of large language models (LLMs) offers an opportunity to revisit the parallelogram model again. Despite being highly opaque systems with complex attention mechanisms and inaccessible latent representations, LLMs have shown a surprising capacity to simulate human semantic judgments [16] and demonstrate emergent relational reasoning abilities manifested in tasks like in-context learning [1]. Here, we ask whether LLM-generated analogies are perceived to be higher quality than human-generated analogies, and whether any such advantage is captured by simple cognitive models such as the parallelogram model or local similarity heuristics. To address these questions, we prompt six state-of-the-art LLMs with a large set of word analogy problems previously given to humans [15] and collect human judgments of how well LLM and human completions preserve the intended relation. We find that people reliably judge LLM analogies to be better than human-generated analogies, and that LLM analogies are better captured by both parallelogram and local similarity measures. We also show that the advantage of LLMs over humans is predicted by greater parallelogram alignment in an external distributional embedding space and by the use of less accessible words. However, this advantage is driven not by LLMs producing uniformly superior responses, but by humans generating a long tail of low-quality completions: when the comparison is restricted to only the most frequent (modal) responses, the LLM advantage disappears, even though completions where LLMs are more parallelogram-aligned and use less frequent words than humans continue to receive higher ratings. These results suggest that while parallelogram models may poorly describe how humans generate analogies, they nonetheless capture what people value in good analogies—and that LLMs are better at producing responses that satisfy this pattern, even though this pattern need not be reflected in the models’ own internal representations. Background Table 1: Semantic relation categories we included from SemEval-2012 Task 2 [9]. Category Subtype 1 Subtype 2 Class-Inclusion Taxonomic (flower:tulip) Class:Individual (queen:Elizabeth) Part-Whole Object:Component (car:engine) Collection:Member (forest:tree) Similar Synonymy (car:auto) Dimensional Similarity (simmer:boil) Contrast Contrary (old:young) Reverse (attack:defend) Attribute Item:Attribute (beggar:poor) Object:State (beggar:poverty) Non-Attribute Item:Nonattribute (harmony:discordant) Object:Nonstate (laureate:dishonor) Case Relations Agent:Instrument (farmer:tractor) Action:Object (plow:earth) Cause-Purpose Cause:Effect (joke:laughter) Cause:Compensatory Action (hunger:eat) Space-Time Location:Item (arsenal:weapon) Time:Associated Item (retirement:pension) Reference Sign:Significant (siren:danger) Representation (portrait:person) Geometric models of analogy. Analogy is a central aspect of human creativity and intelligence [3, 6, 7], making investigation of how analogies are formed and why some analogies are better than others a central issue in cognitive science. Early work by [19] proposed that concepts can be represented as points in a multidimensional space and that relations correspond to difference vectors between those points. Using low-dimensional representations derived from multidimensional scaling, they showed that solving an analogy A:B::C:?? (e.g., rat:pig::goat:?) could be modeled as completing a parallelogram to find a solution at (C+B−AC+B-A). The advent of modern language models and word embeddings renewed interest in the parallelogram model of analogies. Methods such as word2vec [12] and GloVe [14] learn vector representations of words from large text corpora, and early demonstrations showed that linear vector arithmetic could recover canonical analogies (e.g., king – man + woman ≈ queen). Similar geometric reasoning has also been explored in computer vision, where latent representations support vector-based transformations between visual concepts (e.g., a generative model could derive a “smiling man” by calculating the vector for smiling woman−neutral woman+neutral mansmiling woman-neutral woman+neutral man) [18, 17]. In addition to the parallelogram rule, other work has proposed simpler heuristics relying on local similarities to capture how people complete four-term analogies, implemented as either ranking candidates by their similarity to C alone (the C:D similarity rule) or using the relative similarity of A to B versus C to decide whether to retrieve near B or near C (the nearest-neighbor (N) heuristic) [20]. [15] evaluated the parallelogram model against both the C:D similarity and N heuristic and found that while the parallelogram rule tended to capture the top human responses well in their dataset, similarity-based heuristics provided a better account of the full distribution of responses. Analogies by large language models. A growing body of work suggests that LLMs can solve a range of analogy tasks, from four-term word analogies to matrix reasoning and letter-string analogies, and can support plausible open-ended analogical reasoning when prompted to do so [23, 8, 24]. More broadly, LLMs are good relational reasoners: given a few examples, LLMs can infer the latent generalization linking inputs to outputs and apply it to novel cases, known as in-context learning (ICL) [1]. Recent work has grounded these capabilities in geometric theories, demonstrating that task-specific relational information can be extracted as “function vectors” from a model’s internal activations on a few-shot prompt [22, 5]. Taken together, these findings suggest that LLMs may implement geometric strategies during relational reasoning that are reminiscent of [19], but prior work has not systematically evaluated how well LLMs complete word analogies as compared to humans, nor evaluated how well LLM completions are predicted by the parallelogram model or local similarity heuristics. Here, we provide such a comparison and analysis. Do LLM Analogies Differ from Human Analogies? Figure 1: Human–LLM similarity varies across semantic relations. Center: Similarity between human and LLM analogy completions across 20 semantic relation types from SemEval-2012. Darker colors indicate greater similarity between human and LLM responses. Insets: example analogies provided by humans and LLMs. Bar lengths show the proportion of responses; colors distinguish models and humans. We prompted six state-of-the-art LLMs to complete word analogies (e.g., answer:riddle::solution:?), including two closed-source models: gpt-5-mini and o4-mini; and four open-source models: DeepSeek-V3.1-671B, Qwen3-235B-A22B-Instruct-2507, Qwen3-32B, and Gemma-3-27B. To minimize artifacts from specific prompt wording and obtain more robust estimates of model behavior, we used four semantically equivalent prompt phrasings for each analogy.111(1) “Complete this analogy: A is to B as C is to ___”; (2) “What word completes this analogy? A:B::C:?”; (3) “Analogy: A:B::C:? Answer:”; (4) “Fill in the blank: A is to B as C is to _____”. Each was repeated 10 times with temperature =1.0=1.0, yielding 40 responses per model per analogy. We tested LLMs on all 846 word analogy items from [15], sampled from three sources: (1) 80 analogies from [4]; (2) 178 analogies from [10]; and (3) 588 analogies annotated with semantic relation categories (10 relations, each with two subtypes) from [9]. In total, this procedure yielded 203,040 LLM responses, comprising 5,943 distinct completions of the form A:B::C:D. The original dataset in [15] includes 26,265 human responses, corresponding to 9,136 distinct completions. To analyze the degree to which LLM responses resemble human responses, we represented words using 300-dimensional GloVe embeddings trained on the 840B-token Common Crawl corpus [14] and compared where their respective responses fall in semantic space. For each analogy stem, we computed a frequency-weighted centroid of responses in embedding space. Let r index unique responses with embeddings rv_r and relative frequencies frf_r. The centroid is =∑rfrr = _rf_rv_r. We computed separate centroids for humans and each model and measured their cosine similarity. Higher similarity indicates that humans and LLMs generated semantically similar responses, even if the exact words differed (e.g., cat vs cats). Across all 846 analogy stems from the three datasets, the two closed-source OpenAI models showed the highest overall convergence with human completions: GPT-5-mini (mean human–model centroid similarity =.825=.825) and o4-mini (mean =.824=.824). Among open-source models, DeepSeek-V3.1-671B showed the strongest alignment with humans (mean =.817=.817), followed by Qwen3-235B (mean =.798=.798), Qwen3-32B (mean =.797=.797), and Gemma-3-27B (mean =.781=.781). We then zoomed in on the SemEval-2012 Task 2 analogies ([9]; see Table 1), the only dataset that includes semantic relation labels. Though models differed in how much they converge with humans, they largely agreed on where they tend to converge or diverge (Figure 1): the average pairwise correlation between models’ convergence with humans across relation types was r=.93r=.93, indicating that relations with high human–LLM convergence for one model tend to show high convergence for others as well. For example, Time:Associated Item showed strong human–LLM divergence across the board, while relations like Contrary tended to elicit more consistent completions (see Figure 1 for examples). Are LLM Analogies Better than Human Analogies? Figure 2: Relational similarity ratings for LLM versus human analogy completions. (a) Mean rating differences (LLM −- Human) by relation type for six LLMs. Dotted lines show average effects across all relations.*** p<.001p<.001, ** p<.01p<.01, * p<.05p<.05, n.s. = not significant. Participants. We recruited 390 participants via Prolific (8–10 raters per analogy). Participants were adult native English speakers from the United States, all provided informed consent prior to participation in accordance with an approved protocol. The study was preregistered on AsPredicted (link). Stimuli. We constructed 4048 four-term analogies (A:B::C:DA:B::C:D) using A:B::CA:B::C stems from the SemEval-2012 Task 2 dataset [9] and D terms drawn from human responses in [15] and LLM-generated completions to the same set of questions obtained above. For both humans and LLMs, we excluded responses that were provided by only one participant or model, a standard practice in response generation tasks to control the quality of responses [13]. Stimuli were randomly distributed across 42 lists of about 100 analogies each. Procedure. Participants rated how similar the relationship between C and D was to the relationship between A and B on a 7-point scale (1 = extremely different, 7 = extremely similar). Instructions included examples of similar relationships (kitten:cat::chick:chicken) and dissimilar relationships (chick:chicken::hen:rooster). Each list included 5 attention check trials consisting of obviously dissimilar comparisons that expect ratings ≤4≤ 4 (e.g., dog:puppy::happy:sad). 13% of participants who failed more than half of these attention check items were excluded from analyses. Results. To compare the responses of LLMs and humans, we calculated the mean of the relational-similarity ratings for each analogy stem (A:B::C). Specifically, we weighted the rating of each completion D by how frequently it was produced, equivalent to averaging ratings as if a completion produced 10 times was rated 10 times. We computed these means for humans and each LLM, then compared humans to each LLM using paired t-tests across all stems. Figure 2 shows the mean rating differences (LLM −- Human) for each model, both overall and broken down by the 20 semantic relations. Most LLMs’ analogies were rated as statistically significantly better (i.e., higher similarity rating between A:B and C:D): GPT-5-mini showed the largest difference from humans (b=.18,95%CI=[0.14,0.22],p<.001b=.18,95\%CI=[0.14,0.22],p<.001), followed by o4-mini (b=0.17,95%CI=[0.13,0.21],p<.001b=0.17,95\%CI=[0.13,0.21],p<.001), Qwen3-235B (b=0.15,95%CI=[0.11,0.19],p<.001b=0.15,95\%CI=[0.11,0.19],p<.001), DeepSeek-V3.1-671B (b=.13,95%CI=[0.08,0.20],p<.001b=.13,95\%CI=[0.08,0.20],p<.001), and Gemma-3-27B (b=.08,95%CI=[0.03,0.13],p<.001b=.08,95\%CI=[0.03,0.13],p<.001). The only model whose completions were not rated as statistically significantly better than human was Qwen3-32B (b=.006,95%CI=[−.03,0.05],p=.7b=.006,95\%CI=[-.03,0.05],p=.7). The magnitude of LLM advantage varied considerably across different relation types. For example, for Object:State relations (e.g., novice:inexperience::child:?), LLMs tended to generate more relation-preserving state terms (e.g., immaturity), whereas humans produced broader, though still plausible, responses that are less tightly aligned with the A:B relation (e.g., youth). In contrast, some categories showed much narrower gaps between human vs. LLM analogies. In some cases, this likely reflects limited room for improvement because humans and LLMs already produced highly converging completions (e.g., Contrary analogies like happy:sad::black:white have 100% convergence between human and LLMs). Some other relations, such as Representation, displayed substantially greater variability. For example, in diary:person::crown:?, LLMs overwhelmingly produced the more precise term monarch, whereas humans more often responded with king or queen. In diary:person::biography:?, however, LLMs exclusively responded with subject, which was rated as worse compared to humans’ modal responses (author). This might be because LLMs and humans interpret the relation between diary:person in qualitatively different ways: while humans seemed to interpret it in terms of authorship (who writes the diary), LLMs seemed to interpret it as representation (what a diary is about). Why are LLM Analogies better? To understand why LLM-generated analogies receive higher ratings from humans than human-generated analogies, we revisited the word embedding analysis of [15]. Specifically, we ask: (1) How well do parallelogram vs local similarity heuristics capture the responses produced by humans versus LLMs? (2) What predicts the higher ratings of LLM analogies? (3) Does increased parallelogram alignment reflect how LLMs internally generate analogies, or does it simply reflect what humans find to be good analogies? How well do parallelogram vs. local similarities predict human vs. LLM completions? Figure 3: Performance of humans and six LLMs on three GloVe-based analogy metrics. Bar charts show mean rank with 95% confidence intervals (lower is better). Insets display cumulative proportion of responses retrieved as a function of rank percentage (log scale). To quantify how well different geometric rules capture the distributions of human vs. LLM responses, we adopted the cumulative proportion retrieved (CPR) analysis from [15], using the same pretrained GloVe embeddings [14]. We chose GloVe because it explicitly factorizes a word–context co-occurrence matrix, providing a statistically grounded embedding space where geometric properties have distributional interpretations [11, 2]. For each analogy stem A:B::C:?A\!:\!B::C\!:\,?, each rule induced a ranking over the full vocabulary V (|V|=2,196,015|V|=2,196,015). We tested three rules: 1. Parallelogram model. Predict the completion (D term) by applying to C the same offset that maps A to B, i.e., ^=(B)−(A)+(C) d=v(B)-v(A)+v(C). Then, rank all candidates w∈Vw∈ V by cosine similarity cos((w),^) (v(w), d). 2. C:D similarity. Ignore the A:BA:B relation and rank candidates only by similarity to C, i.e., by cos((w),(C)) (v(w),v(C)). 3. Nearest-neighbor (N) heuristic. First compute whether A is closer to B or C. Set the target T=CT=C if cos((A),(B))>cos((A),(C)) (v(A),v(B))> (v(A),v(C)), and T=BT=B otherwise. Then, rank by cosine similarity to T, i.e., cos((w),(T)) (v(w),v(T)). We then measured, at a rank percentile threshold τ, the proportion of observed responses whose predicted rank falls within the top-τ% of candidates, e.g., CPR(1%) = 0.76 means 76% of responses ranked in the top 1% of the model’s predictions. Higher CPR at lower τ indicates better prediction. All three rules captured LLM responses better than human responses. On average, all LLMs were substantially better captured by the three rules (ps<.001ps<.001, see Figure 3). The cumulative retrieval distributions (Figure 3, insets) further illustrate this systematic advantage, with all LLM curves positioned consistently above the human baseline across the entire rank spectrum. For example, At the 0.1% rank threshold, all three rules retrieved a larger proportion of LLM than human responses: the parallelogram model (86% vs. 78%), C:D similarity (89% vs. 84%), and N heuristics (85% vs. 80%). Parallelogram model underperforms the other rules overall, but the gap is smaller for LLMs. Consistent with the findings of [15] on human data, we found the parallelogram model was relatively good at predicting the top LLM responses, while C:D similarity and N did a better job of capturing the entire distribution of LLM responses. At the same time, the performance gap between parallelogram and similarity-based rules was notably smaller for LLMs than for humans (Figure 3). Parallelogram mean ranks were on average 13,355 positions higher (worse) than C:D similarity for LLMs versus 27,264 for humans, and 7,799 positions higher than N for LLMs versus 17,465 for humans. In both cases, the gap between the parallelogram model and the local similarity heuristic was more than twice as large for humans. What predicts the higher ratings of LLM analogies? To understand what predicts LLMs’ advantage in similarity ratings, we considered three non-mutually exclusive possibilities: LLMs may receive higher ratings because their responses are (1) more parallelogram-like in embedding space, (2) more strongly associated with the C term (higher local similarity), (3) closer to the N heuristic and/or (4) less constrained by lexical accessibility, generating lower-frequency words that are judged as better when presented but are less likely to be produced by humans during free generation. We therefore computed four predictors for every observed completion A:B::C:DA\!:\!B::C\!:\!D. Again, we embedded all words with the same 300-dimensional GloVe vectors. For each completion D, we derived parallelogram alignment, the cosine similarity between relation vectors ((B)−(A))(v(B)-v(A)) and ((D)−(C))(v(D)-v(C)); C:D local similarity, the cosine similarity between (C)v(C) and (D)v(D); and a nearest-neighbor (N) heuristic score that selects either cos(C,D) (C,D) or cos(B,D) (B,D) depending on whether A is closer to B than to C. To capture lexical accessibility, we obtained each completion’s word frequency using the wordfreq python package [21] and computed log frequency due to the highly right-skewed (Zipfian) distribution of word frequencies. We then took the weighted average of these metrics for human and LLM responses for each analogy stem, and computed difference scores for each predictor, i.e., Δ , Δ :D similarity, Δ , and Δlog(freq) (freq), all defined as (LLM −- Human). Since C:D similarity and N are substantially correlated (r = .66), these predictors were entered into two multiple regression models (Model A without N; Model B without C:D) to explain the rating advantage of LLM completions. Parallelogram and Word Frequency explain why LLM analogies are better. As shown in Figure 4, Δ strongly predicted LLMs completions’ higher ratings over humans (Model A: β=0.205β=0.205, t=4.84t=4.84, p<.001p<.001; Model B: β=0.233β=0.233, t=5.69t=5.69, p<.001p<.001), suggesting that when LLM completions better satisfied the parallelogram constraint in GloVe, they were also rated as better analogies. Local similarity measures did not explain the LLMs’ advantage: Δ was not a significant predictor (β=−0.009β=-0.009, p=.82p=.82) and Δ :D similarity was a significant negative predictor (β=−0.097β=-0.097, p=.017p=.017), meaning that to the extent LLM completions were more similar to the C term than human completions, the rating advantage for LLMs was smaller. Finally, Δlog(freq) (freq) was a negative predictor (Model A: β=−0.192β=-0.192, p<.001p<.001; Model B: β=−0.191β=-0.191, p<.001p<.001), indicating that LLM tended to outperform humans more when their completions were less frequent (and thus less accessible) than humans’ responses. Figure 4: (b) Standardized regression coefficients predicting rating differences. *** p<.001p<.001, * p<.05p<.05, n.s. = not significant. Do LLMs’ internal representations show the same parallelogram-rating relationship? While LLM completions show stronger parallelogram alignment in GloVe space and this alignment predicts human ratings, two interpretations are possible: (1) LLMs excel because their internal representations are more geometrically structured or (2) parallelogram structure in distributional spaces like GloVe captures what humans find compelling in analogies, and LLMs produce better analogies without relying on (directly observable) geometric representations. To distinguish these, we extracted word representations from the residual stream of each open-source model (DeepSeek-V3.1, Qwen3-235B, Qwen3-32B, Gemma-3-27B), sampling every fourth layer. For each completion, we computed parallelogram alignment (the cosine similarity between (v(B)−v(A))(v(B)-v(A)) and (v(D)−v(C))(v(D)-v(C))) using either isolated word embeddings (words input in isolation, e.g., “king”) and contextual embeddings (words embedded in the complete analogy sentence, e.g., “king” within “king is to queen as man is to woman”). We then tested how well alignment at each layer predicted human ratings. Parallelogram alignment in GloVe predicts human judgments better than LLM internal representations. Parallelogram alignment in GloVe space strongly predicted human ratings (β=0.193β=0.193, p<.001p<.001), consistently outperforming all LLM internal representations. Aggregating across layers, LLM effects using isolated embeddings were substantially weaker: Gemma-3-27B (β=0.094β=0.094), Qwen3-32B (β=0.041β=0.041), DeepSeek-V3.1 (β=0.040β=0.040), and Qwen-3-235B (β=0.033β=0.033), with contextual embeddings yielding even weaker predictions. These results support the second interpretation: LLMs’ advantage does not necessarily stem from more parallelogram-like internal representations but from generating analogies that better satisfy relational constraints captured in embedding spaces optimized for semantic representations. The LLM advantage disappears for modal responses, but what makes a good analogy does not While LLMs produce higher-rated analogies than humans on average, this advantage could reflect either that LLMs’ responses are overall superior to humans’ responses or that LLMs produce fewer low-quality responses. We examined the distributions of responses by humans and LLMs and found human response distributions were considerably more dispersed than LLM distributions: human modal responses accounted for 64% of total responses, compared to 85% for LLMs. In other words, humans produced a long tail of low-frequency completions, while LLMs tended to concentrate their outputs on a few dominant answers. If LLMs’ advantage stems primarily from humans’ long tail of weak responses rather than from LLMs producing better answers across the board, then restricting the comparison to only modal (most frequent) responses should eliminate the LLM advantage. We tested this prediction by repeating the human–LLM comparison using only modal completions (frequency-weighted, with ties preserved). We found the LLM advantage over humans largely disappeared: no model showed a statistically significant overall advantage over humans (ps>.05ps>.05), while Qwen3-32B and Gemma-3-27B were rated lower than human modal responses (b=−0.10b=-0.10, p=.003p=.003; b=−0.08b=-0.08, p=.02p=.02, respectively). All other models’ modal responses were statistically indistinguishable from humans’ (ps>.35ps>.35). Paralleling this convergence in ratings, most LLMs’ modal responses were also not significantly more parallelogram-aligned than humans’ in GloVe space (DeepSeek-V3.1-671B: p<.001p<.001; Qwen3-235B: p=.03p=.03; all others ps>.05ps>.05). However, the factors that predicted the LLM advantage in the full dataset remained significant when the analysis was restricted to modal responses. In two regression models predicting the rating difference between LLM and human modal completions, Δ was a strong positive predictor (Model A: β=0.189β=0.189, p<.001p<.001; Model B: β=0.231β=0.231, p<.001p<.001), and Δlog(freq) (freq) remained a significant negative predictor (Model A: β=−0.173β=-0.173, p<.001p<.001; Model B: β=−0.161β=-0.161, p<.001p<.001). ΔC C:D similarity was still a significant negative predictor (β=−0.141β=-0.141, p<.001p<.001), and Δ was a weaker negative predictor (β=−0.086β=-0.086, p=.03p=.03). These results indicate that the mechanisms driving rating differences between human and LLM completions—parallelogram alignment and lexical accessibility—operate consistently regardless of whether one examines the full distribution or only the dominant responses. Discussion By directly comparing LLMs’ completions of classic four-term word analogies with human completions, we asked whether today’s LLMs solve analogies like people do and, if not, what the differences reveal about the representations and heuristics that capture each system. We found (1) People judge LLM-generated analogies to be better than human-generated ones, on average; (2) LLM analogies are better predicted by the parallelogram model than human responses, although, much like people’s, they are better captured by local similarity heuristics overall. (3) LLM-generated analogies are considered better when they more cleanly instantiate the parallelogram relation and involve less accessible (lower-frequency) words. (4) This advantage is driven not by LLMs producing superior responses across the board, but by humans generating a long tail of weaker completions. When only modal (most frequent) responses are compared, the LLM advantage disappears, though rating differences between LLM and human modal completions continue to be predicted by greater parallelogram alignment and lower word frequency. Parallelogram alignment and lower frequency responses explains LLMs’ advantage in perceived quality. Compared to humans, LLMs show closer alignment with the parallelogram model’s predictions, and this alignment reliably predicts higher human quality ratings. In contrast, local similarity measures—such as semantic similarity between C and D—do not explain LLMs’ advantage and, if anything, predict worse ratings. While [15] suggested that the parallelogram model may simply fail to capture human-like analogies, our findings suggest an alternative interpretation: people may tend to fail to produce strong relational analogies, but they nonetheless prefer analogies well captured by parallelograms. At the same time, we also observe a robust effect of lexical accessibility on relation ratings. LLM completions tend to be lower-frequency than human completions, and people tended to assign lower-frequency completions higher ratings. Crucially, however, the effect of parallelogram alignment remains stable after controlling for accessibility, suggesting that LLM responses are rated better not only because they use less common or more “clever-sounding” words, but also because they better satisfy a relational constraint captured by the parallelogram model. Local similarity outperformed the parallelogram model. We also replicate a key result from prior work: local similarity heuristics (especially C:D similarity) outperformed the parallelogram model as the best overall predictor for both humans and LLMs. Peterson et al. (2020) attributed this pattern in humans partially to cognitive ease: when rushed or confused, people may be drawn to local similarity judgments rather than engage in full relational reasoning. However, the fact that LLMs exhibit the same preference for local similarity heuristics challenges this account, as LLMs face no time pressure or constraints from cognitive load. Alternatively, we think local similarity may reliably predict completions because many strong analogies exhibit both relational and semantic similarity between source and target domains. For instance, in man:woman::king:queen, queen is both relationally appropriate (gender) and semantically similar to king (royalty). Even when a clean relational offset is hard to infer due to ambiguity of word meaning or weak relational signal in A:B, retrieving a word close to C remains a robust fallback for the task. Conclusion. We revisit and lend renewed support to a classic cognitive model of analogy. While the parallelogram model doesn’t fully explain how humans or LLMs generate analogies, it reliably predicts what people judge to be good analogies. By testing these theories on artificial systems, we gain new insight into both how different systems process relational information and into the psychological validity of the theories themselves. We hope to explore these directions in future work. Acknowledgments We thank Toyota Motor North America, Inc. for their support of this research. References [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems 33, p. 1877–1901. Cited by: Introduction, Background. [2] K. Ethayarajh, D. Duvenaud, and G. Hirst (2019) Towards understanding linear word analogies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, p. 3253–3262. Cited by: How well do parallelogram vs. local similarities predict human vs. LLM completions?. [3] D. Gentner (1983) Structure-mapping: a theoretical framework for analogy. Cognitive Science 7 (2), p. 155–170. Cited by: Background. [4] A. E. Green, D. J. Kraemer, J. A. Fugelsang, J. R. Gray, and K. N. Dunbar (2010) Connecting long distance: semantic distance in analogical reasoning modulates frontopolar cortex activity. Cerebral Cortex 20 (1), p. 70–76. Cited by: Do LLM Analogies Differ from Human Analogies?. [5] R. Hendel, M. Geva, and A. Globerson (2023) In-context learning creates task vectors. arXiv preprint arXiv:2310.15916. Cited by: Background. [6] D. R. Hofstadter (2001) Analogy as the core of cognition. In The Analogical Mind: Perspectives from Cognitive Science, D. Gentner, K. J. Holyoak, and B. N. Kokinov (Eds.), p. 499–538. Cited by: Background. [7] K. J. Holyoak and P. Thagard (1995) Mental leaps: analogy in creative thought. MIT Press. Cited by: Background. [8] T. Johnson, M. ter Veen, R. Choenni, H. van der Maas, E. Shutova, and C. E. Stevenson (2025) Do large language models solve verbal analogies like children do?. In Proceedings of the 29th Conference on Computational Natural Language Learning, p. 627–639. Cited by: Background. [9] D. A. Jurgens, P. D. Turney, S. M. Mohammad, and K. J. Holyoak (2012) SemEval-2012 task 2: measuring degrees of relational similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics, p. 356–364. Cited by: Table 1, Do LLM Analogies Differ from Human Analogies?, Do LLM Analogies Differ from Human Analogies?, Are LLM Analogies Better than Human Analogies?. [10] M. J. Kmiecik and R. G. Morrison (2013) Semantic distance modulates the n400 event-related potential in verbal analogical reasoning. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 35. Cited by: Do LLM Analogies Differ from Human Analogies?. [11] O. Levy and Y. Goldberg (2014) Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, Vol. 27. Cited by: How well do parallelogram vs. local similarities predict human vs. LLM completions?. [12] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, Vol. 26. Cited by: Background. [13] D. L. Nelson, C. L. McEvoy, and T. A. Schreiber (2004) The University of South Florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers 36 (3), p. 402–407. Cited by: Are LLM Analogies Better than Human Analogies?. [14] J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), p. 1532–1543. Cited by: Background, Do LLM Analogies Differ from Human Analogies?, How well do parallelogram vs. local similarities predict human vs. LLM completions?. [15] J. C. Peterson, D. Chen, and T. L. Griffiths (2020) Parallelograms revisited: exploring the limitations of vector space models for simple analogies. Cognition 205, p. 104440. Cited by: Introduction, Introduction, Background, Do LLM Analogies Differ from Human Analogies?, Are LLM Analogies Better than Human Analogies?, How well do parallelogram vs. local similarities predict human vs. LLM completions?, How well do parallelogram vs. local similarities predict human vs. LLM completions?, Why are LLM Analogies better?, Discussion. [16] S. T. Piantadosi, D. C. Muller, J. S. Rule, K. Kaushik, M. Gorenstein, E. R. Leib, and E. Sanford (2024) Why concepts are (probably) vectors. Trends in Cognitive Sciences 28 (9), p. 844–856. Cited by: Introduction. [17] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: Background. [18] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee (2015) Deep visual analogy-making. Advances in Neural Information Processing Systems 28. Cited by: Background. [19] D. E. Rumelhart and A. A. Abrahamson (1973) A model for analogical reasoning. Cognitive Psychology 5 (1), p. 1–28. Cited by: Introduction, Background, Background. [20] D. D. Sadler and E. J. Shoben (1993) Context effects on semantic domains as seen in analogy solution.. Journal of Experimental Psychology: Learning, Memory, and Cognition 19 (1), p. 128. Cited by: Background. [21] Rspeer/wordfreq: v3.0 External Links: Document, Link Cited by: What predicts the higher ratings of LLM analogies?. [22] E. Todd, M. L. Li, A. S. Sharma, A. Mueller, B. C. Wallace, and D. Bau (2023) Function vectors in large language models. arXiv preprint arXiv:2310.15213. Cited by: Background. [23] T. Webb, K. J. Holyoak, and H. Lu (2023) Emergent analogical reasoning in large language models. Nature Human Behaviour 7 (9), p. 1526–1541. Cited by: Background. [24] M. Yasunaga, X. Chen, Y. Li, P. Pasupat, J. Leskovec, P. Liang, E. H. Chi, and D. Zhou (2023) Large language models as analogical reasoners. arXiv preprint arXiv:2310.01714. Cited by: Background.