← Back to papers

Paper deep dive

Artificial Intelligence for Sentiment Analysis of Persian Poetry

Arash Zargar, Abolfazl Moshiri, Mitra Shafaei, Shabnam Rahimi-Golkhandan, Mohamad Tavakoli-Targhi, Farzad Khalvati

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 52

Abstract

Abstract:Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.

Tags

ai-safety (imported, 100%)cscl (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%

Last extracted: 3/13/2026, 1:13:19 AM

OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 52954. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

52,063 characters extracted from source content.

Expand or collapse full text

Artificial Intelligence for Sentiment Analysis of Persian Poetry A Preprint Arash Zargar Department of Mechanical & Industrial Engineering Vector Institute for Artificial Intelligence University of Toronto Toronto, Canada a.zargar@mail.utoronto.ca Abolfazl Moshiri Department of Near & Middle Eastern Civilizations University of Toronto Toronto, Canada a.moshiri@utoronto.ca Mitra Shafaei Independent Researcher Shabnam Rahimi-Golkhandan Department of Near & Middle Eastern Civilizations University of Toronto Toronto, Canada Mohamad Tavakoli-Targhi Department of Near & Middle Eastern Civilizations University of Toronto Toronto, Canada Farzad Khalvati ∗ Department of Medical Imaging and Institute of Medical Science Department of Mechanical & Industrial Engineering Department of Computer Science Vector Institute for Artificial Intelligence University of Toronto Toronto, Canada farzad.khalvati@utoronto.ca March 13, 2026 Abstract Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalāl al-Dīn Muḥammad Rūmī (Rumi) and Parvin E’tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of arXiv:2603.11254v1 [cs.CL] 11 Mar 2026 Sentiment Analysis of Persian PoetryA Preprint the Persian poetry and explore potential correlations between the poems’ sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi’s poems express happier sentiments compared to Parvin E’tesami’s poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi’s poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis. KeywordsSentiment Analysis·Emotion Classification·Artificial Intelligence (AI)·Large Language Models (LLMs)·Digital Humanities·Computational Linguistics·Persian Poetry·Jalāl al-Dīn Muḥammad Rūmī’s (Rumi)·Parvin E’tesami 1 Introduction Recent advancements of Artificial Intelligence (AI) have been such extensive that Geoffrey Hinton, one of the most cited scholars in this fields, compared its effect on the human history to the invention of fire, wheel, and industrial revolution [Acres,2023]. These recent advancements provided extremely powerful methods and tools for textual data analysis through invention of Large Language Models (LLM). An LLM is an advanced deep learning algorithm capable of handling numerous tasks related to natural language processing (NLP). These language models utilize transformer architectures and are developed by training on extensive datasets. Such training allows LLMs to understand, translate, anticipate, or produce text and other forms of content [ Elastic]. Scalability, multilinguality, and efficiency are characteristics of LLMs that make them highly powerful tools specifically for analyzing large textual datasets in the field of literacy and poem analysis. In the context of literature, AI-driven tools have shown extensive utility. Text classification is one of these successful applications. In poetry, however, text classification presents a greater challenge compared to other forms of written texts, due to its inherent linguistic complexities [Rahgozar,2020]. Nevertheless, AI tools have previously applied to distinguish poems from prose [de Arruda et al.,2021], identifying poems’ rhythm [Agirrezabal et al.,2016], style [Yang et al.,2024], and chronology [Rahgozar,2020].Agirrezabal et al.[2016] trained multiple machine learning models, including independent classifiers such as Naive Bayes and sequence labeling models like Hidden Markov model, to extract the rhythm of English poems automatically. Their best model achieved a per-syllable accuracy of 91.4% and a 55.3% accuracy in correctly scanning poetry lines, highlighting the need for improving automated rhythm extraction tools through more advanced methods, such as deep learning-based tools. More recently, Yang et al.[2024] applied machine learning algorithms to automatically classify modern French poems into the stylistic and thematic subcategories of Romanticism, Parnassism, and Symbolism. They could achieve maximum accuracy of∼75% with Support Vector Machine (SVM) algorithm. The authors emphasized the need for more advanced methods to achieve higher levels of accuracy. Similar studies for categorizing on Chinese [Zhu et al.,2023] and Spanish [Deng et al.,2023, Navarro-Colorado,2018] poetry based on style and Punjabi [Kaur and Saini,2020] and Arabic [Ahmed et al., 2019] poetry based on subject have been performed previously. In another noteworthy study,Rahgozar [2020] employed machine learning algorithms to semantically categorize the poetry of Khājeh Shams-od-Dīn Moḥammad Ḥāfeẓ-e Shīrāzī (Hafez), one of the greatest Persian poets, predict the chronological order of his works, and tackle the challenge of identifying the original texts from among 16 historical versions of Hafez’s Divan. While artificial intelligence has made significant improvements in classifying and analyzing existing poetic works, AI is increasingly being used to generate original verse. Several studies have explored training AI sys- tems on vast, multilingual literary datasets to compose new poems.Köbis and Mossink[2021] andHitsuwari et al.[2023] demonstrated that AI-powered system can generate poems that human participants were unable to distinguish from those composed by humans. Beyond text generation, AI has also demonstrated powerful applications in the conceptual understanding of poetry, including the identification of poetic devices such as metaphors.Tanasescu et al.[2018] showed that deep learning outperformed other types of machine learning models in detecting metaphor. Another type of analysis that has been the focus of some of the previous studies, is extracting the sentiment of poems using AI. Sentiment analysis of textual data can be considered a classical task that has been the focus of many previous studies in this field [Rahgozar,2020,Medhat et al.,2014]. Extracting the sentiment 2 Sentiment Analysis of Persian PoetryA Preprint of poems, however, is considered a more difficult task because of the presence of metaphors, amphibolo- gies, and other language-related complexities that require subtle understanding beyond standard sentiment analysis techniques. To this effect,Ahmad et al.[2020] utilized attention-based C-BiLSTM (Convolutional Neural Networks (CNN) and the Bidirectional Long Short-Term Memory (BiLSTM)) models to classify the emotional state of the poems into predefined human emotions such as love, fear, and loneliness. The authors trained their model using 9142 English poems manually labelled with emotional state. Their goal was to develop a computational tool that can predict the emotional state of new poems outside the training dataset and could achieve the accuracy of 88% in their best performing model.Alsharif et al.[2013] utilized four supervised learning algorithms including Naïve Bayes, SVM, Hyperpipes and Voting Feature Intervals to extract emotions from Arabic poems. Their dataset included 1231 poems gathered from online publicly available sources which contained the main Arabic poem categories of Retha, Ghazal, Fakhr, and Heja. Their analysis concluded that the Hyperpipes algorithm achieved the highest precision of 79% using non-stemmed, non-rooted, mutually deducted feature vectors with 2000 features. In another study, Rajan and Salgaonkar [2020] utilized a Naïve Bayes classifier to determine the sentiment (positive, negative, or neutral) of poems in the Konkani language. These poems were annotated by three native Konkani speakers (two of whom are well-versed in poetry) and the inter-annotator agreement was assessed using the Kappa statistic. Their analysis showed that when the model was tested on poems containing only words with a senti-tag in the Konkani language corpus, the prediction accuracy reached 82%. However, testing the model on randomly selected poems without ensuring the presence of these words in the corpus resulted in 70% accuracy. These results highlight the potential for further advancements in sentiment analysis for regional languages using NLP algorithms. In a more recent study, Luo[2023] showed that advanced language models like GPT-4 can unveil deeper understandings of English literature by capturing surface-level imagery and uncovering underlying themes such as national pride and reflections on freedom. To the best of authors knowledge, no prior research has applied modern Large Language Models (LLMs), such as BERT and GPT, to conduct sentiment analysis on classical Persian poetry. Therefore, this study leverages these advanced models to develop a framework capable of analyzing the sentiment of Persian poetic texts. The primary objective is to map potential correlations between a poem’s sentiment and its metrical structure (vazn). To achieve this, we tested BERT and GPT-based models’ capability in determining the sentiment of poems from one of Jalāl al-Dīn Muḥammad Rūmī’s (Rumi) books, Divan-i Shams, as well as the work of a more contemporary Persian poet, Parvin E’tesami. We chose Rumi’s work because Divan-i Shams, with its more than 50 meters, is one of the richest sources in Persian literature for utilizing meters, and includes 13 meters invented by Rumi. Parvin E’tesami’s work was chosen to test the models’ performance on more modern texts, allowing us to evaluate their capabilities across a broad temporal spectrum, ranging from the classical medieval era to the twentieth century. Additionally, we wanted to explore how different meters were used to convey various sentiments by a female poet. Considering all these points, Parvin E’tesami’s work was an ideal choice as she employed classical meters in her poetry. 2 Methodology i. Dataset Preparation:This study is conducted using the poems of Divan-i Shams by Rumi and Divan-i Ashaar by Parvin E’tesami. These poems were sourced from the Ganjoor online repository (w.ganjoor.net), a publicly available database of classical Persian poetry. The dataset included each poem’s title, verses, and its corresponding poetic meter (Vazn). i. Sentiment Scoring System:The emotional tone of each poem was evaluated using numerical sentiment scores ranging from 1 to 5, where: • 1 represents a sad or highly negative sentiment. • 2 represents a somewhat sad or mildly negative sentiment. • 3 represents a neutral sentiment. • 4 represents a somewhat happy or mildly positive sentiment. • 5 represents a happy or highly positive sentiment. i. Preprocessing:In natural language processing, a token is the smallest unit of text (e.g., a word or sub-word) that language models use as input for their computations. To facilitate sentiment analysis, all verse lines within a single poem were concatenated into a unified text document and subsequently tokenized. Because the transformer-based models utilized in this study impose maximum input length constraints, poems exceeding these token limits were partitioned into smaller, sequential chunks. The models processed 3 Sentiment Analysis of Persian PoetryA Preprint each chunk independently, and the resulting sentiment scores were averaged to compute a global score for the entire poem. Finally, this average was rounded to the nearest integer to align with our 1-to-5 evaluation scale, effectively representing the poem’s overall emotional tone. iv. Language Models:Four distinct transformer-based language models were employed to perform senti- ment analysis on the dataset. Two of the models are based on the encoder-only BERT architecture, while the remaining two utilize the generative GPT architecture. In this study, all models were evaluated in a zero-shot inference setting. For the GPT models, this meant providing prompts without any in-context examples. For the BERT-based models, this involved utilizing versions pre-trained on generic sentiment datasets and applying them directly to our poetry dataset without any domain-specific fine-tuning. These models are outlined below: •BERT Multilingual Sentiment Analysis Model:BERT multilingual model is a pre-trained model that supports 102 languages, and contains 12 layers, 768 hidden units, 12 attention heads, and 110 million parameters [ Devlin,2019,Devlin et al.,2019]. BERT multilingual model is trained on diverse languages and can handle both cased and uncased text. Using the masked language modeling (MLM) approach, BERT was able to learn from both the preceding and succeeding words within a sentence during training. This bidirectional context allowed it to outperform previous models on a wide range of natural language processing tasks. In this study, the BERT Multilingual Uncased model [ Devlin,2019] was used to analyze the sentiment of poems. The model was applied using a tokenizer that split the poems into chunks which fit within the max sequence length of 512 tokens. •Pars-BERT Uncased Model:While the underlying BERT multilingual model support Persian language and can process Persian text, it was not specifically fine-tunned on Persian sentiment data and therefore, its sentiment predictions may be inaccurate. Thus, we used Pars-BERT model [ Hooshvare Research Lab,Farahani et al.,2021], which is a monolingual fine-tuned version of BERT, specifically trained on Persian language datasets such as Persian Wikipedia (general encyclopedia) containing 1,119,521 documents, Eligasht (itinerary) containing 9,629 documents, Digikala (Digital magazine) containing 8,645 documents, and five other datasets. This domain-specific training data enables Pars-BERT to more effectively capture Persian sentiment nuances compared to generic mul- tilingual models. Performance of Pars-BERT for sentiment analysis in Persian language has been tested byHooshvare Research Labon multiple datasets including Digikala (an Iranian e-commerce platform [Digikala]) and SnappFood (an Iranian online food ordering and delivery service [Snap- food]) which showed a superior results compared to Multilingual BERT. The Digikala user comments dataset, collected by the Open Data Mining Program (ODMP), contains 62,321 user comments dis- tributed across three labels: 10,394 ‘no_idea,’ 15,885 ‘not_recommended,’ and 36,042 ‘recommended [ Digikala,Team,a].’ Meanwhile, the SnappFood dataset includes 70,000 comments evenly split into ‘Happy’ (35,000) and ‘Sad’ (35,000) [Snapfood,Team,b]. Unlike the BERT Multilingual Sentiment analysis model, which classifies the sentiment with a number in 1-to-5 scale, Pars-BERT classifies sentiments as ”negative,” ”neutral,” or ”positive,” which we mapped to the numerical values ”1,” ”3,” or ”5” to align with the results from other models. Due to Pars-BERT’s maximum input sequence length of 512 tokens, we split the longer poems and analyzed their sentiments in separate chunks. •GPT 4o-mini and GPT 4o Models:GPT-4o is one of OpenAI’s most advanced multimodal large language models, capable of processing both text and image inputs and generating text outputs [ openai,a]. GPT-4o contains a deep transformer architecture that uses cross-attention and multi- head attention to facilitate information exchange among various input modalities. GPT-4o has demonstrated excellent performance in non-English languages which makes it highly suitable for analyzing Persian poetry. This model’s strong multilingual performance suggests it has been trained or fine-tuned on a wide variety of languages, including Persian. GPT-4o-mini [openai,b] is a smaller, more affordable version of GPT-4o, designed for fast, lightweight tasks. In this study both GPT models were accessed through OpenAI’s API, and sentiments was assigned using the custom prompt below: Prompt template =“Analyze the sentiment of the following poem and return a number between 1 and 5, where 1 means sad, 5 means happy, 3 is neutral, 2 and 4 are intermediate cases. RETURN ONLY ONE NUMBER THAT SHOWS THE SENTIMENT, (NO LONG ANSWERS JUST A NUMBER)” Using this prompt, the models were tasked with returning a single sentiment score between 1 and 5, similar to the BERT multilingual model. GPT-4o and GPT-4o-mini both have a context window of 128,000 tokens, which allowed us to analyze all the poems in our dataset without splitting the poems. 4 Sentiment Analysis of Persian PoetryA Preprint v. Comparing the sentiment scores with human annotated poems:To ensure the accuracy of our language models and validate the sentiment scores generated, we selected 100 poems from Rumi’s Divan-i Shams and had them evaluated by 2 scholars specializing in humanities and 2 other annotators with general knowledge about Farsi literature. Each scholar was asked to rate the sentiment of each poem on a scale of 1 to 5. Importantly, these poems were carefully chosen to represent a wide range of meters found in Divan-i Shams. The sentiment ratings provided by the scholars served as a benchmark to compare with the sentiment scores generated by our language models. This comparison allowed us to identify the model that most accurately captures the emotional tone of the poems. Before generating a Ground Truth label for the 100 chosen poems, we first assessed the consistency of the human annotators using Krippendorff’s Alpha metric to understand the noise level in the dataset. This metric is chosen because it recognizes that a disagreement between 1 and 5 is more severe than a disagreement between 4 and 5. Furthermore, this metric can compare 4 annotator’s results altogether and it is robust for small sample size data sets. Due to the subjective nature of poetic interpretation, perfect inter-annotator agreement was neither expected nor observed. Consequently, establishing a robust ”Ground Truth” (Gold Label) was the primary statistical prerequisite before model evaluation could be performed. To determine the most reliable method for aggregating human responses, we compared four aggregation strategies: • Mean: The arithmetic average of the 4 human grades (rounded to the nearest integer). This is appropriate for minimizing total error distance. • Median: The middle value of the grades. This is statistically robust for ordinal data as it ignores outliers. • Mode: This considers the majority vote. Mode method is specifically accurate for consensus but struggles when there is high variance. • Dawid-Skene (DS) Model: An Expectation-Maximization (EM) algorithm that estimates the ”true” label by iteratively modeling the confusion matrix (reliability) of each annotator. We validated these methods by calculating the Average Quadratic Weighted Kappa (QWK) of each ag- gregation method against the individual human annotators. This metric was chosen because it penalizes disagreements based on the distance between the ratings. For each Ground Truth candidate (e.g., DS, Mean etc.), we calculated the Quadratic Weighted Kappa (QWK) agreement between that Ground Truth and each of the human annotators. Afterwards, we averaged these 4 scores to get a single alignment score. The method with the highest Average Quadratic Weighted Kappa (QWK) was selected as the final Ground Truth. Af- terwards, we evaluated four models, i.e., Bert Multilingual, Pars-BERT, GPT-4o-mini, and GPT-4o, against the established Ground Truth by calculating the absolute accuracy and QWK correlation of their predicted sentiments with the Ground Truth labels. vi. Entropy calculation:To evaluate the diversity of sentiments expressed within a single poetic meter by each poet, we utilized entropy as a statistical metric. Entropy, denoted asH(X)and calculated using the following equation, quantifies the distribution of sentiment scores associated with a specific metrical structure. H(X)=− n ∑ i=1 p(x i )log 2 p(x i )(1) Where,H(X)is the entropy,p(x i )is the probability of outcomex i , andnis the number of possible outcomes. vii. Investigating the Correlation Between Poetic Meters and Sentiment in Persian Poetry:One of the primary ideas of this research was to explore whether there is a correlation between sentiments of poems in Persian language with meters used. This has been achieved by performing different statistical analysis and several parameters, such as average sentiment score, entropy, standard deviation, and polarization on the calculated sentiments on Rumi’s Divan-i Shams and Parvin E’tesami’s book. viii. Naming Convention for Poetic Meters:To prevent visual clutter and simplify the graphical representation of our results, we established an alphanumeric naming convention for the poetic meters. As detailed in Table1, each meter is designated by a prefix letter followed by a numerical index. Meters shared by both Rumi and Parvin E’tesami are denoted with the prefix ’C’ (Common). Conversely, meters exclusive to Rumi’s or Parvin E’tesami’s works are prefixed with ’R’ and ’P,’ respectively. This systematic approach ensures optimal clarity and readability across all subsequent figures. 5 Sentiment Analysis of Persian PoetryA Preprint Table 1: Mapping of poetic meters. Mapping the Common Meters in Rumi ’s and Parvin's Poems Mapping the Distinct Meters in Rumi 's Poems Mapping the Distinct Meters in Parvin's Poems C1 → فاعلاتن فاعلاتن فاعلاتن فاعلن (رمل مثمن محذوف) C2 → فاعلاتن فاعلاتن فاعلن (رمل مسدس محذوف یا وزن مثنو ی ) C3 → فعلاتن فعلاتن فعلاتن فعلن (رمل مثمن مخبون محذوف) C4 → فعلاتن فعلاتن فعلن (رمل مسدس مخبون محذوف) C5 → فعلاتن مفاعلن فعلن (خف یف مسدس مخبون) C6 → فعولن فعولن فعولن فعل (متقارب مثمن محذوف یا وزن شاهنامه) C7 → فعولن فعولن فعولن فعولن (متقارب مثمن سالم) C8 → مفاعلن فعلاتن مفاعلن فعلن (مجتث مثمن مخبون محذوف) C9 → مفاع یلن مفاع یلن فعولن (هزج مسدس محذوف یا وزن دوب یتی ) C10 → مفاع یلن مفاع یلن مفاع یلن مفاع یلن (هزج مثمن سالم) C11 → مفتعلن فاعلات مفتعلن فع (منسرح مثمن مطو ی منحور) C12 → مفتعلن مفتعلن فاعلن (سر یع مطو ی مکشوف) C13 → مفعول فاعلات مفاع یل فاعلن (مضارع مثمن اخرب مکفوف محذوف) C14 → مفعول مفاعلن فعولن (هزج مسدس اخرب مقبوض محذوف) C15 → مفعول مفاعلن مفاع یلن (هزج مسدس اخرب مقبوض) C16 → مفعول مفاع یل مفاع یل فعولن (هزج مثمن اخرب مکفوف محذوف) R17 → فاعلاتن فاعلاتن فاعلاتن فاعلاتن (رمل مثمن سالم) R18 → فاعلاتن فاعلن فاعلاتن فاعلن R19 → فاعلاتن مفاعلن فاعلاتن مفاعلن R20 → فاعلن فاعلاتن فاعلن فاعلاتن R21 → فاعلن مفعولن فاعلن مفعولن R22 → فعلات فاعلاتن فعلات فاعلاتن (رمل مثمن مشکول) R23 → فعلات فع لن فعلات فع لن R24 → فعلاتن فعلاتن فعلاتن فعلاتن (رمل مثمن مخبون) R25 → فعلاتن مفاعلن فعلاتن مفاعلن R26 → متفاعلاتن متفاعلاتن R27 → متفاعلتن متفاعلتن R28 → متفاعلن متفاعلن R29 → متفاعلن متفاعلن متفاعلن R30 → مستفعلتن مستفعلتن R31 → مستفعلن فع مستفعلن فع (متقارب مثمن اثلم) R32 → مستفعلن فعلن مستفعلن فعلن (بسیط مخبون) R33 → مستفعلن مستفعلن مستفعلن R34 → مستفعلن مستفعلن مستفعلن مستفعلن (رجز مثمن سالم) R35 → مفاعلتن مفاعلتن مفاعلتن مفاعلتن R36 → مفاعلن فعلاتن مفاعلن فعلاتن (مجتث مثمن مخبون) R37 → مفاعلن فعولن مفاعلن فعولن R38 → مفاعلن مفعولن مفاعلن مفعولن R39 → مفاعیل فعولن مفاعیل فعولن R40 → مفاعیل مفاعیل مفاعیل فعولن (هزج مثمن مکفوف محذوف) R41 → مفتعلن فاعلن مفتعلن فاعلن (منسرح مطوی مکشوف) R42 → مفتعلن فع مفتعلن فع R43 → مفتعلن فع مفتعلن فع مفتعلن فع مفتعلن فع R44 → مفتعلن مفاعلن مفتعلن مفاعلن (رجز مثمن مطوی مخبون) R45 → مفتعلن مفتعلن فع مفتعلن مفتعلن فع R46 → مفتعلن مفتعلن مفتعلن فع R47 → مفتعلن مفتعلن مفتعلن مفتعلن (رجز مثمن مطوی) R48 → مفعول فاعلاتن مفعول فاعلاتن (مضارع مثمن اخرب) R49 → مفعول مفاعیل مفاعیل فعل (وزن رباعی) R50 → مفعول مفاعیلن مفعول مفاعیلن (هزج مثمن اخرب) R51 → مفعولن فع مفعولن فع R52 → مفعولن مفعولن مفعولن مفعولن P17 → فعلاتن فعلاتن فعلاتن فع P18 → مفعول فاعلات مفاعیلن (مضارع مسدس اخرب مکفوف) P19 → مفعول مفاعیل فاعلاتن 6 Sentiment Analysis of Persian PoetryA Preprint 3 Results Table2presents the Nominal Fleiss’ Kappa parameter, calculated from sentiment scores assigned to each poem by the language models. This parameter reflects the repeatability of the data and serves as a measure of the reliability of the analysis. The Nominal Fleiss’ Kappa is computed by performing sentiment analysis six times for Rumi’s poems and three times for Parvin E’tesami’s poems. Table 2: Nominal Fleiss’ Kappa parameter based on the seiment scores of vaious language models. BERT Multilingual Pars-Bert GPT 4o-mini GPT 4o Rumi1.01.00.94810.9578 Parvin1.01.00.98820.9316 To evaluate the reliability of our human annotations, we calculated Krippendorff’s Alpha for the sampled Rumi dataset. The resulting value of 0.6 indicates a moderate level of agreement among the annotators. To determine the most accurate ground truth labels for the 100 selected poems, we implemented and compared four aggregation strategies: one probabilistic method (Dawid-Skene (DS)) and three heuristic methods (Mean, Median, and Mode). Figure1illustrates the Average Quadratic Weighted Kappa (QWK) of each aggregation method compared against the individual human annotators. Figure 1: Average Quadratic Weighted Kappa (QWK) of each aggregation method against the individual human annotators. Mean method shows the best performance. Figures 2aand2bshow the average sentiment for each book across the different language models. Figures 2cand2dpresent the average sentiment scores for the meters that are used for composing at least 15 poems in the corresponding book. To quantify the alignment between the sentiment analysis models and the established ground truth (Mean ag- gregation), Table3summarizes the Quadratic Weighted Kappa metrics for all models and human annotators, measured against the ground truth. For the remainder part of this paper, we have utilized GPT4o model for our analyses, as this model’s results were more aligned with the sentiment annotations provided by our scholar group. Figures3aand3c 7 Sentiment Analysis of Persian PoetryA Preprint (a)(b) (c)(d) Figure 2: Comparison of average sentiment scores assigned by various LLMs to Rumi’s and Parvin E’tesami’s poems. (a) Average sentiment for Rumi’s poems across LLMs. (b) Average sentiment for Parvin E’tesami’s poems across LLMs.(c) Average sentiment for each meter in Rumi’s poems that contain more than 15 poems across LLMs. (d) Average sentiment for each meter in Parvin E’tesami’s poems that contain more than 15 poems across LLMs. The results highlight significant differences in the average sentiment scores returned by different LLMs. Furthermore, it is observed that there is a consistent trend of higher sentiment scores in Rumi’s work regardless of the model used. Table 3: Quadratic Weighted Kappa (QWK) agreement scores and absolute prediction accuracy for human annotators and sentiment analysis models compared against the Ground Truth (Mean aggregation) labels. GPT 4o model shows the highest degree of alignment with human evaluations. Sentiment evaluatorHuman annotator 1 Human annotator 2 Human annotator 3 Human annotator 4 Pars-Bert Bert Multilingual GPT 4o-mini GPT 4o Quadratic Weighted Kappa score0.80170.80250.79690.72060.00060.04670.50240.6003 Absolute Prediction Accuracy (%)48476151672533 8 Sentiment Analysis of Persian PoetryA Preprint identify the meters with highest mean sentiment scores (happiest) in Rumi’s and Parvin E’tesami’s poems, respectively. Figures3band3ddepict the meters with the highest percentage of poems exhibiting happy sentiments (sentiment scores of 4 and 5) for Rumi’s and Parvin E’tesami’s poems, respectively. (a)(b) (c)(d) (e) Figure 3: Identification of poetic meters associated with positive sentiment in Rumi’s and Parvin E’tesami’s poems. (a) Meters (containing more than 15 poems) with the highest average sentiment scores in Rumi’s poems. (b) Percentage of Rumi’s poems with happy sentiments in the identified meters (containing more than 15 poems). (c) Meters (containing more than 15 poems) with the highest average sentiment scores in Parvin E’tesami’s poems. (d) Percentage of Parvin E’tesami’s poems with happy sentiments in the identified meters (containing more than 15 poems). (e) Comparison of common meters used by both poets, showing distinction between utilizing poems in the poets’ artworks. Figure4illustrates the entropy of meter usage by Rumi (Figure4a) and Parvin E’tesami (Figure4b). Entropy is a parameter that shows variations of expressed emotions using a specific meter. 9 Sentiment Analysis of Persian PoetryA Preprint (a)(b) Figure 4: Sentiment entropy across poetic meters in Rumi’s and Parvin E’tesami’s poems. (a) Entropy of sentiment distribution in Rumi’s poems. (b) Entropy of sentiment distribution in Parvin E’tesami’s poems. The results highlight Rumi’s masterful use of meters to convey diverse sentiments. Figure 5ashows the standard deviation of the sentiment scores from the average sentiment across all of poems in Divan-i Shams and Parvin E’tesami’s book. Figure5bshows the sentiment polarization (scores of 1, 2, 4,or 5) and neutrality (score of 3) in Rumi’s poems, and Parvin E’tesami’s book. (a)(b) Figure 5: (a) Standard deviation of sentiment scores in Rumi’s and Parvin E’tesami’s poems, illustrating the variability of emotional expression. Rumi’s poems show a wider range of deviations from the average sentiment. (b) Sentiment polarization in Rumi’s and Parvin E’tesami’s poems, distinguishing between neutral and polarized (very sad or very happy) poems. 4 Discussions To evaluate the performance of four distinct sentiment analysis models on our dataset of Farsi poems, we curated a dataset of 100 samples selected from Rumi’s poetry. Rumi’s works were chosen specifically for their high density of metaphors and archaic linguistic structures which presents a more rigorous challenge for semantic interpretation compared to more contemporary work of Parvin E’tesami. The Krippendorff’s Alpha for our dataset between our four annotators is equal to 0.6. This indicates ”Moderate Agreement,” and confirms that the task is highly subjective. Because the humans do not align perfectly, we cannot simply trust one human as the ” ground truth” label for 100 selected poems. We were required to perform a statistical aggregation method to filter out noise and bias. As illustrated in1, the Mean method demonstrated the highest alignment with the human consensus by achieving an average Quadratic Weighted Kappa (QWK) 10 Sentiment Analysis of Persian PoetryA Preprint of∼0.78. Consequently, the labels generated by the Mean model were adopted as the Ground Truth for all subsequent model evaluations. The first question that may arise about analyzing the sentiment of poems using AI, is that how accurate models’ predictions are, and whether the LLMs can understand the complexities of poetry. To gain insight into the ability of these models in analysing poetic sentiments, we first compared the average sentiment scores for the poems calculated using different LLMs. As shown in Figure2aand Figure2b, the average sentiment for each book varies significantly across the different language models. Interestingly, however, all language models assign a higher sentiment scores to Rumi’s poems compared to Parvin E’tesami’s. This suggests that regardless of what model we choose for analyzing the sentiments, the LLMs recognize Rumi’s poems conveying happier sentiment compared to Parvin E’tesami’s. We have repeated the same process for each of meters that contain more than 15 poems and observed the same trends between Rumi’s and Parvin E’tesami’s poems. While this is an interesting finding, the question regarding the accuracy of calculated sentiments of Persian poems with LLMs now becomes twofold: 1) how reliable are these predictions overall?, and 2) which model determines the sentiment of poems more accurately? To evaluate reliability of calculated sentients, we performed the analysis multiple times and compared the results by calculating Nominal Fleiss’ Kappa parameter. As shown in Table1, Kappa parameter for all models is higher than 0.93 which indicates the repeatability of the results. To evaluate the accuracy of models in determining poems’ sentiments, we must compare their evaluations with human interpretations. Notifying the fact that analysis of sentiment is a subjective task by its nature, and even human interpretations of a single text significantly vary. To verify the accuracy of our sentiment analysis, we selected 100 poems and asked 2 scholars specializing in Persian poetry and 2 annotators with general Farsi literature knowledge, to rate their emotional responses to each poem. Of these 100 poems, 80 are chosen from those with highest standard deviations in sentiment analysis results across different language models, meaning that these are the poems that LLMs significantly disagree on their sentiments. The remaining 20 poems, however, are selected from those that various LLMs predicted sentiments are identical. Our initial hypothesis was that if the scholars’ evaluations aligned with the models, these poems could be categorized as having ”straightforward” sentiments that are easy to interpret for both humans and machines. However, our analysis revealed a significant discrepancy. Despite the consensus among the LLMs on these 20 poems, we observed no significant correlation among the human annotators, nor between the humans and the models. This lack of alignment indicates that a high degree of model agreement does not necessarily imply that a poem’s sentiment is objectively clear. Furthermore, it suggests that the underlying mechanisms used by LLMs to determine sentiment differ fundamentally from the interpretive processes of human annotators, making it impossible to categorize these poems as having universally ”clear” sentiments across both domains. Given this inherent complexity and the divergence between human and machine interpretation, it became essential to identify which model could best approximate the human consensus, despite the challenges. We benchmarked each model against the established Ground Truth (derived via Mean aggregation). The results in Table 3indicate a significant performance gap between the GPT-based LLMs and the BERT-based models. GPT-4o emerged as the top performer with a QWK of 0.6. While the exact accuracy remained low (∼33%), this figure must be interpreted within the context of the task’s subjectivity. Notably, even human annotators, including scholars in the field, did not achieve high exact agreement with the Ground Truth labels. As shown in Table 3, human exact accuracy ranged from 47% to 61%. This demonstrates that a 33% accuracy for a machine model is not a failure of computation, but rather a reflection of the inherent ambiguity of the task. When humans themselves only agree on the exact sentiment score roughly half the time, the model’s performance is relatively more impressive than the raw number suggests. Moreover, Our analysis suggests that while the GPT-based LLMs often miss the exact intensity of the sen- timent (e.g., predicting a 4 instead of a 5), they correctly identifies the directional polarity of the poem. Conversely, the smaller encoder-based models (Bert Multilingual and Pars-Bert) struggled to capture the nuances of Rumi’s poetry, showing minimal correlation with human judgment. Crucially, however, even GPT-4o, the most capable model evaluated, failed to perform as high as general knowledge evaluators. This performance ceiling can be attributed to: 1) the inherent complexity of analyzing metaphorical literature, and 2) to the likelihood that these specific poems were absent from the models’ pre-training corpus. Conse- quently, the archaic structures and semantic depth typically present in classical Persian poetry were likely underrepresented in the models’ training distribution which prevents them to learn existing poetic patterns. These findings underscore the significant need for future research dedicated to developing language models capable of genuinely interpreting the rich, metaphorical landscape of literature, including Persian poetry. Another particularly significant finding emerged from the comparison of the encoder-based models. Pars- BERT was developed by fine-tuning BERT Multilingual on Farsi sentiment datasets; therefore, our initial 11 Sentiment Analysis of Persian PoetryA Preprint hypothesis was that its specialized exposure to Farsi would yield superior performance compared to the base model. However, our analysis revealed that Pars-BERT actually performed worse than BERT Multilingual. We attribute this performance deterioration to the domain shift in the training data: since Pars-BERT was fine-tuned primarily on modern Farsi text (e.g., social media and news), its ability to generalize to the archaic linguistic structures of Rumi’s poetry was compromised. This result is critically important, as it demonstrates that fine-tuning with Out-of-Distribution (OOD), even if language-related, data can adversely affect model performance on specialized tasks like classical literature analysis. Based on the results presented in Figure3, the meters utilized to create poems with highest sentiments in Rumi’s work are R24 (فعلاتن فعلاتن فعلاتن فعلاتن), R34 (مستفعلن مستفعلن مستفعلن مستفعلن), C12 (مفتعلن مفتعلن فاعلن) , and R41 (مفتعلن فاعلن مفتعلن فاعلن). In Parvin E’tesami’s poems, the highest sematic scores are exhibited by C8 (مفاعلن فعلاتن مفاعلن فعلن), C9 (مفاعیلن مفاعیلن فعولن), and C4 (فعلاتن فعلاتن فعلن) Meters. As highlighted in these figures, a significant portion of Rumi’s poems, more than 60% under certain meters, such as C12 (مفتعلن مفتعلن فاعلن), exhibit happy sentiment. In contrast, poems with positive sentiments within a single meter in Parvin E’tesami’s work reached to maximum of 12%. This observation aligns with the insights of Prof. Mohammad-Reza Shafiei Kadkani in his book Poetry and Music (موسیقی شعر) [Kadkani,1965], where he notes that many of the poems in Rumi’s Divan-i Shams are composed with meters that contain repetitive subparts. These short repetitive rhythmic patterns are intentionally created mostly for Sufism (Tasawwuf) gatherings, where the communal recitation of poems was designed to create a sense of enthusiasm and spiritual energy. Therefore, it is understandable to expect that Rumi’s poems exhibit happier and more energetic tone and hence higher sentiment scores compared to Parvin E’tesami’s artwork. However, the primary focus of our research goes beyond confirming this poetic intuition. We aimed to perform a feasibility analysis on LLMs ability to recognizing the complex emotional landscapes embedded within the structure of Persian poems and grasp the sentiments of these poems accurately. As shown in Figure 3e, between the common meters in Rumi’s and Parvin E’tesami’s poems, Rumi’s poems consistently showed higher averaged sentiment scores; indicating that Rumi utilized these meters to create happier poetic tone compared to Parvin E’tesami. Notably, the average sentiment scores for Parvin E’tesami’s poems under each meter are below 3, indicating that, on average, all of the meters in her work are used to convey predominantly sad sentiments. These findings not only reflect the distinct emotional tones between the two poets but also mark a significant shift in the way we analyze poetry. We believe these results mark the beginning of a new era in the analysis of Persian poetic works, where LLMs make it possible to perform automatic conceptual statistical analysis. While, previously, computer-based systems could only perform exterior structural analyses and semantic and emotional investigations required human intervention. A comparison of meter usage in Rumi’s versus Parvin E’tesami’s poems can indicate which of the poets utilize the meters in a more artistically diverse way. However, here our aim is not to argue that Rumi has utilized the meters in a more creative manner to compose poems with varied sentiments, as it is well-established among those familiar with Persian poetry that Rumi’s poems are considered one the most significant examples of masterful utilization of meters in Persian poem corpus. Instead, our aim is to determine whether sentiment analysis results extracted using computers, without human intervention, can yield similar conclusions. To address this, we calculated Entropy parameter for each meter. As expected from its formula (Equation 1), given that there are 5 possible sentiment scores in our study, entropy reaches its maximum value when each sentiment score is evenly distributed, i.e., 20% of the poems have sentiment score of 1, 20% sentiment score of 2, and so on. This represents the most diverse distribution of sentiments within a meter. Entropy is minimum, when all poems have the same sentiments under a specific meter, for example, when 100% of the poems have sentiment score of 1, or 100% of the poems have sentiment score of 5, etc. Notably, given that in our dataset we only have 5 possible numbers for the sentiment scores, entropy changes within the range of0≦H(X)≲2.322. As illustrated in Figure 4, the entropy of meter usage by Rumi is significantly higher than Parvin E’tesami, indicating that Rumi utilized meters to express broader range of sentiments. Interestingly the entropy for certain meters such as R47 (مفتعلن مفتعلن مفتعلن مفتعلن), C13 (مفعول فاعلات مفاعیل فاعلن), and C2 (فاعلاتن فاعلاتن فاعلن), are approximately 2.25, close to the maximum possible value for entropy. This may be one of the reasons that Divan-i Shams by Rumi is considered as one of the richest sources and most masterful examples in utilizing meters in Persian poems corpus. In contrast, in Parvin E’tesami’s poems, the maximum value of entropy is observed in C9 (مفاعیلن مفاعیلن فعولن) meter and it reaches to approximately 1.75. These results highlighted that analysis of the poems of Rumi and Parvin E’tesami with LLMs successfully indicate superiority of Rumi’s poems in expressing various emotional sentiments using meters. To further evaluate the emotional depth in Rumi’s poetry, we analyzed the overall variability of sentiments across his poems using standard deviation and polarization parameters. This approach provided another 12 Sentiment Analysis of Persian PoetryA Preprint perspective about the value of his artwork. Figure5ashows that when we calculate the average sentiment across all of poems in Divan-i Shams, and then measure the deviation of the sentiments in each poem from the average sentiment, the resulting deviation in Rumi’s work is larger compared to Parvin E’tesami’s poems. This indicates that Rumi has expressed a large variety of the emotional sentiment from his average sentiment while most poems of Parvin E’tesami was closer to her average sentiment, and less deviation is observed in her poems. These findings become more significant when we compare the sentiment polarization in Rumi’s poems with those of Parvin E’tesami. We define polarized poems, as those with sentiment of either sad (sentiment score of 1or 3) or happy (sentiment score of 4 or 5) and neutral poems, as those with neutral sentiment (sentiment score of 3). Figure5bshows that in total Rumi has more poems with neutral sentiments (sentiment score of 3), however, even though the portion of the polarized poems is less than portion of polarized poems in Parvin E’tesami’s artwork, his masterful usage of the meters led him to compose a wider spectrum of sentiments. Overall, these analyses demonstrated that LLMs can correctly conduct conceptual comparison between Rumi’s and Parvin E’tesami’s artwork and perform semantic analysis of the Persian poems without requiring direct human interpretations. This is extremely important as it shows the feasibility of utilizing LLMs in investigations of non-English complicated poetic artworks and creates new opportunities in large scale automatic multi-lingual studies in the field of humanities. 5 Conclusions In the present study we employed multiple language models including BERT Multilingual Sentiment Analysis, Pars-BERT, GPT 4o, and GPT 4o-mini models to extract the sentiment of poems from Divan-i Shams by Rumi and Divan-i Ashaar by Parvin E’tesami. Our analysis focused on finding correlations between sentiments and meter of the poems. Our findings led to the following conclusions: 1. LLMs exhibited significant performance disparities. While GPT-4o demonstrated the highest re- semblance to human scholars (QWK∼0.60), the BERT-based models failed to achieve meaningful correlation. This suggests that large-scale generative models currently outperform smaller encoder- based models in interpreting the nuances of classical Persian literature. 2. Contrary to expectations, ParsBERT (fine-tuned on modern Farsi) performed worse than the base BERT Multilingual model. This indicates that fine-tuning on modern, out-of-distribution text (e.g., news, social media) can deteriorate performance on archaic, metaphorical texts. 3. While models struggled to replicate exact human grading on a 5-point ordinal scale (low exact accu- racy), GPT-4o showed strong performance in identifying directional polarity (Positive vs. Negative). Therefore, while LLMs are not yet fully reliable for granular sentiment intensity, they are effective tools for macro-level sentiment classification in classical poetry. 4. Poems composed by Rumi expressed happier sentiments compared to Parvin E’tesami’s poems and this trend remained consistent when poems were grouped and analysed based on their meters. 5. Sentiment analysis results showed that Rumi utilized meters for expressing more variety of the tones and this may be one of the reasons that his work is considered one of the best examples of utilizing meter in a masterful way in Persian poetry corpus. 6. Our results highlighted that the standard deviation of sentiment scores in Rumi’s poems is signifi- cantly higher than in Parvin E’tesami’s. This indicates that Parvin E’tesami’s poems follow a more stable and consistent emotional tone, while the sentiments in Rumi’s poems show a larger deviation from the average sentiment score. In summary, while LLMs offer a promising avenue for scalable, automated studies of Persian poetry, they currently face a performance ceiling due to the complexity of archaic metaphors and the lack of representative pre-training data. Future work must focus on bridging the gap between modern training corpora and classical literary heritage to fully unlock the potential of AI in digital humanities. 6 Generative AI Usage Disclosure The authors utilized artificial intelligence tools during the preparation of this manuscript exclusively for grammatical editing and language refinement. 13 Sentiment Analysis of Persian PoetryA Preprint References T. Acres.Sky news, 05 2023.[Online]. Available:https://news.sky.com/story/ geoffrey-hinton-who-is-the-godfather-of-ai-12871205. Elastic. What are large language models (llms)? [Online]. Available:https://w.elastic.co/what-is/ large-language-models. Arya Rahgozar.Automatic poetry classification and chronological semantic analysis. PhD thesis, Université d’Ottawa/University of Ottawa, 2020. Henrique F de Arruda, Sandro M Reia, Filipi N Silva, Diego R Amancio, and Luciano da F Costa. A pattern recognition approach for distinguishing between prose and poetry.arXiv preprint arXiv:2107.08512, 2021. Manex Agirrezabal, Iñaki Alegria, and Mans Hulden. Machine learning for metrical analysis of english poetry. Inproceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical papers, pages 772–781, 2016. Liu Yang, Gang Wang, and Hongjun Wang. Reimagining literary analysis: utilizing artificial intelligence to classify modernist french poetry.Information, 15(2):70, 2024. Mini Zhu, Gang Wang, Chaoping Li, Hongjun Wang, and Bin Zhang. Artificial intelligence classification model for modern chinese poetry in education.Sustainability, 15(6):5265, 2023. Shutian Deng, Gang Wang, Hongjun Wang, and Fuliang Chang. An artificial-intelligence-driven spanish poetry classification framework.Big Data and Cognitive Computing, 7(4):183, 2023. Borja Navarro-Colorado. On poetic topic modeling: extracting themes and motifs from a corpus of spanish poetry.Frontiers in Digital Humanities, 5:15, 2018. Jasleen Kaur and Jatinderkumar R Saini. Designing punjabi poetry classifiers using machine learning and different textual features.Int. Arab J. Inf. Technol., 17(1):38–44, 2020. Munef Abdullah Ahmed, Raed Abdulkareem Hasan, Ahmed Hussein Ali, and Mostafa Abdulghafoor Mo- hammed. The classification of the modern arabic poetry using machine learning.TELKOMNIKA (Telecom- munication Computing Electronics and Control), 17(5):2667–2674, 2019. Nils Köbis and Luca D Mossink. Artificial intelligence versus maya angelou: Experimental evidence that people cannot differentiate ai-generated from human-written poetry.Computers in human behavior, 114: 106553, 2021. Jimpei Hitsuwari, Yoshiyuki Ueda, Woojin Yun, and Michio Nomura. Does human–ai collaboration lead to more creative art? aesthetic evaluation of human-made and ai-generated haiku poetry.Computers in Human Behavior, 139:107502, 2023. Chris Tanasescu, Vaibhav Kesarwani, and Diana Inkpen. Metaphor detection by deep learning and the place of poetic metaphor in digital humanities. InFLAIRS, pages 122–127, 2018. Walaa Medhat, Ahmed Hassan, and Hoda Korashy. Sentiment analysis algorithms and applications: A survey.Ain Shams engineering journal, 5(4):1093–1113, 2014. Shakeel Ahmad, Muhammad Zubair Asghar, Fahad Mazaed Alotaibi, and Sherafzal Khan. Classification of poetry text into the emotional states using deep learning technique.Ieee Access, 8:73865–73878, 2020. Ouais Alsharif, Deema Alshamaa, and Nada Ghneim. Emotion classification in arabic poetry using machine learning.International Journal of Computer Applications, 65(16), 2013. Annie Rajan and Ambuja Salgaonkar. Sentiment analysis for konkani language: Konkani poetry, a case study. InICT Systems and Sustainability: Proceedings of ICT4SD 2019, Volume 1, pages 321–329. Springer, 2020. E Luo. Utilizing computational linguistics tools for enhanced poetic interpretation.Journal of Student Research, 12(4):1–14, 2023. J. Devlin. Github, 2019. [Online]. Available:https://github.com/google-research/bert/blob/master/ multilingual.md. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019. Hooshvare Research Lab. Huggingface. [Online]. Available:https://huggingface.co/HooshvareLab/ bert-base-parsbert-uncased. 14 Sentiment Analysis of Persian PoetryA Preprint M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri. Parsbert: Transformer-based model for persian language understanding.Neural Processing Letters, 53:3831–3847, 2021. Digikala. Digikala. [Online]. Available:https://w.digikala.com/. Snapfood. Snappfood. [Online]. Available:https://snappfood.ir/. Digikala Team. Digikala data, a. [Online]. Available:https://w.digikala.com/opendata/. Snapfood Team. Snapfood data, b. [Online]. Available:https://drive.google.com/uc?id=15J4zPN1BD7Q_ ZIQ39VeFquwSoW8qTxgu. openai. Gpt-4o, a. [Online]. Available:https://platform.openai.com/docs/models/gpt-4o. openai. Gpt-4o mini, b. [Online]. Available:https://platform.openai.com/docs/models/gpt-4o-mini. M.-R. S. Kadkani.Poetry and Music. 1965. 15