Paper deep dive

Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions

Madhav S. Baidya, S. S. Baidya, Chirag Chawla

Year: 2026Venue: arXiv preprintArea: cs.CLType: PreprintEmbeddings: 99

Abstract

Abstract:The rapid proliferation of large language models (LLMs) has created an urgent need for robust and generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector on a single dataset under ideal conditions, leaving open questions about cross-domain transfer, cross-LLM generalization, and adversarial robustness. We present a comprehensive benchmark evaluating diverse detection approaches across two corpora: HC3 (23,363 human-ChatGPT pairs) and ELI5 (15,000 human-Mistral-7B pairs). Methods include classical classifiers, fine-tuned transformer encoders (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a CNN, an XGBoost stylometric model, perplexity-based detectors, and LLM-as-detector prompting. Results show that transformer models achieve near-perfect in-distribution performance but degrade under domain shift. The XGBoost stylometric model matches performance while remaining interpretable. LLM-based detectors underperform and are affected by generator-detector identity bias. Perplexity-based methods exhibit polarity inversion, with modern LLM outputs showing lower perplexity than human text, but remain effective when corrected. No method generalizes robustly across domains and LLM sources.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

98,991 characters extracted from source content.

Expand or collapse full text

Detecting the Machine: A Comprehensive Benchmark of AI-Generated Text Detectors Across Architectures, Domains, and Adversarial Conditions Madhav S. Baidya 1 , S. S. Baidya 2 , Chirag Chawla 1 1 Indian Institute of Technology (BHU), Varanasi, India 2 Indian Institute of Technology Guwahati, India madhavsukla.baidya.chy22@itbhu.ac.in, saurav.baidya@iitg.ac.in, chirag.chawla.chy22@itbhu.ac.in Abstract The rapid proliferation of large language models (llms) has created an urgent need for robust, generalizable detectors of machine-generated text. Existing benchmarks typically evaluate a single detector type on a single dataset under ideal conditions, leaving critical questions about cross-domain transfer, cross-llm generalization, and adversarial robustness unanswered. This work presents a comprehensive benchmark that systematically evaluates a broad spectrum of detection approaches across two carefully constructed corpora: HC3 (23,363 paired human– ChatGPT samples across five domains, 46,726 texts after binary expansion) and ELI5 (15,000 paired human–Mistral-7B samples, 30,000 texts). The approaches evaluated span classical statis- tical classifiers, five fine-tuned encoder transformers (BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3), a shallow 1D-CNN, a stylometric-hybrid XGBoost pipeline, perplexity-based unsupervised detectors (GPT-2/GPT-Neo family), and llm-as-detector prompting across four model scales including GPT-4o-mini. All detectors are further evaluated zero-shot against outputs from five unseen open-source llms with distributional shift analysis, and subjected to iterative adversarial humanization at three rewriting intensities (L0–L2). A principled length-matching preprocessing step is applied throughout to neutralize the well-known length confound. Our central findings are: (i ) fine-tuned transformer encoders achieve near-perfect in-distribution auroc (≥0.994) but degrade universally under domain shift; (i ) an XGBoost stylometric hybrid matches transformer in-distribution performance while remaining fully interpretable, with sentence-level perplexity coefficient of variation and AI-phrase density as the most discriminative features; (i ) llm-as-detector prompting lags far behind fine-tuned approaches — the best open-source result is Llama-2-13b-chat-hf CoT at auroc 0.898, while GPT-4o-mini zero-shot reaches 0.909 on ELI5 — and is strongly confounded by the generator–detector identity problem; (iv ) perplexity-based detectors reveal a critical polarity inversion — modern llm outputs are systematically lower perplexity than human text — that, once corrected, yields effective auroc of≈0.91; and (v ) no detector generalizes robustly across llm sources and domains simultaneously. Keywords: AI-generated text detection, large language models, benchmark evaluation, transformer fine-tuning, adversarial robustness, stylometry, domain generalization, perplexity, cross-LLM generalization arXiv:2603.17522v1 [cs.CL] 18 Mar 2026 Preprint — arXivDetecting the Machine Contents 1 Introduction3 2 Related Work4 2.1 Supervised Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 2.2 Unsupervised and Zero-Shot Approaches . . . . . . . . . . . . . . . . . . . . . . . . .4 2.3 LLM-as-Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 2.4 Adversarial Humanization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 3 Datasets and Preprocessing5 3.1 HC3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 3.2 ELI5 Dataset and Mistral-7B Augmentation . . . . . . . . . . . . . . . . . . . . . .5 3.3 Binary Dataset Preparation and Length Matching . . . . . . . . . . . . . . . . . . .6 4 Detector Families: Architecture and Implementation6 4.1 Statistical / Classical Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 4.2 Fine-Tuned Encoder Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 4.2.1BERT (bert-base-uncased) . . . . . . . . . . . . . . . . . . . . . . . . . . .7 4.2.2RoBERTa (roberta-base) . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 4.2.3ELECTRA (google/electra-base-discriminator) . . . . . . . . . . . . .7 4.2.4DistilBERT (distilbert-base-uncased) . . . . . . . . . . . . . . . . . . . .8 4.2.5DeBERTa-v3 (microsoft/deberta-v3-base) . . . . . . . . . . . . . . . . . .8 4.3 Shallow 1D-CNN Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 4.4 Stylometric and Statistical Hybrid Detector . . . . . . . . . . . . . . . . . . . . . . .8 4.5 Perplexity-Based Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 4.6 LLM-as-Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 5 Experimental Results: Detector Families9 5.1 Statistical / Classical Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 5.2 Fine-Tuned Encoder Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 5.3 Shallow 1D-CNN Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 5.4 Stylometric and Statistical Hybrid Detector . . . . . . . . . . . . . . . . . . . . . . .14 5.5 Stage 1 Key Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 6 LLM-as-Detector and Contrastive Likelihood Detection15 6.1 Prompting Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 6.2 Tiny-Scale Models: TinyLlama-1.1B-Chat-v1.0and Qwen2.5-1.5B . . . . . . . . . . .15 6.3 Mid-Scale Models:Llama-3.1-8B-Instruct and Qwen2.5-7B . . . . . . . . . . . . . . .16 6.4 Large-Scale Models: LLaMA-2-13B-Chat . . . . . . . . . . . . . . . . . . . . . . . .16 6.5 Large-Scale Models: Qwen2.5-14B-Instruct . . . . . . . . . . . . . . . . . . . . . . .17 6.6 GPT-4o-mini as Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 6.7 Contrastive Likelihood Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 7 Perplexity-Based Detectors19 7.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 7.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 8 Cross-LLM Generalization Study20 8.1 Experimental Design and Dataset Construction . . . . . . . . . . . . . . . . . . . . .20 8.2 Neural Detector Cross-LLM Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .21 8.3 Embedding-Space Generalization via Classical Classifiers . . . . . . . . . . . . . . . .21 8.4 Distribution Shift Analysis in Representation Space . . . . . . . . . . . . . . . . . . .22 9 Adversarial Humanization23 1 Preprint — arXivDetecting the Machine 10 Discussion24 10.1 The Cross-Domain Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 10.2 The Generator–Detector Identity Problem . . . . . . . . . . . . . . . . . . . . . . . .25 10.3 The Perplexity Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 10.4 Interpretability vs. Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 10.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 11 Future Work25 12 Conclusion25 13 Implementation Details28 13.1 Family 1 — Statistical Machine Learning Detectors . . . . . . . . . . . . . . . . . . .28 13.2 Family 2 — Fine-Tuned Encoder Transformers . . . . . . . . . . . . . . . . . . . . .28 13.3 Family 3 — Shallow 1D-CNN Detector . . . . . . . . . . . . . . . . . . . . . . . . . .29 13.4 Family 4 — Stylometric and Statistical Hybrid Detector . . . . . . . . . . . . . . . .29 13.5 Family 5 — LLM-as-Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 A Hyperparameter Tables31 A.1 Encoder Transformer Common Training Protocol . . . . . . . . . . . . . . . . . . . .31 A.2 Encoder Transformer Model Specifications . . . . . . . . . . . . . . . . . . . . . . . .31 A.3 1D-CNN Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 A.4 Stylometric Hybrid Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . .32 A.5 llm-as-Detector Configuration Summary . . . . . . . . . . . . . . . . . . . . . . . . .32 A.6 CoT Ensemble Parameters by Model . . . . . . . . . . . . . . . . . . . . . . . . . . .32 B Prompt Templates32 B.1 Zero-Shot Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33 B.2 Few-Shot Prompt Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34 B.3 Chain-of-Thought Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 2 Preprint — arXivDetecting the Machine 1 Introduction The widespread deployment of instruction-tuned large language models — including ChatGPT, Mistral, LLaMA, and their successors [Brown et al., 2020, Ouyang et al., 2022, Bommasani et al., 2021] — has fundamentally altered the landscape of written communication. These systems produce text that is, by many surface measures, indistinguishable from human writing [Floridi and Chiriatti, 2020], giving rise to serious societal concerns around academic integrity, journalistic authenticity, disinformation, and the erosion of trust in digital communication. The development of robust, practical detectors for machine-generated text has consequently become one of the most active research frontiers in natural language processing [Mitchell et al., 2023, Gehrmann et al., 2019]. Despite substantial progress, the field suffers from a critical methodological fragmentation. Existing work evaluates detectors in isolation, on single datasets, under idealized conditions that do not reflect the deployment environment. Key questions remain empirically underexplored: How much does a detector’s performance degrade when the test-time llm differs from the training-time generator? Which architectural families generalize most robustly across domains? Can interpretable, lightweight detectors match the performance of massive fine-tuned transformers? Does prompting large models with structured reasoning constitute a viable detection strategy? What happens to all detector families under adversarial text humanization? This paper addresses these questions through a large-scale, multi-stage benchmark that spans the full spectrum of detection paradigms. Our contributions are: To support reproducibility and further research, we make our implementation and evaluation pipeline available at our GitHub repository. 1.Benchmark design and datasets. We construct two carefully controlled corpora — HC3 (paired human–ChatGPT, 5 domains, 46,726 samples after length matching) and ELI5 (paired human–Mistral-7B, single domain, 30,000 samples) — with a principled length-matching step that prevents detectors from exploiting the length confound [Ippolito et al., 2020]. 2.Three detector families (Stage 1). We implement and rigorously evaluate under in-distribution and cross-domain conditions: (a) classical statistical classifiers on a 22-feature hand-crafted feature set; (b) five fine-tuned encoder transformers — BERT [Devlin et al., 2019], RoBERTa [Liu et al., 2019], ELECTRA [Clark et al., 2020], DistilBERT [Sanh et al., 2019], DeBERTa-v3 [He et al., 2021]; (c) a shallow 1D-CNN [Kim, 2014]; (d ) a stylometric-hybrid XGBoost [Chen and Guestrin, 2016] pipeline with 60+ features including sentence-level perplexity and AI-phrase density; (e) perplexity-based unsupervised detectors (GPT-2/GPT-Neo family); and (f ) llm-as- detector prompting across four model scales (1.1B–14B parameters) including GPT-4o-mini via the OpenAI API. 3.Cross-llm generalization (Stage 2). All Stage 1 detectors are evaluated zero-shot against outputs from five unseen open-source llms (TinyLlama-1.1B, Qwen2.5-1.5B, Qwen2.5-7B,Llama- 3.1-8B-Instruct, LLaMA-2-13B), complemented by embedding-space generalization via classical classifiers and distributional shift analysis in DeBERTa representation space. 4.Adversarial humanization (Stage 3). All detectors are evaluated under three levels of iterative LLM-based rewriting (L0: original, L1: light humanization, L2: heavy humanization) using Qwen2.5-1.5B-Instruct as the rewriting model, probing robustness to the most practical evasion strategy available to adversarial users. 3 Preprint — arXivDetecting the Machine Figure 1. Overview of the benchmark pipeline. Stage 0 constructs two paired corpora (HC3: 23k human– ChatGPT pairs; ELI5: 15k human–Mistral-7B pairs) with length-matched preprocessing. Stage 1 evaluates three detector families: Family 1 (classical statistical classifiers), and Family 2 (fine-tuned encoder transformers — BERT, RoBERTa, ELECTRA, DistilBERT, DeBERTa-v3; 1D-CNN; perplexity-based detectors; stylometric- hybrid XGBoost), and Family 3 (llm-as-detector prompting at four scales including GPT-4o-mini). Stage 2 evaluates cross-llm generalization via neural detectors, embedding-space classifier matrices , and distributional shift analysis. Stage 3 applies adversarial humanization at three levels (L0–L2) using an instruction-tuned rewriter. All families are evaluated under a unified five-metric suite (auroc, auprc, eer, Brier Score, FPR@95%TPR). 2 Related Work 2.1 Supervised Detection Approaches Early work on machine-generated text detection relied on statistical features such as perplexity under reference language models [Solaiman et al., 2019], n-gram statistics, and stylometric signals [Juola, 2006, Stamatatos, 2009, Ippolito et al., 2020]. The introduction of transformer-based detectors substantially advanced the field: models such as GROVER [Zellers et al., 2019] demonstrated that the best generators also serve as the best discriminators. Subsequent work fine-tuned general-purpose encoders (BERT, RoBERTa) on paired human/llm corpora, achieving high in-distribution accuracy [Rodriguez et al., 2022]. The HC3 corpus [Guo et al., 2023] introduced a systematic multi-domain benchmark for ChatGPT detection that has become a de-facto standard. Several subsequent studies have investigated domain transfer [Uchendu et al., 2020], adversarial robustness [Wolff and Wolff, 2022], and the effect of prompt engineering on detectability. Commercial detection tools have also been deployed [OpenAI, 2023], though their generalization across llm families remains poorly characterized. 2.2 Unsupervised and Zero-Shot Approaches DetectGPT [Mitchell et al., 2023] exploits the observation that llm-generated text tends to lie in local probability maxima of the generating model, using perturbation-based curvature estimation 4 Preprint — arXivDetecting the Machine as a detection signal. Statistical visualization tools such as GLTR [Gehrmann et al., 2019] provide complementary token-level detection signals. Perplexity thresholding under reference models has been widely studied [Lavergne et al., 2008], though as we show, the direction of the perplexity signal is counter-intuitive in the modern llm era. Watermarking schemes [Kirchenbauer et al., 2023] provide a complementary but generator-controlled approach that requires cooperation from the model provider. 2.3 LLM-as-Detector The use of large models as zero-shot or few-shot classifiers for their own outputs has been explored in several recent studies [Zeng et al., 2023, Bhattacharjee et al., 2023]. A consistent finding is that prompting-based detection underperforms fine-tuned approaches, particularly on out-of-distribution text. Chain-of-thought prompting has been shown to improve classification accuracy for models with sufficient instruction-following capacity [Kojima et al., 2022, Wei et al., 2022], a finding we confirm and extend across four model scales. 2.4 Adversarial Humanization Paraphrase-based attacks [Krishna et al., 2023], style transfer, and direct human editing have all been demonstrated to substantially reduce detector accuracy. The challenge of adversarial robustness remains largely unsolved, particularly for unsupervised detection methods. Our Stage 2 evaluation systematically characterizes how iterative llm-based rewriting at two intensity levels degrades all detector families simultaneously, filling a gap left by prior work that typically evaluates a single detector family under a single attack strategy. 3 Datasets and Preprocessing 3.1 HC3 Dataset The HC3 (Human-ChatGPT Comparison) corpus [Guo et al., 2023] was loaded from theHello- SimpleAI/HC3repository via the Hugging Facedatasetslibrary. It provides question–answer pairs across multiple domains, with each entry containing one question, a list of human answers, and a list of ChatGPT answers. We flattened the corpus into a structured paired format — one row per question with a single human answer and a single ChatGPT answer — yielding 47,734 paired examples across six domain splits (Table 1). Following exact-duplicate removal on the question field, the corpus was reduced to 23,363 unique records. Table 1. Domain distribution of the HC3 corpus after flattening and deduplication. DomainUnique Pairs redditeli516,153 finance3,933 medicine1,248 open qa1,187 wikicsai842 Total23,363 3.2 ELI5 Dataset and Mistral-7B Augmentation The ELI5 dataset [Fan et al., 2019] was loaded fromsentence-transformers/eli5via the Hugging Face hub. It is a human-only question-answering corpus sourced from the Reddit community r/explainlikeimfive, containing 325,475 training samples with plain-language explanations of complex topics. No llm-generated answers exist in the raw ELI5 data. To create a balanced human–llm paired corpus, we used Mistral-7B-Instruct-v0.2 to generate AI answers for a random sample of 15,000 ELI5 questions. The generation pipeline was optimized for throughput on an NVIDIA A100 GPU (Table 2). 5 Preprint — arXivDetecting the Machine Table 2. Mistral-7B generation configuration for ELI5 augmentation. ParameterValue Model mistralai/Mistral-7B-Instruct-v0.2 PrecisionFP16 (no quantization) AttentionFlash Attention 2 Compilation torch.compile (reduce-overhead) Batch size48 Max new tokens150 Temperature0.7 Top-p0.9 Each question was formatted using Mistral’s[INST]instruction template and fed through the model in batches. Generated tokens were decoded with the prompt stripped, yielding clean answer strings. 3.3 Binary Dataset Preparation and Length Matching AI-generated text detection is formulated as a binary classification problem. Rather than treating each question–answer pair as a unit, every individual answer is treated as an independent text sample labeled either human (0) or llm (1). This decoupling reflects the actual deployment setting, where detectors receive isolated text snippets with no access to the corresponding question. This conversion yielded perfectly balanced binary corpora: • HC3 binary: 46,726 samples (23,363 human + 23,363 llm) • ELI5 binary: 30,000 samples (15,000 human + 15,000 llm) A critical length-matching step was applied before splitting. It is well-documented that llm- generated answers are systematically longer than human answers [Ippolito et al., 2020]; without correction, a classifier can achieve high accuracy by learning response length — a spurious, non- linguistic shortcut that collapses under any length-normalized adversarial condition. Each human answer was therefore matched with an llm answer falling within ±20% of its word count, ensuring statistically comparable length distributions across classes. Stratified 80/20 train/test splits were then constructed, preserving exact class balance (Table 3). Table 3. Train/test split sizes after length matching and stratification. DatasetTrainTest HC336,968 (18,484 per class)9,242 ELI522,862 (11,431 per class)5,716 The two datasets are kept separate throughout all evaluations: HC3 represents a formal, multi- domain corpus with ChatGPT as the llm source, while ELI5 represents a conversational, single- domain corpus with Mistral-7B as the source. This separation enables cross-dataset generalization analysis. 4 Detector Families: Architecture and Implementation All detectors output a continuous detectability score in [0,1] representing the probability that a given text is llm-generated. Each supervised family is trained and evaluated under four conditions: in-distribution (same dataset for train and test) and cross-distribution (train on one dataset, test on the other), producing a 2×2 evaluation grid per detector. Unsupervised and zero-parameter families are evaluated on both test sets without training. 6 Preprint — arXivDetecting the Machine 4.1 Statistical / Classical Detectors This family operates entirely on hand-crafted linguistic features with no learned representations. The feature extractor computes 22 interpretable signals organized into seven categories: (i) Surface statistics: word count, character count, sentence count, average word length, average sentence length. (i) Lexical diversity: type-token ratio, hapax legomena ratio. (i) Punctuation and formatting: comma density, period density, question mark ratio, excla- mation ratio. (iv) Repetition metrics: bigram repetition rate, trigram repetition rate. (v)Entropy measures: Shannon entropy over word-frequency distribution, sentence-length entropy. (vi) Syntactic complexity: sentence-length variance and standard deviation. (vii) Discourse markers: hedging-word density, certainty-word density, connector-word density, contraction ratio, burstiness. Three classifiers are trained on this feature vector: Logistic Regression (L2 penalty, interpretable linear baseline); Random Forest (100 trees, max depth 10, bootstrap sampling); and SVM with RBF Kernel (Platt-scaled probabilities). 4.2 Fine-Tuned Encoder Transformers All transformer models share a common fine-tuning protocol: the pre-trained encoder is loaded with a two-class classification head appended to the[CLS]token representation, then fine-tuned end-to-end for one epoch on binary human/llm labels. Training uses AdamW (lr= 2×10 −5 , weight decay = 0.01), linear warmup over 6% of training steps, dropout increased to 0.2, a 10% held-out validation split for early stopping (patience = 3), and auroc as the model-selection criterion. Mixed precision (FP16) is used throughout. Batch size is 32 (train) and 64 (eval). The detectability score is the softmax probability assigned to the llm class. 4.2.1 BERT (bert-base-uncased) BERT [Devlin et al., 2019] uses bidirectional masked language modeling pre-training, processing full token sequences with attention over both left and right context. The base variant has 12 transformer layers, 12 attention heads, hidden size 768, intermediate size 3,072, and≈110M parameters. Tokenization uses WordPiece with a 30,522-token vocabulary; sequences are truncated to 512 tokens. 4.2.2 RoBERTa (roberta-base) RoBERTa [Liu et al., 2019] improves upon BERT by removing the next-sentence prediction objective, training on 10×more data with larger batches, using dynamic masking, and employing a Byte- Pair Encoding tokenizer (50,265-token vocabulary). It shares the same 12-layer, 768-hidden, 125M parameter architecture but benefits from more robust pre-training. 4.2.3 ELECTRA (google/electra-base-discriminator) ELECTRA [Clark et al., 2020] replaces masked language modeling with a replaced-token detection objective: a small generator corrupts tokens and the discriminator is trained to identify which tokens were replaced. This produces more sample-efficient pre-training, as every token position contributes a training signal (vs.≈15% in BERT). ELECTRA’s token-level discriminative pre-training makes it particularly sensitive to local stylistic anomalies common in llm outputs. 7 Preprint — arXivDetecting the Machine 4.2.4 DistilBERT (distilbert-base-uncased) DistilBERT [Sanh et al., 2019] is a knowledge-distilled compression of BERT, retaining 97% of BERT’s language understanding at 60% of its parameter count (≈66M parameters, 6 layers). Distillation uses a soft-label cross-entropy loss against the teacher BERT’s output distribution, combined with cosine embedding alignment. DistilBERT is particularly attractive for deployment-scale detection systems due to its significantly reduced inference latency. 4.2.5 DeBERTa-v3 (microsoft/deberta-v3-base) DeBERTa-v3 [He et al., 2021] introduces two architectural advances over RoBERTa. First, disen- tangled attention: each token is represented by two separate vectors — one for content and one for relative position — with attention weights computed across all four content-position cross-interactions. Second, DeBERTa-v3 adopts ELECTRA-style replaced token detection for pre-training rather than masked language modelling. The base model has approximately 184M parameters. Implementation. A critical consideration for DeBERTa-v3 is precision handling. The disentan- gled attention mechanism produces gradient magnitudes that underflow in BF16’s 7-bit mantissa, rendering mixed-precision training numerically unsafe for this architecture. Training is therefore conducted in full FP32 throughout (fp16=False,bf16=False, with an explicitmodel.float() cast at initialization). Checkpoint reloading is disabled entirely (savestrategy="no",loadbest- modelatend=False), and final in-memory weights are used directly for prediction — this avoids the LayerNorm parameter naming inconsistency between saved and reloaded checkpoints that is a known fragility of DeBERTa-v3 under the HuggingFace Trainer. Explicit gradient clipping (maxgrad- norm=1.0) is applied for training stability.tokentypeidsare intentionally omitted, as DeBERTa-v3 does not use segment IDs. DeBERTa-v3 uses AdamW (lr= 2×10 −5 , weightdecay= 0.01), 500 warmup steps (fixed, not ratio-based), 1 epoch, batch size 16. 4.3 Shallow 1D-CNN Detector The 1D-CNN detector is a lightweight neural model targeting localn-gram patterns rather than global sequence context, following the architecture of Kim [2014]. It follows the architecture: Embedding→ Parallel 1D-Conv→ Global Max Pool→ Dense Head→ σ A shared embedding layer (vocab size 30,000, dim 128) projects token IDs into dense vectors. Four parallel convolutional branches with kernel sizes2,3,4,5each produce 128 feature maps (BatchNorm + ReLU). Global max pooling extracts the most salient activation per filter, producing a 512-dimensional concatenated feature vector. A two-layer dense head (512→256→1) with dropout (0.4) and sigmoid output produces the detectability score. Texts are truncated to 256 tokens (shorter than the transformer maximum of 512, as localn-gram patterns are captured in shorter windows). Total parameter count is under 5M — intentionally constrained to probe whether shallow learned representations can bridge the gap between handcrafted features and full transformer fine-tuning. Training uses Adam (lr= 10 −3 ), ReduceLROnPlateau scheduling (factor 0.5, patience 1), gradient clipping (norm 1.0), and early stopping (patience = 3) over up to 10 epochs. 4.4 Stylometric and Statistical Hybrid Detector This family substantially extends the classical feature set from 22 to 60+ features across eight categories, adding: • AI phrase density: frequency of structurally AI-characteristic phrases (e.g., “it is worth noting”, “in summary”, “to summarize”). •Function word frequency profiles: overall function word ratio plus per-word frequency for the 10 most common function words. 8 Preprint — arXivDetecting the Machine •Punctuation entropy: Shannon entropy over the punctuation character distribution — llm text tends toward lower entropy (more uniform punctuation). •Readability indices: Flesch Reading Ease, Flesch-Kincaid Grade, Gunning Fog, SMOG Index, ARI, Coleman-Liau Index. • POS tag distribution (spaCy): normalized frequency of 10 POS categories. • Dependency tree depth: mean and maximum parse-tree depth across sentences. • Sentence-level perplexity (GPT-2 Small): mean, variance, standard deviation, and coefficient of variation (CV) of per-sentence perplexity. The CV is particularly diagnostic: llm text exhibits uniformly low perplexity (low CV), while human text varies considerably across sentences (high CV). Three classifiers are trained: Logistic Regression (L2, lbfgs solver), Random Forest (300 trees, max depth 12), and XGBoost [Chen and Guestrin, 2016] (400 estimators, learning rate 0.05, depth 6, subsample 0.8). All features are standardized via StandardScaler. 4.5 Perplexity-Based Detectors Perplexity-based detection is an unsupervised, training-free approach that exploits the distributional overlap between autoregressive reference models and llm-generated text. Because GPT-2 and GPT-Neo family models share training corpus overlap with modern llm generators, they assign systematically lower perplexity to llm-generated text than to human-written text. The detectability score is therefore an inversion of the raw perplexity signal. Five reference models are evaluated: GPT-2 Small (117M), GPT-2 Medium (345M), GPT-2 XL (1.5B), GPT-Neo-125M, and GPT-Neo- 1.3B. Full implementation details and the sliding-window strategy for long texts are described in Section 7. 4.6 LLM-as-Detector The llm-as-detector paradigm treats generative language models as zero-parameter classifiers, deriving detectability scores from constrained decoding logits (for local models) or structured rubric scores (for API models). Five open-source models spanning 1.1B to 14B parameters are evaluated (TinyLlama-1.1B, Qwen2.5-1.5B, Qwen2.5-7B, LLaMA-3.1-8B, LLaMA-2-13B-Chat), along with GPT-4o-mini via the OpenAI API. Full implementation details including prompt polarity correction, task prior subtraction, and the hybrid confidence-logit scoring scheme are described in Section 6. 5 Experimental Results: Detector Families 5.1 Statistical / Classical Detectors Tables 4–6 report results for Logistic Regression, Random Forest, and SVM with RBF kernel. Table 4. Logistic Regression results across evaluation conditions. Condition aurocBrier Log Loss Mean Human Mean llm hc3tohc30.88820.13340.44110.28380.7319 hc3 toeli50.74060.21160.64740.42460.6508 eli5 toeli50.84460.16050.49090.32510.6760 eli5tohc30.74290.24960.90630.20060.4580 9 Preprint — arXivDetecting the Machine Table 5. Random Forest results across evaluation conditions. Condition aurocBrier Log Loss Mean Human Mean llm hc3tohc30.97670.06790.24380.18890.8173 hc3toeli50.78290.19220.58150.38300.6086 eli5 toeli50.96180.08690.30140.23480.7811 eli5tohc30.63370.31931.16360.16430.2903 Table 6. SVM (RBF Kernel) results across evaluation conditions. Condition aurocBrier Log Loss Mean Human Mean llm hc3tohc30.79930.18350.54860.37000.6318 hc3toeli50.69330.23480.66390.51960.6686 eli5 toeli50.79240.18570.55120.37400.6287 eli5tohc30.59920.31691.58520.20830.3191 Figure 2. Calibration curves for classical detectors across four evaluation settings. Points close to the diagonal indicate well-calibrated confidence scores, while systematic deviations reflect over- or under-confidence. Key Observation. Random Forest achieves the strongest in-distribution performance (auroc = 0.977 on HC3) among classical detectors but suffers the largest cross-domain degradation (eli5- tohc3: 0.634), suggesting it overfits to dataset-specific surface statistics rather than generalizable linguistic signals. 5.2 Fine-Tuned Encoder Transformers Tables 7–11 report full results for each fine-tuned encoder. 10 Preprint — arXivDetecting the Machine Table 7. BERT (bert-base-uncased) results. Condition aurocAcc.Brier Log Loss Hum.llmSep. hc3tohc30.99470.90410.09060.57470.19270.99990.8071 hc3toeli50.94890.83960.14720.87200.23190.91470.6828 eli5 toeli50.99430.93880.05720.33150.12450.99960.8751 eli5tohc30.90830.85480.13930.87190.21000.91700.7070 Table 8. RoBERTa (roberta-base) results. Condition aurocAcc.Brier Log Loss Hum.llmSep. hc3tohc30.99940.96790.03030.22040.06421.00000.9357 hc3toeli50.97410.79670.19261.44010.40540.99910.5937 eli5 toeli50.99980.96450.03310.22640.07110.99990.9289 eli5tohc30.96570.90450.09320.70820.11290.92140.8085 Table 9. ELECTRA (google/electra-base-discriminator) results. Condition aurocAcc.Brier Log Loss Hum.llmSep. hc3tohc30.99720.86390.12980.86630.27310.99960.7265 hc3toeli50.95970.84920.14500.98680.27700.97250.6955 eli5toeli50.99750.96050.03590.18040.08210.99860.9166 eli5tohc30.93180.87900.11610.64080.16300.91400.7511 Table 10. DistilBERT (distilbert-base-uncased) results. Condition aurocAcc.Brier Log Loss Hum.llm hc3tohc30.99680.95020.04600.26980.09990.9997 hc3toeli50.95780.88350.10880.62350.12500.8907 eli5toeli50.99830.96920.02880.15030.06470.9993 eli5 tohc30.93090.87020.12290.72050.13970.8768 Table 11. DeBERTa-v3 (microsoft/deberta-v3-base) results. Condition aurocAcc.Brier Log Loss Hum.llm hc3tohc30.99130.88880.11000.98030.22250.9991 hc3toeli50.87620.57280.42454.05170.85320.9997 eli5 toeli50.95300.77940.20891.23870.43770.9998 eli5tohc30.88900.77490.21481.37640.41470.9662 11 Preprint — arXivDetecting the Machine (a) Detectability score distributions. (b) Calibration curves. (c) ROC curves. Figure 3. Performance analysis of DistilBERT across four evaluation conditions. Top: score distributions indicating class separability. Middle: reliability diagrams assessing calibration. Bottom: ROC curves illustrating discrimination performance. DistilBERT achieves near-transformer performance at approximately 60% of BERT’s parameter count. 5.3 Shallow 1D-CNN Detector Table 12 reports 1D-CNN results. Figure 4 shows training curves and score distributions; Figure 5 shows the degradation curve under progressive humanization. Table 12. 1D-CNN results across evaluation conditions. Condition aurocAcc.Brier Log Loss Hum.llm hc3tohc30.99950.99160.00670.02620.00930.9862 hc3toeli50.83030.71240.24461.08870.11920.5275 eli5toeli50.99820.97480.01910.06660.04770.9844 eli5 tohc30.84320.68660.27521.47230.07300.4455 12 Preprint — arXivDetecting the Machine (a) Training loss and validation AUC across epochs. (b) Detectability score distributions across evaluation conditions. Figure 4. Training dynamics and detectability behavior of the 1D-CNN detector. Top: rapid convergence to high validation AUC on both datasets. Bottom: score distributions indicating strong separability between human and llm text. Figure 5. 1D-CNN degradation curve under progressive text humanization. Thex-axis represents the fraction of human tokens mixed into otherwise llm-generated text. The steep, smooth decline confirms that the 1D-CNN is highly sensitive to even small amounts of human-style n-gram patterns. Key Observation. The 1D-CNN achieves near-perfect in-distribution auroc (0.9995 on HC3) — competitive with full transformers — while having 20×fewer parameters. Cross-domain performance drops to 0.83–0.84, indicating that learnedn-gram patterns are domain-specific but still substantially more transferable than pure classical features. 13 Preprint — arXivDetecting the Machine 5.4 Stylometric and Statistical Hybrid Detector Tables 13–15 report results for all three classifiers trained on the extended stylometric feature set. Figure 6 shows the auroc heatmap across classifiers and evaluation conditions, and Table 13. Stylometric Hybrid — Logistic Regression results. Condition aurocAcc.Brier Log Loss Hum.llmSep. hc3tohc30.97210.92430.05800.20930.12730.87530.7480 hc3toeli50.67310.62960.25390.81100.36680.55020.1834 eli5toeli50.94480.88070.08970.30030.18230.81660.6343 eli5tohc30.73480.66500.26691.21850.20060.49410.2935 Table 14. Stylometric Hybrid — Random Forest results. Condition aurocAcc.Brier Log Loss Hum.llmSep. hc3tohc30.99810.97850.01890.08540.07310.93620.8631 hc3toeli50.85570.75160.17680.55860.16990.56280.3929 eli5toeli50.99340.96050.03950.16260.13710.87590.7388 eli5 tohc30.88480.65890.21000.61230.11640.43630.3199 Table 15. Stylometric Hybrid — XGBoost results. Condition aurocAcc.Brier Log Loss Hum.llmSep. hc3tohc30.99960.99280.00590.02260.01790.99120.9733 hc3 toeli50.86330.72520.22700.94510.06730.50330.4361 eli5 toeli50.99710.97320.01970.07140.05290.96200.9091 eli5tohc30.90370.72750.22810.96240.04390.48080.4369 Figure 6. Stylometric hybrid auroc heatmap. Rows correspond to classifiers (Logistic Regression, Random Forest, XGBoost), while columns represent the four evaluation conditions (eli5-to-eli5, eli5-to-hc3, hc3-to-eli5, hc3-to-hc3). Cell colors range from red (0.5) to dark green (1.0). XGBoost dominates across all conditions; the cross-domain eli5-to-hc3 auroc of 0.904 represents a substantial improvement over the classical Stage 1 Random Forest (0.634). Key Observation. XGBoost on the full stylometric feature set achieves auroc = 0.9996 in- distribution — on par with fine-tuned transformers — while remaining fully interpretable. The extended feature set (particularly sentence-level perplexity CV, connector density, and AI-phrase density) substantially improves cross-domain performance over the classical Stage 1 feature set alone, with XGBoost eli5tohc3 reaching 0.904 versus Random Forest’s 0.634 in the classical setting. 14 Preprint — arXivDetecting the Machine 5.5 Stage 1 Key Conclusions 1.Fine-tuned encoder transformers dominate all other families. RoBERTa achieves the highest in-distribution auroc (0.9994 on HC3), confirming that task-specific fine-tuning on paired human/llm data is the most effective detection strategy. 2. Cross-domain degradation is universal and substantial. Every detector family suffers auroc drops of 5–30 points when trained on one dataset and tested on the other, indicating that no current detector generalizes robustly across llm sources and domains. 3. The 1D-CNN achieves near-transformer in-distribution performance with 20×fewer parameters. Its cross-domain performance (0.83–0.84) reveals that learnedn-gram patterns are dataset-specific rather than universally generalizable. 4.DeBERTa-v3 is competitive in-distribution but severely miscalibrated cross-domain. Following FP32 precision correction, it reaches auroc 0.991 (HC3) and 0.953 (ELI5) in- distribution. Cross-domain transfer exposes a critical failure: hc3toeli5 accuracy collapses to 0.573 (log loss 4.052) despite auroc 0.876, indicating well-ordered but poorly calibrated scores — consistent with overfitting to HC3’s formal register. 5.The XGBoost stylometric hybrid matches transformer in-distribution performance while remaining fully interpretable. Sentence-level perplexity CV, connector density, and AI-phrase density are the most discriminative features. 6.Length matching was critical. Without the±20% length normalization, classical detectors would have trivially exploited the well-known length disparity between human and llm answers, inflating reported performance. 6 LLM-as-Detector and Contrastive Likelihood Detection This section evaluates generative llms as zero-parameter AI-text detectors across six model scales — from sub-2B to a frontier API model — under three prompting regimes. The pipeline incorporates calibrated threshold analysis alongside fixed-threshold evaluation, and a hybrid confidence-logit scoring scheme for Chain-of-Thought outputs. 6.1 Prompting Paradigms Zero-Shot prompting presents only a system instruction and target text. Detection scores are derived via constrained decoding: the next-token log-probability distribution is read at the final prompt position, and a soft [0,1] detectability score is computed from the softmax oflogP(llm) versus logP (human), yielding a continuous score without generation. Few-Shot prompting augments the zero-shot prompt withklabeled examples drawn from the training pool (k= 3 for sub-2B models;k= 5 otherwise), with TF-IDF based semantic retrieval used for larger models to select maximally informative demonstrations. Chain-of-Thought (CoT) prompting instructs the model to reason across structured linguistic dimensions before delivering a finalVERDICT. CoT scoring employs a hybrid confidence-logit scheme: when the model produces a parseable numerical confidence estimate alongside its verdict, this is combined with the logit-derived soft score at the verdict token position using equal weighting; otherwise, the logit-only score is used. CoT is restricted to models with sufficient instruction- following capacity; sub-2B models are excluded. Three threshold strategies are reported: fixed at 0.5 (acc@0.5), calibrated at the score median (acc@median), and optimal Youden-J (acc@optimal). 6.2 Tiny-Scale Models: TinyLlama-1.1B-Chat-v1.0and Qwen2.5-1.5B Both models are evaluated under zero-shot and few-shot regimes on 500 balanced samples per dataset, loaded in FP16 with devicemap="auto". 15 Preprint — arXivDetecting the Machine Table 16. Tiny-scale llm-as-detector results. ModelRegimeDataset auroc acc@0.5 acc@median acc@optimal TinyLlama-1.1B-Chat-v1.0 Zero-ShotHC30.56530.5580.5340.558 TinyLlama-1.1B-Chat-v1.0 Zero-ShotELI50.50720.5240.5100.524 TinyLlama-1.1B-Chat-v1.0 Few-ShotHC30.61980.6140.6000.614 TinyLlama-1.1B-Chat-v1.0 Few-ShotELI50.58600.5800.5660.580 Qwen2.5-1.5B-InstructZero-ShotHC30.52210.4360.5300.574 Qwen2.5-1.5B-InstructZero-ShotELI50.52050.4700.5120.562 Qwen2.5-1.5B-InstructFew-ShotHC30.47940.4500.5180.536 Qwen2.5-1.5B-InstructFew-ShotELI50.63400.4840.6200.620 Both models perform near chance across all conditions (auroc 0.48–0.63), confirming that detection as a meta-cognitive task does not emerge at sub-2B scale. The threshold analysis surfaces a qualitatively important finding: Qwen2.5-1.5B-Instructzero-shot scores cluster systematically above 0.5 (median≈0.75–0.80), yet auroc remains near chance — a score-collapsing pattern in which the model emits uniformly high detectability scores regardless of label, yielding poor rank ordering rather than a polarity inversion. TinyLlama’s few-shot median score shifts from≈0.26 (zero-shot) to≈0.69 (few-shot), reflecting a format-induced distributional shift rather than improved class discrimination. 6.3 Mid-Scale Models:Llama-3.1-8B-Instruct and Qwen2.5-7B Both 8B models are evaluated under all three regimes. Zero-shot and few-shot use constrained decoding on 500 samples; CoT uses full autoregressive generation (max 400 tokens, greedy decoding) on 70 samples. Table 17. Mid-scale llm-as-detector results. FP/FN denote false positive/negative counts. ModelRegimeDataset auroc acc@0.5 acc@optimal FP FN Llama-3.1-8B-Instruct Zero-ShotHC30.72950.6800.68048120 Llama-3.1-8B-Instruct Zero-ShotELI50.75080.6700.70210657 Llama-3.1-8B-Instruct Few-ShotHC30.50270.5460.55053172 Llama-3.1-8B-Instruct Few-ShotELI50.59610.5740.57877140 Llama-3.1-8B-Instruct CoTHC30.67710.6290.6571118 Llama-3.1-8B-Instruct CoTELI50.59880.5860.6001513 Qwen2.5-7B-InstructZero-ShotHC30.69020.6560.66670102 Qwen2.5-7B-InstructZero-ShotELI50.66380.6200.63260130 Qwen2.5-7B-InstructFew-ShotHC30.45790.4840.502132126 Qwen2.5-7B-InstructFew-ShotELI50.50420.5240.542125113 Qwen2.5-7B-InstructCoTHC30.63880.5140.657312 Qwen2.5-7B-InstructCoTELI50.78080.6140.743203 Llama-3.1-8B-Instructachieves competitive zero-shot auroc of 0.730–0.751, demonstrating that genuine detection signal emerges at 8B scale without in-context examples. However, few-shot prompting markedly degrades performance (auroc 0.503–0.596). Qwen2.5-7B CoT on ELI5 achieves 0.781, the highest among mid-scale models. 6.4 Large-Scale Models: LLaMA-2-13B-Chat Pipeline Design Llama-2-13b-chat-hf is evaluated under all three regimes, loaded with 4-bit NF4 quantization (double quantization, FP16 compute dtype). Zero-shot and few-shot use 200 samples per dataset; CoT uses 30 samples with full generation (max 400 tokens, greedy decoding). 16 Preprint — arXivDetecting the Machine Token-level debug analysis revealed that Llama-2-13b-chat-hfexhibits a strong unconditional “no” bias. Rather than inverting this post-hoc, the pipeline resolves it structurally via prompt polarity swapping: the model is asked “Was this text written by a human?” withyes=human andno=AI-generated. Additionally, a task-specific prior is computed by averaging yes/no logits over 50 real task prompts drawn from the evaluation pool; subtracting this prior removes task-level marginal bias while preserving sample-discriminative signal. The CoT prompt frames the task as a stylometric analysis — avoiding the term “AI detection” to circumvent LLaMA-2’s safety-oriented refusal behaviors. The model scores seven linguistic dimensions on a 0–10 scale. The hybrid confidence-logit ensemble weights confirmed confidence and logit scores at 0.6/0.4 when confidence falls outside the dead zone [0.40,0.60]; otherwise the logit score is used alone. Table 18. Llama-2-13b-chat-hfdetection results (n = 200 zero/few-shot; n = 30 CoT). RegimeDataset auroc acc@0.5 acc@median acc@optimal FP FN Zero-Shot HC30.81240.7150.7100.7602136 Zero-Shot ELI50.80980.7500.7550.7602822 Few-ShotHC30.66780.6350.6300.6602845 Few-ShotELI50.63740.5900.6200.6203844 CoTHC30.87780.8330.8670.86724 CoTELI50.89780.7330.8000.86723 The corrected pipeline yields substantially improved results relative to the original implementation (auroc 0.363–0.705 attributable to polarity and prior misconfiguration). Zero-shot auroc of 0.810– 0.812 is consistent across both datasets, and CoT peaks at 0.878 and 0.898 on HC3 and ELI5 respectively — the strongest CoT results among all open-source models. 6.5 Large-Scale Models: Qwen2.5-14B-Instruct Pipeline Design Qwen2.5-14B-Instruct is evaluated under all three regimes using the same swapped polarity convention, loaded with 4-bit NF4 quantization and BFloat16 compute dtype. The original implementation suffered from 86.7–90.0% unknown rates due to two failure modes: premature generation termination wheneostokenidwas used aspadtokenid, and an insufficientmaxnewtokens=350budget. The corrected pipeline setspadtokenidexplicitly and increasesmaxnewtokensto 500. Task framing adopts “forensic linguist performing authorship attribution analysis” to minimize safety-motivated refusals. Table 19. Qwen2.5-14B-Instruct detection results (n = 200 zero/few-shot; n = 30 CoT). RegimeDataset auroc acc@0.5 acc@median acc@optimal FP FN Zero-Shot HC30.66860.6800.6600.6803133 Zero-Shot ELI50.72940.6900.6550.6954121 Few-ShotHC30.31530.3850.3900.5005271 Few-ShotELI50.42620.4700.4650.5005947 CoTHC30.66220.7000.6670.73344 CoTELI50.80000.7330.6670.76743 6.6 GPT-4o-mini as Detector GPT-4o-mini is evaluated via the OpenAI API using a structured 7-dimension rubric scoring protocol across three regimes. Unlike local models — where constrained logit decoding is used — GPT- 17 Preprint — arXivDetecting the Machine Table 20. Qwen2.5-14B-InstructCoT component analysis: hybrid vs. logit-only scoring. Dataset Score Type nauroc Accuracy % of Total HC3conf+logit11 0.62500.72736.7% HC3logitonly19 0.64290.68463.3% ELI5conf+logit15 0.85190.80050.0% ELI5logitonly15 0.75930.66750.0% 4o-mini employs a rubric-based elicitation strategy that forces the model to commit to seven independent dimension scores (hedging/formulaic language, response completeness, personal voice, lexical uniformity, structural neatness, response fit, formulaic tells) before producing a finalAISCORE ∈ [0,100]. This design circumvents the known rlhf-induced suppression of numeric probability outputs. A full five-metric evaluation suite is applied with 1,000-iteration bootstrap confidence intervals (n = 200 for zero-shot/few-shot; n = 50 for CoT). Table 21. GPT-4o-mini (llm-as-detector) results. All score directions verified correct. ModelRegime Data auroc Acc.@0.5 Acc.@Opt.Sep. GPT-4o-miniZSHC30.84700.76000.7900+0.311 GPT-4o-miniZSELI50.90930.88000.8800+0.419 GPT-4o-miniFSHC30.71630.70000.7000+0.187 GPT-4o-miniFSELI50.78240.68000.7400+0.246 GPT-4o-mini CoTHC30.80560.78000.8000+0.279 GPT-4o-mini CoTELI50.77440.76000.8000+0.261 Finding 1: Structured rubric prompting outperforms constrained decoding. GPT-4o- mini achieves the highest zero-shot auroc of all five models evaluated (0.8470 vs. 0.8124 on HC3; 0.9093 vs. 0.8098 on ELI5 relative to LLaMA-2-13B). Table 22. Zero-shot to few-shot auroc degradation on HC3. ModelZS auroc FS auroc∆ Qwen2.5-14B-Instruct0.66860.3153 −0.353 Qwen2.5-7B-Instruct0.69020.4579 −0.232 Llama-3.1-8B-Instruct0.72950.5027 −0.227 Llama-2-13b-chat-hf0.81240.6678 −0.145 GPT-4o-mini0.84700.7163 −0.131 Finding 2: GPT-4o-mini degrades least under few-shot prompting. Finding 3: CoT underperforms zero-shot for GPT-4o-mini. auroc drops from 0.8470 to 0.8056 on HC3 (∆ =−0.041) and from 0.9093 to 0.7744 on ELI5 (∆ =−0.135). Adding per-dimension reasoning to an already-explicit rubric introduces noise rather than precision. Finding 4: ELI5 is easier than HC3 under zero-shot. GPT-4o-mini achieves auroc 0.9093 on ELI5 versus 0.8470 on HC3 (∆ = +0.062). ELI5’s Mistral-7B-generated text carries stronger stylistic markers than HC3’s ChatGPT-3.5 text. Stage 1b Conclusions Detection capability scales non-monotonically with parameter count. Sub-2B models perform near random (auroc 0.48–0.63); meaningful discrimination first appears at 8B and consoli- dates at 13B (Llama-2-13b-chat-hf zero-shot: 0.810–0.812). Qwen2.5-14B zero-shot (0.669–0.729) 18 Preprint — arXivDetecting the Machine underperforms Llama-2-13b-chat-hf at the same regime, indicating that RLHF alignment strategy and prompt polarity interact with scale in ways that confound simple parameter-count comparisons. Prompt polarity correction and task prior subtraction are necessary conditions for valid constrained decoding. Naive constrained decoding without prior correction produces systematically inverted or near-random scores due to RLHF-induced unconditional response biases. CoT prompting provides the largest and most consistent gains, contingent on correct implementation. Llama-2-13b-chat-hf CoT peaks at auroc 0.878–0.898, Qwen2.5-7B-InstructCoT reaches 0.781 on ELI5. CoT gains are contingent on sufficient generation budget, correctpad- tokenid handling, safety-neutral prompt framing, and a robust multi-fallback verdict parser. Few-shot prompting is consistently harmful across all model scales. Few-shot degrades auroc relative to zero-shot:Llama-3.1-8B-Instruct (0.503–0.596 vs. 0.730– 0.751), Llama-2-13b-chat-hf (0.637–0.668 vs. 0.810–0.812), Qwen2.5-14B-Instruct(0.315–0.426), and GPT-4o-mini on HC3 (0.7163 vs. 0.8470). The generator–detector identity confound is critical. Mistral-7B-Instruct, used to generate the ELI5 llm answers, performed near or below random as a detector (auroc 0.363–0.540). A model cannot reliably detect its own outputs. No llm-as-detector configuration approaches supervised fine-tuned encoders. The best result — GPT-4o-mini zero-shot on ELI5 at auroc 0.9093 — remains well below RoBERTa in-distribution (auroc 0.9994). 6.7 Contrastive Likelihood Detection The contrastive score is defined as: S(x) = logP large (x)− logP small (x)(1) Table 23. Contrastive likelihood detection results. VariantDatasetaurocScore Sep. basecontrastHC30.50070.0007 basecontrastELI50.68730.1873 multi scaleHC30.50070.0007 multiscaleELI50.68730.1873 tokenvarianceHC30.63230.1323 tokenvariance ELI50.56440.0644 hybridHC30.59990.0446 hybridELI50.76150.1463 The hybrid score achieves auroc of 0.762 on ELI5 but near-random on HC3 (0.600 and below). The performance gap is explained by a representational affinity constraint : GPT-2 and Mistral-7B share architectural and pretraining characteristics, while ChatGPT (GPT-3.5) underwent extensive RLHF alignment at a larger parameter scale. 7 Perplexity-Based Detectors 7.1 Method Perplexity-based detection is unsupervised and training-free. Because GPT-2 and GPT-Neo family models assign systematically lower perplexity to llm-generated text than to human-written text, the detectability score is an inversion of the raw perplexity signal: PPL(x) = exp − 1 T T X t=1 logP (x t | x 1 ,...,x t−1 ) ! (2) Five reference models are evaluated: GPT-2 Small (117M), GPT-2 Medium (345M), GPT-2 XL (1.5B), GPT-Neo-125M, and GPT-Neo-1.3B. All models are run in FP16 on the full HC3 and 19 Preprint — arXivDetecting the Machine ELI5 test sets. A sliding window (512-token window, 256-token stride) handles long texts. Outlier perplexities are clipped at 10,000 for rank stability. Raw perplexities are converted to [0,1] detectability scores via four normalization methods: rank-based, log-rank, minmax, and sigmoid. The best method per condition is selected by auroc, with optimal decision thresholds identified via Youden’s J statistic. 7.2 Results Table 24. Perplexity-based detector results. Method = best normalization by auroc. ModelData Method aurocBrier Acc@Opt Hum.llmSep. GPT-2 SmallHC3rank0.90990.12840.88050.29500.70500.4100 GPT-2 SmallELI5rank0.90730.12970.83780.29630.70370.4074 GPT-2 MediumHC3rank0.90470.13100.88040.29760.70240.4047 GPT-2 MediumELI5rank0.92750.11960.85460.28620.71380.4276 GPT-2 XLHC3rank0.89170.13750.86090.30410.69590.3918 GPT-2 XLELI5rank0.93140.11760.86600.28430.71570.4315 GPT-Neo-125M HC3rank0.91730.12470.88600.29130.70870.4173 GPT-Neo-125M ELI5minmax0.89680.45970.81770.95780.98570.0279 GPT-Neo-1.3BHC3rank0.89990.13340.87850.30000.70000.3999 GPT-Neo-1.3BELI5rank0.92610.12030.85340.28690.71310.4261 Table 25. Cross-model median perplexity statistics. llm text exhibits perplexity consistently 0.24–0.45× that of human text. ModelData Hum. Med. llm Med. Ratio GPT-2 SmallHC344.3111.350.256 GPT-2 SmallELI540.4317.990.445 GPT-2 MediumHC333.098.170.247 GPT-2 MediumELI530.9212.950.419 GPT-2 XLHC326.426.380.242 GPT-2 XLELI525.0810.020.400 GPT-Neo-125M HC346.5211.200.241 GPT-Neo-125M ELI542.0218.990.452 GPT-Neo-1.3BHC326.826.380.238 GPT-Neo-1.3BELI525.4510.480.412 Perplexity-based detection achieves auroc ranging from 0.891 to 0.931 across all well-behaved conditions. Reference model scale has negligible impact: GPT-2 Small and GPT-2 XL achieve nearly identical auroc on HC3 (0.910 vs. 0.892). Rank-based normalization is selected as optimal in 9 of 10 conditions. 8 Cross-LLM Generalization Study 8.1 Experimental Design and Dataset Construction Stage 3 evaluates whether detectors trained on ChatGPT-generated text (HC3) generalize to outputs from unseen llms. Five open-source models serve as unseen source llms: TinyLlama-1.1B, Qwen2.5- 1.5B, Qwen2.5-7B,Llama-3.1-8B-Instruct, and LLaMA-2-13B. Each generates 200 responses per dataset, yielding 2,000 llm-generated samples per dataset against human pools of 4,621 (HC3) and 2,858 (ELI5) texts. All detectors are evaluated zero-shot — with no retraining on any unseen llm’s outputs. 20 Preprint — arXivDetecting the Machine 8.2 Neural Detector Cross-LLM Evaluation Setup. Five HC3-trained transformer detectors — BERT, RoBERTa, ELECTRA, DistilBERT, and DeBERTa-v3-base — are evaluated zero-shot against outputs from all five unseen source llms on both HC3 and ELI5 domains, using the five-metric suite (auroc, auprc, eer, Brier, FPR@95%TPR) with 1,000-iteration bootstrap CIs. Table 26. Cross-llm generalization results for HC3-trained neural detectors (selected conditions). DetectorSource llmDataset aurocauroc CIauprceerBrier FPR@95 BERT-HC3TinyLlama-1.1B-Chat-v1.0HC30.960[0.942, 0.977]0.9520.0750.1300.165 BERT-HC3TinyLlama-1.1B-Chat-v1.0ELI50.952[0.929, 0.973]0.9400.0970.1740.115 BERT-HC3Qwen2.5-1.5B-InstructHC30.917[0.887, 0.943]0.8980.1400.1500.335 BERT-HC3Qwen2.5-1.5B-InstructELI50.876[0.842, 0.913]0.8450.1900.1970.370 BERT-HC3Llama-2-13b-chat-hfHC30.969[0.952, 0.984]0.9650.0670.1260.080 BERT-HC3Llama-2-13b-chat-hfELI50.973[0.956, 0.988]0.9650.0750.1630.095 RoBERTa-HC3Llama-3.1-8B-InstructHC30.993[0.987, 0.997]0.9930.0400.0510.030 RoBERTa-HC3Qwen2.5-1.5B-InstructELI50.858[0.820, 0.893]0.8230.2170.2730.410 RoBERTa-HC3Llama-2-13b-chat-hfHC30.990[0.980, 0.999]0.9930.0250.0470.005 ELECTRA-HC3Qwen2.5-7B-InstructELI50.942[0.917, 0.965]0.9220.1050.2070.175 ELECTRA-HC3Llama-3.1-8B-InstructELI50.951[0.926, 0.973]0.9280.0900.2030.125 ELECTRA-HC3Llama-2-13b-chat-hfELI50.968[0.949, 0.986]0.9510.0670.2030.075 DistilBERT-HC3 Qwen2.5-1.5B-InstructELI50.845[0.806, 0.882]0.8000.2320.2600.445 DistilBERT-HC3 Llama-2-13b-chat-hfELI50.985[0.976, 0.993]0.9850.0550.0790.055 DeBERTa-HC3Qwen2.5-1.5B-InstructHC30.923[0.891, 0.953]0.8670.0920.1030.105 DeBERTa-HC3TinyLlama-1.1B-Chat-v1.0ELI50.500[0.441, 0.560]0.4510.5120.4340.595 DeBERTa-HC3Llama-2-13b-chat-hfELI50.499[0.436, 0.561]0.4510.5220.4310.590 Key observations. Cross-llm generalization within a fixed domain is broadly achievable: RoBERTa achieves HC3 auroc 0.976–0.993 across all unseen source llms. Domain shift is the primary generalization bottleneck — DeBERTa collapses to near-random on ELI5 (0.499–0.607) regardless of source llm. ELECTRA is the most domain-robust detector, with ELI5 scores ranging 0.910–0.968. Llama-2-13b-chat-hf is the most consistently detectable source llm; Qwen2.5-1.5B- Instructis the hardest to detect. 8.3 Embedding-Space Generalization via Classical Classifiers All texts are encoded usingall-MiniLM-L6-v2(384-dimensional embeddings), and three classical classifiers — LR, SVM (RBF), and RF (200 trees) — are trained and evaluated under a full 5×5 train-test matrix. Human texts are split into disjoint train/test partitions for leakage-free evaluation. Table 27. Stage 3B embedding-space generalization on HC3 (selected). In-distribution conditions in bold. Classifier Train llmTest llmauroc SVMTinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0 0.976 SVMTinyLlama-1.1B-Chat-v1.0Llama-3.1-8B-Instruct0.844 SVMQwen2.5-7B-InstructQwen2.5-1.5B-Instruct0.941 SVMLlama-2-13b-chat-hfLlama-2-13b-chat-hf0.992 SVMLlama-2-13b-chat-hfLlama-3.1-8B-Instruct0.818 RFTinyLlama-1.1B-Chat-v1.0 TinyLlama-1.1B-Chat-v1.0 1.000 RFTinyLlama-1.1B-Chat-v1.0Llama-3.1-8B-Instruct0.812 RFLlama-2-13b-chat-hfQwen2.5-1.5B-Instruct0.755 LRLlama-2-13b-chat-hfLlama-3.1-8B-Instruct0.760 LRQwen2.5-1.5B-InstructQwen2.5-7B-Instruct0.885 SVM is the most generalizable classifier (off-diagonal auroc 0.818–0.941). Sentence embedding 21 Preprint — arXivDetecting the Machine classifiers are substantially more domain-robust than fine-tuned neural detectors, with HC3/ELI5 divergence < 0.03 auroc on average. 8.4 Distribution Shift Analysis in Representation Space Embeddings are extracted from DeBERTa-v3-base’s penultimate CLS layer, PCA-projected to 64 dimensions, and three distance metrics are computed under a Gaussian approximation: KL Divergence captures the information lost when approximating the source LLM’s embedding distribution with ChatGPT’s training distribution. Its asymmetry is deliberate: we are specifically interested in regions where the unseen LLM’s outputs have probability mass that the detector’s training distribution does not cover — precisely the scenario that causes detection failure. D KL (P∥Q) = 1 2 tr(Σ −1 Q Σ P ) + (μ Q − μ P ) ⊤ Σ −1 Q (μ Q − μ P )− d + ln |Σ Q | |Σ P | (3) Wasserstein-2 Distance measures the minimum transport cost between the two distributions under the squared Euclidean metric, providing a geometrically interpretable and symmetric char- acterization of distributional shift. Unlike KL divergence, it remains well-defined even when the two distributions have non-overlapping support — an important property given that different LLM families may occupy disjoint regions of embedding space. W 2 (P,Q) = s ∥μ P − μ Q ∥ 2 + tr Σ P + Σ Q − 2 Σ 1/2 P Σ Q Σ 1/2 P 1/2 (4) Fr ́echet Distance is included as a cross-validation of the Wasserstein estimate, drawing on its established use in generative model evaluation (FID) as a measure of representational divergence between two Gaussian-approximated distributions. Its close relationship toW 2 2 allows direct compar- ison, with any divergence between the two metrics indicating sensitivity to the symmetrizing square root in the covariance term. FD(P,Q) =∥μ P − μ Q ∥ 2 + tr Σ P + Σ Q − 2(Σ P Σ Q ) 1/2 (5) Spearman rank correlation is used rather than Pearson’srto test the distance-degradation relationship, as it makes no assumption about the linearity of the association between embedding- space distance and auroc drop — a sensible precaution given that detection failure may saturate or threshold at extreme distances. Correlations are computed separately for HC3 and ELI5 domains, with 500-iteration bootstrap confidence bands on regression lines, to assess whether domain modulates the distance-difficulty relationship. Table 28. Spearman rank correlations between distributional distance and auroc drop. ∗ = p < 0.05. MetricHC3 ρ HC3 p ELI5 ρ ELI5 p KL Divergence −0.2980.148 −0.4430.027 ∗ Wasserstein-2 −0.3690.070 −0.3220.117 Fr ́echet−0.3690.070 −0.3220.117 22 Preprint — arXivDetecting the Machine Table 29. Per-detector distributional distances and auroc drop on HC3. Note: Baseline auroc values for drop computation are taken from the 200-sample evaluation subsets used in Stage 3, not from the full test sets in Tables 7–11. Negative drop indicates cross-llm performance exceeds the Stage 3 subset baseline. DetectorSource llmKL W 2 FDauroc Drop BERT-HC3TinyLlama-1.1B-Chat-v1.01.0190.9340.872+0.006 BERT-HC3Qwen2.5-1.5B-Instruct0.4710.6820.465+0.050 BERT-HC3Qwen2.5-7B-Instruct0.7410.6330.400+0.033 BERT-HC3Llama-3.1-8B-Instruct2.0150.8080.652+0.023 BERT-HC3Llama-2-13b-chat-hf1.1050.8220.676 −0.003 RoBERTa-HC3TinyLlama-1.1B-Chat-v1.01.0190.9340.872+0.019 RoBERTa-HC3Qwen2.5-1.5B-Instruct0.4710.6820.465+0.020 RoBERTa-HC3Llama-3.1-8B-Instruct2.0150.8080.652+0.005 RoBERTa-HC3Llama-2-13b-chat-hf1.1050.8220.676+0.007 ELECTRA-HC3Qwen2.5-1.5B-Instruct0.4710.6820.465+0.020 ELECTRA-HC3Llama-3.1-8B-Instruct2.0150.8080.652+0.009 ELECTRA-HC3Llama-2-13b-chat-hf1.1050.8220.676 −0.011 DistilBERT-HC3Qwen2.5-1.5B-Instruct0.4710.6820.465+0.085 DistilBERT-HC3Qwen2.5-7B-Instruct0.7410.6330.400+0.080 DistilBERT-HC3Llama-3.1-8B-Instruct2.0150.8080.652+0.053 DistilBERT-HC3Llama-2-13b-chat-hf1.1050.8220.676+0.009 DeBERTa-HC3TinyLlama-1.1B-Chat-v1.01.0190.9340.872 −0.026 DeBERTa-HC3Qwen2.5-1.5B-Instruct0.4710.6820.465 −0.046 DeBERTa-HC3Llama-3.1-8B-Instruct2.0150.8080.652 −0.045 DeBERTa-HC3Llama-2-13b-chat-hf1.1050.8220.676 −0.034 All three distance metrics produce negative rather than positive Spearman correlations with auroc drop, directly contradicting the expectation that geometrically more distant llms should be harder to detect. Qwen2.5-1.5B-Instructand Qwen2.5-7B-Instructexhibit the smallest embedding distances from ChatGPT yet cause the largest auroc drops — supporting a proximity-confusion hypothesis. 9 Adversarial Humanization Setup. Paraphrase-based attacks have been shown to substantially reduce detector accuracy [Krishna et al., 2023]. Following this motivation, two hundred ChatGPT-generated samples are drawn from each dataset (HC3 and ELI5) and subjected to two rounds of humanization using Qwen2.5-1.5B-Instruct (4-bit NF4 quantized), producing three evaluation conditions: • L0 — original AI-generated text, unmodified. •L1 — light humanization: varied sentence length, informal register, avoidance of formulaic structure; semantic content preserved. •L2 — heavy humanization: applied iteratively on L1 output; aggressive removal of AI-like patterns (numbered lists, formal transitions), deliberate conversational imperfections, minor grammatical relaxation permitted. At each level, detector scores are computed against a fixed pool of 200 human-authored texts from the same dataset. Metrics reported: auroc, detection rate (proportion of AI texts scoring > 0.5), mean P (llm) score, and Brier score. 23 Preprint — arXivDetecting the Machine Table 30. Stage 4 adversarial humanization results. DetectorDataset Level auroc Det. Rate Mean AI Mean Human Brier BERT-HC3HC3L00.96371.0000.99980.27360.1278 BERT-HC3HC3L10.97491.0000.99970.27360.1278 BERT-HC3HC3L20.87920.8700.86960.27360.1914 BERT-HC3ELI5L00.95300.9300.92490.24540.1480 BERT-HC3ELI5L10.99450.9950.99490.24540.1154 BERT-HC3ELI5L20.89890.8500.85530.24540.1817 RoBERTa-HC3HC3L00.98961.0001.00000.07750.0374 RoBERTa-HC3HC3L10.99111.0001.00000.07750.0374 RoBERTa-HC3HC3L20.96210.9100.90710.07750.0819 RoBERTa-HC3ELI5L00.94430.9900.98990.48490.2370 RoBERTa-HC3ELI5L10.96991.0001.00000.48490.2320 RoBERTa-HC3ELI5L20.87570.9050.90490.48490.2794 ELECTRA-HC3HC3L00.94241.0000.99970.40920.1958 ELECTRA-HC3HC3L10.96521.0000.99970.40920.1958 ELECTRA-HC3HC3L20.85740.8900.88830.40920.2497 ELECTRA-HC3ELI5L00.95400.9800.97950.35010.1744 ELECTRA-HC3ELI5L10.98541.0000.99970.35010.1645 ELECTRA-HC3ELI5L20.89720.8850.88880.35010.2184 DistilBERT-HC3 HC3L00.99000.9950.99480.12040.0580 DistilBERT-HC3 HC3L10.95060.8950.88860.12040.1036 DistilBERT-HC3 HC3L20.85670.6750.66080.12040.2131 DistilBERT-HC3 ELI5L00.94620.8250.82030.08500.1248 DistilBERT-HC3 ELI5L10.95210.8350.83870.08500.1088 DistilBERT-HC3 ELI5L20.87070.6450.63490.08500.2089 DeBERTa-HC3HC3L00.88511.0000.99990.23110.1140 DeBERTa-HC3HC3L10.92261.0001.00000.23110.1140 DeBERTa-HC3HC3L20.89980.9100.90900.23110.1584 DeBERTa-HC3ELI5L00.52521.0000.99990.85210.4232 DeBERTa-HC3ELI5L10.59361.0001.00000.85210.4232 DeBERTa-HC3ELI5L20.58870.9150.91510.85210.4655 Light humanization does not reduce detectability. L1 auroc≥L0 across all detectors and both domains without exception. Light paraphrasing by a small instruction-tuned model superimposes additional model-specific patterns, rendering the composite text more detectable. Heavy humanization produces consistent but incomplete evasion. RoBERTa is most resistant (L0→L2 drop: 0.028 on HC3). DistilBERT is most susceptible (drop: 0.133; detection rate: 99.5%→ 67.5%). No detector falls below auroc 0.857 on HC3 at L2. auroc and detection rate diverge at L2, indicating that L2 humanization shifts AI texts toward the uncertain region around the 0.5 decision boundary rather than cleanly into the human score region. DeBERTa’s ELI5 collapse is unaffected by humanization (L0: 0.525, L1: 0.594, L2: 0.589), confirming that its ELI5 weakness is a domain-level structural limitation. Mean human scores are invariant across levels, validating experimental design. 10 Discussion 10.1 The Cross-Domain Challenge Cross-domain degradation is the central finding of this benchmark. Every detector family suffers auroc drops of 5–30 points when trained on one corpus and tested on the other. The most severe 24 Preprint — arXivDetecting the Machine case is the classical Random Forest (eli5-to-hc3: 0.634). Fine-tuned transformers maintain the highest cross-domain performance (RoBERTa eli5-to-hc3: 0.966). The stylometric hybrid XGBoost achieves competitive cross-domain auroc (0.904 eli5-to-hc3), substantially exceeding classical baselines. We attribute this to the perplexity CV feature: the consistency of fluency across sentences is a generator-agnostic signal that transfers across both ChatGPT and Mistral-7B outputs. 10.2 The Generator–Detector Identity Problem The Mistral-7B llm-as-detector results reveal a fundamental confound: a model cannot reliably detect its own outputs. If a detector is trained or prompted using the same model family as the target generator, its performance will be systematically underestimated. 10.3 The Perplexity Inversion Modern llms produce text that is significantly more predictable than human writing, because their optimization objectives push strongly toward high-probability, fluent outputs. In our experimental setting, naive perplexity thresholding assigns higher scores to human text, yielding below-random performance. 10.4 Interpretability vs. Performance The XGBoost stylometric hybrid nearly matches the in-distribution auroc of the best transformer (0.9996 vs. 0.9998) while remaining fully interpretable. 10.5 Limitations This study has several limitations. First, the evaluation covers only two llm sources (ChatGPT/GPT- 3.5 and Mistral-7B-Instruct); generalization to frontier models (Claude, Gemini, GPT-4) remains to be tested. Second, the adversarial humanization study uses only Qwen2.5-1.5B-Instructas the humanizer; different humanizer models may yield different evasion rates. Third, the llm-as-detector experiments use relatively small evaluation subsets (n= 30 for CoT) due to computational cost. Fourth, the evaluation is limited to English Q&A text; performance on other genres and languages is unknown. Fifth, the Stage 3C distribution shift analysis uses 200-sample evaluation subsets as baselines rather than the full test sets, which should be noted when interpreting auroc drop values. 11 Future Work 1.Expansion to frontier models. Evaluation on Claude-3, Gemini, LLaMA-3, and GPT-4 outputs, probing whether the perplexity inversion and contrastive likelihood signals hold for heavily RLHF-aligned generators. 2. Non-Q&A domains. Evaluation on essays, news articles, and scientific abstracts. 3.Ensemble methods. Systematic exploration of ensembles combining fine-tuned transformers with interpretable stylometric features. 4. Multilingual evaluation. Extension to non-English corpora. 5. Adaptive adversarial humanization. Evaluation of humanizers that are aware of specific detector architectures and craft targeted evasion strategies. 12 Conclusion We have presented one of the most comprehensive evaluations to date, spanning multiple detector families, two carefully controlled corpora, four evaluation conditions, and detectors ranging from 25 Preprint — arXivDetecting the Machine logistic regression on 22 hand-crafted features to fine-tuned transformer encoders and llm-scale promptable classifiers. Our central findings are: fine-tuned encoder transformers achieve near-perfect in-distribution detection (auroc≥0.994) but degrade universally under domain shift; an interpretable XGBoost stylometric hybrid matches this performance with negligible inference cost; the 1D-CNN achieves near-transformer performance with 20×fewer parameters; perplexity-based detection reveals a critical polarity inversion that inverts naive hypotheses about llm text distributions; and prompting-based detection, while requiring no training data, lags far behind fine-tuned approaches and is strongly confounded by the generator–detector identity problem. Collectively, these results paint a clear picture: robust, generalizable, and adversarially resistant AI-generated text detection remains an open problem. No single detector family dominates across all conditions. Closing the cross-domain gap — particularly in the presence of adversarial humanization — is the most critical open challenge in the field. Acknowledgments The authors thank the Indian Institute of Technology (BHU) and IIT Guwahati for computational resources and support. The authors also acknowledge the maintainers of the HC3 and ELI5 datasets, the HuggingFace open-source ecosystem, and the developers of the open-source models evaluated in this benchmark. The full evaluation pipeline and benchmark code are available at our GitHub repository. All fine-tuned transformer models are available as private repositories athttps://huggingface. co/Moodlerz. References Bhattacharjee, A., Kumarage, T., Moraffah, R., and Liu, H. ConDA: Contrastive domain adaptation for AI-generated text detection. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the ACL (IJCNLP-AACL), pages 598–610, 2023. Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. ELECTRA: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations (ICLR), 2020. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4171–4186, 2019. Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., and Auli, M. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 3558–3567, Florence, Italy, 2019. Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., and Wu, Y. How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597, 2023. He, P., Liu, X., Gao, J., and Chen, W. DeBERTa: Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations (ICLR), 2021. Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., and Goldstein, T. A watermark for large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202, pages 17061–17084. PMLR, 2023. 26 Preprint — arXivDetecting the Machine Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 22199–22213, 2022. Krishna, K., Song, Y., Karpinska, M., Wieting, J., and Iyyer, M. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023. Lavergne, T., Urvoy, T., and Yvon, F. Detecting fake content with relative entropy scoring. In Proceedings of PAN at CLEF 2008, 2008. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019. Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., and Finn, C. DetectGPT: Zero-shot machine- generated text detection using probability curvature. In Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202, pages 24950–24962. PMLR, 2023. Rodriguez, P. A., Sheppard, T., Jiang, B., and Hu, Z. Cross-domain detection of GPT-2-generated technical text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), 2022. Sanh, V., Debut, L., Chaumond, J., and Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019. Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W., Kreps, S., McCain, J., Newhouse, A., Blazakis, J., McGuffie, K., and Wang, J. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019. Uchendu, A., Le, T., Shu, K., and Lee, D. Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, 2020. Wolff, M. and Wolff, R. Attacking neural text detectors. arXiv preprint arXiv:2002.11768, 2022. Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi, A., Roesner, F., and Choi, Y. Defending against neural fake news. In Advances in Neural Information Processing Systems (NeurIPS), volume 32, 2019. Zeng, Z., Shi, J., Gao, Y., and Gao, B. Evaluating large language models at zero-shot machine- generated text detection. arXiv preprint arXiv:2310.03395, 2023. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 1877–1901, 2020. Bommasani, R., Hudson, D. A., Aditi, E., Altman, R., Arora, S., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 785–794, 2016. Floridi, L. and Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30(4):681–694, 2020. Gehrmann, S., Strobelt, H., and Rush, A. M. GLTR: Statistical detection and visualization of generated text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL): System Demonstrations, pages 111–116, 2019. 27 Preprint — arXivDetecting the Machine Ippolito, D., Duckworth, D., Callison-Burch, C., and Eck, D. Automatic detection of generated text is easiest when humans are fooled. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1808–1822, 2020. Juola, P. Authorship attribution. Foundations and Trends in Information Retrieval, 1(3):233–334, 2006. Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014. OpenAI. AI text classifier: A fine-tuned language model that predicts how likely it is that a piece of text was generated by AI. Technical report, OpenAI, 2023.https://openai.com/blog/ new-ai-classifier-for-indicating-ai-written-text. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 27730–27744, 2022. Stamatatos, E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):538–556, 2009. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 24824–24837, 2022. 13 Implementation Details 13.1 Family 1 — Statistical Machine Learning Detectors Twenty-two hand-crafted linguistic features were extracted from each text sample across seven categories: surface statistics (word count, character count, sentence count, average word/sentence length); lexical diversity (type-token ratio, hapax legomena ratio); punctuation (comma density, period density, question mark and exclamation ratios); repetition (bigram and trigram repetition rates); entropy (word-frequency and sentence-length entropy); syntactic complexity (sentence-length variance and standard deviation); and discourse markers (hedging density, certainty density, connector density, contraction ratio, and burstiness). All features were extracted without normalisation beyond per-feature standardisation applied at training time. Three classifiers were trained on this feature vector: Logistic Regression (maxiter=1000), Ran- dom Forest (nestimators=100,maxdepth=10), and SVM with RBF kernel (probability=True). Labels were encoded as binary values (human = 0, llm = 1). Each classifier was evaluated under four conditions: HC3→HC3, HC3→ELI5, ELI5→ELI5, and ELI5→HC3. 13.2 Family 2 — Fine-Tuned Encoder Transformers Five pre-trained encoder transformers were fine-tuned for binary AI-text classification under a shared protocol: a two-class classification head attached to the[CLS]token, AdamW optimisation (lr= 2×10 −5 , weightdecay = 0.01), warmup over 6% of training steps, dropout = 0.2, one training epoch, and a 90/10 stratified train/validation split. Inputs were tokenised to a maximum of 512 tokens. No intermediate checkpoints were saved; final in-memory weights were used directly for all downstream evaluation. Model-specific deviations from this shared protocol are noted in Table 31. 28 Preprint — arXivDetecting the Machine Table 31. Fine-tuned encoder transformer configurations. Entries marked “—” follow the shared protocol described above. ModelParams Precision Batch (tr/ev) Warmup Notes BERT110MFP1632 / 646% ratio— RoBERTa125MFP1632 / 646% ratioDynamic masking, no NSP ELECTRA110MFP1632 / 646% ratioDiscriminator fine-tuned DistilBERT66MFP1632 / 646% ratioModel-specific dropout params DeBERTa-v3-base184MFP3216 / 32500 stepsSee note below DeBERTa-v3-base: architecture-specific adjustments. Both FP16 and BF16 were disabled; the model was trained in full FP32 throughout, as BF16 silently zeroed gradients due to the small gradient magnitudes produced by disentangled attention, and FP16 caused gradient scaler instability. Checkpoint saving was disabled entirely (savestrategy="no",loadbestmodelatend=False) to avoid a LayerNorm key mismatch during checkpoint reloading that caused all 24 LayerNorm layers to reinitialise to random weights and collapse auroc to approximately 0.50. Explicit gradient clipping was applied (maxgradnorm=1.0).tokentypeidswere omitted as DeBERTa-v3 does not use segment IDs. All five models were uploaded to the Hugging Face Hub underMoodlerz/following training. 13.3 Family 3 — Shallow 1D-CNN Detector A lightweight multi-filter 1D-CNN was implemented with under 5M parameters. The architecture follows Kim [2014]: a shared embedding layer (vocabsize=30,000,embeddim=128) feeds four parallel convolutional branches with kernel sizes2,3,4,5and 128 filters each (BatchNorm1d + ReLU + global max pooling), producing a 512-dimensional concatenated representation. A classification head (Dropout(0.4)→Linear(512→256)→ReLU→Dropout(0.2)→Linear(256→1)) with BCEWithLogitsLoss and Kaiming normal weight initialisation completes the architecture. Sequences were truncated or padded to 256 tokens. Training used Adam (lr= 10 −3 , weight- decay= 10 −4 ), ReduceLROnPlateau scheduling (patience= 1, factor= 0.5), gradient clipping (max- norm=1.0), and early stopping (patience= 3) over a maximum of 10 epochs with batch size 64. 13.4 Family 4 — Stylometric and Statistical Hybrid Detector An extended feature set of 60+ hand-crafted features substantially augments the Family 1 set with the following additions: POS tag distribution (10 universal POS tags via spaCyencorewebsm); dependency tree depth (mean and maximum per sentence); function word frequency profiles (10 high-frequency tokens plus aggregate ratio); punctuation entropy; AI hedge phrase density (16 characteristic AI-generated phrases, normalised by sentence count); six readability indices (Flesch Reading Ease, Flesch-Kincaid Grade, Gunning Fog, SMOG, ARI, Coleman-Liau viatextstat); and sentence-level perplexity from GPT-2 Small (117M) over up to 15 sentences per text, yielding mean, variance, standard deviation, and coefficient of variation. Low perplexity variance is treated as a potential AI signal due to the characteristically uniform fluency of llm-generated text. Three classifiers were trained on this feature vector: Logistic Regression, Random Forest, and XGBoost [Chen and Guestrin, 2016]. Key hyperparameters are listed in Table 32. All features were standardised viaStandardScalerfitted on the training partition only. Missing values arising from short texts or parsing failures were imputed with per-column training medians. Table 32. Stylometric hybrid classifier hyperparameters. ClassifierKey Parameters Logistic Regression maxiter=2000, C=1.0, solver=lbfgs, classweight=balanced Random Forest nestimators=300, maxdepth=12, classweight=balanced XGBoost nestimators=400, maxdepth=6, lr=0.05, subsample=0.8 29 Preprint — arXivDetecting the Machine 13.5 Family 5 — LLM-as-Detector All llm-as-detector experiments shared a four-component pipeline applied in sequence: polarity correction, task prior calibration, constrained decoding, and (for CoT regimes) hybrid ensemble scoring. Constrained Decoding. Detection scores were derived by extracting next-token logits following the prompt’s finalAnswer:token. The maximum logit within the set of single-token surface forms of yesandnowas taken for each polarity class, and a softmax over the two values yielded a continuous P (llm)∈ [0, 1]. Polarity Correction. A systematic label bias was observed across all models: Qwen-family and LLaMA-2 models produced strongernologits for both human and llm text, making the raw P(yes=AI) signal non-discriminative. Prompts for these models were therefore reframed so that yes=humanandno=AI, withP(llm) read directly fromP(no) withflip=False. TinyLlama-1.1B, Qwen2.5-1.5B, andLlama-3.1-8B-Instruct used the standard orientation (flip=True); Qwen2.5-7B, Qwen2.5-14B, and Llama-2-13b-chat-hf used the swapped orientation (flip=False). Task Prior Calibration. A task-specific prior was computed by averagingyes/nologits over 50 real task prompts drawn equally from HC3 and ELI5 evaluation sets using the exact inference-time prompt template. These averaged logits were subtracted from each sample’s token logits before softmax, correcting model-level base rate biases without requiring a labelled calibration set. TF-IDF Few-Shot Retrieval. For few-shot regimes,kexamples were retrieved from a pool of 30 balanced training samples per dataset using TF-IDF cosine similarity (maxfeatures=5,000, bigrams), with balanced class representation enforced per query. CoT Ensemble Scoring. In CoT regimes, the model generated up to 350–500 tokens of free-form reasoning. A numericAICONFIDENCEscore on a 0–10 scale was extracted via regex and normalised to [0,1]. A zero-shot constrained logit score was computed separately using the same task prior. The two signals were combined as: score = 0.6× conf + 0.4× logitscore(6) When the confidence score fell within a model-specific dead zone (indicating uninformative reasoning), only the logit score was used. All open-source models were loaded in 4-bit NF4 quantisation with BitsAndBytes double quantisa- tion (bnb4bitcomputedtype=float16, except Qwen2.5-14B-Instructwhich usedbfloat16). CoT generation used greedy decoding (dosample=False). Model-specific configurations are summarised in Table 33. Table 33. Per-model configuration for llm-as-detector experiments. “Swap” indicates swapped polarity convention (yes = human, no = AI). n ZS/FS and n CoT denote evaluation sample sizes. ModelQuant.PolarityPriorRegimesn ZS/FS n CoT Max New Tokens TinyLlama-1.1B-Chat-v1.0 FP16StandardNoneZS, FS500— Qwen2.5-1.5B-InstructFP16StandardNoneZS, FS500— Llama-3.1-8B-InstructNF4/FP16Standard50 prompts ZS, FS, CoT50070350 Qwen2.5-7B-InstructNF4/FP16Swap50 prompts ZS, FS, CoT50070350 Llama-2-13b-chat-hfNF4/FP16Swap50 prompts ZS, FS, CoT20030400 Qwen2.5-14B-InstructNF4/BF16Swap50 prompts ZS, FS, CoT20030500 GPT-4o-miniAPI—ZS, FS, CoT20050180 / 600 Model-specific notes. Llama-2-13b-chat-hfrequired a manual[INST]...<<SYS>>...<</SYS>> ...[/INST]template fallback for checkpoints without a registeredchattemplatefield, and CoT prompts used “stylometric analysis” framing to circumvent safety-oriented refusal behaviors. Qwen2.5- 14B-Instructrequiredpadtokenid=tokenizer.padtokenid(noteostokenid) ingenerate()— usingeosas padding caused premature generation termination and an≈90% unknown verdict rate in the original implementation. GPT-4o-mini was prompted via a structured 7-dimension scoring format requiring explicit per-dimension scores before a finalAISCOREtag (0 =human, 100 =AI); temperature was set to 0 with seed=42. 30 Preprint — arXivDetecting the Machine A Hyperparameter Tables A.1 Encoder Transformer Common Training Protocol Table 34. Shared fine-tuning protocol for all encoder transformer detectors. DeBERTa-v3-specific deviations are noted in parentheses. ParameterValue OptimiserAdamW Learning rate2× 10 −5 Weight decay0.01 Warmup6% of total steps (500 fixed steps for DeBERTa-v3) Dropout0.2 Training epochs1 Max sequence length512 Train batch size32 (16 for DeBERTa-v3) Eval batch size64 (32 for DeBERTa-v3) PrecisionFP16 (FP32 for DeBERTa-v3) Checkpoint strategyNone — final in-memory weights used Eval frequencyEvery 200 steps Validation split10% stratified A.2 Encoder Transformer Model Specifications Table 35. Encoder transformer model checkpoints and architectural notes. ModelCheckpointParams Notes BERT bert-base-uncased∼110MStandard MLM pre-training RoBERTa roberta-base∼125MDynamic masking, no NSP ELECTRA google/electra-base-discriminator ∼110MReplaced token detection DistilBERT distilbert-base-uncased∼66MKnowledge distillation from BERT DeBERTa-v3 microsoft/deberta-v3-base∼184M FP32 only; no checkpointing; ex- plicit grad clip A.3 1D-CNN Hyperparameters Table 36. 1D-CNN architecture and training hyperparameters. ParameterValue Vocabulary size30,000 Minimum word frequency2 Max sequence length256 Embedding dimension128 Filter sizes2, 3, 4, 5 Filters per size128 Total filter dimension512 Hidden layer dimension256 Dropout0.4 (head), 0.2 (second layer) OptimiserAdam Learning rate10 −3 Weight decay10 −4 Batch size64 Max epochs10 Early stopping patience3 LR schedulerReduceLROnPlateau (patience= 1, factor= 0.5) Gradient clipping max norm=1.0 31 Preprint — arXivDetecting the Machine A.4 Stylometric Hybrid Hyperparameters Table 37. Stylometric hybrid classifier and feature extraction hyperparameters. ParameterValue Logistic Regression C1.0 Logistic Regression solver lbfgs Logistic Regression maxiter2,000 Random Forest n estimators300 Random Forest maxdepth12 Random Forest minsamplesleaf5 XGBoost n estimators400 XGBoost maxdepth6 XGBoost learning rate0.05 XGBoost subsample0.8 XGBoost colsample bytree0.8 Feature scaling StandardScaler (fit on train only) Missing value imputationColumn-wise training medians Class weightingBalanced (Logistic Regression, Random Forest) Sentence perplexity modelGPT-2 Small (117M) Max sentences for perplexity15 per text A.5 llm-as-Detector Configuration Summary Table 38. llm-as-detector per-model configuration. ModelSize Quant.PolarityPriorn ZS/FS n CoT TinyLlama-1.1B-Chat-v1.01.1BFP16 yes=AINeutral500— Qwen2.5-1.5B-Instruct1.5BFP16 yes=AINeutral500— Llama-3.1-8B-Instruct8BNF4/FP16 yes=AITask (n = 50)50070 Qwen2.5-7B-Instruct7BNF4/FP16 yes=humanTask (n = 50)50070 Llama-2-13b-chat-hf13BNF4/FP16 yes=humanTask (n = 50)20030 Qwen2.5-14B-Instruct14BNF4/BF16 yes=humanTask (n = 50)20030 GPT-4o-miniAPI— AI SCORE—20050 A.6 CoT Ensemble Parameters by Model Table 39. CoT hybrid ensemble parameters. The dead zone defines the confidence interval within which the logit-only score is used instead of the ensemble. ModelConf. weight Logit weight Dead zone Verdict override Max tokens Llama-3.1-8B-Instruct0.60.4[0.40, 0.60][0.35, 0.65]350 Qwen2.5-7B-Instruct0.60.4[0.35, 0.65][0.35, 0.65]350 Llama-2-13b-chat-hf0.60.4[0.40, 0.60][0.35, 0.65]400 Qwen2.5-14B-Instruct0.60.4[0.35, 0.65][0.35, 0.65]500 B Prompt Templates All prompts are reproduced verbatim. [TEXT] denotes the target text placeholder. 32 Preprint — arXivDetecting the Machine B.1 Zero-Shot Prompts --- TinyLlama-1.1B-Chat-v1.0/Llama-3.1-8B-Instruct (standard polarity: yes=AI) --- System: You detect AI-generated text. Answer with ONE word only: yes or no. yes = AI-generated. no = human-written. No explanation. No punctuation. One word. User: Was this text generated by an AI language model? Text: """[TEXT]""" Answer yes or no. Answer: Figure 7. Zero-shot prompt for TinyLlama-1.1B-Chat-v1.0andLlama-3.1-8B-Instruct (standard polarity). --- Qwen2.5-7B-Instruct(swapped polarity: yes=human) --- System: You detect AI-generated text. Answer with ONE word only: yes or no. yes = human-written. no = AI-generated. No explanation. No punctuation. One word. User: Was this text written by a human? Text: """[TEXT]""" Answer yes or no. Answer: Figure 8. Zero-shot prompt for Qwen2.5-7B-Instruct(swapped polarity). --- Llama-2-13b-chat-hf(swapped polarity, stylometric framing) --- System: You are a linguistics researcher studying writing styles. Answer with ONE word only: yes or no. yes = written by a human. no = written by an AI system. No explanation. No punctuation. One word only. User: Was this text written by a human? Text sample: """[TEXT]""" Answer yes or no. Answer: Figure 9. Zero-shot prompt for Llama-2-13b-chat-hf(swapped polarity, stylometric framing). --- Qwen2.5-14B-Instruct(swapped polarity, authorship framing) --- System: You are an expert in authorship attribution and AI-generated text analysis. Answer with ONE word only: yes or no. yes = human-written. no = AI-generated. No explanation. No punctuation. One word. User: Was this text written by a human? Text: """[TEXT]""" Answer yes or no. Answer: Figure 10. Zero-shot prompt for Qwen2.5-14B-Instruct(swapped polarity, authorship framing). 33 Preprint — arXivDetecting the Machine --- GPT-4o-mini (7-dimension structured scoring) --- System: You are an expert forensic linguist specialising in authorship attribution. AI-generated text is very common, including short conversational-looking text from older models like ChatGPT-3.5. Score honestly based on the dimensions provided. Use the full 0-10 range for each dimension. Complete every analysis. User: Score this passage on each dimension from 0 (strongly human) to 10 (strongly AI). Passage: [TEXT] HEDGING/FORMULAIC: ’it is important’, ’certainly’, numbered sections, safe generalisations COMPLETENESS: Covers every sub-angle even when not asked PERSONAL VOICE: Opinions, errors, tangents, emotional register LEXICAL UNIFORMITY: Vocabulary register stays perfectly consistent STRUCTURAL NEATNESS: Clear intro/body/conclusion or logical flow RESPONSE FIT: Directly and precisely addresses the apparent question FORMULAIC TELLS: Restates question, tidy closing, ’I hope this helps’ Then write: AI_SCORE: [arithmetic mean of 7 scores x 10, rounded to nearest integer] Format: 1:[score] 2:[score] 3:[score] 4:[score] 5:[score] 6:[score] 7:[score] AI_SCORE: [mean] Figure 11. Zero-shot prompt for GPT-4o-mini (structured 7-dimension rubric scoring). B.2 Few-Shot Prompt Structure --- Few-Shot Structure (Llama-3.1-8B-Instruct/ Qwen2.5-7B-Instruct/ Llama-2-13b-chat-hf / Qwen2.5-14B) --- System: [same as zero-shot for respective model] User: Examples: Text: "[EXAMPLE_1_TEXT]" [Human-written? / AI-generated?] [yes/no] Text: "[EXAMPLE_2_TEXT]" [Human-written? / AI-generated?] [yes/no] Text: "[EXAMPLE_3_TEXT]" [Human-written? / AI-generated?] [yes/no] Now answer: Text: "[TARGET_TEXT]" [Human-written? / AI-generated?] yes or no. Answer: Figure 12. Few-shot prompt structure.k= 3 TF-IDF-retrieved examples are prepended to the zero-shot prompt. Label phrasing follows each model’s polarity convention. 34 Preprint — arXivDetecting the Machine B.3 Chain-of-Thought Prompts ---Llama-3.1-8B-Instruct CoT (7-dimension scoring with AI_CONFIDENCE) --- System: You are an expert forensic linguist. Determine whether a passage was written by a human or generated by an AI. Think carefully and be precise. User: Analyse whether this passage was written by a HUMAN or an AI. Passage: """[TEXT]""" Score each dimension 0 (strongly human) to 10 (strongly AI): STRUCTURE: Neatly organised with clear sections or numbered points? COMPLETENESS: Covers the topic comprehensively without gaps? HEDGING: Acknowledges uncertainty or says "I’m not sure"? PERSONAL VOICE: Personal opinions, anecdotes, slang, contractions, typos? LEXICAL RANGE: Broad, polished vocabulary even in casual answers? RESPONSE FIT: Directly addresses the question or wanders? SHORT-FORM TELLS: Starts "Certainly!", restates question, unnaturally tidy closing? BREVITY PATTERN: Ends with an unnatural one-sentence summary? QUESTION ECHO: Begins by restating or paraphrasing the question? GENERIC EXAMPLES: Placeholder examples ("consider X") where X is suspiciously apt? IMPORTANT: Short answers can still be AI-generated. Do not assume short = human. After scoring, state on the LAST TWO LINES exactly: AI_CONFIDENCE: [average of 7 scores, 0-10] VERDICT: yes (if AI-generated) VERDICT: no (if human-written) Figure 13. CoT prompt forLlama-3.1-8B-Instruct. --- Llama-2-13b-chat-hfCoT (stylometric framing) --- System: You are an expert in stylometric analysis and authorship attribution. Analyse writing samples to determine if written by a human or AI. Always complete your analysis. Always end with AI_CONFIDENCE and VERDICT. User: Perform a stylometric analysis of this writing sample. Sample: """[TEXT]""" Score each dimension 0 (strongly human) to 10 (strongly AI): STRUCTURAL REGULARITY: Uniform sentence length, predictable paragraph transitions? LEXICAL POLISH: Consistently formal/polished vocabulary? TOPIC COVERAGE: Suspiciously complete, covering all sub-aspects? HEDGING STYLE: Confident and authoritative vs uncertain and personal? PERSONAL MARKERS: Opinions, anecdotes, typos, contractions, informal phrasing? RESPONSE ALIGNMENT: Tightly matches the implied question? FORMULAIC OPENING: Starts with "Certainly!", "Great question!", or restates question? Note: Short answers can still be AI-generated. Final output (EXACTLY these two lines): AI_CONFIDENCE: [average of 7 scores, 0-10] VERDICT: yes (if AI-generated) VERDICT: no (if human-written) Figure 14. CoT prompt for Llama-2-13b-chat-hf(stylometric framing to reduce safety refusals). 35 Preprint — arXivDetecting the Machine --- Qwen2.5-14B-InstructCoT (explicit completion constraint) --- System: You are an expert forensic linguist performing authorship attribution analysis. You ALWAYS complete your full analysis and ALWAYS end with AI_CONFIDENCE and VERDICT. Never leave your analysis incomplete or refuse to give a verdict. User: Analyse this passage to determine if written by a HUMAN or generated by an AI. Passage: """[TEXT]""" Score each dimension 0 (strongly human) to 10 (strongly AI): STRUCTURE (0-10): Organised with clear sections/numbered points? COMPLETENESS (0-10): Covers topic without obvious gaps? HEDGING (0-10): Confident authoritative tone, lacks uncertainty? PERSONAL VOICE (0-10): Lacks personal opinions/anecdotes/typos? LEXICAL POLISH (0-10): Uniformly formal/polished vocabulary? RESPONSE FIT (0-10): Directly and completely addresses question? FORMULAIC TELLS (0-10): Restates question, "Certainly!", unnaturally tidy closing? IMPORTANT: Short texts CAN be AI-generated. Score all 7 dimensions regardless of length. You MUST end with EXACTLY: AI_CONFIDENCE: [average score 0-10] VERDICT: yes OR VERDICT: no Begin your analysis now: Figure 15. CoT prompt for Qwen2.5-14B. Explicit completion directives were added to resolve the≈90% unknown verdict rate in the original implementation. --- GPT-4o-mini CoT (evidence-plus-score format) --- System: You are an expert forensic linguist specialising in authorship attribution. Score honestly based on evidence. Use the full 0-10 range. Complete every dimension. User: Analyse whether this passage was written by a HUMAN or generated by an AI. Passage: [TEXT] For each dimension write ONE evidence sentence, then a score 0 (human) to 10 (AI): HEDGING/FORMULAIC -- ’it is important’, ’certainly’, numbered sections: Evidence: ... Score (0-10): COMPLETENESS -- covers every sub-angle even when not asked: Evidence: ... Score (0-10): PERSONAL VOICE -- opinions, errors, tangents, emotional register: Evidence: ... Score (0-10): LEXICAL UNIFORMITY -- vocabulary register stays perfectly consistent: Evidence: ... Score (0-10): STRUCTURAL NEATNESS -- clear intro/body/conclusion or logical flow: Evidence: ... Score (0-10): RESPONSE FIT -- directly and precisely addresses the apparent question: Evidence: ... Score (0-10): FORMULAIC TELLS -- restates question, tidy closing, ’I hope this helps’: Evidence: ... Score (0-10): Then write: AI_SCORE: [mean of 7 scores x 10, rounded to nearest integer] VERDICT: ai OR VERDICT: human Figure 16. CoT prompt for GPT-4o-mini (evidence-plus-score format). 36