Paper deep dive

Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System

Xiang Zhang, Hongming Xu, Le Zhou, Wei Zhou, Xuanhe Zhou, Guoliang Li, Yuyu Luo, Changdong Liu, Guorun Chen, Jiang Liao, Fan Wu

Year: 2026Venue: arXiv preprintArea: cs.DBType: PreprintEmbeddings: 83

Abstract

Abstract:Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built-in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt-based approaches tightly couple intent reasoning with dialect syntax, rule-based translators often degrade native operators into generic constructs, and multi-dialect fine-tuning suffers from cross-dialect interference. In this paper, we present Dial, a knowledge-grounded framework for dialect-specific NL2SQL. Dial introduces: (1) a Dialect-Aware Logical Query Planning module that converts natural language into a dialect-aware logical query plan via operator-level intent decomposition and divergence-aware specification; (2) HINT-KB, a hierarchical intent-aware knowledge base that organizes dialect knowledge into (i) a canonical syntax reference, (ii) a declarative function repository, and (iii) a procedural constraint repository; and (3) an execution-driven debugging and semantic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS-NL2SQL, a benchmark covering six major database systems with 2,218 dialect-specific test cases. Experimental results show that Dial consistently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state-of-the-art baselines. The code is at this https URL.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 97%

Last extracted: 3/13/2026, 12:33:57 AM

Summary

Dial is a knowledge-grounded framework designed to address the challenges of dialect-specific NL2SQL translation. It utilizes a hierarchical intent-aware knowledge base (HINT-KB) and a dialect-aware logical query planning module to map natural language intents to syntactically correct and executable SQL across heterogeneous database systems. The system incorporates an execution-driven debugging loop to ensure semantic fidelity and prevent drift, achieving significant improvements in translation accuracy and feature coverage.

Entities (4)

DS-NL2SQL · benchmark · 100%Dial · system · 100%HINT-KB · knowledge-base · 100%NL-LQP · methodology · 95%

Relation Signals (3)

Dial → evaluatedon → DS-NL2SQL

confidence 100% · We construct DS-NL2SQL, a benchmark covering six major database systems... Experimental results show that Dial consistently improves translation accuracy

Dial → utilizes → HINT-KB

confidence 100% · Dial introduces: (2) HINT-KB, a hierarchical intent-aware knowledge base

HINT-KB → contains → Declarative Function Repository

confidence 95% · HINT-KB... organizes dialect knowledge into (ii) a declarative function repository

Cypher Suggestions (2)

Identify systems that utilize specific benchmarks · confidence 95% · unvalidated

MATCH (s:System)-[:EVALUATED_ON]->(b:Benchmark) RETURN s.name, b.name

Find all components of the HINT-KB knowledge base · confidence 90% · unvalidated

MATCH (k:KnowledgeBase {name: 'HINT-KB'})-[:CONTAINS]->(c:Component) RETURN c.name

Full Text

82,966 characters extracted from source content.

Expand or collapse full text

Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System Xiang Zhang ∗ Shanghai Jiao Tong Univ. zhangxxx@sjtu.edu.cn Hongming Xu ∗ Shanghai Jiao Tong Univ. muzhihai@sjtu.edu.cn Le Zhou ∗ Shanghai Jiao Tong Univ. mytruing1912@gmail.com Wei Zhou † Shanghai Jiao Tong Univ. weizhoudb@sjtu.edu.cn Xuanhe Zhou † Shanghai Jiao Tong Univ. zhouxuanhe@sjtu.edu.cn Guoliang Li Tsinghua University liguoliang@tsinghua.edu.cn Yuyu Luo HKUST (GZ) yuyuluo@hkust-gz.edu.cn Changdong Liu Shanghai Ideal Information Industry (Group) liuzd4@telecom.cn Guorun Chen Shanghai Ideal Information Industry (Group) chenguorun@telecom.cn Jiang Liao China Telecom Corporation Ltd. Shanghai Branch liaojiang@telecom.cn Fan Wu Shanghai Jiao Tong University fwu@cs.sjtu.edu.cn Abstract Enterprises commonly deploy heterogeneous database systems, each of which owns a distinct SQL dialect with different syntax rules, built-in functions, and execution constraints. However, most existing NL2SQL methods assume a single dialect (e.g., SQLite) and struggle to produce queries that are both semantically correct and executable on target engines. Prompt-based approaches tightly couple intent reasoning with dialect syntax, rule-based translators often degrade native operators into generic constructs, and multi- dialect fine-tuning suffers from cross-dialect interference. In this paper, we presentDial, a knowledge-grounded frame- work for dialect-specific NL2SQL.Dialintroduces: (1) a Dialect- Aware Logical Query Planning module that converts natural lan- guage into a dialect-aware logical query plan via operator-level intent decomposition and divergence-aware specification; (2) HINT- KB, a hierarchical intent-aware knowledge base that organizes dialect knowledge into(푖)a canonical syntax reference,(푖)a declarative function repository, and(푖)a procedural constraint repository; and (3) an execution-driven debugging and seman- tic verification loop that separates syntactic recovery from logic auditing to prevent semantic drift. We construct DS-NL2SQL, a benchmark covering six major database systems with 2,218 dialect- specific test cases. Experimental results show thatDialconsis- tently improves translation accuracy by 10.25% and dialect feature coverage by 15.77% over state-of-the-art baselines. The code is at https://github.com/weAIDB/Dial. 1 Introduction Existing NL2SQL methods predominantly target a single database dialect [16,23,49]. However, in real-world scenarios, most enter- prises (e.g., 80% predicted by Gartner [33]) are supported by multiple database systems, each exposing its own SQL dialect with distinct syntax, function signatures, and compilation rules [12,21,25]. This creates the practical need for dialect-specific NL2SQL: given a target database, the system must generate SQL that is both semantically correct and natively executable under that database’s dialect. ∗ Equal Contribution. † Xuanhe Zhou and Wei Zhou are the corresponding authors. Case 1: Unsupported Syntax Query the names of the top 10 employees with the highest salaries . SELECT name FROM employees ORDER BY salary DESC LIMIT 10; Agentar-Scale-SQL : NL -> Oracle Error Code: ORA- 00933: SQL command not properly ended . Analysis: Hallucinated MySQL's LIMIT. Should use FETCH FIRST 10 ROWS ONLY. Case 2: Incorrect Usage Display full name by combining first, middle, and last names. SELECT employee_id, CONCAT(first_name, ' ', middle_name, ' ', last_name) AS full_name FROM employees; Agentar-Scale-SQL: NL -> Oracle Error Code: ORA-00909: invalid number of arguments. Analysis: Oracle's CONCAT is strictly 2-arg. Interfered by MySQL's mul�-argusage. Case 3: Implicit Constraints List distinct pub names ordered by paper count. SELECT DISTINCT pub.publication_name FROM publications pub JOIN paper_publicationspp ON pub.publication_id= p.publication _id ORDER BY COUNT(p.paper_id) DESC; Agentar-Scale-SQL: NL ->SQLite Error Code: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list. Analysis: lgnored PG's strict rule.Requires logical retactoring. SELECT DISTINCT pub.publication_name FROM publications AS pub JOIN paper _publications AS p ON pub.publication_id = p.publication_id GROUP BY pub.publication_name ORDER BY COUNT(p.paper_id) DESC; SQLGlot: SQLite ->PostgreSQL Figure 1: Dialect-Specific NL2SQL Failures – Case 1: Oracle re- jects MySQL-styleLIMIT. Case 2: Oracle’sCONCATaccepts only two arguments, unlike MySQL’s variadic version. Case 3: PostgreSQL enforcesORDER BYunderDISTINCTreferences selected expressions. This problem is tricky because there are various dialect-relevant issues that can be easily overlooked and cause translation failure. Figure 1 demonstrates several examples: (1) In Case 1, Oracle does not support SQLite/MySQL-styleLIMIT, causing a parsing error. (2) In Case 2, Oracle’sCONCATaccepts only two arguments, whereas MySQL accepts ones in variable number. (3) In Case 3, PostgreSQL requires theORDER BYexpression underSELECT DISTINCTmust appear in the projection list, a rule not imposed by SQLite. These cases highlight that dialect-specific NL2SQL is fundamentally more complex than fixed-dialect translation. And a robust solution must (1) bind user intents to dialect-specific function syntax, (2) generate constructs that are syntactically valid under the target dialect, (3) explicitly account for implicit, cross-clause compilation constraints, and (4) utilize database native functions rather than verbose ones. Although recent enterprise-oriented benchmarks [13,20] have begun to incorporate different database dialects, their primary focus remains on schema complexity and cross-domain generalization. Consequently, current approaches often exhibit substantial perfor- mance degradation in dialect-specific NL2SQL: (1) Prompt-based arXiv:2603.07449v1 [cs.DB] 8 Mar 2026 methods (e.g., DIN-SQL [27], MAC-SQL [34]) rely on in-context demonstrations to guide generation. However, the underlying mod- els are primarily pretrained and instruction-tuned on dominant dialects such as SQLite and MySQL. When deployed to a different database system, they often transfer familiar syntactic patterns, resulting in unsupported functions or incorrect operator signatures. (2) Tool-augmented or rule-based pipelines (e.g., SQLGlot [1], Wre- nAI [37]) typically rely on a generic intermediate representation to translate across dialects. While portable, this lowest-common- denominator strategy may overlook implicit dialect constraints and replace native operators with verbose rewrites, leading to invalid SQLs in many cases (see Section 2.2). (3) Multi-dialect fine-tuning ap- proaches (e.g., ExeSQL [42]) attempt to internalize dialect variations within model parameters. However, different dialects share over- lapping yet conflicting syntactic and functional patterns, which can easily induce cross-dialect interference and negative transfer [36]. Moreover, this monolithic training paradigm lacks adaptability: supporting a new dialect or even a minor version update typically requires additional data collection and model finetuning. Challenges. To realize reliable dialect-specific NL2SQL, there are three main challenges. C1: How to map ambiguous intents to dialect-specific func- tions? Users express analytical requests in a dialect-agnostic man- ner (e.g., “months since registration” ), without specifying the con- crete operators required to realize them. For example, computing month differences can correspond toTIMESTAMPDIFFin MySQL but require nestedEXTRACT/AGEconstructs in PostgreSQL. Thus, the first challenge is to correctly identify the appropriate function sig- nature and operator semantics from a vast, dialect-specific search space while preserving the user’s original intent. C2: How to satisfy implicit dialect-level constraints for gen- erated queries? Even after the correct function syntax identified, a query may still be rejected due to dialect-level compilation and se- mantic constraints. These constraints are orthogonal to functional intent, such asDISTINCT–ORDER BYcoupling rules, grouping legal- ity, name scoping, identifier scoping, and null-handling semantics. Therefore, dialect-specific NL2SQL must not only recover the in- tended functionality (C1), but also ensure that the generated query complies with dialect-specific parsing and semantic rules. C3: How to conduct dialect-aware correction and experience accumulation? Even advanced LLMs and agentic workflows can- not guarantee fully correct SQL generation in a single pass, making post-generation correction unavoidable. However, existing correc- tion strategies focus primarily on restoring executability through iterative re-generation. Such simplistic repair struggles to handle dialect-specific nuances and may introduce semantic drift by alter- ing the intended computation. Furthermore, current systems lack a structured way to consolidate successful repairs into reusable experience, resulting in repeated reasoning for recurring dialectal issues. Therefore, reliable dialect-specific NL2SQL requires to en- sure dialect-compliant repair and incrementally integrate validated corrections to improve long-term translation robustness. To address these challenges, we proposeDial, a knowledge- grounded framework for dialect-specific NL2SQL. (1) The core ab- straction is the Natural Language Logical Query Plan (NL-LQP), which converts a user query into a linearized, dialect-agnostic op- erator chain that captures its essential semantic intent (e.g., data sourcing, filtering, scalar computation). This logical plan is then selectively refined into a dialect-aware plan by detecting dialect- sensitive operators and aligning them with a standardized func- tional taxonomy. (2) To support faithful realization, we construct HINT-KB, a hierarchical and intent-aware knowledge base with three components:(푖)a canonical syntax Reference grounded in ANSI SQL to normalize abstract primitives;(푖)a Declarative Func- tion Repository that maps these primitives to dialect-specific im- plementations with explicit signatures and usage constraints; and (푖)a Procedural Constraint Repository that encodes implicit com- pilation rules indexed by diagnostic signals. (3) During generation, Dialfirst instantiates SQL through function-level retrieval from the Declarative Function Repository. It then performs iterative, execution-driven refinement. Syntactic errors trigger rule retrieval from the Procedural Constraint Engine, while semantic verification checks the executable SQL against the dialect-aware logical plan to prevent intent drift. Validated repair traces are distilled back into HINT-KB, enabling continuous knowledge consolidation and adaptive dialect support. Contributions. We make the following contributions. (1) We propose a knowledge-grounded framework (Dial) for dialect- specific NL2SQL that decouples logical intent modeling from dialect realization and couples generation with execution-driven verifica- tion, enabling native executability without model retraining. (2) We design a hierarchical dialect knowledge architecture (HINT- KB) grounded in ANSI primitives, which separates functional syn- tax mappings from implicit compilation constraints and supports automated knowledge distillation from vendor documentation. (3) We introduce the NL Logical Query Plan (NL-LQP), a strictly linearized and dialect-agnostic operator-chain abstraction that nor- malizes free-form user intent into structured relational operators and explicitly materializes implicit computation steps. (4) We develop a divergence-aware dialect specification pipeline that isolates dialect-sensitive operators and maps them into a stan- dardized functional taxonomy, providing high-precision retrieval anchors for dialect realization. (5) We propose an execution-driven refinement mechanism that separates syntactic recovery from semantic logic verification, en- forcing structural and computational invariants to prevent semantic drift while guaranteeing executability. (6) We construct DS-NL2SQL, an NL2SQL benchmark across six major database systems. Experiments show thatDialimproves translation accuracy by 10.25% and dialect feature coverage by 15.77% over baselines, with higher executability overall. 2 Preliminary In this section, we define the dialect-specific NL2SQL problem (Section 2.1); and next we summarize the limitations of existing potential solutions to this problem (Section 2.2). 2.1 Problem Definition Dialect Specific NL2SQL. Given a natural language question푞, a database schemaS, and a target database dialect푑, dialect-specific 2 Unsupported (U) Misuse (M) Implict (I) Figure 2: Dialect-Specific Error Analysis. – Top: total error rate; Bottom: non-executable rate. Errors are grouped into Unsupported Syntax (U), Incorrect Usage (M), and Implicit Constraints (I). The row gap reflects semantic drift (executable but incorrect). NL2SQL aims to generate a SQL query푠such that: (1)푠is executable under dialect푑, and (2)푠correctly realizes the semantic intent of푞 overS. Compared with general NL2SQL task [11,22], the output푠 of dialect-specific NL2SQL is not a single canonical SQL form, but dialect-constrained realizations. Typical Dialect Discrepancies. Based on our observations, dialect discrepancies mainly arise in three dimensions: (1) Syntactic Rules:Database systems have distinct SQL syntax rules: (1) Identifier Quoting: Different characters are used to quote table or column names that might be reserved keywords (e.g., Oracle uses “ID”, while MySQL uses ‘ID‘). (2) Subquery Aliasing: Some databases like MySQL require all derived tables (subqueries in the FROMclause) have an alias, whereas others do not. (3) Pagination Syntax: The syntax for limiting the number of returned rows varies, such as LIMIT (MySQL/PostgreSQL) vs. FETCH FIRST (Oracle). (2) Function Differences:The names and behaviors of built-in func- tions often vary: (1) String Manipulation: We can concatenate strings using the CONCAT() function, the || operator, or the + opera- tor for different databases. (2) Date and Time Formatting: Functions that format dates and times have different names and argument styles (e.g., STRFTIME(), DATE_FORMAT(), TO_CHAR()). (3) Semantic Variations:Implicit differences are like: (1) NULL value ordering: some systems placeNULLvalues first by default during sorting, while others place them last; moreover, explicitNULLS FIRST orNULLS LASTclauses are not supported in all databases. (2) Data type handling: Core data types (e.g.,DATETIME,TIMESTAMP, DATE) may differ in precision, representation, and functions. 2.2 Limitations of Existing Methods To quantify the limitations of existing approaches and motivate new design, we evaluate ExeSQL (model fine-tuning), SQLGlot (tool- based translation), DIN-SQL (prompt-based generation), across three typical database systems (MySQL, PostgreSQL, Oracle). As shown in Figure 2, we derive four main observations. (Observation 1) Intent-to-Syntax Mapping under Dialect Di- vergence. Existing NL2SQL methods largely assume that once the user’s logical intent is understood, the corresponding SQL syntax can be produced through LLM prompting or rule-based transla- tion. However, dialect-specific realizations of the same intent often differ in subtle yet critical ways, including function signatures, argu- ment conventions, and grammar constraints. These differences are rarely expressed explicitly in natural language, making it difficult for models to deterministically map abstract analytical intent to dialect-compliant syntax. Static model parameters and handcrafted translation rules cannot comprehensively encode such fine-grained and context-dependent variations. As a result, even when the high- level intent is correctly identified [28], the generated syntax fre- quently violates dialect-specific requirements. (Observation 2) Blindness to Implicit Inter-Functional Con- straints. The preliminary results demonstrate that existing ap- proaches fail to handle implicit syntax constraints. For example, in queries involving specificNULLordering behaviors, error rates re- main consistently high across all evaluated methods. These failures arise from unmodeled cross-clause dependencies, such as Post- greSQL’s requirements ofORDER BYexpression (mentioned in Sec- tion 1). Such constraints are not directly derived from user intent but are enforced by the engine at compilation time. The observed er- ror patterns indicate that reliable dialect adaptation requires explicit modeling and enforcement of these implicit syntax rules. (Observation 3) Severe Dialect Overfitting Induced by LLM Finetuning. While fine-tuning models on specific dialects im- proves in-domain accuracy, it might suffer from overfitting and lose cross-engine generalizability. For instance, evaluating ExeSQL with its released MySQL-tuned checkpoint (exesql_bird_mysql) reveals severe performance degradation on Oracle, with non- executable rates ranging from 95% to 100% across all evaluated di- alect features. Because the model has statically internalized MySQL- specific functional signatures, it erroneously applies these incom- patible constructs to Oracle environments. This brittle inductive bias demonstrates that monolithic fine-tuning cannot sustainably scale to heterogeneous databases. (Observation 4) Syntactic Executability Does Not Guarantee Semantic Correctness. Our empirical analysis reveals a substan- tial gap between the non-executable rate and the total error rate (i.e., queries that execute successfully but produce incorrect results). For example, ExeSQL exhibits 38%–62% semantic drift on MySQL and PostgreSQL. Similarly, rule-based tools such as SQLGlot lack many dialect mappings (e.g., 71%–76% total error on unsupported syntax), generating queries that pass syntactic validation yet violate the original user intent. These results highlight that restoring exe- cutability alone is insufficient, and motivate the need for a rigorous debugging and semantic verification mechanism that ensures logical fidelity while correcting syntactic errors. 3 System Overview Figure 3 demonstrates the architecture and workflow of Dial. 3 Dialect-Aware Logical Query Planning Question (q) Schema (S) Dialect (d) Logical Query Plan Construction q Divergence-Aware Logic Specification Cascaded Operator Labeling Functional Category Mapping Adaptive & Iterative Debugging and Evaluation Hierarchical Dialect Knowledge Base ... Official Database Documentations Canonical Syntax Reference Knowledge Base Construction Documentation Tagging Declarative Function Repository 퓕 퓕 Procedural Constraint Repository 퓡 퓡퓕 퓡 Extracting Date Components 1.Common Scenarios •asking about specific year/month/quarter •day of the week •... 2. Relevant Function Description •obtaining specified parts from a date or time value •aggregation to achieve grouping statistics by time dimensions ... Division NULL Protection MySQL supports NULLIF, also can use IFNULL(numerator/denominator, 0); eg: CAST(t1.numge1500 AS FLOAT) / NULLIF(...) Adaptive Syntactic Recovery S relevant Knowledge-Grounded Initialization Execution-Driven Rule Retrieval Deep Diagnostic Reasoning Semantic Logic Verification Multi-Dimensional Logic Auditing semantic deviation report Contrastive Feedback & Targeted Rectification episodic repair trajectory Incremental Knowledge Consolidation Dialect SQL (s) Reference-Guided Syatax Mapping Template-Based Knowledge Generation ... ... operators Context-Aware Implicit Logic Mining 퓛 퓛 푸 풊 푸 풆 Execution-Guided Query Decomposition 퓞 ퟏ ... 퓞 풊 퓛=퓞 ퟏ ,..,퓞 풊 d relevant raw error 푸 풓풆 풓 error trace Viable query 푸 풆 Corrective Exemplar Incorrect Pattern Root-Cause Analysis analytical intent data samples implicit requirements The generic logical plan dialect-sensitive subset 퓛 퓼퓡퓕 The dialect-aware plan standardized references 풐 풊 ∗ ...풐 풎 ∗ 퓛 퓼퓡퓕 ∗ =퓞 ퟏ +풐 ퟏ ∗ ,..,퓞 풎 +풐 풎 ∗ 퓛 퓼퓡퓕 ∗ 퓛 퓼퓡퓕 ∗ 퓛 퓼퓡퓕 ∗ Figure 3: System Overview of Dial. Dialect Knowledge Base Construction. In the offline stage,Dial builds a hierarchical dialect knowledge base (HINT-KB) from of- ficial documentation. Instead of using documentation as unstruc- tured references, we reorganize it around a canonical syntax Refer- ence, which normalizes common database operations into an ANSI- aligned canonical space. HINT-KB is structured into (1) Declar- ative Function Repository mapping abstract functional intents (e.g., temporal arithmetic, string manipulation) to their concrete, dialect-specific implementations (e.g.,TIMESTAMPDIFFin MySQL), which is indexed by natural-language usage patterns to enable di- rect intent-to-function retrieval; and (2) Procedural Constraint Repository capturing implicit structural rules required for execu- tion correctness (e.g., quoting conventions), where these rules are indexed by diagnostic error signatures, enabling error-driven query correction during the debugging stage. Dialect-Aware Logical Query Planning. In the online stage, given a user request,Dialfirst generates a Natural Language Logi- cal Query Plan to obtain a dialect-agnostic representation of user’s intent. (1) Logical Plan Construction: The user’s query is decom- posed into a linearized chain of standardized macro-operators (e.g., data sourcing, filtering, scalar calculation), which represent the core analytical steps. (2) Dialect-Aware Logic Specification: We then identify operators that require dialect-sensitive implementations and annotate them with standardized functional categories from HINT-KB. The result is a dialect-aware logical plan that serves as a precise blueprint for SQL generation. Adaptive & Iterative Debugging and Evaluation. Based on the dialect-aware logical plan,Dialenters a closed-loop generation and validation process. (1) Knowledge-Grounded Initialization: An initial candidate query is synthesized by retrieving dialect-specific function templates from the Declarative Function Repository based on the labeled functional categories. (2) Adaptive Syntactic Re- covery: If the query fails execution, the database error message is used as a key to retrieve a corresponding transformation rule from the Procedural Constraint Repository, which is then applied to patch the query. (3) Semantic Logic Verification: Once the query is exe- cutable, it is audited against the invariants defined in the original NL-LQP to prevent any semantic drift introduced during the repair process. (4) Incremental Knowledge Consolidation: Finally, the verified repair logic is generalized into a new rule and integrated back into HINT-KB, enablingDialto learn from its experience and continuously improve. 4 Hierarchical Dialect Knowledge Base To generate correct dialect-specific SQL, LLMs must be informed by two types of knowledge: (1) the explicit, intent-aligned func- tional syntax (e.g., usingTIMESTAMPDIFFto calculate an age), and (2) the implicit, structural rules required for execution (e.g., quoting reserved keywords). However, naively retrieving from official doc- umentation is inadequate, as these manuals are definition-oriented (e.g., technical details forTIMESTAMPDIFF) rather than intent-driven in user queries (e.g., calculating a user’s age). 4 Documentation Tagging Database Documents Reference-Guided Syntax Mapping Standardized Document MySQL SQL Server 퓕 퓕 퓡 퓡퓕퓡 <dialect> Date Difference Calculation Syntax <dialect> <dialect> Substring extraction Syntax <dialect> Template-Based Knowledge Generation <dialect> Unlike MySQL, SQL Server datetime ... must use CONVERT(datetime, string, 23)...to avoid out-of-range errors <dialect> Extract Difference DATEDIFF(unit, start, end) DATE_TRUNC( unit, datetime) HINT-KB HINT-KB Dialect Knowledge Base Construction ... Date Format Constraint Syntax Date Difference Calculation Syntax Date Unit Truncation & Alignment ... Diverse Rule-based Tagging <dialect>Date Format Constraint Syntax <dialect> <dialect>Date Difference Calculation Syntax <dialect> <dialect>Date Unit Truncation & Alignment <dialect> matched Unmatched Figure 4: Dialect Knowledge Base Construction. To address this, we propose the Hierarchical Intent-aware Di- alect Knowledge Base (HINT-KB), a structured repository that reor- ganizes vendor documentation into a queryable, dual-component architecture. This section details its design principles (Section 4.1) and the automated pipeline for its construction (Section 4.2). 4.1 Knowledge Base Architecture To resolve the semantic gap between explicit user intents and im- plicit database execution rules, the knowledge base HINT-KB is designed around two core principles: a canonical reference for stan- dardization and a decoupled retrieval architecture. canonical syntax Reference. Since 푛 distinct user requirements (e.g., “sort by date” or “get top 10” ) often map to a single functional requirement (e.g., result ordering and limitation), which in turn re- quire푚atomic syntax points to implement (e.g.,ORDER BYand LIMITkeywords), mapping syntaxes directly to specific require- ments would result in an inefficient푂(푛×푚)storage complexity. To eliminate redundancy, we group requirements with identical func- tional goals into unified categories, which serve as the fundamental units in HINT-KB. This optimization reduces the computational complexity to푂(1×푚), effectively preventing knowledge base bloat. These categories are designed to be dialect-agnostic and universally applicable across diverse database systems. Specifically, we abstract the comprehensive SQL syntax space into 11 distinct canonical categories (e.g., String Manipulation, Date & Time Operations, and Window Functions). This systematic categorization encompasses over 40 atomic syntax points. For instance, the category Date & Time Operations comprises 6 atomic syntax points, such as Date Truncation (e.g.,DATE_TRUNC), Interval Arithmetic, and Timestamp Extraction. Decoupled Retrieval Architecture. HINT-KB employs a decou- pled, dual-component architecture: the Declarative Function Repos- itory (F 퐹푢푛푐 ), which is retrieved using the refined user intents (de- tailed in Section 5.1), and the Procedural Constraint Repository (R 푅푢푙푒 ), which is triggered by execution errors. (1) Declarative Function Repository (F 퐹푢푛푐 ) stores dialect-specific implementations of functional constructs aligned with user intents. Each entry contains: (i) common usage scenarios, which describe potential application contexts (e.g., “computing a person’s age” ); (i) detailed function specifications, which define the semantic opera- tion (e.g., “calculating the interval between two dates in years” ); and (i) concrete implementations, which provide the specific syntactic realization (e.g.,TIMESTAMPDIFF(YEAR, ...)in MySQL). These metadata elements are designed to be directly triggered by natural language requirements; for instance, the primitive퐶 temporal_diff can be invoked by the aforementioned age-related request. By providing such context-rich information,F 퐹푢푛푐 facilitates precise semantic disambiguation even when surface-level similarity is low. (2) Procedural Constraint Repository (R 푅푢푙푒 ) captures implicit structural rules that manifest as execution errors. Each entry con- tains: (i) detailed specifications of latent syntactic rules, which define the grammar constraints (e.g., “reserved keywords likeYEAR cannot be used as unquoted aliases” ); and (i) a collection of correct and erroneous usage cases, which provide concrete contrastive ex- amples (e.g., Erroneous:SELECT ... AS YEARvs. Correct:SELECT ... AS "YEAR" ). These rules and examples are dynamically evolved through the execution-driven mechanism (detailed in Section 6.1). 4.2 Dialect Knowledge Base Construction Manually populating the knowledge base for every target database is not scalable. To automate this process, we designed a three-stage knowledge construction pipeline that systematically distills and organizes dialectal knowledge from raw, unstructured vendor man- uals, as illustrated in Figure 4. This pipeline leverages the canoni- cal syntax Reference (B 퐶푆푅 ) as a semantic bridge to overcome the mismatch between definition-oriented documentation and intent- driven translation tasks. (1) Documentation Tagging. We first ingest the raw official documentation of a target dialect. Documents in various formats 5 (e.g., HTML, JSON, MD, SGML) are sequentially tagged by format- specific rule-based methods (e.g.,<dialect> txt <dialect>). Their labeled contents are merged into a single document, yielding a processed version with clearly demarcated content sections. (2) Reference-Guided Syntax Mapping. To populate the HINT- KB for a target dialect, this step maps the structured documentation toB 퐶푆푅 via a dual-track semantic alignment. Rather than relying on ambiguous keyword matching, we perform retrieval by sepa- rately projecting the semantic definitions fromF 퐹푢푛푐 andR 푅푢푙푒 into the documentation’s vector space. Specifically, functional con- structs fromF 퐹푢푛푐 serve as query inputs to identify semantically equivalent built-in functions; for example, the abstract intent of “extracting a substring” inF 퐹푢푛푐 is used to retrieveSUBSTRin Oracle orCHARINDEXin SQL Server. Concurrently, constraint patterns from R 푅푢푙푒 are utilized to locate corresponding structural rules within the manual; for instance, a generic pattern regarding “identifier quot- ing” inR 푅푢푙푒 can successfully pinpoint the specific requirement in PostgreSQL documentation that “identifiers containing uppercase letters must be enclosed in double quotes”. (3) Template-Based Knowledge Generation. After mapping official documentation to the predefined categories inB 퐶푆푅 , this module employs an LLM to synthesize structured knowledge en- tries for the repository. Specifically, for each functional requirement identified, the module generates a complete declarative function entry forF 퐹푢푛푐 by populating: (i) the usage scenario (e.g., “calcu- lating age”), (i) the semantic specification (e.g., “date difference in years”), and (i) the concrete implementation (e.g.,AGE()in Post- greSQL). For structural requirements, the module distills procedural constraints forR 푅푢푙푒 by defining the underlying grammar rules (e.g., “table aliases must use the AS keyword” ). Notably, at this stage, R 푅푢푙푒 entries consist only of these precise syntactic specifications; the contrastive erroneous/correct cases are not yet generated, as they are reserved for the dynamic execution-driven evolution phase (Section 6.1). To supplement the Procedural Constraint Repository (R 푅푢푙푒 ), this step targets documentation fragments that deviate from the ANSI baseline. Specifically, we design targeted prompts to instruct the LLM to identify dialect-specific constraints signaled by contrastive phrases, such as “unlike standard SQL” or “must be quoted as.” By filtering out extraneous descriptive text, the module isolates precise structural rules, such as reserved keyword con- flicts. These structural deviations are then integrated intoR 푅푢푙푒 to support execution-driven error correction. 5 Dialect-Aware Logical Query Planning Existing NL2SQL methods [14,27,34,45] typically focus on map- ping semantic user intent to SQL generation, often overlooking the usage of dialect-specific syntax. It leads to a tight coupling between semantic understanding and dialect-specific constraints, resulting in errors arising from the neglect of dialect-specific syntax. For instance, they might incorrectly invoke MySQL’s CHAR_LENGTH function in Oracle, especially with complex or implicit queries. To address this challenge, we introduce the Dialect-Aware Logi- cal Query Planning, which generates Natural Language Logical Query Plan (NL-LQP), as illustrated in Figure 5. It separates dialect- specific syntax from core semantic logic, enabling more accurate SQL translation by explicitly handling dialect-specific features. It operates in two phases: (1) Constructing the semantic logical plan Table 1: Standardized Logical Operators in NL-LQP. OperatorSemantic Definition Data Sourcing (O 푠푟푐 ) Identifies base relations and resolves log- ical dependencies (e.g., joins) to establish the target data scope. Filtering (O 푓 푙푡 ) Evaluates predicates to prune tuples, en- compassing both row-level selections and post-aggregation constraints. Scalar Calculation (O 푐푎푙 ) Derives new values via row-level trans- formations, such as type casting, string processing, and temporal arithmetic. Aggregation (O 푎푔 ) Alters data granularity by grouping records along specified dimensions and computing summary metrics. Result Organization (O 표푟푔 ) Structures the final output sequence by applying attribute projection, sorting cri- teria, and cardinality constraints. Auxiliary Operation (O 푎푢푥 ) Handles supplementary logic, acting as a fallback for operations unclassifiable under the primary relational operators. to decompose the query into operators (Section 5.1), and (2) Specify- ing the dialect-aware logic usages to tag potential dialect-sensitive elements (Section 5.2). 5.1 Logical Query Plan Construction To derive a strictly dialect-agnostic logical planL, we design an LLM-based structured generation method governed by strict con- straints. Given the user query푞, the target dialect푑, and the data- base schemaS(the DDL including data types and data samples), we generate a plan that decomposes the input user query and outputs a sequential list of logical operators. We first present the standard- ized logical operators in the plan, and then describe the two-step construction. Standardized Logical Plan Operator. To isolate the core seman- tics of a query from dialect-specific syntactic features, we define a set of logical plan operators that map semantic expressions in user queries to dialect-specific SQL syntax. As shown in Table 1, these operators decompose the analytical intent of user queries into six standardized logical operators, facilitating the clarification between semantic meaning and dialect-specific requirements. For example, consider a user query that involves extracting data from multiple sources and applying a string length function. The operatorO푠푟푐 identifies the base relations and resolves logical dependencies (e.g., joins) to establish the data. Then, for the string-length operation, we should useO푐푎푙to perform a scalar calculation, but the spe- cific function depends on the target database’s dialect. For instance, MySQL uses CHAR_LENGTH() while Oracle uses LENGTH(). Semantic-Driven Logical Plan Construction. Using these oper- ators, we instruct LLMs to construct the plan semantically. (1) Execution-Guided Query Decomposition. To ensure the va- lidity of the generated SQL queries, we require the LLM to follow a strict relational SQL execution order to generate the plan. It ensures that the logic aligns with how relational databases process queries 6 Dialect- sensitive Operators 풪 cal 풪 org 풪 src 풪 flt High Possibility Dialect Sensitivity Low Possibility In order from high to low Operator Category Sorting Remaining part Cascaded Operator Labeling Inner join `transactions` and `transaction_logs` on ... and inner join ... Clean 'transactions.amount (TEXT)' string and Cast to REAL Group the records by ... and compute the summation of ... Project 'total_transaction_amount (REAL)' Filter `transaction_logs.action (TEXT)` = 'viewed'.  Question: Total cryptocurrency amounts of...  Schema: transactions.x， transaction_logs.x, users.x, ......  Dialect: Task Logical Query Plan Construction Context-Aware Implicit Logic Mining Primitive Operator Newly Mined Operator 풪 src 풪 flt 풪 cal 풪 agg 풪 org 풪 cal 풪 org Question Schema Rule-based Functional Category Set Dialect- Sensitive Operators Dialect- Aware Operators Functional Category Mapping LLM-Based LLM-Based Natural Language Logical Query Plan (NL-LQP) Dialect-Aware Logical Query Plan 풪 src 풪 flt 풪 aux 풪 cal 풪 org 풪 agg Inject classification rules into prompt Execution-Guided Query Decomposition [Category1] Standardized expression1. 풪 org [Category2] Standardized expression2. 풪 cal 풪 src 풪 flt 풪 agg 풪 org [String_Processing] Slice string from index 3 to length-4, remove commas, cast to float. 풪 cal Dialect-Aware Logic Specification Lexical Trigger Matching Match "extract′,"regex′, "cast′ ... Dialect-Sensitive Keywords Contain Keywords Type-Aware Dependency Checking Match Dialect-Sensitive Data Types Use Data Types TIMESTAMP, JSON, ARRAY ... Figure 5: Dialect-Aware Logical Query Planning. (e.g., by first filtering raw data, then computing scalar metrics, and finally aggregating). To ground the logical operators in the physical database, we retrieve accurate schema metadata, including the exact data types and the data samples for relevant schema attributes. The metadata is incorporated into the prompt context, ensuring that every referenced column explicitly carries its physical data type (e.g.,transactions.amount (TEXT)). Furthermore, to maintain strict semantic separation, the LLM is prohibited from outputting any SQL syntax or database-specific functions, and all operators should be expressed in natural language. For example, consider the query “What are the total cryptocurrency amounts associated with each user’s viewed transactions?”, the LLM translates this into a sequential base plan. First, it performsO 푠푟푐 to access the transactions table and join it withtransaction_logsandusers tables. Then, it appliesO 푓푙푡 to filter the results for records where transaction_logs.action (TEXT)is “viewed”. It finally projects the result withO 표푟푔 , outputting the username and the aggregated total_transaction_amount. (2) Context-Aware Implicit Logic Mining. To address the fact that users often omit necessary intermediate steps in their natural language queries, we introduce a heuristic compensation mecha- nism. Without explicitly extracting these intermediate steps, SQL generation is likely to overlook essential functional requirements and introduce errors. Specifically, we jointly analyze the user’s analytical intent and the actual data samples embedded within the schema. We identify implicit requirements that are not explicitly stated but are crucial for producing valid SQL. For the same query in the last step, it asks for the “total cryptocurrency amounts” based on thetransactions.amountcolumn, the schema reveals that this column is of typeTEXT(e.g., “$ 1,234.56 USD”). The mechanism detects a conflict between the desired operation (summation) and the data type (string), which can lead to errors in SQL generation. Since type conversion mechanisms differ across database dialects, the system explicitly handles this by materializing an implicitO 푐푎푙 operator. This operator cleans the string (e.g., stripping symbols and commas) and casts it into a numeric type. The converted value is then passed to theO 푎푔 operator for aggregation. 5.2 Dialect-Aware Logic Specification Although the logical operators extracted in Section 5.1 are struc- turally sound, they remain semantically under-specified (e.g., a scalar calculation operator “Converttransactions.amount (TEXT)into a numeric value” ). Using such operators to retrieve syntactic rules from a dialect knowledge base would introduce sub- stantial retrieval noise. To augment the generic logical planLfor accurate downstream dialect-specific implementation, we introduce a Dialect-Aware Logic Specification mechanism. It links the logical operators to their corresponding dialect-specific syntax. Through two formalized sequential modules, we systematically augment the logical operators with standardized dialect specifications. (1) Cascaded Operator Labeling. Given a generated plan with a sequence of logical operatorsL=⟨표 1 ,표 2 , . . .,표 푛 ⟩, we propose a rule-based cascaded labeling function,퐹 푙푎푏푒푙 , to efficiently isolate the subset of dialect-sensitive operators (L 푠푒푛 ⊆ L). This process minimizes computational overhead and reduces downstream re- trieval noise. Rather than relying on costly LLM inference,퐹 푙푎푏푒푙 applies three hierarchical checks to evaluate each operator표 푖 : (i) Operator Category Sorting:퐹 푙푎푏푒푙 first sort operators with inher- ent structural divergence, such as scalar calculations (O 푐푎푙 ) and result organization operations like cardinality constraints (O 표푟푔 ) before those adhering to core ANSI SQL standards, like basic entity selection (O 푠푟푐 ) and simple equality filtering (O 푓푙푡 ); (i) Lexical Trig- ger Matching: To detect more complex operations hidden within simple operators,퐹 푙푎푏푒푙 scans the descriptions of표 푖 using a prede- fined lexicon of dialect-sensitive keywords (e.g., “extract”, “regex”, “cast”); (i) Type-Aware Dependency Checking: Lastly,퐹 푙푎푏푒푙 refers to the schema attributes with the schema definition to identify operators that manipulate complex or dialect-specific data types 7 (e.g., TIMESTAMP, JSON, ARRAY), flagged as dialect-sensitive re- gardless of their operator category. Operators extracted from this cascaded pipeline, formingL 푠푒푛 , are forwarded to explicit dialect augmentation. (2) Functional Category Mapping. Since operators inL 푠푒푛 have diverse textual descriptions, we define a functional cate- goryC= 퐶 1 ,퐶 2 , . . .,퐶 푚 to encode their functional charac- teristics into a unified semantic space. Each category퐶 푖 repre- sents a standardized category of dialect-specific operations (e.g., [Temporal_Manipulation],[String_Processing]). For each op- erator표 푖 ∈ L 푠푒푛 , we use an LLM as a semantic classifier and stan- dardizer to map it to the corresponding category. The LLM takes the verbose description of표 푖 along with the categoryCas inputs, then assigns표 푖 to the most relevant category퐶 ∈ C. At the same time, the LLM is instructed to discard unnecessary explanations (e.g., business logic justifications) and focus on reformulating the core intent into a standardized textual format. This results in a standard- ized reference표 ∗ 푖 = ⟨퐶 ∗ , standard_description⟩. For example, the description oftransactions.amountis converted into a precise representation:표 ∗ 푐푎푙 =⟨ [String_Processing], “Slice string from index 3 to length-4, remove commas, cast to float”⟩. By appending these standardized representations to their categories, the generic planL 푠푒푛 is transformed into the dialect-aware planL ∗ 푠푒푛 . These standardized representations then serve as high-precision semantic indices, guiding the downstream module to retrieve the appropriate syntactic patterns from the knowledge base. 6 Adaptive & Iterative Debugging and Evaluation While HINT-KB provides foundational knowledge, it cannot guar- antee one-shot executability due to LLM hallucinations and implicit dialect constraints discoverable only at runtime. For instance, an LLM might hallucinate a non-existent function likeDATE_AGE() for PostgreSQL instead of the correctAGE(), or overlook Oracle’s prohibition of using column aliases inWHEREclauses. A naive re- finement that directly feeds error messages back to the LLM often leads to semantic drift, i.e., the model may alter the original business logic just to make the query run. To address this, we propose the Adaptive & Iterative Debug- ging and Evaluation. It decouples syntactic error resolution from semantic logic auditing by using the dialect-aware plan (L ∗ 푠푒푛 .) as an immutable ground truth, ensuring all repairs remain strictly aligned with the user’s intent. The process unfolds in three stages: 6.1 Adaptive Syntactic Recovery This stage iteratively refines the generated query until successful execution is achieved. We collect diagnostic feedback from the target database and resolve detected violations through a structured three-phase recovery pipeline. This process ensures that the final query is not only syntactically well-formed but also compliant with dialect-level compilation constraints and executable. (1) Knowledge-Grounded Generation. The process begins by synthesizing an initial candidate query,푄 푖푛푖푡 , using the dialect- aware planL ∗ 푠푒푛 , the database schemaS, and relevant function templates from the Declarative Function Repository (F 퐹푢푛푐 ). This query prioritizes functional mapping over structural compliance. If executing푄 푖푛푖푡 fails, the system captures the raw database error traceE 푟푎푤 as a diagnostic signal. (2) Execution-Driven Rule Retrieval. The system then uses the error traceE 푟푎푤 and the associated failing SQL segment as a composite key to retrieve a specific transformation rule from the Procedural Constraint Repository (R 푅푢푙푒 ). This retrieval is driven by engine feedback, not user intent, ensuring the fix is targeted at the specific dialect violation. The retrieved rule is applied to푄 푖푛푖푡 to produce a revised query, 푄 푟푒푣 . (3) Deep Diagnostic Reasoning. If푄 푟푒푣 also fails, indicating a complex error not covered by existing rules, the system escalates to a deep diagnostic phase. It performs a multi-dimensional root- cause analysis by cross-referencing the flawed query푄 푟푒푣 , the new error trace, and the ground-truth intent preserved inL ∗ 푠푒푛 .. Here, L ∗ 푠푒푛 . serves as a crucial structural anchor, ensuring that necessary syntactic transformations do not inadvertently alter the core logic. This reasoning process yields a syntactically viable query, 푄 푒푥푒푐 . 6.2 Semantic Logic Verification To ensure the executable query푄 푒푥푒푐 has not drifted from the user’s intent, it undergoes a formal logic audit. Unlike approaches that rely on subjective self-reflection [32], which can be prone to inconsistency without external grounding [10], we leverage the dialect-aware planL ∗ 푠푒푛 . as an objective gold standard. (1) Multi-Dimensional Logic Auditing. We parse 푄 푒푥푒푐 into an Abstract Syntax Tree (AST) and map its clauses to a sequence of logical operators. This allows for a normalized comparison against four semantic invariants derived from the macro-operators inL ∗ 푠푒푛 : •Structural Topology: Verifies that the join relationships in the query match the logical associations (O 푠푟푐 ) in the plan. •Constraint Fidelity: Ensures all filtering rules (O 푓푙푡 ) are cor- rectly implemented in the ‘WHERE‘ and ‘HAVING‘ clauses. •Computational Consistency: Confirms that aggregation and calculation logic (O 푎푔 ,O 푐푎푙 ) are mathematically equivalent to the user’s intent. •Projection Accuracy: Matches the final output columns and aliases (O 표푟푔 ) against the target projection in the plan. (2) Contrastive Feedback and Rectification. If any invariant is violated, the system generates a semantic deviation report pinpoint- ing the mismatch. This report is fed back to the reasoning module as a high-priority constraint for a targeted repair. This cycle repeats until all invariants are satisfied, yielding the final, verified query푠. 6.3 Incremental Knowledge Consolidation To eliminate redundant reasoning for recurring errors, the final stage distills successful repair trajectories into reusable knowledge for HINT-KB, enabling the knowledge base to evolve autonomously. (1) Knowledge Distillation. A validated repair is abstracted into a generalized, schema-agnostic knowledge primitive,G= ⟨푃 푖푛푐 ,퐸 푐표푟 ,퐴 푟푡푐 ⟩. This structure formalizes the Incorrect Pattern (푃 푖푛푐 ), the Corrective Exemplar (퐸 푐표푟 ), and a natural-language Root- Cause Analysis (퐴 푟푡푐 ), transforming a one-off fix into a structured heuristic (e.g., mapping MySQL Error 1241 to its corresponding fix). (2) Dual-Mechanism Knowledge Routing. The new primitiveG is then integrated back into HINT-KB. A routing decision is made 8 based on the cosine similarity betweenGand the original logical planL ∗ 푠푒푛 . If similarity is high (≥0.75), the fix is deemed intent- driven and is added to the Declarative Function Repository (F 퐹푢푛푐 ). Otherwise, it is categorized as a universal, environment-driven constraint and is routed to the Procedural Constraint Repository. 7 Experiments In this section, we comprehensively evaluateDialacross a het- erogeneous environment comprising six major database systems: SQLite, PostgreSQL, MySQL, SQL Server, DuckDB, and Oracle. We first detail the experimental setup, then present the construction of a novel, high-quality multi-dialect NL2SQL benchmark, and finally, we present a comprehensive analysis of the experimental results. 7.1 Experimental Setup Baselines. We compareDialagainst three categories of state-of- the-art approaches. We evaluate their standard generation perfor- mance. To ensure a fair comparison, we use Qwen-3-Max as the default LLM backbone. (1) Input Prompting: We evaluate DIN- SQL [27] and Agentar-Scale-SQL [35]. These methods rely primar- ily on LLMs’ in-context learning capabilities, driven by elaborate prompting strategies. Specifically, DIN-SQL is deployed with Qwen- 3-Max, and Agentar utilizes its officially open-sourced Agentar- Scale-SQL-Generation-32B model; (2) Model Finetuning: These methods attempt to bridge the dialect gap by fine-tuning models on multi-dialect corpora. We evaluate EXESQL [42] using its released exesql_bird_mysqlcheckpoint, which is a fine-tuned version of DeepSeek-Coder-7B (epoch1_bird_mysql) on thebird_dpo_mysql dataset. We exclude SQL-GEN [30] from our evaluation because its fine-tuned model weights are not open-sourced; (3) Tool Aug- mentation: These methods augment the generation process by integrating external parsers or rule-based translators. We evaluate Dialect-SQL [31] (using Qwen-3-Max) and the widely used trans- lation engine SQLGlot [1]. Since SQLGlot is purely a translation tool, we adopt a pipeline approach: we first use Agentar to generate SQL in the widely-supported SQLite dialect, and then employ SQL- Glot to translate these queries into the target database dialects. We omit WrenAI [37] from our batch evaluation because it is a highly integrated application framework with rigid connection modes; it restricts connections to a single database instance at a time, lacks SQLite support, and exhibits severe performance degradation when handling schemas with numerous tables. Evaluation Metrics. We employ three metrics to measure per- formance: (1) Executability (Exec): The percentage of generated SQL queries that execute without syntax errors on the target data- base; (2) Execution Accuracy (Acc): The percentage of generated SQL queries that return result sets identical to the gold SQL; (3) Di- alect Feature Coverage (DFC): The recall of dialect-specific features (e.g., unique functions) successfully used in the generated SQL and present in the gold SQL. Using predefined regular expression rules, it is a fine-grained metric that evaluates whether the methods use the intended database syntax correctly. Implementation. All experiments and database executions are conducted on a workstation running Ubuntu 22.04 LTS, equipped with 512 GB of main memory and high-capacity storage. To en- sure reproducibility and accurate execution feedback, we evalu- ate the generated queries against the following database versions: SQLite v3.45.3, MySQL v8.0.45, PostgreSQL v14.20, SQL Server v17.0, DuckDB v1.4.3, and Oracle Database 19c (Enterprise Edition). 7.2 DS-NL2SQL Benchmark Existing Text-to-SQL benchmarks, such as Spider [40] and BIRD [20], predominantly focus on SQLite-compatible syntax and fail to cap- ture the specificity and heterogeneity inherent in real-world en- terprise database dialects. To bridge this gap, we constructed DS- NL2SQL, a benchmark comprising 2,218 test samples across 796 distinct databases that supports the evaluation of tasks targeting specific database engines. As summarized in Table 2, DS-NL2SQL provides parallel multi-dialect NL-SQL pairs with an average dialect discrepancy of 3.67 points per sample, which significantly exceeds the 1.60 points recorded for BIRD Mini-Dev. To ensure a precise assessment of dialect-specific syntax, we prioritize queries that exhibit dialect incompatibility, in which implementations are syn- tactically exclusive to specific database systems. Furthermore, by manually enforcing execution equivalence across all variations, the benchmark ensures that execution results remain consistent across engines and eliminates interference from logical errors, facilitating an objective assessment of engine-specific constraint satisfaction. Crucially, to strictly focus on evaluating dialect-specific capabilities, DS-NL2SQL provides the ground-truth schema elements (i.e., the specific tables and columns used in the gold SQL) as part of the input for SQL generation. This design choice eliminates the need for an additional schema linking step, neutralizing the confound- ing effects of schema retrieval errors and ensuring a fair, targeted comparison of the models’ pure dialect generation performance. The construction pipeline is as follows: (1) Data Aggregation and Context Decoupling. We aggre- gated data from multiple mainstream datasets, including Spider [40], BIRD [20], SparC [41], CoSQL [39], OmniSQL [18], and Archer [44]. For multi-turn conversational datasets like SparC and CoSQL, we employed LLMs to rewrite context-dependent queries into seman- tically complete, self-contained questions, eliminating contextual dependencies within dialogue turns; (2) Dialect Migration and Syntax Validation. Using SQLite [8] as the source dialect, we utilized SQLGlot [1] to translate queries into five target dialects: MySQL [5], PostgreSQL [7], SQL Server, DuckDB [4], and Oracle [6]. We then enforced strict syntactic validation using the specific parse trees of each target database, discarding samples with parsing errors to ensure syntactic viability; (3) Dialect Specificity Fil- tering. A critical step in our pipeline is ensuring the benchmark targets dialect nuances rather than generic SQL. We migrated the schemas to all target database systems using SQLAlchemy and ex- ecuted the queries. If a query was compatible across all systems (e.g., a simpleSELECT * FROM table), it was deemed a "generic query" and excluded. We retained only queries that exhibited di- alect exclusivity (i.e., failed on at least one system due to dialect mismatch). (4) Consistency Verification and Manual Correc- tion. We verified the execution results across dialects to ensure logical equivalence. Furthermore, to address the limitations of auto- mated tools (e.g., SQLGlot’s failure to map SQLite’sGROUP_CONCAT to Oracle’sLISTAGG), we performed meticulous manual correc- tions using official documentation. This process resulted in a robust multi-dialect benchmark characterized by high heterogeneity. 9 Table 2: Comparison of our Dialect-Specific NL2SQL Benchmark with other representative benchmarks. Benchmark Dialect-Specific NL2SQL Multi-Dialect NL-SQL Dialectal Incompatibility Execution Equivalence # Test Samples # Dialect Types # Test Databases Average Dialectal Discrepancy Spider [40]✗2,1471206– BIRD [20]✗1,789115– BIRD Mini-Dev [20]✓✗5003111.60 PARROT [47]✗✓5988– Spider 2.0-Lite [14]✓✗5473158– DS-NL2SQL✓2,21867963.67 Table 3: Main performance comparison on DS-NL2SQL across six database dialects. Exec: Executability (%), Acc: Execution Accuracy (%), DFC: Dialect Feature Coverage (%). Method SQLitePostgreSQLMySQLSQL ServerDuckDBOracle Exec Acc DFCExec Acc DFCExec Acc DFCExec Acc DFCExec Acc DFCExec Acc DFC Input Prompting DIN-SQL83.36 44.27 63.7569.07 37.83 40.9449.95 29.13 33.5973.67 40.08 48.0565.28 36.93 42.7266.73 39.13 53.16 Agentar-Scale-SQL98.69 50.36 74.9382.10 41.25 44.6377.95 37.96 48.8765.24 31.70 37.8285.75 44.23 54.4378.58 42.25 50.16 Model Finetuning EXESQL86.88 26.96 44.0380.12 26.65 28.2984.36 26.69 37.0754.37 18.26 20.0581.24 27.23 35.125.503.744.25 Tool Augmentation Dialect-SQL81.56 41.61 55.8481.24 39.90 42.6584.58 44.05 58.6580.43 41.43 51.9780.66 41.16 49.3174.75 39.13 50.97 Agentar-Scale-SQL+SQLGlot98.60 50.36 74.9381.51 42.61 70.7290.17 47.57 67.8980.79 42.52 68.3382.78 43.15 65.5481.97 43.55 58.43 Dial (Ours)99.67 59.00 90.0798.33 53.33 78.7099.87 55.87 88.4299.00 51.71 85.7099.93 57.94 80.5899.21 53.42 76.97 7.3 Performance Comparison Table 3 presents the overall performance ofDialand the baselines across six database dialects. Translation Accuracy. To rigorously evaluate cross-system gener- alization, we strictly define the overall metrics reported in Figure 6: for a given natural language query, the overall Exec is counted as 1 if and only if the generated SQL executes successfully across all evaluated database systems (otherwise 0); similarly, the overall Acc is 1 only if the correct result is returned across all systems. Under this strict all-or-nothing requirement,Dialsignificantly out- performs all baselines, achieving an overall Exec of 97.33% and an overall Acc of 48.39%. Compared to the best-performing baseline, SQLGlot,Dialimproves overall Exec by 23.12% and overall Acc by 9.48%. This rigorous metric inherently highlights our method’s robust cross-dialect adaptation capabilities. While baselines might succeed on familiar dialects (e.g., SQLite), they suffer from single- point failures on complex dialects, causing their overall scores to plummet. For instance, model fine-tuning methods like EXESQL suf- fer from severe dialect overfitting; because it is statically fine-tuned on MySQL, it rigidly internalizes MySQL-specific syntax and fails catastrophically on Oracle (e.g., dropping to 5.50% Exec), drastically dragging down its overall executability. Translation Robustness. The performance gap is particularly pronounced in dialects with complex or highly unique syntax paradigms, such as Oracle and DuckDB. For instance, on Oracle, Dialachieves 99.21% Exec and 53.42% Acc, whereas SQLGlot only reaches 81.97% Exec and 43.55% Acc. Tool-Augmentation methods like SQLGlot lack comprehensive translation rules for complex Input Prompting Model FinetuningTool Augmentation Ours Figure 6: Overall Performance Comparison. nested queries and often degrade native operators. Meanwhile, In- put Prompting methods (e.g., Agentar-Scale-SQL) suffer from severe “dialect hallucinations”, mistakenly applying MySQL or PostgreSQL functions in Oracle. In contrast,Dialutilizes a logic-decoupled ar- chitecture to accurately isolate user intents before anchoring them to dialect-specific implementations. 7.4 Fine-Grained Analysis To better understand whyDialsuccessfully translates queries where baselines fail, we further conduct finer-grained analysis in both dialect coverage and LLM backbones. Dialect Coverage. Executability alone does not guarantee that queries are written idiomatically.Dialachieves a high DFC across all databases (e.g., 90.07% on SQLite and 88.42% on MySQL). This indicates that our system does not merely generate generic SQL to bypass syntax errors, but effectively retrieves and synthesizes 10 Table 4: Performance Comparison of Dial Variants. Method Logic Query Planning HINT-KB Correction PostgreSQLSQL ServerOracle ExecAccDFCExecAccDFCExecAccDFC Dial✓98.33 53.33 78.7099.00 51.71 85.7099.21 53.42 76.97 Ablation Variants ✗✓91.6438.5953.3598.1648.3569.1395.1643.9952.41 ✓✗96.9843.1758.7093.2144.4961.9098.3543.7651.97 ✓✗96.5345.9461.9295.6346.7161.2691.3443.1548.52 ✗✓✗91.8638.5653.3583.3044.0558.3995.4043.0752.41 Table 5: Performance over DifferentLLM Backbones. Method PostgreSQLSQL ServerOracle ExecAccDFCExecAccDFCExecAccDFC Dial (Ours) Qwen-3-Max98.33 53.33 78.7099.00 51.71 85.7099.21 53.42 76.97 DeepSeek-V3.2 96.4243.1063.1099.4245.3965.2798.0345.7760.08 GPT-5.2 96.8848.9170.9599.7651.6876.8499.7048.8966.64 DIN-SQL Qwen-3-Max69.0737.8340.9473.6740.0848.0566.7339.1353.16 DeepSeek-V3.241.2423.3323.6149.6529.1133.5336.0520.3824.14 GPT-5.250.3230.0730.8346.7827.6732.4362.6439.3543.19 native dialect features to systematically overcome complex dialec- tal conflicts. For instance,Dialsuccessfully avoids Unsupported Syntax by accurately synthesizing Oracle’s nativeLISTAGGinstead of hallucinatingGROUP_CONCAT, prevents Incorrect Usage by strictly adhering to Oracle’s two-argumentCONCATsignature, and resolves Implicit Constraints by restructuring illegal nested aggregations in MySQL into compliant CASE WHEN constructs. LLM Backbones. We evaluate the performance ofDialand DIN- SQL across three different LLMs: Qwen-3-Max, DeepSeek-V3.2, and GPT-5.2. As shown in Table 5,Dialmaintains stable and high performance regardless of the underlying LLM. For instance, on PostgreSQL,Dialachieves an execution accuracy between 43.10% and 53.33% across the three models. In contrast, DIN-SQL exhibits high variance, dropping from 37.83% with Qwen-3-Max to 23.33% with DeepSeek-V3.2. This demonstrates that by explicitly decou- pling semantic logic from dialect syntax and relying on an external knowledge base,Dialsignificantly reduces the dependency on the LLM’s internal (often flawed) parametric knowledge. 7.5 Ablation Study We conduct ablation studies to verify the effectiveness of the three core components ofDial. The results on PostgreSQL, SQL Server, and Oracle are summarized in Table 4. Note that a configuration omitting the knowledge base (HINT- KB✗) while retaining the Adaptive & Iterative Debugging and Evaluation (AIDE✓) is fundamentally invalid in our architecture. As detailed in Sections 4 and 6, AIDE is explicitly grounded in HINT- KB. Specifically, the “Execution-Driven Rule Retrieval” phase relies on the Procedural Constraint Repository (R 푅푢푙푒 ) within HINT-KB to map raw diagnostic error signals to validated transformation rules. Without this structured knowledge anchor, execution-driven debugging degenerates into blind LLM self-reflection, which we empirically observe frequently induces severe semantic drift. There- fore, AIDE’s execution is inextricably linked to HINT-KB’s presence. 7.5.1 Effectiveness of the Dialect-Aware Logical Query Planning. We investigate the impact of the Logic Query Planning by remov- ing it (Row 2 in Table 4). Without the extractor, the model must simultaneously perform semantic reasoning and syntax generation. This tangled process leads to a significant performance drop. For example, on PostgreSQL, the execution accuracy drops from 53.33% to 38.59%, and DFC drops from 78.70% to 53.35%. The extractor is crucial because it breaks down the query into standard logical operators, reducing the search space and preventing the LLM from being disoriented by complex database schemas. 7.5.2 Effectiveness of Hierarchical Dialect Knowledge Base. The HINT-KB component bridges the gap between abstract intents and concrete syntax. When the system operates without the complete hierarchical knowledge base (Row 4), it struggles to identify the correct dialect-specific functions. As a result, the model falls back on its pre-trained biases, causing the Dialect Feature Coverage (DFC) on Oracle to drop from 76.97% to 48.52%. This confirms that relying solely on the LLM’s internal knowledge or raw documentation is insufficient; the structured, intent-aware mapping provided by HINT-KB is essential for native syntax synthesis. 7.5.3 Effectiveness of Adaptive & Iterative Debugging and Evalua- tion. We assess the contribution of the iterative correction mecha- nism by disabling the feedback loop (Row 3). Without execution- driven correction, the execution accuracy on SQL Server decreases from 51.71% to 44.49%. Zero-shot generation, even with an accurate knowledge base, cannot anticipate all implicit engine constraints (e.g., transient type-casting rules or specific reserved keyword con- flicts). The adaptive debugging mechanism provides critical on-the- fly repairs, ensuring that minor syntactic violations are resolved without causing semantic drift from the original user intent. 7.6 Case Study We conduct a finer-grained analysis of the translation errors accord- ing to the categories in Table 6 and identify what factors contribute to a successful dialect-specific SQL generation. The table showcases valuable examples that are supported byDialbut not adequately handled by the baselines. Based on the execution feedback, we sys- tematically categorize these dialectal generation failures into three distinct classes: (1) Unsupported Syntax. LLMs are prone to hallucination or blindly transferring functions from dominant dialects (e.g., MySQL or SQLite) to the target database. As shown in Table 6, when tasked with string aggregation (U1), baselines like DIN-SQL and EXESQL incorrectly project MySQL’sGROUP_CONCATfunction onto Oracle, 11 Table 6: Dialect-Specific Generation Errors Effectively Resolved byDial(User questions are abstracted as realistic questions. Target syntax highlights the correct dialect-specific pattern alongside strictly prohibited anti-patterns). TypeUser Question (Abbreviated Intent)Target Syntax Constraints (Gold)DIN-SQLEXESQLDialect-SQLSQLGlotDialTarget Dialect U1List all IP addresses accessed by each city.LISTAGG(...) (Strictly NO GROUP_CONCAT)×✓Oracle U2Find the shipment details matching the ID.CAST(id AS CHAR) (Strictly NO AS TEXT)×✓×✓MySQL U3When did the earliest complaint start on 2017-03-22?TO_DATE(...) (Strictly NO date() function)×✓Oracle U4What are the total sales metrics for last month?Native Numeric (Strictly NO ORM bindings)✓×✓DuckDB M1Combine first, middle, and last names.first || middle || last (Strictly NO 3-arg CONCAT)×✓Oracle M2Retrieve user details across multiple related logs.FROM "T1" "T2" (Strictly NO AS keyword)✓×✓Oracle M3Sort the movie ratings from lowest to highest.ORDER BY rating ASC (Strictly NO asc())✓×✓Oracle M4Find songs where the language is exactly English.col = ’english’ (Strictly NO double quotes " ")✓×✓Postgres/MySQL M5What is the total number of districts?SELECT ... FROM DUAL (Strictly NO absent FROM)✓×✓Oracle M6Find the lowest stock product among top sellers.FROM (SELECT...) AS alias (Strictly NO anonymous)×✓×✓Postgres/MySQL I1What are the names and total distinct programs used?GROUP BY... (Strictly NO DISTINCT inside OVER())×✓×✓Postgres/MySQL I2Count days with high trading volume.COUNT(CASE WHEN...) (Strictly NO nested AVG aggregation)✓×✓MySQL I3Find the highest revenue movie and its average rating.Isolate in CTE (Strictly NO unaggregated ORDER BY)×✓SQL Server I4What was the average price of the most recent crypto?= (SELECT... LIMIT 1) (Strictly NO scalar unconstrained)✓×✓Postgres/MySQL causing immediate execution failures. Meanwhile, as shown in U4, Dialect-SQL suffer from rigid type bindings, the ORM frameworks fail to map specific underlying data types in DuckDB (e.g., native nu- merics). This abstraction leak causes the entire translation pipeline to crash directly. In contrast,Dialresolves this by decoupling the abstract user intent from the SQL generation. It queries the hi- erarchical knowledge base (HINT-KB) to anchor the exact native implementations (e.g.,LISTAGGfor Oracle), ensuring the generated functions are strictly supported by the target engine. (2) Incorrect Usage. Even when models select the correct target syntax or keywords, they frequently violate dialect-specific usage rules and function signatures. For example, as shown in M1, while Oracle supports theCONCATfunction, it strictly limits the input to exactly two arguments. Standard LLMs, biased by the variadic CONCATin MySQL, often generate invalid 3-argument calls. Fur- thermore, baselines struggle with strict syntactic grammar, such as incorrectly appending theASkeyword for table aliases in Oracle (M2) or omitting mandatory aliases for derived tables (subqueries) in PostgreSQL and MySQL (M6).Dialovercomes these issues by retrieving precise function specifications and constraint rules from HINT-KB, guaranteeing strict adherence to target signatures and syntax conventions. (3) Implicit Constraints. Real-world queries often fail because their structural composition violates compiler restrictions, even if the individual syntactic elements are correct. These implicit con- straints are orthogonal to the user intent and are typically absent from standard LLM prompts. For instance, MySQL prohibits nested aggregations (e.g., applyingCOUNToverAVG), causing baselines to fail during compilation (I2). Standard models blindly generate these invalid constructs. Static translation tools (e.g., SQLGlot) typically perform direct syntax mapping without understanding semantic execution constraints. For instance, strict databases like PostgreSQL and MySQL prohibit a scalar subquery on the right side of an equals sign (=) from returning multiple rows. While permissive engines like SQLite forgive this, static translators fail to append a single- row constraint during translation, leading to runtime errors (I4). Instead,Dialdetects these structural conflicts via its execution- driven feedback loop and systematically restructures the query (e.g., transforming nested aggregations into compliantCASE WHEN constructs for I2) without altering the original semantics. 8 Related Work General NL2SQL. The rapid advancement of LLMs has funda- mentally reshaped NL2SQL research, as summarized in recent sur- veys [16,23,24]. Current LLM-based approaches can be broadly categorized into two lines. (1) Modular prompting pipelines: Meth- ods such as DIN-SQL [27], DAIL-SQL [9], and Chase-SQL [26] decompose generation into structured reasoning steps, improving controllability and intermediate interpretability. (2) Specialized fine- tuning strategies: Systems including DTS-SQL [29] and CODES [19] internalize schema linking and structural reasoning via supervised alignment on curated corpora. Additionally, recent methods incor- porate search, feedback, and optimization strategies to enhance reasoning and robustness, including MCTS-based exploration [17], software-engineering-inspired validation [15], process-supervised rewards [43], complexity-aware routing [48], and structured multi- step deduction [38]. However, these methods predominantly assume a single target dialect and do not explicitly disentangle semantic planning from dialect-specific realization. Dialect-Specific NL2SQL. Dialect-SQL [31] introduces an adap- tive framework using Object-Relational Mapping (ORM) as an in- termediate layer. However, this approach often degrades native, high-performance operators into verbose, generic constructs to maintain cross-platform compatibility. Other data-centric strate- gies like SQL-GEN [30] and ExeSQL [42] utilize synthetic tutorials and execution-driven feedback to mitigate data scarcity. SQL Dialect Translation. Migrating queries across databases has traditionally relied on rule-based translation tools (e.g., SQLGlot [1], jOOQ [2], SQLines [3]). However, existing dialect translation tools integrate limited translation rules maintained by humans and can- not translate successfully in many complex cases. To overcome this rigidity, recent studies like CrackSQL [46] explore hybrid architec- tures combining LLMs with functionality-based query processing to automate cross-dialect SQL-to-SQL translation. 9 Conclusion In this paper, we proposed a dialect-specific NL2SQL framework. We introduced Dialect-Aware Logical Query Planning, which con- structs a Natural Language Logical Query Plan (NL-LQP) to decou- ple semantic intent from dialect-specific syntax. We built HINT-KB, 12 a hierarchical intent-aware knowledge base that organizes ven- dor documentation into declarative function mappings and proce- dural constraint rules to guide generation. We further designed an Adaptive & Iterative Debugging and Evaluation mechanism that leverages execution feedback for syntactic recovery while verifying consistency with the logical plan. Experiments on the DS-NL2SQL benchmark show that Dial significantly improves execution accu- racy and dialect feature coverage over state-of-the-art baselines. In the future, we will focus on several potential directions. First, we will incorporate lightweight dialect parsers to reduce reliance on live database feedback. Second, we will improve knowledge acquisition to better support niche dialects or legacy databases lacking comprehensive documentation. 13 References [1] [n.d.]. https://sqlglot.com/sqlglot.html. Last accessed on 2024-10. [2] [n.d.]. https://w.jooq.org/. Last accessed on 2024-10. [3] [n.d.]. https://w.sqlines.com/. Last accessed on 2024-10. [4] DuckDB. (DBMS). https://w.duckdb.org [5] MySQL. (DBMS). https://w.mysql.com/ [6] Oracle. (DBMS). https://w.oracle.com/database/ [7] PostgreSQL. (DBMS). https://w.postgresql.org [8] SQLite. (DBMS). https://w.sqlite.org [9] Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. Proceedings of the VLDB Endowment 17, 5 (2024), 1132– 1145. [10]Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2023. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798 (2023). [11] Hyeonji Kim, Byeong-Hoon So, Wook-Shin Han, and Hongrae Lee. 2020. Natural language to SQL: Where are we today? Proceedings of the VLDB Endowment 13, 10 (2020), 1737–1750. [12]Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. 2021. Data management in microservices: state of the practice, challenges, and research directions. Proc. VLDB Endow. 14, 13 (Sept. 2021), 3348–3361. https://doi.org/10.14778/3484224.3484232 [13]Fangyu Lei, Jixuan Chen, et al.2025. Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows. In Proceedings of the 13th International Conference on Learning Representations (ICLR). [14] Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin SU, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caim- ing Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. 2025. Spider 2.0: Evaluat- ing Language Models on Real-World Enterprise Text-to-SQL Workflows. In Inter- national Conference on Learning Representations, Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu (Eds.), Vol. 2025. 28691–28735. https://proceedings.iclr.c/paper_files/ paper/2025/file/46c10f6c8ea5a6f267bcdabcb123f97-Paper-Conference.pdf [15] Boyan Li, Chong Chen, Zhujun Xue, Yinan Mei, and Yuyu Luo. 2025. DeepEye-SQL: A Software-Engineering-Inspired Text-to-SQL Framework. CoRR abs/2510.17586 (2025). [16]Boyan Li, Yuyu Luo, Chengliang Chai, Guoliang Li, and Nan Tang. 2024. The Dawn of Natural Language to SQL: Are We Fully Ready? [Experiment, Analysis & Benchmark ]. Proc. VLDB Endow. 17, 11 (2024), 3318–3331. [17]Boyan Li, Jiayi Zhang, Ju Fan, Yanwei Xu, Chong Chen, Nan Tang, and Yuyu Luo. 2025. Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search. In ICML. OpenReview.net. [18] Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, et al.2025. Omnisql: Synthesizing high-quality text-to-sql data at scale. arXiv preprint arXiv:2503.02240 (2025). [19]Haoyang Li, Jing Zhang, Hanbing Liu, Ju Fan, Xiaokang Zhang, Jun Zhu, Renjie Wei, Hongyan Pan, Cuiping Li, and Hong Chen. 2024. CodeS: Towards Building Open-source Language Models for Text-to-SQL. Proc. ACM Manag. Data 2, 3, Article 127 (May 2024), 28 pages. https://doi.org/10.1145/3654930 [20]Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al.2024. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems 36 (2024). [21]X. Li, Y. Wang, et al.2025. An Empirical Study on Database Usage in Microservices. arXiv preprint arXiv:2510.20582 (2025). [22]Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuyu Luo, Yuxin Zhang, Ju Fan, Guoliang Li, and Nan Tang. 2024. A Survey of NL2SQL with Large Language Models: Where are we, and where are we going. arXiv preprint arXiv:2408.05109 (2024). [23]Xinyu Liu, Shuyu Shen, Boyan Li, Peixian Ma, Runzhi Jiang, Yuxin Zhang, Ju Fan, Guoliang Li, Nan Tang, and Yuyu Luo. 2025. A Survey of Text-to-SQL in the Era of LLMs: Where Are We, and Where Are We Going? IEEE Trans. Knowl. Data Eng. 37, 10 (2025), 5735–5754. [24]Yuyu Luo, Guoliang Li, Ju Fan, Chengliang Chai, and Nan Tang. 2025. Natural Language to SQL: State of the Art and Open Problems. Proc. VLDB Endow. 18, 12 (2025), 5466–5471. [25]HR News. 2026. Why Jobs for Developers with Complex Stack Experience Are Growing. https://hrnews.co.uk/why-jobs-for-developers-with-complex-stack- experience-are-growing/. [26]Mohammadreza Pourreza, Hailong Li, Ruoxi Sun, Yeounoh Chung, Shayan Talaei, Gaurav Tarlok Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Ö. Arik. 2024. CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL. CoRR abs/2410.01943 (2024). https://doi.org/10.48550/ ARXIV.2410.01943 arXiv:2410.01943 [27] Mohammadreza Pourreza and Davood Rafiei. 2023. Din-sql: Decomposed in- context learning of text-to-sql with self-correction. Advances in Neural Informa- tion Processing Systems 36 (2023), 36339–36348. [28] Mohammadreza Pourreza and Davood Rafiei. 2023. Evaluating Cross-Domain Text-to-SQL Models and Benchmarks. In EMNLP. Association for Computational Linguistics, 1601–1611. [29] Mohammadreza Pourreza and Davood Rafiei. 2024. DTS-SQL: Decomposed Text-to-SQL with Small Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 8212–8220. https://doi.org/10.18653/v1/2024.findings-emnlp.481 [30] Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, and Sercan O Arik. 2024. Sql-gen: Bridging the dialect gap for text-to-sql via synthetic data and model merging. arXiv preprint arXiv:2408.12733 (2024). [31]Jie Shi, Xi Cao, Bo Xu, Jiaqing Liang, Yanghua Xiao, Jia Chen, Peng Wang, and Wei Wang. 2025. Dialect-SQL: An Adaptive Framework for Bridging the Dialect Gap in Text-to-SQL. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 3604–3619. [32]Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learn- ing. Advances in neural information processing systems 36 (2023), 8634–8652. [33]technotes.in. 2025.The Future is Polyglot: Why Businesses Use Multi- ple Databases.https://technotes.in/2025/08/25/the-future-is-polyglot-why- businesses-use-multiple-databases/. Accessed: 2026-03. [34] Bing Wang, Changyu Ren, Jian Yang, Xinnian Liang, Jiaqi Bai, Linzheng Chai, Zhao Yan, Qian-Wen Zhang, Di Yin, Xing Sun, and Zhoujun Li. 2025. MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL. In Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025. Association for Computational Linguistics, 540–557. [35] Pengfei Wang, Baolin Sun, Xuemei Dong, Yaxun Dai, Hongwei Yuan, Mengdie Chu, Yingqi Gao, Xiang Qi, Peng Zhang, and Ying Yan. 2025. Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling. arXiv preprint arXiv:2509.24403 (2025). [36] Zirui Wang, Zihang Dai, Barnabás Póczos, and Jaime Carbonell. 2019. Character- izing and avoiding negative transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11293–11302. [37] WrenAI. 2024. . https://getwren.ai/oss [38] Chenyu Yang, Yuyu Luo, Chuanxuan Cui, Ju Fan, Chengliang Chai, and Nan Tang. 2025. Data Imputation with Limited Data Redundancy Using Data Lakes. Proc. VLDB Endow. 18, 10 (2025), 3354–3367. [39] Tao Yu, Rui Zhang, Heyang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, et al.2019. CoSQL: A Conversational Text-to- SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP). 1962–1979. [40]Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al.2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3911–3921. [41]Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, et al.2019. SParC: Cross-Domain Semantic Parsing in Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 4511–4523. [42] Jipeng Zhang, Haolin Yang, Kehao Miao, Ruiyuan Zhang, Renjie Pi, Jiahui Gao, and Xiaofang Zhou. 2025.ExeSQL: Self-Taught Text-to-SQL Mod- els with Execution-Driven Bootstrapping for SQL Dialects. arXiv preprint arXiv:2505.17231 (2025). [43]Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, and Guoliang Li. 2025. Reward-SQL: Boosting Text-to-SQL via Stepwise Reasoning and Process- Supervised Rewards. arXiv:2505.04671 [cs.CL] https://arxiv.org/abs/2505.04671 [44]Danna Zheng, Mirella Lapata, and Jeff Pan. 2024. Archer: A Human-Labeled Text- to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Yvette Graham and Matthew Purver (Eds.). Association for Computational Linguistics, St. Julian’s, Malta, 94–111. https://doi.org/10.18653/v1/2024.eacl-long.6 [45] Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs/1709.00103 (2017). [46] Wei Zhou, Yuyang Gao, Xuanhe Zhou, and Guoliang Li. 2025. Cracking SQL barriers: An llm-based dialect translation system. Proceedings of the ACM on Management of Data 3, 3 (2025), 1–26. [47] Wei Zhou, Guoliang Li, Haoyu Wang, Yuxing Han, Wu Xufei, Fan Wu, and Xu- anhe Zhou. [n.d.]. PARROT: A Benchmark for Evaluating LLMs in Cross-System SQL Translation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. 14 Dial: A Knowledge-Grounded Dialect-Specific NL2SQL System [48]Yizhang Zhu, Runzhi JIANG, Boyan Li, Nan Tang, and Yuyu Luo. 2025. EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing. In Second Conference on Language Modeling. https://openreview.net/forum?id=8OqGNXKwo8 [49] Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, Chengliang Chai, Chong Chen, Shimin Di, Ju Fan, Ji Sun, Nan Tang, Fugee Tsung, Jiannan Wang, Chenglin Wu, Yanwei Xu, Shaolei Zhang, Yong Zhang, Xuanhe Zhou, Guoliang Li, and Yuyu Luo. 2025. A Survey of Data Agents: Emerging Paradigm or Overstated Hype? CoRR abs/2510.23587 (2025).