Why SQLite's own grammar

Most SQL formatters and linters use hand-written or approximate grammars. syntaqlite doesn't. It uses SQLite's own tokenizer and Lemon-generated parser rules directly. This page explains what that means in practice.

The problem with approximate grammars

SQLite's grammar has grown from 326 rules in version 3.12.2 to over 410 rules in recent releases. It includes syntax for window functions, upsert, RETURNING clauses, generated columns, ->>/->> JSON operators, and dozens of other features added across 24 grammar change points.

A hand-written parser trying to match this surface area will inevitably diverge. It might accept SQL that SQLite rejects, reject SQL that SQLite accepts, or misparse edge cases around keyword-as-identifier rules (SQLite allows most keywords as identifiers in specific contexts. The full list is encoded in the Lemon grammar's fallback table).

What "uses SQLite's grammar" means concretely

syntaqlite's parser is built from two C source files generated by SQLite's Lemon parser generator:

  • sqlite_tokenize.c — SQLite's tokenizer, including the keyword hash table and character classification tables
  • sqlite_parse.c — the LALR(1) state machine generated from SQLite's parse.y grammar

These are not vendored from an upstream SQLite release. They're regenerated from SQLite's grammar rules as part of syntaqlite's build pipeline, which concatenates .y action files and runs them through an embedded copy of Lemon. The Rust layer wraps the C parser in safe APIs and builds everything else (formatter, analyzer, LSP) on top.

The parser action files in syntaqlite-syntax/parser-actions/ must match upstream grammar rule signatures exactly. A grammar verification step enforces this: any unintentional divergence from SQLite's grammar is a build error.

Version-aware parsing

SQLite's grammar has been semantically additive across all analyzed versions (3.12.2–3.51.2): it never narrows the set of accepted SQL. syntaqlite uses the latest grammar for parsing and validates against your target version after parsing:

syntaqlite validate --sqlite-version 3.41.0 query.sql

This means syntaqlite will parse the SQL correctly even if it uses syntax from a newer SQLite version, then report diagnostics if the target version wouldn't support it. See project setup for details.

Intentional divergences

syntaqlite adds a small number of grammar rules beyond upstream SQLite, all related to error recovery. These are explicitly tracked in ALLOWED_EXTRA_RULES:

  • ecmd ::= error SEMI — recover from errors at statement boundaries
  • expr ::= error — recover from malformed expressions (used for embedded SQL hole interpolation)

These rules only affect error recovery behavior. They don't change what valid SQL is accepted.

When SQLite changes its grammar

New SQLite releases occasionally add grammar rules. syntaqlite's codegen pipeline is designed to minimize the human effort needed:

  1. Grammar rules and the state machine are regenerated automatically
  2. AST node structure is defined in .synq files. New nodes need a .synq entry for formatting and validation
  3. Build tooling detects any divergence and guides the update

The only manual decision is which AST nodes to fold together (e.g., whether a new clause type gets its own node or folds into an existing one). The codegen handles everything else.