Architecture

Pipeline

SQL source text flows through a pipeline of stages.

First, source text is tokenized and parsed into an AST:

graph LR Source["SQL source"] --> Tokenizer --> Parser --> AST

The formatter and analyzer then consume the AST independently — they don't depend on each other:

graph LR AST --> Formatter --> Formatted["Formatted SQL"] AST --> Analyzer["Semantic analyzer"] --> Diagnostics

The LSP server ties both together for editor integration. You can also use any component on its own — tokenize without parsing, parse without formatting, validate without formatting, or do both.

The C/Rust sandwich

syntaqlite's parser is not a hand-written Rust parser. It uses SQLite's Lemon-generated grammar and tokenizer, compiled from C and linked into Rust via FFI. The Rust layer wraps the C parser in safe APIs and builds the formatter, semantic analyzer, and LSP on top.

There's also an outbound FFI layer: the Rust formatter and validator are exported back to C consumers through #[no_mangle] functions, so non-Rust projects can link syntaqlite as a C library. This makes the architecture a genuine sandwich — C at the bottom (parser), Rust in the middle (analysis, formatting, LSP), C at the top (consumer API).

The bottom layer is the C parser and tokenizer:

graph LR LemonTokens["sqlite_tokenize.c"] --> TokenizerC["tokenizer.c"] LemonGrammar["sqlite_parse.c"] --> ParserC["parser.c"]

The middle layer is Rust, wrapping the C parser in safe types and building analysis and formatting on top:

graph LR RustAPI["Parser, Tokenizer wrappers"] --> Fmt["Formatter"] RustAPI --> Sem["Semantic analyzer"] Sem --> Cat["Catalog"] Fmt --> Lsp["LSP server"] Sem --> Lsp

The top layer exports Rust back to C for external consumers:

graph LR Fmt["Formatter"] --> FmtFFI["syntaqlite_formatter_*"] Sem["Semantic analyzer"] --> ValFFI["syntaqlite_validator_*"]

.synq grammar files generate code consumed by multiple layers:

graph LR Synq[".synq files"] --> Codegen["syntaqlite-buildtools"] Codegen --> CH["C headers"] Codegen --> RS["Rust code"] CH --> Parser["C parser"] RS --> Analyzer["Analyzer"] RS --> Formatter["Formatter"]

What's from SQLite

sqlite_parse.c and sqlite_tokenize.c are generated by SQLite's Lemon parser generator. They contain the grammar state machine and token/keyword tables. These are not vendored from upstream SQLite — they're regenerated from SQLite's grammar rules as part of the syntaqlite build.

What's generated from `.synq`

The .synq grammar files in syntaqlite-syntax/parser-nodes/ are the source of truth for AST node structure. syntaqlite-buildtools generates from them:

C headers — struct layouts for AST nodes, parser action code, node metadata (field names, field counts, list ranges), semantic role byte tables, and formatter dispatch tables
Rust code — typed AST node wrappers and the semantic_roles.rs table

The arena

The C parser allocates all nodes for a statement into a flat memory buffer (arena). Rust reads the nodes directly via pointer casts — no copying or deserialization. The arena is reset between statements, so memory usage stays proportional to the largest single statement, not the whole file.

The FFI boundaries

Inbound (C → Rust): syntaqlite-syntax/src/parser/ffi.rs defines unsafe wrappers around the C parser and tokenizer. These are the only place raw C pointers are handled. The rest of the Rust code sees safe Parser, Tokenizer, and ParseSession types.

Outbound (Rust → C): syntaqlite/src/fmt/ffi.rs and syntaqlite/src/semantic/ffi.rs export the formatter and validator as opaque C handles (SyntaqliteFormatter*, SyntaqliteValidator*) with lifecycle functions (create, use, destroy).

Crates

Crate	Role
`syntaqlite-syntax`	Tokenizer, parser, AST arena, grammar system (Rust + C)
`syntaqlite`	Formatter, semantic analyzer, LSP, dialect interface (Rust)
`syntaqlite-common`	Shared types — semantic roles (Rust)
`syntaqlite-buildtools`	Code generation from `.synq` grammar definitions
`syntaqlite-cli`	Command-line interface
`syntaqlite-wasm`	WebAssembly bindings

Grammar system

The source of truth for AST structure is a set of .synq files in syntaqlite-syntax/parser-nodes/. These define:

Nodes — AST node types with typed fields
Enums — fixed value sets (e.g., sort order, join type)
Flags — bit-packed booleans
Lists — sequences of child nodes
Semantic annotations — instructions for the validator
Formatting rules — bytecode for the pretty-printer

Adding a new AST node, its formatting, and its validation behavior is a single change to a .synq file followed by running code generation.

Tokenizer

Wraps the Lemon-generated token function in a C struct with lifecycle methods (create, reset, next, destroy). The Rust Tokenizer type wraps this further in a safe API. Zero-copy — tokens reference byte offsets into the source string. Exposed as a public API for consumers who only need lexical analysis.

Parser

Driven by the Lemon-generated state machine. Key properties:

Streaming — yields one statement at a time, so memory usage doesn't grow with input size
Error recovery — on a syntax error, the parser skips to the next semicolon and continues
Token collection — optionally records tokens and comments alongside the AST (needed by the formatter to preserve whitespace and comment placement)
Incremental parsing — a separate mode for editors, where tokens are fed one at a time for completion support
Macro expansion — registered macros are expanded during parsing with recursion tracking

Semantic analyzer

Single-pass walk over the AST that validates references against a layered catalog:

Query (innermost) — CTEs, subquery aliases
Document — CREATE TABLE statements in the current file
Connection — DDL from prior statements (Execute mode only)
Database — user-provided schema
Dialect (outermost) — built-in functions

The walk is driven by a semantic role table — a byte-encoded instruction set generated from .synq annotations and stored as a flat C byte array. Each AST node type maps to a role (e.g., query, source_ref, cte_binding) that tells the analyzer what to validate and how to update scope. Rust reads the table via a direct pointer cast — zero decoding cost.

Diagnostics are emitted inline during the walk. There's no separate "resolve" pass — everything happens in one traversal.

Formatter

Uses Wadler-Lindig style document algebra:

Group — try to fit contents on one line; if too long, break
Line — space in flat mode, newline + indent in break mode
SoftLine — nothing in flat mode, newline + indent in break mode
Nest — increase indentation level
Keyword — SQL keyword (case-transformed per config)
Text — literal text from the source (never transformed)

The formatting rules for each AST node are compiled from .synq fmt blocks into bytecode. At format time, the bytecode interpreter walks the AST and builds a document, which is then rendered with line-width-aware layout.

Comment placement is handled separately — the formatter tracks comment positions from the parser and reattaches them to the appropriate locations in the formatted output.

Diagnostic rendering

Diagnostics carry byte offsets, a structured message, and optional help text (e.g., "did you mean 'name'?" via Levenshtein distance matching). The DiagnosticRenderer produces rustc-style output with source snippets and underline markers.