Architecture
Pipeline
SQL source text flows through a pipeline of stages.
First, source text is tokenized and parsed into an AST:
The formatter and analyzer then consume the AST independently — they don't depend on each other:
The LSP server ties both together for editor integration. You can also use any component on its own — tokenize without parsing, parse without formatting, validate without formatting, or do both.
The C/Rust sandwich
syntaqlite's parser is not a hand-written Rust parser. It uses SQLite's Lemon-generated grammar and tokenizer, compiled from C and linked into Rust via FFI. The Rust layer wraps the C parser in safe APIs and builds the formatter, semantic analyzer, and LSP on top.
There's also an outbound FFI layer: the Rust formatter and validator are
exported back to C consumers through #[no_mangle] functions, so non-Rust
projects can link syntaqlite as a C library. This makes the architecture a
genuine sandwich — C at the bottom (parser), Rust in the middle (analysis,
formatting, LSP), C at the top (consumer API).
The bottom layer is the C parser and tokenizer:
The middle layer is Rust, wrapping the C parser in safe types and building analysis and formatting on top:
The top layer exports Rust back to C for external consumers:
.synq grammar files generate code consumed by multiple layers:
What's from SQLite
sqlite_parse.c and sqlite_tokenize.c are generated by SQLite's Lemon
parser generator. They contain the grammar state machine and token/keyword
tables. These are not vendored from upstream SQLite — they're regenerated from
SQLite's grammar rules as part of the syntaqlite build.
What's generated from .synq
The .synq grammar files in syntaqlite-syntax/parser-nodes/ are the source
of truth for AST node structure. syntaqlite-buildtools generates from them:
- C headers — struct layouts for AST nodes, parser action code, node metadata (field names, field counts, list ranges), semantic role byte tables, and formatter dispatch tables
- Rust code — typed AST node wrappers and the
semantic_roles.rstable
The arena
The C parser allocates all nodes for a statement into a flat memory buffer (arena). Rust reads the nodes directly via pointer casts — no copying or deserialization. The arena is reset between statements, so memory usage stays proportional to the largest single statement, not the whole file.
The FFI boundaries
Inbound (C → Rust): syntaqlite-syntax/src/parser/ffi.rs defines unsafe
wrappers around the C parser and tokenizer. These are the only place raw C
pointers are handled. The rest of the Rust code sees safe Parser,
Tokenizer, and ParseSession types.
Outbound (Rust → C): syntaqlite/src/fmt/ffi.rs and
syntaqlite/src/semantic/ffi.rs export the formatter and validator as opaque
C handles (SyntaqliteFormatter*, SyntaqliteValidator*) with lifecycle
functions (create, use, destroy).
Crates
| Crate | Role |
|---|---|
syntaqlite-syntax | Tokenizer, parser, AST arena, grammar system (Rust + C) |
syntaqlite | Formatter, semantic analyzer, LSP, dialect interface (Rust) |
syntaqlite-common | Shared types — semantic roles (Rust) |
syntaqlite-buildtools | Code generation from .synq grammar definitions |
syntaqlite-cli | Command-line interface |
syntaqlite-wasm | WebAssembly bindings |
Grammar system
The source of truth for AST structure is a set of .synq files in
syntaqlite-syntax/parser-nodes/. These define:
- Nodes — AST node types with typed fields
- Enums — fixed value sets (e.g., sort order, join type)
- Flags — bit-packed booleans
- Lists — sequences of child nodes
- Semantic annotations — instructions for the validator
- Formatting rules — bytecode for the pretty-printer
Adding a new AST node, its formatting, and its validation behavior is a single
change to a .synq file followed by running code generation.
Tokenizer
Wraps the Lemon-generated token function in a C struct with lifecycle methods
(create, reset, next, destroy). The Rust Tokenizer type wraps this further
in a safe API. Zero-copy — tokens reference byte offsets into the source
string. Exposed as a public API for consumers who only need lexical analysis.
Parser
Driven by the Lemon-generated state machine. Key properties:
- Streaming — yields one statement at a time, so memory usage doesn't grow with input size
- Error recovery — on a syntax error, the parser skips to the next semicolon and continues
- Token collection — optionally records tokens and comments alongside the AST (needed by the formatter to preserve whitespace and comment placement)
- Incremental parsing — a separate mode for editors, where tokens are fed one at a time for completion support
- Macro expansion — registered macros are expanded during parsing with recursion tracking
Semantic analyzer
Single-pass walk over the AST that validates references against a layered catalog:
- Query (innermost) — CTEs, subquery aliases
- Document —
CREATE TABLEstatements in the current file - Connection — DDL from prior statements (Execute mode only)
- Database — user-provided schema
- Dialect (outermost) — built-in functions
The walk is driven by a semantic role table — a byte-encoded instruction
set generated from .synq annotations and stored as a flat C byte array. Each
AST node type maps to a role (e.g., query, source_ref, cte_binding) that
tells the analyzer what to validate and how to update scope. Rust reads the
table via a direct pointer cast — zero decoding cost.
Diagnostics are emitted inline during the walk. There's no separate "resolve" pass — everything happens in one traversal.
Formatter
Uses Wadler-Lindig style document algebra:
- Group — try to fit contents on one line; if too long, break
- Line — space in flat mode, newline + indent in break mode
- SoftLine — nothing in flat mode, newline + indent in break mode
- Nest — increase indentation level
- Keyword — SQL keyword (case-transformed per config)
- Text — literal text from the source (never transformed)
The formatting rules for each AST node are compiled from .synq fmt blocks
into bytecode. At format time, the bytecode interpreter walks the AST and
builds a document, which is then rendered with line-width-aware layout.
Comment placement is handled separately — the formatter tracks comment positions from the parser and reattaches them to the appropriate locations in the formatted output.
Diagnostic rendering
Diagnostics carry byte offsets, a structured message, and optional help text
(e.g., "did you mean 'name'?" via Levenshtein distance matching). The
DiagnosticRenderer produces rustc-style output with source snippets and
underline markers.