Typed AST & lossless roundtripping
The parser produces a fully-typed, immutable AST where every SQL construct has its own interface type. Together with lossless roundtripping, that makes the AST safe to use for refactoring, migration, formatting, and any transformation that needs to touch SQL without breaking it.
Typed interface hierarchy
The AST has 5,988 distinct typed interfaces across all 15 dialects. Every construct, from a column reference to a window function, gets its own type.
No stringly-typed generic nodes. A SELECT is a different type from an INSERT. A CASE expression is a different type from a function call. So:
- Dedicated types everywhere.
Expr.Addis different fromExpr.Multiply;TableRef.Simpleis different fromTableRef.Subquery. - No casting. You always know what you're working with.
- IDE support. Auto-complete, go-to-definition, and find-usages all work.
- Pattern matching.
instanceofchecks against the full hierarchy.
All AST nodes are immutable Java records with wither methods for functional transformation. They're safe to share across threads, cache indefinitely, and use as hash map keys. Optional<T> everywhere — no nulls.
Example
Parsing SELECT a + 1 FROM t WHERE x > 0 produces:
SelectStatement
├── selectItems: [Expr.Add(Expr.Identifier("a"), Expr.IntLiteral(1))]
├── from: [TableRef.Simple("t")]
└── where: Expr.GreaterThan(Expr.Identifier("x"), Expr.IntLiteral(0))
Every node has a dedicated type. That's a meaningful difference from parsers that produce generic "expression" nodes with string-typed operators: with generic nodes, a missing case shows up at runtime; with typed interfaces, the type system catches it.
Error recovery at every node
Every AST interface has a ParseErr variant. When the parser hits a syntax error, it inserts a ParseErr node and keeps going. The rest of the AST retains full structure. That matters for IDE use cases, where you need to analyze partially-written SQL and the valid portions still need complete type information.
Lossless roundtripping
Every token in the original SQL is preserved in the AST. Not just the semantically meaningful parts:
- Whitespace between tokens
- Comments (line and block), attached to the nearest token
- Keyword spelling.
SELECTvsselectvsSelect. - Keyword aliases.
INTvsINTEGER,!=vs<>. - Unnecessary parentheses preserved exactly as written
- Quote style. Double quotes, backticks, square brackets.
Parse a file, render it back, get byte-identical output. This is verified by 177,197+ identity tests drawn from 34+ real-world SQL sources, with 99.7%+ pass rates across all 15 dialects.
It isn't a feature we bolted on. It falls out of the grammar architecture: every token is described in the grammar, so every token ends up in the AST. Hand-written parsers don't get there because they cut corners — skip whitespace here, normalize casing there.
Semantic comparison
Sometimes you want to compare SQL by structure rather than by text. Every AST node provides:
semanticEquals()for structural equality, ignoring whitespace, comments, and keyword casingsemanticHash()for a consistent hash code matching semantic equalitysemanticCompareTo()for deterministic ordering
That covers use cases like detecting duplicate queries, finding semantically equivalent but differently formatted SQL, and building query caches keyed by structure rather than text.
Identity test methodology
Lossless roundtripping isn't just a design goal. It's continuously verified against the largest known SQL test corpus:
- 177,197+ test statements from 34+ sources.
- Sources include PostgreSQL's pg_regress suite, the DuckDB test suite, Apache Spark tests, Apache Calcite, Trino, CockroachDB, ZetaSQL, ANTLR grammars-v4, SQLGlot, sqlfluff, ShardingSphere, sqlparser-rs, and more.
- Every statement is parsed, rendered, and re-parsed; the output must be byte-identical.
- Automated regression. CI runs all identity tests on every commit across all 15 dialects.
These are not synthetic. The corpus includes PostgreSQL's own pg_regress suite (50,807 statements), DuckDB's test suite (36,325 statements), and SQL from 32+ other real-world sources.
What this enables
- Safe refactoring tools. Rename a column, add a filter, rewrite a subquery — the rest of the file stays exactly as the author wrote it.
- Diff-friendly transformations. Changes show up as minimal, meaningful diffs instead of wholesale reformatting.
- Migration tooling. Convert SQL between dialects while preserving original style and comments.
- Code review. Generated SQL changes are reviewable because only the intended modifications appear.
- Formatters. The SQL formatter builds on lossless roundtripping to offer adaptive formatting that respects author intent.