Skip to main content

Typed AST & lossless roundtripping

The parser produces a fully-typed, immutable AST where every SQL construct has its own interface type. Together with lossless roundtripping, that makes the AST safe to use for refactoring, migration, formatting, and any transformation that needs to touch SQL without breaking it.

Typed interface hierarchy

The AST has 5,988 distinct typed interfaces across all 15 dialects. Every construct, from a column reference to a window function, gets its own type.

No stringly-typed generic nodes. A SELECT is a different type from an INSERT. A CASE expression is a different type from a function call. So:

  • Dedicated types everywhere. Expr.Add is different from Expr.Multiply; TableRef.Simple is different from TableRef.Subquery.
  • No casting. You always know what you're working with.
  • IDE support. Auto-complete, go-to-definition, and find-usages all work.
  • Pattern matching. instanceof checks against the full hierarchy.

All AST nodes are immutable Java records with wither methods for functional transformation. They're safe to share across threads, cache indefinitely, and use as hash map keys. Optional<T> everywhere — no nulls.

Example

Parsing SELECT a + 1 FROM t WHERE x > 0 produces:

SelectStatement
├── selectItems: [Expr.Add(Expr.Identifier("a"), Expr.IntLiteral(1))]
├── from: [TableRef.Simple("t")]
└── where: Expr.GreaterThan(Expr.Identifier("x"), Expr.IntLiteral(0))

Every node has a dedicated type. That's a meaningful difference from parsers that produce generic "expression" nodes with string-typed operators: with generic nodes, a missing case shows up at runtime; with typed interfaces, the type system catches it.

Error recovery at every node

Every AST interface has a ParseErr variant. When the parser hits a syntax error, it inserts a ParseErr node and keeps going. The rest of the AST retains full structure. That matters for IDE use cases, where you need to analyze partially-written SQL and the valid portions still need complete type information.

Lossless roundtripping

Every token in the original SQL is preserved in the AST. Not just the semantically meaningful parts:

  • Whitespace between tokens
  • Comments (line and block), attached to the nearest token
  • Keyword spelling. SELECT vs select vs Select.
  • Keyword aliases. INT vs INTEGER, != vs <>.
  • Unnecessary parentheses preserved exactly as written
  • Quote style. Double quotes, backticks, square brackets.

Parse a file, render it back, get byte-identical output. This is verified by 177,197+ identity tests drawn from 34+ real-world SQL sources, with 99.7%+ pass rates across all 15 dialects.

It isn't a feature we bolted on. It falls out of the grammar architecture: every token is described in the grammar, so every token ends up in the AST. Hand-written parsers don't get there because they cut corners — skip whitespace here, normalize casing there.

Semantic comparison

Sometimes you want to compare SQL by structure rather than by text. Every AST node provides:

  • semanticEquals() for structural equality, ignoring whitespace, comments, and keyword casing
  • semanticHash() for a consistent hash code matching semantic equality
  • semanticCompareTo() for deterministic ordering

That covers use cases like detecting duplicate queries, finding semantically equivalent but differently formatted SQL, and building query caches keyed by structure rather than text.

Identity test methodology

Lossless roundtripping isn't just a design goal. It's continuously verified against the largest known SQL test corpus:

  • 177,197+ test statements from 34+ sources.
  • Sources include PostgreSQL's pg_regress suite, the DuckDB test suite, Apache Spark tests, Apache Calcite, Trino, CockroachDB, ZetaSQL, ANTLR grammars-v4, SQLGlot, sqlfluff, ShardingSphere, sqlparser-rs, and more.
  • Every statement is parsed, rendered, and re-parsed; the output must be byte-identical.
  • Automated regression. CI runs all identity tests on every commit across all 15 dialects.

These are not synthetic. The corpus includes PostgreSQL's own pg_regress suite (50,807 statements), DuckDB's test suite (36,325 statements), and SQL from 32+ other real-world sources.

What this enables

  • Safe refactoring tools. Rename a column, add a filter, rewrite a subquery — the rest of the file stays exactly as the author wrote it.
  • Diff-friendly transformations. Changes show up as minimal, meaningful diffs instead of wholesale reformatting.
  • Migration tooling. Convert SQL between dialects while preserving original style and comments.
  • Code review. Generated SQL changes are reviewable because only the intended modifications appear.
  • Formatters. The SQL formatter builds on lossless roundtripping to offer adaptive formatting that respects author intent.