Skip to main content

Typed AST & Lossless Roundtripping

The parser produces a fully-typed, immutable abstract syntax tree where every SQL construct has its own dedicated interface type. Combined with lossless roundtripping, this makes the AST safe to use for refactoring, migration, formatting, and any transformation that needs to touch SQL without breaking it.

Typed Interface Hierarchy

The AST consists of 14,782 distinct typed interfaces across all 15 dialects. Every SQL construct — from a simple column reference to a complex window function — has its own dedicated type.

There are no stringly-typed generic nodes. A SELECT statement is a different type from an INSERT statement. A CASE expression is a different type from a function call. This means:

  • Dedicated types everywhere -- Expr.Add is different from Expr.Multiply, TableRef.Simple is different from TableRef.Subquery
  • No casting -- you always know what you're working with
  • IDE support -- auto-complete, go-to-definition, find-usages all work naturally
  • Pattern matching -- instanceof checks against the full type hierarchy

All AST nodes are immutable Java records with wither methods for functional transformation. They're safe to share across threads, cache indefinitely, and use as hash map keys. Optional<T> everywhere -- no nulls, ever.

Example

Parsing SELECT a + 1 FROM t WHERE x > 0 produces:

SelectStatement
├── selectItems: [Expr.Add(Expr.Identifier("a"), Expr.IntLiteral(1))]
├── from: [TableRef.Simple("t")]
└── where: Expr.GreaterThan(Expr.Identifier("x"), Expr.IntLiteral(0))

Every node has a dedicated type. This is fundamentally different from parsers that produce generic "expression" nodes with string-typed operators. With generic nodes, a missing case is only discovered at runtime. With typed interfaces, the type system catches it.

Error Recovery at Every Node

Every AST interface has a ParseErr variant. When the parser encounters a syntax error, it inserts a ParseErr node and continues parsing. The rest of the AST retains full structure. This is critical for IDE use cases -- you need to analyze partially-written SQL, and the valid portions must still have complete type information.

Lossless Roundtripping

Every token in the original SQL is preserved in the AST -- not just the semantically meaningful parts, but also:

  • Whitespace between tokens
  • Comments (line and block), attached to the nearest token
  • Keyword spelling -- SELECT vs select vs Select
  • Keyword aliases -- INT vs INTEGER, != vs <>
  • Unnecessary parentheses -- preserved exactly as written
  • Quote style -- double quotes, backticks, square brackets

Parse a file, render it back, and you get byte-identical output. This is verified by 170,686+ identity tests drawn from 33+ real-world SQL sources, achieving 99.7%+ pass rates across all 15 dialects.

This isn't a feature we bolted on. It falls out of the grammar architecture -- every token is described in the grammar, so every token ends up in the AST. Hand-written parsers can't achieve this because they all make pragmatic shortcuts: skip whitespace here, normalize casing there.

Semantic Comparison

Sometimes you want to compare SQL by structure rather than by text. Every AST node provides:

  • semanticEquals() -- structural equality ignoring whitespace, comments, and keyword casing
  • semanticHash() -- consistent hash code for semantic equality
  • semanticCompareTo() -- deterministic ordering for canonical forms

This enables use cases like detecting duplicate queries, finding semantically equivalent but differently formatted SQL, and building query caches keyed by structure rather than text.

Identity Test Methodology

Lossless roundtripping isn't just a design goal -- it's continuously verified against the largest known SQL test corpus:

  • 170,686+ test statements from 33+ sources
  • Sources include: PostgreSQL's pg_regress suite, DuckDB test suite, Apache Spark tests, Apache Calcite, Trino, CockroachDB, ZetaSQL, ANTLR grammars-v4, SQLGlot, sqlfluff, ShardingSphere, sqlparser-rs, and more
  • Every statement is parsed, rendered, and re-parsed -- the output must be byte-identical
  • Automated regression -- CI runs all identity tests on every commit across all 15 dialects

These are not synthetic tests. The corpus includes PostgreSQL's own pg_regress suite (50,806 statements), DuckDB's test suite (36,325 statements), and SQL from 31+ other real-world sources.

What This Enables

  • Safe refactoring tools -- rename a column, add a filter, rewrite a subquery, and the rest of the file stays exactly as the author wrote it
  • Diff-friendly transformations -- changes show up as minimal, meaningful diffs rather than wholesale reformatting
  • Migration tooling -- convert SQL between dialects while preserving the original style and comments
  • Code review -- generated SQL changes are reviewable because only the intended modifications appear
  • Formatters -- the SQL formatter builds on lossless roundtripping to offer adaptive formatting that respects author intent