Skip to main content

Query optimization

The optimizer transforms SQL ASTs through a multi-pass pipeline. It resolves every reference, qualifies every column, expands every SELECT *, and simplifies expressions, while preserving semantics and lossless roundtripping.

Why this matters

The optimizer isn't primarily about query performance. It's the foundation that makes lineage, type inference, and every downstream analysis possible.

Before you can trace column lineage, every column has to be qualified to its source table. Before you can infer types, SELECT * has to be expanded to concrete columns. Before you can detect unused joins or dead columns, the scope graph has to be fully resolved.

For data products: if your product offers column-level lineage, impact analysis, dead column detection, or schema validation, the optimizer is what makes those features accurate. Without column qualification and star expansion, lineage tools can only give you table-level results.

For query authors: the optimizer can show the canonical form of any query — fully qualified, simplified, with unnecessary subqueries eliminated. Handy for understanding complex SQL, finding redundancies, and checking that a rewrite is semantically equivalent.

Optimization passes

PassWhat it does
Normalize identifiersCase-normalizes identifiers per dialect conventions (PostgreSQL folds to lowercase, Oracle to uppercase)
Qualify tablesAdds self-aliases to table references (FROM users becomes FROM users AS users)
Isolate table selectsWraps bare table references so downstream passes can push projections into them
Qualify columnsResolves every column reference to its source table, expands SELECT * to explicit columns, adds table qualifiers
Push down projectionsRemoves unused columns from inner queries. If an outer query uses 3 of 20 columns, the inner query's SELECT list is trimmed accordingly.
Normalize predicatesConverts WHERE conditions to conjunctive normal form (AND of ORs) for consistent processing
Unnest subqueriesConverts correlated subqueries to equivalent JOINs where possible
Push down predicatesMoves WHERE conditions from outer queries into subqueries and derived tables
Optimize joinsReorders and restructures JOIN conditions for better execution
Eliminate subqueriesConverts duplicate derived tables into CTEs to avoid repeated computation
Merge subqueriesInlines single-use CTEs and simple derived tables, reducing query nesting
Eliminate joinsRemoves JOINs that provably don't affect output (LEFT JOINs where no columns from the right table are used)
Eliminate CTEsRemoves CTEs that are no longer referenced after earlier passes
Annotate typesRuns type inference to assign a data type to every expression
Canonicalize and simplifyNormalizes expressions to standard forms, folds constants, applies boolean algebra, removes tautologies

Each pass is a pure function from immutable AST to immutable AST. Composable, testable, dialect-aware.

Scope resolution

The optimizer builds a scope graph for every level of the query:

  • Root scope: the outermost SELECT
  • Subquery scopes: scalar subqueries, IN/EXISTS subqueries
  • Derived table scopes: inline views in FROM clauses
  • CTE scopes: WITH clause definitions
  • UNION scopes: each branch of a set operation
  • Pipe step scopes: BigQuery pipe syntax operators

Every column reference resolves to a specific source table and column, with fully qualified names. That resolution is what enables accurate lineage, type inference, and safe optimization.

Properties

  • Type-safe. Operates on immutable, typed ASTs with dedicated interface types.
  • Lossless. Roundtripping is preserved through all transformations; optimized SQL still renders back to valid, readable text.
  • Scope-aware. Every pass that needs to understand column references goes through the scope system.
  • Dialect-aware. Identifier normalization, quoting rules, and function resolution respect the target dialect.