Skip to main content

Query Optimization

The optimizer transforms SQL ASTs through a multi-pass pipeline that resolves every reference, qualifies every column, expands every SELECT *, and simplifies expressions — all while preserving semantics and lossless roundtripping.

Why This Matters

The optimizer isn't primarily about query performance. It's the foundation that makes lineage, type inference, and every downstream analysis possible.

Before you can trace column lineage, you need every column qualified to its source table. Before you can infer types, you need SELECT * expanded to concrete columns. Before you can detect unused joins or dead columns, you need the full scope graph resolved.

For data products: if your product offers column-level lineage, impact analysis, dead column detection, or schema validation -- the optimizer is what makes those features accurate. Without full column qualification and star expansion, lineage tools can only give you table-level results.

For query authors: the optimizer can show the "canonical" form of any query -- fully qualified, simplified, with unnecessary subqueries eliminated. This is useful for understanding complex SQL, finding redundancies, and validating that a rewrite is semantically equivalent.

Optimization Passes

PassWhat It Does
Normalize identifiersCase-normalizes identifiers per dialect conventions (e.g., PostgreSQL folds to lowercase, Oracle to uppercase)
Qualify tablesAdds self-aliases to table references (FROM users becomes FROM users AS users)
Isolate table selectsWraps bare table references so downstream passes can push projections into them
Qualify columnsResolves every column reference to its source table, expands SELECT * to explicit columns, adds table qualifiers
Push down projectionsRemoves unused columns from inner queries -- if an outer query only uses 3 of 20 columns, the inner query's SELECT list is trimmed
Normalize predicatesConverts WHERE conditions to conjunctive normal form (AND of ORs) for consistent processing
Unnest subqueriesConverts correlated subqueries to equivalent JOINs where possible
Push down predicatesMoves WHERE conditions from outer queries into subqueries and derived tables
Optimize joinsReorders and restructures JOIN conditions for better execution
Eliminate subqueriesConverts duplicate derived tables into CTEs to avoid repeated computation
Merge subqueriesInlines single-use CTEs and simple derived tables, reducing query nesting
Eliminate joinsRemoves JOINs that provably don't affect output (e.g., LEFT JOINs where no columns from the right table are used)
Eliminate CTEsRemoves CTEs that are no longer referenced after other optimization passes
Annotate typesRuns type inference to determine the data type of every expression
Canonicalize and simplifyNormalizes expressions to standard forms, folds constants, applies boolean algebra, removes tautologies

Each pass is a pure function from immutable AST to immutable AST. Composable, testable, dialect-aware.

Scope Resolution

The optimizer builds a scope graph representing every level of the query:

  • Root scope -- the outermost SELECT
  • Subquery scopes -- scalar subqueries, IN/EXISTS subqueries
  • Derived table scopes -- inline views in FROM clauses
  • CTE scopes -- WITH clause definitions
  • UNION scopes -- each branch of a set operation
  • Pipe step scopes -- BigQuery pipe syntax operators

Every column reference is resolved to a specific source table and column, with fully qualified names. This resolution is what enables accurate lineage, type inference, and safe optimization.

Properties

  • Type-safe -- operates on immutable, typed ASTs with dedicated interface types
  • Lossless -- roundtripping is preserved through all transformations; optimized SQL can still be rendered back to valid, readable text
  • Scope-aware -- every pass that needs to understand column references works through the scope system
  • Dialect-aware -- identifier normalization, quoting rules, and function resolution respect the target dialect