Query optimization

The optimizer transforms SQL ASTs through a multi-pass pipeline. It resolves every reference, qualifies every column, expands every SELECT *, and simplifies expressions, while preserving semantics and lossless roundtripping.

Why this matters

The optimizer isn't primarily about query performance. It's the foundation that makes lineage, type inference, and every downstream analysis possible.

Before you can trace column lineage, every column has to be qualified to its source table. Before you can infer types, SELECT * has to be expanded to concrete columns. Before you can detect unused joins or dead columns, the scope graph has to be fully resolved.

For data products: if your product offers column-level lineage, impact analysis, dead column detection, or schema validation, the optimizer is what makes those features accurate. Without column qualification and star expansion, lineage tools can only give you table-level results.

For query authors: the optimizer can show the canonical form of any query — fully qualified, simplified, with unnecessary subqueries eliminated. Handy for understanding complex SQL, finding redundancies, and checking that a rewrite is semantically equivalent.

Optimization passes

Pass	What it does
Normalize identifiers	Case-normalizes identifiers per dialect conventions (PostgreSQL folds to lowercase, Oracle to uppercase)
Qualify tables	Adds self-aliases to table references (`FROM users` becomes `FROM users AS users`)
Isolate table selects	Wraps bare table references so downstream passes can push projections into them
Qualify columns	Resolves every column reference to its source table, expands `SELECT *` to explicit columns, adds table qualifiers
Push down projections	Removes unused columns from inner queries. If an outer query uses 3 of 20 columns, the inner query's SELECT list is trimmed accordingly.
Normalize predicates	Converts WHERE conditions to conjunctive normal form (AND of ORs) for consistent processing
Unnest subqueries	Converts correlated subqueries to equivalent JOINs where possible
Push down predicates	Moves WHERE conditions from outer queries into subqueries and derived tables
Optimize joins	Reorders and restructures JOIN conditions for better execution
Eliminate subqueries	Converts duplicate derived tables into CTEs to avoid repeated computation
Merge subqueries	Inlines single-use CTEs and simple derived tables, reducing query nesting
Eliminate joins	Removes JOINs that provably don't affect output (LEFT JOINs where no columns from the right table are used)
Eliminate CTEs	Removes CTEs that are no longer referenced after earlier passes
Annotate types	Runs type inference to assign a data type to every expression
Canonicalize and simplify	Normalizes expressions to standard forms, folds constants, applies boolean algebra, removes tautologies

Each pass is a pure function from immutable AST to immutable AST. Composable, testable, dialect-aware.

Scope resolution

The optimizer builds a scope graph for every level of the query:

Root scope: the outermost SELECT
Subquery scopes: scalar subqueries, IN/EXISTS subqueries
Derived table scopes: inline views in FROM clauses
CTE scopes: WITH clause definitions
UNION scopes: each branch of a set operation
Pipe step scopes: BigQuery pipe syntax operators

Every column reference resolves to a specific source table and column, with fully qualified names. That resolution is what enables accurate lineage, type inference, and safe optimization.

Properties

Type-safe. Operates on immutable, typed ASTs with dedicated interface types.
Lossless. Roundtripping is preserved through all transformations; optimized SQL still renders back to valid, readable text.
Scope-aware. Every pass that needs to understand column references goes through the scope system.
Dialect-aware. Identifier normalization, quoting rules, and function resolution respect the target dialect.

Why this matters​

Optimization passes​

Scope resolution​

Properties​

Why this matters

Optimization passes

Scope resolution

Properties