Query optimization
The optimizer transforms SQL ASTs through a multi-pass pipeline. It resolves every reference, qualifies every column, expands every SELECT *, and simplifies expressions, while preserving semantics and lossless roundtripping.
Why this matters
The optimizer isn't primarily about query performance. It's the foundation that makes lineage, type inference, and every downstream analysis possible.
Before you can trace column lineage, every column has to be qualified to its source table. Before you can infer types, SELECT * has to be expanded to concrete columns. Before you can detect unused joins or dead columns, the scope graph has to be fully resolved.
For data products: if your product offers column-level lineage, impact analysis, dead column detection, or schema validation, the optimizer is what makes those features accurate. Without column qualification and star expansion, lineage tools can only give you table-level results.
For query authors: the optimizer can show the canonical form of any query — fully qualified, simplified, with unnecessary subqueries eliminated. Handy for understanding complex SQL, finding redundancies, and checking that a rewrite is semantically equivalent.
Optimization passes
| Pass | What it does |
|---|---|
| Normalize identifiers | Case-normalizes identifiers per dialect conventions (PostgreSQL folds to lowercase, Oracle to uppercase) |
| Qualify tables | Adds self-aliases to table references (FROM users becomes FROM users AS users) |
| Isolate table selects | Wraps bare table references so downstream passes can push projections into them |
| Qualify columns | Resolves every column reference to its source table, expands SELECT * to explicit columns, adds table qualifiers |
| Push down projections | Removes unused columns from inner queries. If an outer query uses 3 of 20 columns, the inner query's SELECT list is trimmed accordingly. |
| Normalize predicates | Converts WHERE conditions to conjunctive normal form (AND of ORs) for consistent processing |
| Unnest subqueries | Converts correlated subqueries to equivalent JOINs where possible |
| Push down predicates | Moves WHERE conditions from outer queries into subqueries and derived tables |
| Optimize joins | Reorders and restructures JOIN conditions for better execution |
| Eliminate subqueries | Converts duplicate derived tables into CTEs to avoid repeated computation |
| Merge subqueries | Inlines single-use CTEs and simple derived tables, reducing query nesting |
| Eliminate joins | Removes JOINs that provably don't affect output (LEFT JOINs where no columns from the right table are used) |
| Eliminate CTEs | Removes CTEs that are no longer referenced after earlier passes |
| Annotate types | Runs type inference to assign a data type to every expression |
| Canonicalize and simplify | Normalizes expressions to standard forms, folds constants, applies boolean algebra, removes tautologies |
Each pass is a pure function from immutable AST to immutable AST. Composable, testable, dialect-aware.
Scope resolution
The optimizer builds a scope graph for every level of the query:
- Root scope: the outermost SELECT
- Subquery scopes: scalar subqueries, IN/EXISTS subqueries
- Derived table scopes: inline views in FROM clauses
- CTE scopes: WITH clause definitions
- UNION scopes: each branch of a set operation
- Pipe step scopes: BigQuery pipe syntax operators
Every column reference resolves to a specific source table and column, with fully qualified names. That resolution is what enables accurate lineage, type inference, and safe optimization.
Properties
- Type-safe. Operates on immutable, typed ASTs with dedicated interface types.
- Lossless. Roundtripping is preserved through all transformations; optimized SQL still renders back to valid, readable text.
- Scope-aware. Every pass that needs to understand column references goes through the scope system.
- Dialect-aware. Identifier normalization, quoting rules, and function resolution respect the target dialect.