Column Lineage

Column lineage answers one question: where does each output column's data come from? Given a SQL query and a schema, the engine traces every output column backwards through the query, all the way down to base table columns.

What it traces

For each output column, lineage reports:

Data origins. Which base table columns feed this output, through however many layers of CTEs, subqueries, and joins.
Filter dependencies. Which columns appear in WHERE and HAVING clauses that constrain this column's rows.
Join dependencies. Which columns appear in JOIN conditions connecting the tables that produce this column.
Aggregate dependencies. Which columns are passed through aggregate functions (SUM, COUNT, and so on).
GROUP BY dependencies. Which columns appear in GROUP BY clauses affecting this column.
Direct-copy detection. Whether the column is passed through unchanged or transformed by an expression.

SQL constructs handled

The lineage engine handles the full range of SQL complexity:

CTEs, including recursive CTEs with cycle detection
Subqueries: scalar, IN/EXISTS, and correlated
Derived tables (inline views in FROM)
JOINs, including LATERAL
Set operations: UNION, INTERSECT, EXCEPT (traced by ordinal position)
UNNEST for array expansion (BigQuery, Presto)
PIVOT and UNPIVOT
Pipe syntax (BigQuery's pipe operator chains)
Star expansion: SELECT * resolved to individual columns, with lineage per column
Cross-view lineage when the schema includes view definitions

Try it in the playground. The ANSI playground includes lineage examples. Switch to the Lineage tab for an interactive column lineage graph.

Loading lineage visualization...

Example

Given this query:

WITH monthly AS (
  SELECT customer_id,
         DATE_TRUNC('month', order_date) AS month,
         SUM(amount) AS total
  FROM orders
  GROUP BY customer_id, DATE_TRUNC('month', order_date)
)
SELECT c.name,
       m.month,
       m.total,
       m.total / 12.0 AS monthly_avg
FROM customers c
JOIN monthly m ON c.id = m.customer_id
WHERE m.total > 1000

The lineage engine traces each output column:

Output Column	Data Origins	Filter Dependencies	Join Dependencies
`name`	`customers.name`	`orders.amount` (via total > 1000)	`customers.id`, `orders.customer_id`
`month`	`orders.order_date`	`orders.amount` (via total > 1000)	`customers.id`, `orders.customer_id`
`total`	`orders.amount`	`orders.amount` (via total > 1000)	`customers.id`, `orders.customer_id`
`monthly_avg`	`orders.amount`	`orders.amount` (via total > 1000)	`customers.id`, `orders.customer_id`

All of this comes from a single pass through the CTE, the JOIN, the WHERE filter, and the aggregation. total resolves back through SUM(amount) to orders.amount.

Use cases

Data governance: "Which base table columns feed the total_spent metric?"
Impact analysis: "If we rename users.email, which dashboards break?"
Data quality: "Is order_count a direct pass-through, or does it go through a CASE expression?"
Audit: "What joins and filters constrain the data in this report column?"
Migration: "Which source tables does this query actually depend on?"

Performance

The lineage engine is a single-pass algorithm with memoization. Each (scope, column) pair is resolved once and the result is shared across every reference, producing an immutable DAG of column origins.

That's a lot faster than re-qualifying and re-traversing the query per output column. On complex queries with shared CTEs or subqueries, the gap can be orders of magnitude.

dbt integration

Through the dbt pipeline, lineage follows ref() and source() automatically. Each model's output columns get full provenance, from the final SELECT back through every intermediate model to the raw source tables. This is exercised across the 59-project, 9,925-model test suite.

What it traces​

SQL constructs handled​

Example​

Use cases​

Performance​

dbt integration​