Skip to main content

Column Lineage

Column lineage answers: "Where does each output column's data come from?" Given a SQL query and a schema, it traces every output column backwards through the full query structure -- all the way down to base table columns.

What It Traces

For each output column, lineage provides:

  • Data origins -- which base table columns feed into this output, through however many layers of CTEs, subqueries, and joins
  • Filter dependencies -- which columns appear in WHERE and HAVING clauses that constrain this column's rows
  • Join dependencies -- which columns appear in JOIN conditions that connect the tables producing this column
  • Aggregate dependencies -- which columns are passed through aggregate functions (SUM, COUNT, etc.)
  • GROUP BY dependencies -- which columns appear in GROUP BY clauses affecting this column
  • Direct-copy detection -- whether the column is passed through unchanged or transformed by an expression

SQL Constructs Handled

The lineage engine handles the full range of SQL complexity:

  • CTEs -- WITH clauses, including recursive CTEs with cycle detection
  • Subqueries -- scalar subqueries, IN/EXISTS subqueries, correlated subqueries
  • Derived tables -- inline views in the FROM clause
  • JOINs -- all join types including LATERAL
  • Set operations -- UNION, INTERSECT, EXCEPT (traced by ordinal position)
  • UNNEST -- array expansion (BigQuery, Presto)
  • PIVOT / UNPIVOT -- value rotation
  • Pipe syntax -- BigQuery's pipe operator chains
  • Star expansion -- SELECT * resolved to individual columns with full lineage per column
  • Cross-view lineage -- traces through view definitions when schema includes them

Try it in the playground -- the ANSI playground includes lineage examples. Switch to the Lineage tab to see an interactive column lineage graph.

Loading lineage visualization...

Example

Given this query:

WITH monthly AS (
SELECT customer_id,
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS total
FROM orders
GROUP BY customer_id, DATE_TRUNC('month', order_date)
)
SELECT c.name,
m.month,
m.total,
m.total / 12.0 AS monthly_avg
FROM customers c
JOIN monthly m ON c.id = m.customer_id
WHERE m.total > 1000

The lineage engine traces each output column:

Output ColumnData OriginsFilter DependenciesJoin Dependencies
namecustomers.nameorders.amount (via total > 1000)customers.id, orders.customer_id
monthorders.order_dateorders.amount (via total > 1000)customers.id, orders.customer_id
totalorders.amountorders.amount (via total > 1000)customers.id, orders.customer_id
monthly_avgorders.amountorders.amount (via total > 1000)customers.id, orders.customer_id

This is traced in a single pass through the CTE, the JOIN, the WHERE filter, and the aggregation -- resolving total back through SUM(amount) to orders.amount.

Use Cases

  • Data governance -- "Which base table columns feed the total_spent metric?"
  • Impact analysis -- "If we rename users.email, which dashboards break?"
  • Data quality -- "Is order_count a direct pass-through, or does it go through a CASE expression?"
  • Audit -- "What joins and filters constrain the data in this report column?"
  • Migration -- "Which source tables does this query actually depend on?"

Performance

The lineage engine uses a single-pass algorithm with memoization. Each (scope, column) pair is resolved once and the result is shared across all references. This produces an immutable DAG (directed acyclic graph) of column origins.

This is significantly faster than approaches that re-qualify and re-traverse the query per output column. For complex queries with many shared CTEs or subqueries, the difference can be orders of magnitude.

Integration with dbt

When used through the dbt pipeline, lineage automatically traces through ref() and source() dependencies. Each model's output columns are resolved with full provenance -- from the final SELECT back through all intermediate models to the raw source tables. This works across the full 59-project, 9,925-model test suite.