Column Lineage
Column lineage answers one question: where does each output column's data come from? Given a SQL query and a schema, the engine traces every output column backwards through the query, all the way down to base table columns.
What it traces
For each output column, lineage reports:
- Data origins. Which base table columns feed this output, through however many layers of CTEs, subqueries, and joins.
- Filter dependencies. Which columns appear in WHERE and HAVING clauses that constrain this column's rows.
- Join dependencies. Which columns appear in JOIN conditions connecting the tables that produce this column.
- Aggregate dependencies. Which columns are passed through aggregate functions (SUM, COUNT, and so on).
- GROUP BY dependencies. Which columns appear in GROUP BY clauses affecting this column.
- Direct-copy detection. Whether the column is passed through unchanged or transformed by an expression.
SQL constructs handled
The lineage engine handles the full range of SQL complexity:
- CTEs, including recursive CTEs with cycle detection
- Subqueries: scalar, IN/EXISTS, and correlated
- Derived tables (inline views in FROM)
- JOINs, including LATERAL
- Set operations: UNION, INTERSECT, EXCEPT (traced by ordinal position)
- UNNEST for array expansion (BigQuery, Presto)
- PIVOT and UNPIVOT
- Pipe syntax (BigQuery's pipe operator chains)
- Star expansion:
SELECT *resolved to individual columns, with lineage per column - Cross-view lineage when the schema includes view definitions
Try it in the playground. The ANSI playground includes lineage examples. Switch to the Lineage tab for an interactive column lineage graph.
Example
Given this query:
WITH monthly AS (
SELECT customer_id,
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS total
FROM orders
GROUP BY customer_id, DATE_TRUNC('month', order_date)
)
SELECT c.name,
m.month,
m.total,
m.total / 12.0 AS monthly_avg
FROM customers c
JOIN monthly m ON c.id = m.customer_id
WHERE m.total > 1000
The lineage engine traces each output column:
| Output Column | Data Origins | Filter Dependencies | Join Dependencies |
|---|---|---|---|
name | customers.name | orders.amount (via total > 1000) | customers.id, orders.customer_id |
month | orders.order_date | orders.amount (via total > 1000) | customers.id, orders.customer_id |
total | orders.amount | orders.amount (via total > 1000) | customers.id, orders.customer_id |
monthly_avg | orders.amount | orders.amount (via total > 1000) | customers.id, orders.customer_id |
All of this comes from a single pass through the CTE, the JOIN, the WHERE filter, and the aggregation. total resolves back through SUM(amount) to orders.amount.
Use cases
- Data governance: "Which base table columns feed the
total_spentmetric?" - Impact analysis: "If we rename
users.email, which dashboards break?" - Data quality: "Is
order_counta direct pass-through, or does it go through a CASE expression?" - Audit: "What joins and filters constrain the data in this report column?"
- Migration: "Which source tables does this query actually depend on?"
Performance
The lineage engine is a single-pass algorithm with memoization. Each (scope, column) pair is resolved once and the result is shared across every reference, producing an immutable DAG of column origins.
That's a lot faster than re-qualifying and re-traversing the query per output column. On complex queries with shared CTEs or subqueries, the gap can be orders of magnitude.
dbt integration
Through the dbt pipeline, lineage follows ref() and source() automatically. Each model's output columns get full provenance, from the final SELECT back through every intermediate model to the raw source tables. This is exercised across the 59-project, 9,925-model test suite.