Built for LLMs and Code Agents
LLMs generate SQL. Code agents modify SQL. Both need fast, structured feedback to know whether what they produced is correct. Datoria provides the tight feedback loop that turns "generate and hope" into "generate and verify."
The Problem
When an LLM generates a SQL query, you typically have two options: run it against a database (slow, expensive, requires credentials) or trust the output (risky). There's no middle ground -- no way to structurally validate the SQL, check types, trace lineage, or verify correctness without execution.
Code agents face the same problem at higher stakes. An agent modifying a dbt project needs to understand the impact of its changes across the model DAG. Without structural understanding, every edit is a guess.
What Datoria Provides
Instant Structural Validation
Parse generated SQL in 56 microseconds and know immediately:
- Is the SQL syntactically valid for the target dialect?
- If not, where exactly are the errors? (precise positions, not just "syntax error")
- Does it use constructs that exist in the target dialect? (no BigQuery STRUCT in a PostgreSQL query)
With error recovery, even partially broken SQL yields a useful partial AST -- the agent can see which parts are valid and which need fixing.
Type Checking Without a Database
Type inference resolves the data type of every expression, column, and function call -- without connecting to a warehouse:
- Does
SUM(string_column)make sense? The type system catches it. - Does a UNION have compatible branch types? Verified statically.
- Does an INSERT's value list match the target column types? Checked.
This gives LLMs and agents compile-time confidence about generated SQL correctness.
Column Lineage as Context
An agent modifying a dbt model needs to understand what it's affecting. Datoria's column lineage answers:
- "What does this model depend on?" -- every source column, through every CTE and JOIN
- "What breaks if I change this column?" -- reverse lineage across the model DAG
- "What does SELECT * actually expand to?" -- concrete column lists with types
This is exactly the context an agent needs to make safe, targeted modifications. Instead of feeding the agent an entire project, you can give it precise dependency information: "this column comes from stg_orders.amount through a SUM aggregation, and is consumed by 3 downstream models."
Scope Resolution for Precise Context
The optimizer's scope resolution qualifies every column reference to its source table. For an LLM writing SQL, this means:
- You can validate that referenced columns actually exist in the referenced tables
- You can provide accurate autocomplete suggestions based on what's in scope
- You can detect ambiguous column references before they become runtime errors
Formatting for Consistent Output
LLM-generated SQL is often poorly formatted -- inconsistent indentation, missing newlines, mixed casing. The formatter normalizes output to a consistent style in one pass, making generated SQL readable and review-friendly.
The LSP Connection
These capabilities map directly to what a Language Server Protocol (LSP) implementation needs:
| LSP Feature | Datoria Capability |
|---|---|
| Diagnostics (red squiggles) | Error recovery with precise positions |
| Hover (type info) | Type inference for every expression |
| Go to definition | Scope resolution and column qualification |
| Completion | Scope graph knows what's available at any position |
| Code actions (quick fixes) | AST transformation with lossless roundtripping |
| Formatting | AST-aware formatter with 27 options |
| References (find usages) | Column lineage traces all consumers |
Whether the "client" is a human in an IDE or an LLM agent, the underlying capabilities are the same. The difference is that an agent can consume them programmatically and at the speed of the parser (~0.3 ms per model for the full analysis pipeline).
Performance for Agent Loops
Agent workflows are iterative: generate → validate → fix → validate → fix. Each iteration needs to be fast enough to not bottleneck the loop.
- Parse: 56 microseconds per file
- Full analysis (parse + optimize + lineage + types): ~0.3 ms per model
- 10,000 models: under 3 seconds for the entire project
This means an agent can analyze the impact of a proposed change across an entire dbt project in seconds, not minutes. The feedback loop is tight enough for interactive agent workflows.