Built for LLMs and code agents
LLMs generate SQL. Code agents modify SQL. Both need fast, structured feedback to know whether what they produced is correct. Datoria gives them a tight enough loop to turn "generate and hope" into "generate and verify."
The problem
When an LLM emits a SQL query, you usually have two options: run it against a database (slow, expensive, requires credentials) or trust the output (risky). There's nothing in the middle that checks syntax, types, and lineage without execution.
Code agents face the same problem at higher stakes. An agent modifying a dbt project has to understand the impact of its changes across the model DAG. Without structural understanding, every edit is a guess.
What Datoria provides
Instant structural validation
Parse generated SQL in 56 microseconds and know:
- Is the SQL syntactically valid for the target dialect?
- If not, where exactly are the errors? (Precise positions, not just "syntax error".)
- Does it use constructs that exist in the target dialect? (No BigQuery STRUCT in a PostgreSQL query.)
With error recovery, even partially broken SQL yields a useful partial AST. The agent can see which parts are valid and which still need fixing.
Type checking without a database
Type inference resolves the data type of every expression, column, and function call — no warehouse connection required:
- Does
SUM(string_column)make sense? The type system catches it. - Does a UNION have compatible branch types? Verified statically.
- Does an INSERT's value list match the target column types? Checked.
That gives LLMs and agents compile-time confidence about generated SQL.
Column lineage as context
An agent modifying a dbt model needs to understand what it's affecting. Column lineage answers:
- "What does this model depend on?" Every source column, through every CTE and JOIN.
- "What breaks if I change this column?" Reverse lineage across the model DAG.
- "What does SELECT * actually expand to?" Concrete column lists with types.
That's exactly the context an agent needs to make safe, targeted edits. Instead of feeding the agent an entire project, you can hand it precise dependency information: "this column comes from stg_orders.amount through a SUM aggregation and is consumed by 3 downstream models."
Scope resolution for precise context
The optimizer's scope resolution qualifies every column reference to its source table. For an LLM writing SQL, that means you can:
- Validate that referenced columns actually exist in the referenced tables
- Provide accurate autocomplete suggestions based on what's in scope
- Detect ambiguous column references before they become runtime errors
Formatting for consistent output
LLM-generated SQL tends to be inconsistently formatted: ragged indentation, missing newlines, mixed casing. The formatter normalizes it in one pass, making generated SQL readable and review-friendly.
The LSP connection
These capabilities map onto what a Language Server Protocol (LSP) implementation needs:
| LSP Feature | Datoria Capability |
|---|---|
| Diagnostics (red squiggles) | Error recovery with precise positions |
| Hover (type info) | Type inference for every expression |
| Go to definition | Scope resolution and column qualification |
| Completion | Scope graph knows what's available at any position |
| Code actions (quick fixes) | AST transformation with lossless roundtripping |
| Formatting | AST-aware formatter with 27 options |
| References (find usages) | Column lineage traces all consumers |
Whether the client is a human in an IDE or an LLM agent, the underlying capabilities are the same. An agent just consumes them programmatically, at parser speed (~0.3 ms per model for the full analysis pipeline).
Performance for agent loops
Agent workflows are iterative: generate → validate → fix → validate → fix. Each iteration needs to be fast enough to not bottleneck the loop.
- Parse: 56 microseconds per file
- Full analysis (parse + optimize + lineage + types): ~0.3 ms per model
- 10,000 models: under 3 seconds for the whole project
So an agent can analyze the impact of a proposed change across an entire dbt project in seconds, not minutes — tight enough for an interactive workflow.