Skip to main content

Built for LLMs and Code Agents

LLMs generate SQL. Code agents modify SQL. Both need fast, structured feedback to know whether what they produced is correct. Datoria provides the tight feedback loop that turns "generate and hope" into "generate and verify."

The Problem

When an LLM generates a SQL query, you typically have two options: run it against a database (slow, expensive, requires credentials) or trust the output (risky). There's no middle ground -- no way to structurally validate the SQL, check types, trace lineage, or verify correctness without execution.

Code agents face the same problem at higher stakes. An agent modifying a dbt project needs to understand the impact of its changes across the model DAG. Without structural understanding, every edit is a guess.

What Datoria Provides

Instant Structural Validation

Parse generated SQL in 56 microseconds and know immediately:

  • Is the SQL syntactically valid for the target dialect?
  • If not, where exactly are the errors? (precise positions, not just "syntax error")
  • Does it use constructs that exist in the target dialect? (no BigQuery STRUCT in a PostgreSQL query)

With error recovery, even partially broken SQL yields a useful partial AST -- the agent can see which parts are valid and which need fixing.

Type Checking Without a Database

Type inference resolves the data type of every expression, column, and function call -- without connecting to a warehouse:

  • Does SUM(string_column) make sense? The type system catches it.
  • Does a UNION have compatible branch types? Verified statically.
  • Does an INSERT's value list match the target column types? Checked.

This gives LLMs and agents compile-time confidence about generated SQL correctness.

Column Lineage as Context

An agent modifying a dbt model needs to understand what it's affecting. Datoria's column lineage answers:

  • "What does this model depend on?" -- every source column, through every CTE and JOIN
  • "What breaks if I change this column?" -- reverse lineage across the model DAG
  • "What does SELECT * actually expand to?" -- concrete column lists with types

This is exactly the context an agent needs to make safe, targeted modifications. Instead of feeding the agent an entire project, you can give it precise dependency information: "this column comes from stg_orders.amount through a SUM aggregation, and is consumed by 3 downstream models."

Scope Resolution for Precise Context

The optimizer's scope resolution qualifies every column reference to its source table. For an LLM writing SQL, this means:

  • You can validate that referenced columns actually exist in the referenced tables
  • You can provide accurate autocomplete suggestions based on what's in scope
  • You can detect ambiguous column references before they become runtime errors

Formatting for Consistent Output

LLM-generated SQL is often poorly formatted -- inconsistent indentation, missing newlines, mixed casing. The formatter normalizes output to a consistent style in one pass, making generated SQL readable and review-friendly.

The LSP Connection

These capabilities map directly to what a Language Server Protocol (LSP) implementation needs:

LSP FeatureDatoria Capability
Diagnostics (red squiggles)Error recovery with precise positions
Hover (type info)Type inference for every expression
Go to definitionScope resolution and column qualification
CompletionScope graph knows what's available at any position
Code actions (quick fixes)AST transformation with lossless roundtripping
FormattingAST-aware formatter with 27 options
References (find usages)Column lineage traces all consumers

Whether the "client" is a human in an IDE or an LLM agent, the underlying capabilities are the same. The difference is that an agent can consume them programmatically and at the speed of the parser (~0.3 ms per model for the full analysis pipeline).

Performance for Agent Loops

Agent workflows are iterative: generate → validate → fix → validate → fix. Each iteration needs to be fast enough to not bottleneck the loop.

  • Parse: 56 microseconds per file
  • Full analysis (parse + optimize + lineage + types): ~0.3 ms per model
  • 10,000 models: under 3 seconds for the entire project

This means an agent can analyze the impact of a proposed change across an entire dbt project in seconds, not minutes. The feedback loop is tight enough for interactive agent workflows.