Operating SQLMesh in Production: State, Scheduling, and Observability

Introduction

Once you've got SQLMesh containerized and your CI/CD pipeline wired up (Part 1), the next set of questions are operational: what does the state database actually track? How should scheduling work? What should you monitor, and what do you do when things go wrong?

This post covers the internals of SQLMesh state, scheduling with the built-in scheduler and external orchestrators, backfill management, and practical observability for production systems.

This is Part 2 of a three-part series. Part 1 covers infrastructure, CI/CD, and deployment. Part 3 covers what's different for teams coming from dbt.

What SQLMesh State Tracks

The state database contains three categories of data:

Snapshots

Fingerprints of model definitions. Every version of every model you've ever deployed is recorded as a snapshot with a content hash. SQLMesh uses these to determine when a model has changed — if the fingerprint of the current model definition doesn't match the snapshot stored in state, the model has been modified and a new plan is needed.

Environment Mappings

The mapping from environment names (like prod, dev, alice_dev) to snapshot IDs. This is the virtual environment registry — it tells SQLMesh which snapshot of each model is "current" in a given environment.

This is what makes virtual environments work. When Alice creates her environment, it stores a mapping from alice_dev to the same snapshot IDs as prod for every unchanged model, and new snapshot IDs only for models she's modified. No data duplication for the unchanged models.

Interval Records

Which time intervals have been successfully processed for each model in each environment. This is what makes sqlmesh run idempotent and self-healing — it checks the interval records to know what's already been done.

Virtual Environments

Virtual environments are the most important concept to internalize for production operations.

When you run sqlmesh plan alice_dev --no-prompts --auto-apply and Alice has modified two models, SQLMesh creates new physical tables for those two models and updates Alice's environment mapping to point at them. All other models in alice_dev still point at the shared physical tables used by prod.

This is genuinely different from how most data tools handle environments. There's no full data duplication. Creating a new environment is effectively instant for unchanged models. The only cost is the physical tables for models that have actually diverged.

The implications for team workflows:

Every developer gets their own environment at essentially zero marginal cost for unchanged models. No more shared dev environments where one person's broken model blocks everyone else.
Environment creation is cheap enough to be disposable. Create one per feature branch. Invalidate it when the branch merges.
Prod is just another environment. It's not special infrastructure — it's an entry in the environment mapping that points at a set of snapshot IDs.

Plan Execution: What's Cheap and What's Expensive

sqlmesh plan without --auto-apply is cheap to run — it's a diff computation. It reads model files, computes fingerprints, queries the state database, and reports what would change. Even on large projects, this completes in under a minute.

The expensive part is what happens when the plan is applied (either via --auto-apply or interactive confirmation):

DDL operations: creating or replacing tables. Usually fast, schema-dependent.
Backfills: re-executing model queries for historical intervals. Cost scales with the number of intervals and the model's query complexity.
Non-breaking changes: if a model change is classified as non-breaking (adding a new column, changing a comment), SQLMesh may be able to apply it without a backfill. Always check the plan output — it tells you explicitly what backfilling it will do.

When a plan shows an unexpected backfill, stop and understand why before applying. Common causes: a model's kind changed, a column was renamed, or a grain definition changed.

Optimistic Locking and Concurrent Teams

SQLMesh uses optimistic locking on environment state. The workflow:

plan reads the current environment state and computes a diff
When you confirm (or --auto-apply), SQLMesh attempts to write the new state, but only if the environment state hasn't changed since the plan was computed
If another plan was applied between your plan computation and your confirmation, the write fails

This is correct behavior. SQLMesh is protecting you from unknowingly overwriting someone else's deployment. The resolution is always: merge the latest changes, run sqlmesh plan again.

In practice, on teams larger than 2-3 people, you want a serialized deployment process. CI/CD that applies to production should run one deploy at a time. GitHub's environment protection rules with required reviewers effectively enforce this — only one approval and deploy runs at a time.

State Database Maintenance

The state database grows over time. Deleted models leave orphaned snapshots. Old environment configurations accumulate. The janitor process cleans them up:

sqlmesh janitor

Run this as a scheduled job — weekly is usually sufficient. If you've been running SQLMesh for months without running janitor, the first run may take a while as it cleans up accumulated state.

What janitor does not do: it won't delete state that any active environment still references. It's conservative. If a snapshot is reachable from any environment, it stays.

Monitoring State Growth

Track the size of your state database. It should grow slowly and predictably. Rapid growth usually means janitor hasn't run, or environments are accumulating without being invalidated. We've seen state databases grow to several GB without janitor — the impact is mostly performance: plan evaluation gets slower as the state database grows.

Disaster Recovery

If you lose the state database, you lose the virtual environment registry and the interval completion records. This is serious but recoverable.

Recovery process:

Restore from backup if you have one. This is the right answer. Back up the state database. It's small — a few hundred MB even for large projects. Daily backups with one week retention is sufficient.
If you have no backup, you can rebuild state by running sqlmesh plan against each environment. SQLMesh will re-examine the physical tables, attempt to match them to model definitions, and reconstruct the snapshot registry. This works but takes time and requires you to know which environments existed.
After rebuilding state, you will have lost interval completion records. This means sqlmesh run may attempt to re-execute intervals that already completed. For idempotent models (most incremental models), this produces correct results but wastes compute. For non-idempotent models, this is a data correctness problem.

The practical takeaway: treat the state database like production data. It is production data.

Scheduling and Orchestration

SQLMesh's Built-In Scheduler

SQLMesh includes a scheduler that evaluates what needs to run based on cron intervals and the state database. You invoke it with sqlmesh run:

sqlmesh run

This command looks at the cron schedules defined on your models, checks the state database to see which intervals have already been completed, and executes the ones that haven't. It's idempotent — running it twice doesn't double-execute anything. If nothing is due, it exits cleanly.

The built-in scheduler is the right choice for single-project setups where you don't have complex inter-project dependencies. Set up a CronJob in Kubernetes (or a cron entry on a VM) that runs sqlmesh run on your desired cadence — typically every 15-30 minutes to catch up on any due intervals:

# Kubernetes CronJob example
apiVersion: batch/v1
kind: CronJob
metadata:
  name: sqlmesh-run
spec:
  schedule: "*/15 * * * *"
  concurrencyPolicy: Forbid  # don't allow overlapping runs
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: sqlmesh
            image: your-registry/sqlmesh-runner:latest
            command: ["sqlmesh", "run"]
            env:
            - name: SQLMESH_STATE_DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: sqlmesh-secrets
                  key: state-db-password
          restartPolicy: OnFailure

concurrencyPolicy: Forbid is non-negotiable. If a run takes longer than your schedule interval, you do not want a second run starting while the first is still executing.

When to Use External Orchestration

Reach for Airflow, Dagster, or Prefect when you have:

Multi-project dependencies: Project B's models depend on Project A completing first. SQLMesh has no cross-project awareness.
Complex DAGs with non-model steps: data quality checks, notifications, downstream triggers that aren't SQLMesh models.
Existing orchestration infrastructure: if the team already runs Airflow, adding SQLMesh jobs there is less friction than running a parallel scheduler.
SLA monitoring and alerting: Airflow's SLA miss notifications are battle-tested. SQLMesh has no native SLA concept.

Using Airflow with SQLMesh

The integration is straightforward: use a BashOperator to call sqlmesh run. Do not call sqlmesh plan from Airflow. Plans are for CI/CD. Scheduling is for runs.

from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta

with DAG(
    dag_id="sqlmesh_prod_run",
    schedule_interval="*/15 * * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    max_active_runs=1,  # same as concurrencyPolicy: Forbid
    default_args={
        "retries": 2,
        "retry_delay": timedelta(minutes=5),
    },
) as dag:
    run_sqlmesh = BashOperator(
        task_id="sqlmesh_run",
        bash_command="sqlmesh run",
        env={
            "SQLMESH_STATE_DB_PASSWORD": "{{ var.value.sqlmesh_state_db_password }}",
            "SNOWFLAKE_PASSWORD": "{{ var.value.snowflake_password }}",
        },
    )

Set max_active_runs=1 to prevent concurrent runs. Set catchup=False — you do not want Airflow attempting to catch up on missed intervals. SQLMesh handles that itself when sqlmesh run executes.

Failure Handling and Recovery

SQLMesh tracks interval completion in the state database. If a run fails partway through — say, the fifth model in a ten-model execution batch fails — the state records which intervals completed before the failure. The next sqlmesh run picks up the remaining work automatically.

This means transient failures (network blip, warehouse timeout, OOM on a single model) are self-healing. You don't need to manually specify what to re-run. You just re-run sqlmesh run and it figures it out.

The exception is when you want to force a re-execution of intervals that already completed — for instance, after fixing a bug in a model. That requires a new sqlmesh plan with the specific date range, not just a re-run.

Backfills

Backfills are triggered through plan, not run:

sqlmesh plan prod --start 2024-01-01 --end 2024-06-30 --no-prompts --auto-apply

A few things to be careful about:

Check the interval count before you apply. For a daily model running from 2024-01-01 to 2024-06-30, that's 181 intervals. If you have 20 models, that's 3,620 interval executions. For a warehouse-intensive model, that could be a very large and expensive backfill. Always run without --auto-apply first to see the plan output.

Set batch_size on models with large backfills. SQLMesh can batch intervals together, which is faster but uses more memory and can hit warehouse limits. Configure batch_size in the model definition:

MODEL (
  name analytics.large_model,
  kind INCREMENTAL_BY_TIME_RANGE (
    time_column event_date,
    batch_size 30  -- process 30 days at a time
  )
);

Incremental models with large history. If a model is defined as INCREMENTAL_BY_TIME_RANGE and you're backfilling years of data, test with a small date range first to get a sense of how long each interval takes before committing to the full backfill.

Monitoring and Observability

Key Metrics to Track

Not everything needs a dashboard, but these metrics are worth capturing:

Plan evaluation time: how long does sqlmesh plan take? Should be stable. A sudden increase means the state database is growing or model complexity increased.
Model execution duration: per-model. Regressions here indicate query plan changes or data volume growth.
Total run duration: end-to-end time for a sqlmesh run call. This is your SLA metric.
State database size: should grow slowly and predictably. Rapid growth means janitor hasn't run or environments are accumulating.
Warehouse query count and cost: SQLMesh can generate many queries during backfills. Track warehouse spend and query volume.

Logging

SQLMesh logs to stdout by default. For log aggregation (Datadog, CloudWatch, ELK stack), use the --log-to-stdout flag to ensure logs are captured by your container log collector:

sqlmesh --log-to-stdout run

For Kubernetes, use a node-level log collector (Fluent Bit is common) to ship container stdout to your log aggregation system. Filter on the sqlmesh container name to get only SQLMesh logs. Set up a log-based metric for error-level messages to drive alerting.

Alerting Priorities

Three tiers:

Page-worthy (wake someone up):

sqlmesh run exits non-zero after all retries
State database unreachable

Warning (Slack notification, fix next business day):

Run duration more than 2x the rolling 7-day average
Any model execution taking more than 3x its historical average
State database size growing more than 20% week-over-week

Informational (dashboard only):

Backfill triggered in prod (worth knowing, not worth alerting)
Janitor run completion

The Hard Problem: Detecting Wrong Data

SQLMesh won't catch logical errors. A model that joins on the wrong key, applies the wrong filter, or has off-by-one errors in date truncation will execute successfully and record a completed interval. The data will be wrong and nothing in SQLMesh will tell you.

This is not a SQLMesh failure — it's the nature of batch data pipelines. Address it with:

SQLMesh audits. Define audits on critical models to assert row counts, non-null constraints, and referential integrity. Audits are defined in your model files or in separate .sql files in an audits directory:

MODEL (
  name analytics.orders_daily,
  kind INCREMENTAL_BY_TIME_RANGE (
    time_column order_date
  ),
  audits (
    not_null(columns := (order_date, revenue)),
    accepted_range(column := revenue, min_v := 0)
  )
);

Audits run after each model execution. If an audit fails, the interval is marked as failed. This catches null values, out-of-range numbers, and other basic data quality issues at the source.

Downstream data quality tools. Great Expectations, Soda, or Monte Carlo for statistical anomaly detection. These catch things audits miss — distribution shifts, sudden metric changes, and cross-model consistency issues.

Putting It Together: A Scheduled Production Setup

Here's the complete scheduling infrastructure for a production SQLMesh deployment:

# sqlmesh-run: executes due intervals every 15 minutes
apiVersion: batch/v1
kind: CronJob
metadata:
  name: sqlmesh-run
spec:
  schedule: "*/15 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: sqlmesh
            image: your-registry/sqlmesh-runner:latest
            command: ["sqlmesh", "--log-to-stdout", "run"]
            resources:
              requests:
                memory: "2Gi"
                cpu: "1"
              limits:
                memory: "4Gi"
          restartPolicy: OnFailure
---
# sqlmesh-janitor: cleans up orphaned state weekly
apiVersion: batch/v1
kind: CronJob
metadata:
  name: sqlmesh-janitor
spec:
  schedule: "0 2 * * 0"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: sqlmesh
            image: your-registry/sqlmesh-runner:latest
            command: ["sqlmesh", "janitor"]
          restartPolicy: OnFailure

Combined with the CI/CD pipeline from Part 1, this gives you: plans computed and applied through CI on merge, scheduled runs every 15 minutes to process due intervals, and weekly janitor to keep state clean.

What's Next

This post covered the operational layer — state management, scheduling, backfills, and monitoring. If you're coming from dbt, the next post in this series covers the mental model shifts that trip up experienced dbt users when they start working with SQLMesh.

Have feedback or corrections? Reach out directly — we update these posts when we find things that are wrong or outdated.