Introduction
Once you've got SQLMesh containerized and your CI/CD pipeline wired up (Part 1), the next set of questions are operational: what does the state database actually track? How should scheduling work? What should you monitor, and what do you do when things go wrong?
This post covers the internals of SQLMesh state, scheduling with the built-in scheduler and external orchestrators, backfill management, and practical observability for production systems.
This is Part 2 of a three-part series. Part 1 covers infrastructure, CI/CD, and deployment. Part 3 covers what's different for teams coming from dbt.
What SQLMesh State Tracks
The state database contains three categories of data:
Snapshots
Fingerprints of model definitions. Every version of every model you've ever deployed is recorded as a snapshot with a content hash. SQLMesh uses these to determine when a model has changed — if the fingerprint of the current model definition doesn't match the snapshot stored in state, the model has been modified and a new plan is needed.
Environment Mappings
The mapping from environment names (like prod, dev, alice_dev) to snapshot IDs. This is the virtual environment registry — it tells SQLMesh which snapshot of each model is "current" in a given environment.
This is what makes virtual environments work. When Alice creates her environment, it stores a mapping from alice_dev to the same snapshot IDs as prod for every unchanged model, and new snapshot IDs only for models she's modified. No data duplication for the unchanged models.
Interval Records
Which time intervals have been successfully processed for each model in each environment. This is what makes sqlmesh run idempotent and self-healing — it checks the interval records to know what's already been done.
Virtual Environments
Virtual environments are the most important concept to internalize for production operations.
When you run sqlmesh plan alice_dev --no-prompts --auto-apply and Alice has modified two models, SQLMesh creates new physical tables for those two models and updates Alice's environment mapping to point at them. All other models in alice_dev still point at the shared physical tables used by prod.
This is genuinely different from how most data tools handle environments. There's no full data duplication. Creating a new environment is effectively instant for unchanged models. The only cost is the physical tables for models that have actually diverged.
The implications for team workflows:
- Every developer gets their own environment at essentially zero marginal cost for unchanged models. No more shared dev environments where one person's broken model blocks everyone else.
- Environment creation is cheap enough to be disposable. Create one per feature branch. Invalidate it when the branch merges.
- Prod is just another environment. It's not special infrastructure — it's an entry in the environment mapping that points at a set of snapshot IDs.
Plan Execution: What's Cheap and What's Expensive
sqlmesh plan without --auto-apply is cheap to run — it's a diff computation. It reads model files, computes fingerprints, queries the state database, and reports what would change. Even on large projects, this completes in under a minute.
The expensive part is what happens when the plan is applied (either via --auto-apply or interactive confirmation):
- DDL operations: creating or replacing tables. Usually fast, schema-dependent.
- Backfills: re-executing model queries for historical intervals. Cost scales with the number of intervals and the model's query complexity.
- Non-breaking changes: if a model change is classified as non-breaking (adding a new column, changing a comment), SQLMesh may be able to apply it without a backfill. Always check the plan output — it tells you explicitly what backfilling it will do.
When a plan shows an unexpected backfill, stop and understand why before applying. Common causes: a model's kind changed, a column was renamed, or a grain definition changed.
Optimistic Locking and Concurrent Teams
SQLMesh uses optimistic locking on environment state. The workflow:
planreads the current environment state and computes a diff- When you confirm (or
--auto-apply), SQLMesh attempts to write the new state, but only if the environment state hasn't changed since the plan was computed - If another plan was applied between your plan computation and your confirmation, the write fails
This is correct behavior. SQLMesh is protecting you from unknowingly overwriting someone else's deployment. The resolution is always: merge the latest changes, run sqlmesh plan again.
In practice, on teams larger than 2-3 people, you want a serialized deployment process. CI/CD that applies to production should run one deploy at a time. GitHub's environment protection rules with required reviewers effectively enforce this — only one approval and deploy runs at a time.
State Database Maintenance
The state database grows over time. Deleted models leave orphaned snapshots. Old environment configurations accumulate. The janitor process cleans them up:
sqlmesh janitor
Run this as a scheduled job — weekly is usually sufficient. If you've been running SQLMesh for months without running janitor, the first run may take a while as it cleans up accumulated state.
What janitor does not do: it won't delete state that any active environment still references. It's conservative. If a snapshot is reachable from any environment, it stays.
Monitoring State Growth
Track the size of your state database. It should grow slowly and predictably. Rapid growth usually means janitor hasn't run, or environments are accumulating without being invalidated. We've seen state databases grow to several GB without janitor — the impact is mostly performance: plan evaluation gets slower as the state database grows.
Disaster Recovery
If you lose the state database, you lose the virtual environment registry and the interval completion records. This is serious but recoverable.
Recovery process:
-
Restore from backup if you have one. This is the right answer. Back up the state database. It's small — a few hundred MB even for large projects. Daily backups with one week retention is sufficient.
-
If you have no backup, you can rebuild state by running
sqlmesh planagainst each environment. SQLMesh will re-examine the physical tables, attempt to match them to model definitions, and reconstruct the snapshot registry. This works but takes time and requires you to know which environments existed. -
After rebuilding state, you will have lost interval completion records. This means
sqlmesh runmay attempt to re-execute intervals that already completed. For idempotent models (most incremental models), this produces correct results but wastes compute. For non-idempotent models, this is a data correctness problem.
The practical takeaway: treat the state database like production data. It is production data.
Scheduling and Orchestration
SQLMesh's Built-In Scheduler
SQLMesh includes a scheduler that evaluates what needs to run based on cron intervals and the state database. You invoke it with sqlmesh run:
sqlmesh run
This command looks at the cron schedules defined on your models, checks the state database to see which intervals have already been completed, and executes the ones that haven't. It's idempotent — running it twice doesn't double-execute anything. If nothing is due, it exits cleanly.
The built-in scheduler is the right choice for single-project setups where you don't have complex inter-project dependencies. Set up a CronJob in Kubernetes (or a cron entry on a VM) that runs sqlmesh run on your desired cadence — typically every 15-30 minutes to catch up on any due intervals:
# Kubernetes CronJob example
apiVersion: batch/v1
kind: CronJob
metadata:
name: sqlmesh-run
spec:
schedule: "*/15 * * * *"
concurrencyPolicy: Forbid # don't allow overlapping runs
jobTemplate:
spec:
template:
spec:
containers:
- name: sqlmesh
image: your-registry/sqlmesh-runner:latest
command: ["sqlmesh", "run"]
env:
- name: SQLMESH_STATE_DB_PASSWORD
valueFrom:
secretKeyRef:
name: sqlmesh-secrets
key: state-db-password
restartPolicy: OnFailure
concurrencyPolicy: Forbid is non-negotiable. If a run takes longer than your schedule interval, you do not want a second run starting while the first is still executing.
When to Use External Orchestration
Reach for Airflow, Dagster, or Prefect when you have:
- Multi-project dependencies: Project B's models depend on Project A completing first. SQLMesh has no cross-project awareness.
- Complex DAGs with non-model steps: data quality checks, notifications, downstream triggers that aren't SQLMesh models.
- Existing orchestration infrastructure: if the team already runs Airflow, adding SQLMesh jobs there is less friction than running a parallel scheduler.
- SLA monitoring and alerting: Airflow's SLA miss notifications are battle-tested. SQLMesh has no native SLA concept.
Using Airflow with SQLMesh
The integration is straightforward: use a BashOperator to call sqlmesh run. Do not call sqlmesh plan from Airflow. Plans are for CI/CD. Scheduling is for runs.
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime, timedelta
with DAG(
dag_id="sqlmesh_prod_run",
schedule_interval="*/15 * * * *",
start_date=datetime(2026, 1, 1),
catchup=False,
max_active_runs=1, # same as concurrencyPolicy: Forbid
default_args={
"retries": 2,
"retry_delay": timedelta(minutes=5),
},
) as dag:
run_sqlmesh = BashOperator(
task_id="sqlmesh_run",
bash_command="sqlmesh run",
env={
"SQLMESH_STATE_DB_PASSWORD": "{{ var.value.sqlmesh_state_db_password }}",
"SNOWFLAKE_PASSWORD": "{{ var.value.snowflake_password }}",
},
)
Set max_active_runs=1 to prevent concurrent runs. Set catchup=False — you do not want Airflow attempting to catch up on missed intervals. SQLMesh handles that itself when sqlmesh run executes.
Failure Handling and Recovery
SQLMesh tracks interval completion in the state database. If a run fails partway through — say, the fifth model in a ten-model execution batch fails — the state records which intervals completed before the failure. The next sqlmesh run picks up the remaining work automatically.
This means transient failures (network blip, warehouse timeout, OOM on a single model) are self-healing. You don't need to manually specify what to re-run. You just re-run sqlmesh run and it figures it out.
The exception is when you want to force a re-execution of intervals that already completed — for instance, after fixing a bug in a model. That requires a new sqlmesh plan with the specific date range, not just a re-run.
Backfills
Backfills are triggered through plan, not run:
sqlmesh plan prod --start 2024-01-01 --end 2024-06-30 --no-prompts --auto-apply
A few things to be careful about:
Check the interval count before you apply. For a daily model running from 2024-01-01 to 2024-06-30, that's 181 intervals. If you have 20 models, that's 3,620 interval executions. For a warehouse-intensive model, that could be a very large and expensive backfill. Always run without --auto-apply first to see the plan output.
Set batch_size on models with large backfills. SQLMesh can batch intervals together, which is faster but uses more memory and can hit warehouse limits. Configure batch_size in the model definition:
MODEL (
name analytics.large_model,
kind INCREMENTAL_BY_TIME_RANGE (
time_column event_date,
batch_size 30 -- process 30 days at a time
)
);
Incremental models with large history. If a model is defined as INCREMENTAL_BY_TIME_RANGE and you're backfilling years of data, test with a small date range first to get a sense of how long each interval takes before committing to the full backfill.
Monitoring and Observability
Key Metrics to Track
Not everything needs a dashboard, but these metrics are worth capturing:
- Plan evaluation time: how long does
sqlmesh plantake? Should be stable. A sudden increase means the state database is growing or model complexity increased. - Model execution duration: per-model. Regressions here indicate query plan changes or data volume growth.
- Total run duration: end-to-end time for a
sqlmesh runcall. This is your SLA metric. - State database size: should grow slowly and predictably. Rapid growth means janitor hasn't run or environments are accumulating.
- Warehouse query count and cost: SQLMesh can generate many queries during backfills. Track warehouse spend and query volume.
Logging
SQLMesh logs to stdout by default. For log aggregation (Datadog, CloudWatch, ELK stack), use the --log-to-stdout flag to ensure logs are captured by your container log collector:
sqlmesh --log-to-stdout run
For Kubernetes, use a node-level log collector (Fluent Bit is common) to ship container stdout to your log aggregation system. Filter on the sqlmesh container name to get only SQLMesh logs. Set up a log-based metric for error-level messages to drive alerting.
Alerting Priorities
Three tiers:
Page-worthy (wake someone up):
sqlmesh runexits non-zero after all retries- State database unreachable
Warning (Slack notification, fix next business day):
- Run duration more than 2x the rolling 7-day average
- Any model execution taking more than 3x its historical average
- State database size growing more than 20% week-over-week
Informational (dashboard only):
- Backfill triggered in prod (worth knowing, not worth alerting)
- Janitor run completion
The Hard Problem: Detecting Wrong Data
SQLMesh won't catch logical errors. A model that joins on the wrong key, applies the wrong filter, or has off-by-one errors in date truncation will execute successfully and record a completed interval. The data will be wrong and nothing in SQLMesh will tell you.
This is not a SQLMesh failure — it's the nature of batch data pipelines. Address it with:
SQLMesh audits. Define audits on critical models to assert row counts, non-null constraints, and referential integrity. Audits are defined in your model files or in separate .sql files in an audits directory:
MODEL (
name analytics.orders_daily,
kind INCREMENTAL_BY_TIME_RANGE (
time_column order_date
),
audits (
not_null(columns := (order_date, revenue)),
accepted_range(column := revenue, min_v := 0)
)
);
Audits run after each model execution. If an audit fails, the interval is marked as failed. This catches null values, out-of-range numbers, and other basic data quality issues at the source.
Downstream data quality tools. Great Expectations, Soda, or Monte Carlo for statistical anomaly detection. These catch things audits miss — distribution shifts, sudden metric changes, and cross-model consistency issues.
Putting It Together: A Scheduled Production Setup
Here's the complete scheduling infrastructure for a production SQLMesh deployment:
# sqlmesh-run: executes due intervals every 15 minutes
apiVersion: batch/v1
kind: CronJob
metadata:
name: sqlmesh-run
spec:
schedule: "*/15 * * * *"
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
containers:
- name: sqlmesh
image: your-registry/sqlmesh-runner:latest
command: ["sqlmesh", "--log-to-stdout", "run"]
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
restartPolicy: OnFailure
---
# sqlmesh-janitor: cleans up orphaned state weekly
apiVersion: batch/v1
kind: CronJob
metadata:
name: sqlmesh-janitor
spec:
schedule: "0 2 * * 0"
jobTemplate:
spec:
template:
spec:
containers:
- name: sqlmesh
image: your-registry/sqlmesh-runner:latest
command: ["sqlmesh", "janitor"]
restartPolicy: OnFailure
Combined with the CI/CD pipeline from Part 1, this gives you: plans computed and applied through CI on merge, scheduled runs every 15 minutes to process due intervals, and weekly janitor to keep state clean.
What's Next
This post covered the operational layer — state management, scheduling, backfills, and monitoring. If you're coming from dbt, the next post in this series covers the mental model shifts that trip up experienced dbt users when they start working with SQLMesh.
Have feedback or corrections? Reach out directly — we update these posts when we find things that are wrong or outdated.