Introduction
SQLMesh is earning real adoption in data engineering teams, and for good reason — the virtual environment model, column-level lineage, and the plan workflow solve problems that other tools handle awkwardly or not at all. The documentation covers getting started well. What it covers less well is what happens after you've decided to commit: containerization, CI/CD pipelines, state database choices, and deployment across environments.
This is the first in a three-part series distilling what we've learned running SQLMesh across dozens of projects at varying scales. This post covers the foundation: getting SQLMesh containerized, choosing the right state database, and building CI/CD pipelines that work.
Part 2 covers state management, scheduling, and observability. Part 3 covers what's different for teams coming from dbt.
Container Images
Start with python:3.11-slim. Not Alpine.
Alpine uses musl libc instead of glibc, and several of SQLMesh's dependencies — particularly anything that relies on compiled C extensions — either fail to install cleanly or require build toolchains that add complexity and image size. The supposed image size advantage of Alpine disappears once you've added gcc, musl-dev, and assorted build deps. python:3.11-slim is a better baseline: glibc, smaller than full Debian, and compatible with everything SQLMesh needs.
A minimal working Dockerfile:
FROM python:3.11-slim
WORKDIR /app
# System deps for psycopg2 and cryptography packages
RUN apt-get update && apt-get install -y --no-install-recommends \
libpq-dev \
gcc \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENTRYPOINT ["sqlmesh"]
Dependency Pinning
Pin everything. Not sqlmesh[postgres]>=0.120 — sqlmesh[postgres]==0.125.3. SQLMesh has shipped breaking changes between minor versions, and a pip install during a deploy job that pulls in a new minor version at 2am is not a fun way to discover that.
Your requirements.txt should look like this, not a range:
sqlmesh[postgres]==0.125.3
psycopg2-binary==2.9.9
Lock with pip-compile from pip-tools if you want the full transitive dependency graph locked. At minimum, pin SQLMesh itself exactly.
A minor version bump that changes snapshot fingerprinting will cause SQLMesh to see all models as changed, triggering a full backfill on the next plan. This has happened. Pin your versions.
When you want to upgrade, do it deliberately: update the pin, run sqlmesh plan against a dev environment, verify no unexpected backfills are triggered, then promote through staging before touching prod.
Compute Requirements
Plan evaluation is the expensive operation, not execution. When SQLMesh runs plan, it loads model definitions, computes fingerprints for every snapshot, diffs the current environment state against the desired state, and figures out which intervals need backfilling. For projects with 200+ models, plan evaluation regularly consumes 2-4GB of memory. On a small instance, this will OOM.
Execution resource requirements are gateway-dependent — if you're pushing compute to a cloud data warehouse, your SQLMesh container is mostly coordinating. If you're running DuckDB locally or against a local Postgres, the container is doing real work.
A reasonable starting point for a project with 100-300 models:
- Plan container: 4GB memory, 2 vCPU
- Run container: 2GB memory, 1 vCPU
Adjust from there based on actual metrics. The plan container almost always needs more resources than people initially allocate.
State Database: Use Postgres
Do not use SQLite in production. SQLite is the default and it's fine for local development. It is not fine for production for several reasons:
Concurrent access. SQLite has file-level locking. If two processes try to write to the state database simultaneously — say, a scheduled run and a CI/CD plan — one will block or error. SQLMesh itself handles some of this, but you're building in a fragility that doesn't need to exist.
Container restarts. If your state database is a SQLite file on a container's local filesystem and the container restarts, you lose state. You can work around this with persistent volumes, but at that point you're doing extra work to use a worse solution.
Corruption risk. SQLite is not designed for environments where processes can be killed mid-write. A container OOM event mid-plan can corrupt the state file.
Use Postgres. The configuration is minimal:
# config.yaml
gateways:
prod:
state_connection:
type: postgres
host: your-postgres-host
port: 5432
database: sqlmesh_state
username: sqlmesh
password: "{{ env_var('SQLMESH_STATE_DB_PASSWORD') }}"
Run a small Postgres instance — the state database is not high-throughput. An RDS db.t3.small or equivalent handles it easily. What it cannot handle is being unavailable, so make sure it's in a reliable setup. No Aurora Serverless with aggressive auto-pause unless you enjoy cold start latency on every plan.
Kubernetes Considerations
Kubernetes is common but not required. The only hard requirements are: persistent access to the state database and reliable scheduling. ECS, Fly.io, a bare VM with cron — all work.
If you are on Kubernetes, use Jobs for sqlmesh plan and CronJobs for sqlmesh run. Do not run SQLMesh processes as long-lived Deployments — they're batch workloads. Use resource requests and limits so the scheduler can make proper placement decisions, and use a dedicated namespace for SQLMesh jobs to keep RBAC and resource quotas clean.
CI/CD Pipeline Design
How sqlmesh plan Works in CI
The plan command is the core of SQLMesh's deployment workflow. It does everything: computes the diff against the current environment, shows you what would change, and — when you confirm — applies those changes. There is no separate apply command. You plan and apply in a single step.
In CI, two flags make this non-interactive:
--no-promptsskips the interactive confirmation prompt (required for any non-interactive context or your CI job will hang waiting for stdin)--auto-applyautomatically applies the plan after computing it
For development environments where you want auto-apply on merge:
sqlmesh plan dev --no-prompts --auto-apply
For production, you typically want a human approval gate between computing the plan and applying it. One approach is to run plan with --no-prompts but without --auto-apply in a PR check (so the team can see what would change), then run the full --auto-apply version after merge with an environment protection gate:
# PR check: compute the plan, skip backfills, just validate
sqlmesh plan prod --no-prompts --skip-backfill
# After merge + approval: compute and apply
sqlmesh plan prod --no-prompts --auto-apply
The --skip-backfill flag (also aliased as --dry-run) creates the virtual environment update but skips executing any backfills. This is useful for validating that a plan computes cleanly without actually running any data through the warehouse.
Capture the exit code. 0 means success. Non-zero means something went wrong — invalid SQL, a plan conflict, a connection failure. Do not proceed on non-zero.
Handling Plan Conflicts
When two branches both modify overlapping models and both try to plan against the same environment, the second plan will fail with a conflict. SQLMesh uses optimistic locking on environment state: the first plan wins, the second detects that the state it computed against has changed.
The fix is straightforward: merge the main branch into your feature branch and re-run sqlmesh plan. The conflict detection is doing exactly what it should — preventing you from clobbering someone else's changes.
Testing in CI
Run sqlmesh test before planning. SQLMesh unit tests are fast and catch model logic errors before you try to apply them:
sqlmesh test
A Concrete GitHub Actions Workflow
This is a working example, not pseudocode:
name: SQLMesh CI/CD
on:
pull_request:
branches: [main]
push:
branches: [main]
env:
SQLMESH_STATE_DB_PASSWORD: ${{ secrets.SQLMESH_STATE_DB_PASSWORD }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
jobs:
test-and-plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run unit tests
run: sqlmesh test
- name: Validate plan against dev
run: sqlmesh plan dev --no-prompts --skip-backfill
apply-dev:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Plan and apply to dev
run: sqlmesh plan dev --no-prompts --auto-apply
apply-prod:
runs-on: ubuntu-latest
if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
environment: production # requires manual approval in GitHub
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Plan and apply to prod
run: sqlmesh plan prod --no-prompts --auto-apply
A few notes on this structure:
The apply-prod job uses a GitHub environment named production. Configure that environment in your repo settings with required reviewers. This gives you a manual approval gate without any custom tooling — GitHub handles the prompt, and the deploy only runs after a human approves it.
The dev job uses --auto-apply because dev is low-stakes. Prod uses --auto-apply too, but only after the environment approval gate. If you want to inspect the plan output before approving, split it into two jobs: one that plans with --skip-backfill and one (triggered manually) that plans and applies.
SQLMesh's Built-In CI/CD Bot
Before you build all of this from scratch, know that SQLMesh ships a first-party GitHub Actions CI/CD bot. Install it with pip install sqlmesh[github] and add a workflow file — the bot handles a lot of what the manual workflow above does, plus things that are harder to build yourself:
- Automatic PR environments. On every pull request, the bot creates an isolated environment representing your code changes. It reuses existing tables via virtual environments, so this is fast.
- Plan summaries as PR comments. The bot posts a comment on your PR showing exactly which models changed, what type of changes they are, and what intervals would be backfilled. Your team reviews the plan directly in the PR — no need to dig through CI logs.
- Unit test and lint results. Tests run automatically against DuckDB (no warehouse credentials needed for unit tests), and results are posted to the PR.
- Production deployment. The bot supports both synchronized deployment (code and data update together, triggered by
/deploycommands or PR approval) and desynchronized deployment (code merges first, production data updates happen on a separate schedule).
The bot is worth evaluating before rolling your own workflow. The custom GitHub Actions approach above gives you more control and is useful if you need to integrate SQLMesh into an existing CI pipeline with other steps. The built-in bot is faster to set up and handles the PR feedback loop better out of the box.
You can also use both — the bot for PR environments and plan previews, and custom workflows for production deployment with your own approval gates.
Environment Strategy and Deployment
The Standard Pattern
Three tiers covers most teams:
- Per-developer environments: each developer has their own SQLMesh environment (
sqlmesh plan alice_dev --no-prompts --auto-apply). The environment name is arbitrary — SQLMesh creates it on first use. This is cheap because of virtual environments — no data duplication for unchanged models. - Shared staging environment: auto-applied on merge to main.
- Production environment: deployed on tagged releases with manual approval.
Resist the temptation to simplify it down to just dev/prod until you've felt the pain of developers clobbering each other in a shared dev environment.
Gateway Configuration
Define separate gateways for each target, each pointing to a different database:
# config.yaml
gateways:
dev:
connection:
type: snowflake
account: your-account
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
database: ANALYTICS_DEV
warehouse: DEV_WH
state_connection:
type: postgres
host: your-state-db-host
database: sqlmesh_state
username: sqlmesh
password: "{{ env_var('SQLMESH_STATE_DB_PASSWORD') }}"
prod:
connection:
type: snowflake
account: your-account
user: "{{ env_var('SNOWFLAKE_USER') }}"
password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
database: ANALYTICS
warehouse: PROD_WH
state_connection:
type: postgres
host: your-state-db-host
database: sqlmesh_state
username: sqlmesh
password: "{{ env_var('SQLMESH_STATE_DB_PASSWORD') }}"
default_gateway: dev
Both gateways share the same state database, which is intentional — the state database tracks environments across gateways and needs visibility into all of them.
The different database values (ANALYTICS_DEV vs ANALYTICS) are how you keep dev and prod data isolated in the warehouse.
Promotion Flow
- Developer creates
alice_devenvironment withsqlmesh plan alice_dev --no-prompts --auto-apply - Iterates on models, re-plans as needed
- Opens a pull request — CI runs
sqlmesh plan dev --no-prompts --skip-backfillto validate - PR merges to main — CI runs
sqlmesh plan staging --no-prompts --auto-apply - Team cuts a release tag — CI runs
sqlmesh plan prod --no-prompts --auto-applyafter manual approval gate
Virtual environments make step 1 free. Step 5 is where the manual gate matters.
Cleaning Up Developer Environments
Developer environments accumulate. Use sqlmesh invalidate to mark an environment for removal — the janitor process cleans it up on its next run:
sqlmesh invalidate alice_dev
This marks the virtual environment for deletion. The actual cleanup happens when sqlmesh janitor runs. Physical tables are only dropped if no other environment references them — SQLMesh handles the reference counting.
Common Infrastructure Pitfalls
Accidental Full Backfill
The most painful one. You run sqlmesh plan prod --start 2020-01-01 --no-prompts --auto-apply and suddenly SQLMesh is running six years of daily intervals across 50 models. That's 109,500 interval executions.
This usually happens when a backfill command is copy-pasted from a dev context into a prod context without checking the start date.
Always run plan without --auto-apply first when specifying historical start dates. Read the plan output. It tells you exactly how many intervals will be processed. If that number surprises you, stop.
# Do this first — plan without auto-apply to see what would happen
sqlmesh plan prod --start 2020-01-01 --no-prompts
# Check the output. If it says "backfill 2,190 intervals", that's 6 years of daily data.
# Only re-run with --auto-apply if you're sure that's what you want.
Connection Pool Exhaustion
SQLMesh executes models concurrently. The default concurrency is fine for small projects but can exhaust connection pools on projects with 50+ models executing in parallel.
Symptoms: intermittent connection errors during runs, models succeeding on retry. Check your database connection pool size (Snowflake virtual warehouse, RDS max_connections, etc.) and tune the concurrent_tasks setting on your gateway connection:
# config.yaml
gateways:
prod:
connection:
type: snowflake
concurrent_tasks: 10 # reduce if you see pool exhaustion
# ...
Also check any external connection pooler (e.g., PgBouncer) if you're using one.
Warehouse Cost Surprises on Backfills
Large backfills can generate significant warehouse costs. A backfill of 3 years of a model that processes a large table on each interval is a math problem worth doing before applying.
Two mitigations:
- Set
batch_sizeon your models. This is configured per-model in theMODEL()block. It controls how many intervals are processed per job:
MODEL (
name analytics.large_model,
kind INCREMENTAL_BY_TIME_RANGE (
time_column event_date,
batch_size 90 -- process 90 days at a time
)
);
- Preview before committing. Run
sqlmesh evaluate model_name --start 2024-01-01 --end 2024-01-01 --limit 10to test the query against a single interval before running the full backfill.
Rolling Back a Bad Deploy
When a production plan goes wrong — a model has a logic bug, or a backfill produces incorrect data — the rollback path is to revert the model code in Git and re-run sqlmesh plan prod --no-prompts --auto-apply. SQLMesh computes a new plan based on the reverted definitions, creates new snapshots, and backfills the affected intervals with the corrected logic.
There is no "undo last plan" command. The rollback is the same workflow as any other deployment: change the code, plan, apply. Your CI/CD pipeline handles this naturally — revert the PR, merge, and the deploy pipeline runs the corrected plan.
For cases where you need to act faster than a full CI cycle, you can run sqlmesh plan directly from a checkout of the last known-good commit. The key thing is that rolling back is a forward operation — you deploy the previous state as a new plan, you don't literally rewind time.
What's Next
This post covered the foundation — containerization, state database, CI/CD, and multi-environment deployment. In the next post, we cover the operational side: state management, scheduling, and monitoring in a running production system. And if you're coming from dbt, the third post covers the mental model shifts that trip up experienced dbt users.
Have feedback or corrections? Reach out directly — we update these posts when we find things that are wrong or outdated.