All posts Engineering

Running SQLMesh in Production: Infrastructure, CI/CD, and Deployment

Scotty Pate 14 min read

Introduction

SQLMesh is earning real adoption in data engineering teams, and for good reason — the virtual environment model, column-level lineage, and the plan workflow solve problems that other tools handle awkwardly or not at all. The documentation covers getting started well. What it covers less well is what happens after you've decided to commit: containerization, CI/CD pipelines, state database choices, and deployment across environments.

This is the first in a three-part series distilling what we've learned running SQLMesh across dozens of projects at varying scales. This post covers the foundation: getting SQLMesh containerized, choosing the right state database, and building CI/CD pipelines that work.

Part 2 covers state management, scheduling, and observability. Part 3 covers what's different for teams coming from dbt.


Container Images

Start with python:3.11-slim. Not Alpine.

Alpine uses musl libc instead of glibc, and several of SQLMesh's dependencies — particularly anything that relies on compiled C extensions — either fail to install cleanly or require build toolchains that add complexity and image size. The supposed image size advantage of Alpine disappears once you've added gcc, musl-dev, and assorted build deps. python:3.11-slim is a better baseline: glibc, smaller than full Debian, and compatible with everything SQLMesh needs.

A minimal working Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# System deps for psycopg2 and cryptography packages
RUN apt-get update && apt-get install -y --no-install-recommends \
    libpq-dev \
    gcc \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

ENTRYPOINT ["sqlmesh"]

Dependency Pinning

Pin everything. Not sqlmesh[postgres]>=0.120sqlmesh[postgres]==0.125.3. SQLMesh has shipped breaking changes between minor versions, and a pip install during a deploy job that pulls in a new minor version at 2am is not a fun way to discover that.

Your requirements.txt should look like this, not a range:

sqlmesh[postgres]==0.125.3
psycopg2-binary==2.9.9

Lock with pip-compile from pip-tools if you want the full transitive dependency graph locked. At minimum, pin SQLMesh itself exactly.

A minor version bump that changes snapshot fingerprinting will cause SQLMesh to see all models as changed, triggering a full backfill on the next plan. This has happened. Pin your versions.

When you want to upgrade, do it deliberately: update the pin, run sqlmesh plan against a dev environment, verify no unexpected backfills are triggered, then promote through staging before touching prod.

Compute Requirements

Plan evaluation is the expensive operation, not execution. When SQLMesh runs plan, it loads model definitions, computes fingerprints for every snapshot, diffs the current environment state against the desired state, and figures out which intervals need backfilling. For projects with 200+ models, plan evaluation regularly consumes 2-4GB of memory. On a small instance, this will OOM.

Execution resource requirements are gateway-dependent — if you're pushing compute to a cloud data warehouse, your SQLMesh container is mostly coordinating. If you're running DuckDB locally or against a local Postgres, the container is doing real work.

A reasonable starting point for a project with 100-300 models:

Adjust from there based on actual metrics. The plan container almost always needs more resources than people initially allocate.

State Database: Use Postgres

Do not use SQLite in production. SQLite is the default and it's fine for local development. It is not fine for production for several reasons:

Concurrent access. SQLite has file-level locking. If two processes try to write to the state database simultaneously — say, a scheduled run and a CI/CD plan — one will block or error. SQLMesh itself handles some of this, but you're building in a fragility that doesn't need to exist.

Container restarts. If your state database is a SQLite file on a container's local filesystem and the container restarts, you lose state. You can work around this with persistent volumes, but at that point you're doing extra work to use a worse solution.

Corruption risk. SQLite is not designed for environments where processes can be killed mid-write. A container OOM event mid-plan can corrupt the state file.

Use Postgres. The configuration is minimal:

# config.yaml
gateways:
  prod:
    state_connection:
      type: postgres
      host: your-postgres-host
      port: 5432
      database: sqlmesh_state
      username: sqlmesh
      password: "{{ env_var('SQLMESH_STATE_DB_PASSWORD') }}"

Run a small Postgres instance — the state database is not high-throughput. An RDS db.t3.small or equivalent handles it easily. What it cannot handle is being unavailable, so make sure it's in a reliable setup. No Aurora Serverless with aggressive auto-pause unless you enjoy cold start latency on every plan.

Kubernetes Considerations

Kubernetes is common but not required. The only hard requirements are: persistent access to the state database and reliable scheduling. ECS, Fly.io, a bare VM with cron — all work.

If you are on Kubernetes, use Jobs for sqlmesh plan and CronJobs for sqlmesh run. Do not run SQLMesh processes as long-lived Deployments — they're batch workloads. Use resource requests and limits so the scheduler can make proper placement decisions, and use a dedicated namespace for SQLMesh jobs to keep RBAC and resource quotas clean.


CI/CD Pipeline Design

How sqlmesh plan Works in CI

The plan command is the core of SQLMesh's deployment workflow. It does everything: computes the diff against the current environment, shows you what would change, and — when you confirm — applies those changes. There is no separate apply command. You plan and apply in a single step.

In CI, two flags make this non-interactive:

For development environments where you want auto-apply on merge:

sqlmesh plan dev --no-prompts --auto-apply

For production, you typically want a human approval gate between computing the plan and applying it. One approach is to run plan with --no-prompts but without --auto-apply in a PR check (so the team can see what would change), then run the full --auto-apply version after merge with an environment protection gate:

# PR check: compute the plan, skip backfills, just validate
sqlmesh plan prod --no-prompts --skip-backfill

# After merge + approval: compute and apply
sqlmesh plan prod --no-prompts --auto-apply

The --skip-backfill flag (also aliased as --dry-run) creates the virtual environment update but skips executing any backfills. This is useful for validating that a plan computes cleanly without actually running any data through the warehouse.

Capture the exit code. 0 means success. Non-zero means something went wrong — invalid SQL, a plan conflict, a connection failure. Do not proceed on non-zero.

Handling Plan Conflicts

When two branches both modify overlapping models and both try to plan against the same environment, the second plan will fail with a conflict. SQLMesh uses optimistic locking on environment state: the first plan wins, the second detects that the state it computed against has changed.

The fix is straightforward: merge the main branch into your feature branch and re-run sqlmesh plan. The conflict detection is doing exactly what it should — preventing you from clobbering someone else's changes.

Testing in CI

Run sqlmesh test before planning. SQLMesh unit tests are fast and catch model logic errors before you try to apply them:

sqlmesh test

A Concrete GitHub Actions Workflow

This is a working example, not pseudocode:

name: SQLMesh CI/CD

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

env:
  SQLMESH_STATE_DB_PASSWORD: ${{ secrets.SQLMESH_STATE_DB_PASSWORD }}
  SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}

jobs:
  test-and-plan:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run unit tests
        run: sqlmesh test

      - name: Validate plan against dev
        run: sqlmesh plan dev --no-prompts --skip-backfill

  apply-dev:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Plan and apply to dev
        run: sqlmesh plan dev --no-prompts --auto-apply

  apply-prod:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v')
    environment: production  # requires manual approval in GitHub
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Plan and apply to prod
        run: sqlmesh plan prod --no-prompts --auto-apply

A few notes on this structure:

The apply-prod job uses a GitHub environment named production. Configure that environment in your repo settings with required reviewers. This gives you a manual approval gate without any custom tooling — GitHub handles the prompt, and the deploy only runs after a human approves it.

The dev job uses --auto-apply because dev is low-stakes. Prod uses --auto-apply too, but only after the environment approval gate. If you want to inspect the plan output before approving, split it into two jobs: one that plans with --skip-backfill and one (triggered manually) that plans and applies.

SQLMesh's Built-In CI/CD Bot

Before you build all of this from scratch, know that SQLMesh ships a first-party GitHub Actions CI/CD bot. Install it with pip install sqlmesh[github] and add a workflow file — the bot handles a lot of what the manual workflow above does, plus things that are harder to build yourself:

The bot is worth evaluating before rolling your own workflow. The custom GitHub Actions approach above gives you more control and is useful if you need to integrate SQLMesh into an existing CI pipeline with other steps. The built-in bot is faster to set up and handles the PR feedback loop better out of the box.

You can also use both — the bot for PR environments and plan previews, and custom workflows for production deployment with your own approval gates.


Environment Strategy and Deployment

The Standard Pattern

Three tiers covers most teams:

Resist the temptation to simplify it down to just dev/prod until you've felt the pain of developers clobbering each other in a shared dev environment.

Gateway Configuration

Define separate gateways for each target, each pointing to a different database:

# config.yaml
gateways:
  dev:
    connection:
      type: snowflake
      account: your-account
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      database: ANALYTICS_DEV
      warehouse: DEV_WH
    state_connection:
      type: postgres
      host: your-state-db-host
      database: sqlmesh_state
      username: sqlmesh
      password: "{{ env_var('SQLMESH_STATE_DB_PASSWORD') }}"

  prod:
    connection:
      type: snowflake
      account: your-account
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      password: "{{ env_var('SNOWFLAKE_PASSWORD') }}"
      database: ANALYTICS
      warehouse: PROD_WH
    state_connection:
      type: postgres
      host: your-state-db-host
      database: sqlmesh_state
      username: sqlmesh
      password: "{{ env_var('SQLMESH_STATE_DB_PASSWORD') }}"

default_gateway: dev

Both gateways share the same state database, which is intentional — the state database tracks environments across gateways and needs visibility into all of them.

The different database values (ANALYTICS_DEV vs ANALYTICS) are how you keep dev and prod data isolated in the warehouse.

Promotion Flow

  1. Developer creates alice_dev environment with sqlmesh plan alice_dev --no-prompts --auto-apply
  2. Iterates on models, re-plans as needed
  3. Opens a pull request — CI runs sqlmesh plan dev --no-prompts --skip-backfill to validate
  4. PR merges to main — CI runs sqlmesh plan staging --no-prompts --auto-apply
  5. Team cuts a release tag — CI runs sqlmesh plan prod --no-prompts --auto-apply after manual approval gate

Virtual environments make step 1 free. Step 5 is where the manual gate matters.

Cleaning Up Developer Environments

Developer environments accumulate. Use sqlmesh invalidate to mark an environment for removal — the janitor process cleans it up on its next run:

sqlmesh invalidate alice_dev

This marks the virtual environment for deletion. The actual cleanup happens when sqlmesh janitor runs. Physical tables are only dropped if no other environment references them — SQLMesh handles the reference counting.


Common Infrastructure Pitfalls

Accidental Full Backfill

The most painful one. You run sqlmesh plan prod --start 2020-01-01 --no-prompts --auto-apply and suddenly SQLMesh is running six years of daily intervals across 50 models. That's 109,500 interval executions.

This usually happens when a backfill command is copy-pasted from a dev context into a prod context without checking the start date.

Always run plan without --auto-apply first when specifying historical start dates. Read the plan output. It tells you exactly how many intervals will be processed. If that number surprises you, stop.

# Do this first — plan without auto-apply to see what would happen
sqlmesh plan prod --start 2020-01-01 --no-prompts

# Check the output. If it says "backfill 2,190 intervals", that's 6 years of daily data.
# Only re-run with --auto-apply if you're sure that's what you want.

Connection Pool Exhaustion

SQLMesh executes models concurrently. The default concurrency is fine for small projects but can exhaust connection pools on projects with 50+ models executing in parallel.

Symptoms: intermittent connection errors during runs, models succeeding on retry. Check your database connection pool size (Snowflake virtual warehouse, RDS max_connections, etc.) and tune the concurrent_tasks setting on your gateway connection:

# config.yaml
gateways:
  prod:
    connection:
      type: snowflake
      concurrent_tasks: 10  # reduce if you see pool exhaustion
      # ...

Also check any external connection pooler (e.g., PgBouncer) if you're using one.

Warehouse Cost Surprises on Backfills

Large backfills can generate significant warehouse costs. A backfill of 3 years of a model that processes a large table on each interval is a math problem worth doing before applying.

Two mitigations:

  1. Set batch_size on your models. This is configured per-model in the MODEL() block. It controls how many intervals are processed per job:
MODEL (
  name analytics.large_model,
  kind INCREMENTAL_BY_TIME_RANGE (
    time_column event_date,
    batch_size 90  -- process 90 days at a time
  )
);
  1. Preview before committing. Run sqlmesh evaluate model_name --start 2024-01-01 --end 2024-01-01 --limit 10 to test the query against a single interval before running the full backfill.

Rolling Back a Bad Deploy

When a production plan goes wrong — a model has a logic bug, or a backfill produces incorrect data — the rollback path is to revert the model code in Git and re-run sqlmesh plan prod --no-prompts --auto-apply. SQLMesh computes a new plan based on the reverted definitions, creates new snapshots, and backfills the affected intervals with the corrected logic.

There is no "undo last plan" command. The rollback is the same workflow as any other deployment: change the code, plan, apply. Your CI/CD pipeline handles this naturally — revert the PR, merge, and the deploy pipeline runs the corrected plan.

For cases where you need to act faster than a full CI cycle, you can run sqlmesh plan directly from a checkout of the last known-good commit. The key thing is that rolling back is a forward operation — you deploy the previous state as a new plan, you don't literally rewind time.


What's Next

This post covered the foundation — containerization, state database, CI/CD, and multi-environment deployment. In the next post, we cover the operational side: state management, scheduling, and monitoring in a running production system. And if you're coming from dbt, the third post covers the mental model shifts that trip up experienced dbt users.


Have feedback or corrections? Reach out directly — we update these posts when we find things that are wrong or outdated.