The most interesting part of building with language models is the part nobody writes about. The writing is about benchmarks, context windows, which lab shipped what on Tuesday, whether the bubble is popping, and whether radiologists are safe. The work is about tool schemas, partial unique indices on remediation tables, admission webhooks that inject git-sync init containers, GitHub App scopes, and whether your agent can path-traverse out of a cloned repo.
The internet has two takes on language models: the model is about to put every white collar profession out of work, or the model is a parlor trick that will crash in the next downturn. Both takes argue about how capable the model is. Neither touches the system around the model, because the system is not a story. It is forty lines of Go that refuse to start a second remediation when one is already running. It is a Kyverno policy that injects a workspace volume before a pod can start. It is a GitHub App with contents:write and nothing else. None of that gets retweeted. All of it decides whether the model's output ever reaches production without breaking something.
Principle 1: Constrain the agent through its interface, not its instructions
The prompt is a suggestion. The tool surface is a contract. The code inside each tool is enforcement. The permission model the code calls into is the wall. An agent stays in bounds because the wall is real, not because you asked nicely.
Start with the tool surface. An agent for this kind of job needs exactly the tools required to do it and not one more. Ten was enough:
| Tool | Type | Purpose |
|---|---|---|
read_repo_file |
Read | Read a file from the cloned repo |
list_repo_files |
Read | List repository structure |
get_project_info |
Read | Project metadata |
list_job_runs |
Read | Recent job runs |
get_job_logs |
Read | Full logs for a job run |
create_branch |
Write | Create a scoped fix branch |
write_repo_file |
Write | Write file content (path traversal protected) |
commit_and_push |
Write | Stage, commit, push to remote |
create_pull_request |
Write | Open a GitHub PR |
merge_pull_request |
Write | Squash-merge a PR (when auto-merge is enabled) |
That is the entire surface area. The agent cannot run arbitrary shell commands. It cannot exec into a pod. It cannot read from the production database. It cannot delete a branch. It cannot force-push. It cannot touch any repo it was not granted access to. With ten tools, the action space is small, the prompt stays small, the cost per run stays low, and the failure modes are enumerable. Any misbehavior can be replayed as a sequence of tool calls, because there is no behavior outside the tool list.
Compare this to "give the agent a bash terminal." Action space unbounded. Prompt has to defend against every sharp edge in coreutils. Post-incident review has to reason about arbitrary side effects.
Now the per-tool enforcement. Examples:
Path traversal. The file-write tool resolves the requested path against the repo root and rejects anything that escapes. This is in Go, in the tool handler, before any write hits the filesystem. The LLM cannot ask its way out of it.
Branch name isolation. The agent never writes to main. The branch-creation tool generates a scoped branch name outside the LLM. The LLM never picks the branch name. Accidental commits to main are structurally impossible, not just discouraged.
Minimal GitHub App scopes. The GitHub App that authenticates git operations holds contents:write and pull_requests:write and nothing else. No admin, no delete, no organization-level permissions. If the LLM hallucinates a tool call that tries to delete a repo, GitHub rejects the call before it lands. The wall is in GitHub's permission model, not in code.
Deduplication in the database. Only one active fix attempt per project at a time. Enforced by a unique partial index:
CREATE UNIQUE INDEX one_active_remediation_per_project
ON remediation_attempts (project_id)
WHERE status NOT IN ('succeeded', 'failed', 'cancelled');
If the agent service tries to start a second attempt while one is already running, the INSERT fails. The database refuses to record a runaway loop.
Circuit breaker in twelve lines of Go. Jobs produced by the agent carry a triggered_by field set to the agent's identity. The job-failure handler reads that field and refuses to start another fix attempt on a job the agent itself produced. The agent cannot chain-react on its own output. Twelve lines, in the failure handler, not in the prompt.
Every rule that matters belongs in code, in schema, in policy, or in IAM. Anything in a prompt is a suggestion the model is free to misread.
Principle 2: Close the loop through review paths you already trust
A suggestion is an unfinished feature.
If your AI tool reads a log and writes a paragraph telling the operator what to do, you have shifted work onto the operator. They still have to read the paragraph, evaluate it, find the file, apply the change, commit it, push it, open a PR, get review, merge, redeploy, and verify the fix. The agent did the easy 20%. The operator does the rest. The "AI" part looks impressive in a demo and saves nobody any time.
The loop has to close end-to-end, and it has to close through the review path the team already uses. Both halves matter. Closing the loop without review is reckless. Reviewing without closing the loop is a suggestion engine.
Here is what end-to-end looks like when a scheduled data job fails:
1. A scheduled data job fails. A sidecar that watches job pods detects
the non-zero exit and parses the run summary.
2. The sidecar posts the failure to the control plane.
3. The control plane checks: is autonomous remediation enabled for
this project? Was the failed job itself produced by a previous fix
attempt (circuit breaker)? Is another fix attempt already running
for this project? Has this particular failure already been handled?
4. If all checks pass, the control plane dispatches to the agent
service.
5. The agent service records a fix-attempt row, mints a short-lived
GitHub App installation token for the project's repo, and clones
the repo over HTTPS.
6. The LLM runs in a tool loop. It reads the failure logs, reads the
surrounding code, identifies the root cause, writes the fix,
creates a branch, commits, pushes, and opens a pull request.
7. If the project has auto-merge enabled, the agent service squash-
merges the PR and flags the next deployment for auto-approval.
8. ArgoCD detects the merge. The next deployment runs. The previously
failed job runs again, this time successfully. The sidecar reports
the success.
9. The control plane closes the fix-attempt row as succeeded and posts
a Slack notification with a link to the diff.
Steps 1 through 9 happen without human action in the auto-merge configuration. The operator finds out there was a problem by reading the Slack message that says it is already fixed. Mean time to recovery becomes "however long the model takes to think," which is about two minutes.
Now the review half. The PR in step 6 is a normal GitHub PR, authored by a bot account the GitHub App authenticates as. Your branch protections apply to it. Your required reviewers apply to it. Your CODEOWNERS apply to it. Your status checks apply to it. Your CI runs against it. If your team requires two human reviewers on changes under models/critical/, the bot's PR sits in a pending state until two humans have reviewed it. The system cannot bypass any of this. The GitHub App does not have the permission to override branch protections.
Auto-merge is per-project and off by default. Teams turn it on for low-stakes projects where the cost of a wrong fix is "the next run also fails and the agent tries again." They keep it off for anything that touches finance models, customer-facing dashboards, or compliance reporting. The decision is yours, in the project settings, and the system enforces it.
The same rule applies inside the team that built the agent. Every change to the loop goes through a multi-role review covering architecture, implementation, operations, customer perspective, and reality-checking before it lands. The reasoning is the same as the principle itself: a multi-role review path catches the mistakes a single reviewer will not. Skipping it is how you ship subtly broken systems with confidence.
Principle 3: Design the dev loop to iterate in seconds
Iteration speed is a property of the system, not of the engineer. If the loop from "change a line of code" to "see the change behave on a real failing pipeline" takes an hour, you will iterate once an hour. If it takes thirty seconds, you will iterate hundreds of times a day. Result quality is downstream of iteration count. Every other improvement compounds against it.
The inner dev cycle for the agent is:
- The agent service runs locally, with the same LLM client and the same tool implementations as production.
- A fixture repository contains a deliberately broken project with a known failure mode.
- A test harness posts a trigger request to the local service against the fixture.
- The service clones the fixture, runs the tool loop, and either opens a PR against a sandbox repo or dumps the proposed diff to stdout.
Save to result, under thirty seconds. Tool definitions, prompt structure, and guardrails get iterated on by running this loop hundreds of times.
The same principle drives the user-facing iteration loop. A failing pipeline triggers a fix attempt. The fix attempt produces a PR. The PR runs in CI. If the fix is wrong, the next run fails again and the agent tries again with the new context. Each attempt is roughly two minutes. The user iterates by watching, not by typing.
If your iteration loop is slow, fix the loop before you fix anything else.
Principle 4: Push complexity into platform extension points and existing frameworks
If someone else's code already solves the problem, do not write your own. This has two forms: use your platform's extension points (admission webhooks, policy engines, CRDs) for complexity every workload shares, and use existing frameworks for complexity with a well-understood contract (protocol-level loops, deployment controllers, auth handshakes).
Application code is the worst place for complexity every workload needs. Every job pod in the system needs the same scaffolding: a workspace volume, a git-sync init container that clones the repo, project environment variables from a ConfigMap, project secrets from a Secret, file-based secrets mounted at /secrets, per-gateway database credentials, workload identity tokens, CPU and memory limits, the control plane URL, and an internal token for callbacks.
If the scaffolding lives in application code, four services end up each constructing a 200-line Job spec. Every change to the scaffolding becomes a change to four codebases. Every drift between them is a production incident waiting to happen.
Instead, every service emits a minimal Job with annotations:
apiVersion: batch/v1
kind: Job
metadata:
name: my-project-run-20260407-120000
namespace: account-org-id
annotations:
dagctl.io/git-url: "git@github.com:org/repo.git"
dagctl.io/git-branch: "main"
dagctl.io/git-path: "."
dagctl.io/image: "runner:v1.42"
dagctl.io/cpu-request: "500m"
dagctl.io/memory-request: "1Gi"
dagctl.io/cpu-limit: "2000m"
dagctl.io/memory-limit: "7Gi"
dagctl.io/use-managed-state: "true"
dagctl.io/gateways: "LOCAL,CLOUD"
spec:
template:
spec:
containers:
- name: runner
image: placeholder # mutated by the admission webhook
restartPolicy: OnFailure
A Kyverno admission webhook reads the annotations when the Job creates a Pod and mutates the pod spec. It injects the workspace volume, adds the git-sync init container, mounts the env-vars ConfigMap, attaches the secrets, sets resource requests and limits, mounts the database credentials, configures workload identity, and copies the internal token into the pod.
The application code is now 20 lines. The 180 lines of scaffolding live in five Kyverno policies every pod in the cluster shares. Change the git-sync image version? Change one ClusterPolicy file. Every job in the cluster picks it up on the next pod creation. No service redeploys. No application code changes. No drift.
This is not "use Kubernetes for everything." It is the recognition that some categories of complexity are platform-shaped and belong at the platform layer. When you find yourself writing the same setup code in three services, that code wants to live below your services, not inside them.
The same rule applies one layer up, with frameworks instead of platform extension points. Four examples:
The LLM tool-use protocol as an owned loop. No parsing LLM output for tool intent. No hand-rolling a JSON schema validator. No managing turn-by-turn conversation state. No handling partial tool argument streaming. Bedrock's tool use protocol does all of that. Tools are defined as JSON schema, registered, and the framework owns the loop. Respond to tool_use blocks with tool_result blocks. The framework owns the turn state. You own the tools.
ArgoCD Applications as the deployment unit. Per-tenant agent services are not deployed by a custom controller. They are ArgoCD Applications with Helm parameters. The control plane creates the Application object during tenant onboarding. ArgoCD's reconciliation loop handles drift detection, sync, rollback, and health checks. Deployment substrate, for free.
GitHub App for git auth. No storing SSH keys for customer repos. No rotating PATs. No long-lived tokens. One RSA private key in AWS Secrets Manager, sign a JWT, exchange it for a one-hour installation token, clone over HTTPS with x-access-token. Token caching, expiry, and refresh are about 80 lines of Go wrapping a well-defined GitHub API. The framework is GitHub's OAuth-for-apps spec. You are a consumer.
Kubernetes CRDs for project state. Project status visible from the cluster is stored in a custom resource. The sidecar updates it. The web UI reads it through the control plane. No custom state store, no custom watchers, no custom reconciler. Kubernetes already ships one.
The test for whether a framework is the right container: if you pulled it out and wrote the equivalent yourself, would the result be smaller, simpler, and less buggy than the framework? Almost always, no. Use the framework. Spend your engineering on the parts that are genuinely yours.
Principle 5: Fix root causes, not symptoms
This is the rule the agent follows for the code it edits, and the rule the team follows when the agent misbehaves.
For the agent: when a model fails because an upstream column was renamed, the fix is to update the downstream model. Not to wrap the failing query in try/except. Not to add a coalesce. Not to comment out the model. The agent's prompt frames every failure as a root-cause search, and the tool design supports the frame. The agent can read upstream code, read recent commits, and read the surrounding file. The fix it commits is meant to be the fix a human engineer would commit.
For the team: when the agent makes a mistake, the first instinct is to add a sentence to the prompt. That instinct is wrong almost every time. If the agent path-traverses, the fix is path-traversal protection in the tool handler. If the agent loops on the same broken file, the fix is a loop-detection guard in the tool loop. If the agent picks the wrong file to edit, the fix is usually a better tool that returns more relevant context, not a longer prompt.
Prompts are band-aids. Tools are root causes. Every time the band-aid got chosen for short-term convenience, it got paid for later in unpredictable behavior that could not be reproduced, could not be tested, and could not be reasoned about.
The same rule applies to infrastructure. When an admission policy injects the wrong env var into a pod, fix the policy. Do not work around it in application code. When a database migration is non-atomic, wrap it in a transaction. Do not write a background reconciler to repair the inconsistency after the fact. The reconciler would work. It would also hide a class of bugs forever.
Fixing root causes is slower this week and faster every week after.
Neither camp is doing the work
Notice what is not in any of the principles above. Nothing here depends on whether the next frontier model will be AGI or whether transformers will plateau. Nothing here cares about benchmark scores, context window length, or which lab released the latest leaderboard winner. Prompt engineering, model selection, fine-tuning, RAG, vector databases, and evaluation harnesses did not come up once. Those topics are not unimportant. They are downstream of the system design. A perfectly tuned prompt against an unbounded action space, an ungrounded context, a non-closing loop, and a review path that does not exist will still produce a worse outcome than a commodity model wired into a system that respects the principles above.
The maximalists will tell you the model is about to solve this for you. It is not. The skeptics will tell you the model cannot contribute anything useful. It already does. The interesting work is in neither position. It is in the unglamorous layer of tool handlers, database constraints, admission policies, and GitHub App scopes. It does not make a viral tweet. It is what makes the thing run.
If you are building something that uses an LLM and you find yourself spending 80% of your time on the prompt, you are working on the wrong layer. Stop. Look at what your LLM can touch, what it can see, what catches it when it falls, and what closes the loop after it acts. The leverage is there. The model is rented from someone else.