GUIDE · INCIDENT 8 min ·

$82,000 GCP bill from a coding agentWhat ran. What spent. What to change.

A developer left a GCP service account key in the working directory. A coding agent found it, treated infrastructure code as a problem to fix, and ran a retry loop for 56 hours. The bill arrived Monday morning.

TL;DR· the answer, in twenty seconds

What happened: A coding agent running against an infrastructure codebase found a GCP service account key in the project root. When a Terraform apply failed due to a quota error, the agent treated the failure as a fixable problem, modified config, and retried. The loop ran across a weekend. GCP billed for every resource created and torn down in the cycle.

The minimum fix: credential files should not live inside any directory your agent can read. The structural fix is to never let the agent process hold cloud credentials at all. Deliver credentials to a specific child process at exec time via a broker; revoke after the process exits.

The lesson: an agent with ambient cloud credentials and the ability to call infrastructure tooling has an unbounded blast radius. Read-only IAM helps, but most infrastructure tasks require write permissions. The real constraint is whether the agent holds credentials at all, not which permissions those credentials carry.

A GCP service account key sat in the project root. It was there for convenience. The developer had been testing a Terraform module over the previous week and had not moved the key to a secrets manager or rotated it out. The agent session started Friday afternoon. By Sunday night the account had $82,000 in charges.

This is the shape of a new incident category. No attacker. No phishing. No malicious package. A developer left credentials where a helpful tool could find them, and the helpful tool did exactly what it was designed to do.

The billing alert threshold was set to $500. The alert fired. It went to an email address the developer does not check on weekends.

What to know in 60 seconds

  • A coding agent running in a directory with cloud credentials can call cloud APIs. That is not a bug.
  • When an infrastructure task fails partway through, most agents treat the failure as a signal to retry or reattempt with a modified approach.
  • GCP charges for resource creation and deletion. A retry loop that creates and tears down Compute Engine instances, Cloud SQL replicas, or Cloud Run revisions generates a bill on each pass.
  • Budget alerts with email delivery and a 24-hour evaluation window will not catch a runaway weekend loop.
  • The blast radius is not bounded by the credentials' IAM roles unless those roles are read-only. Most infrastructure work requires write permissions.

How the loop started

The project was a set of Terraform modules for a staging environment. The developer's task for the afternoon session: migrate a Cloud SQL instance from one region to another. The agent was given the task in natural language and pointed at the Terraform directory.

The first terraform apply hit a regional quota limit. The error was parseable: the target region did not have enough db-n1-standard-4 capacity at that moment. GCP returns this as a 409 with a quota error code.

The agent read the error, concluded the approach needed adjustment, and proposed using a different machine type. It applied the change, ran terraform apply again. That apply partially succeeded, created a few resources, then failed on a different dependency. The agent read the new error.

This is normal agent behavior. Each error looks like a solvable problem. The agent has working credentials and a shell. Nothing in the loop told it that the cost of another attempt was nonzero, let alone that the bill was compounding.

By Saturday morning, the loop had made roughly 40 full and partial apply cycles. By Sunday night, closer to 400 API calls had been billed. Cloud SQL instance-hours, Compute Engine GPU provisioning for a retry that tried a different tier, network egress between regions on each migration attempt: all billed.

Why read-only IAM does not fix this

The obvious reaction is to give the agent a read-only service account. For audit tasks, cost analysis, or code review that calls gcloud to inspect existing resources, read-only works. For any task that involves changing infrastructure, it does not.

The developer in this incident could not have given the agent read-only credentials and completed the migration. Terraform needs write access to create, modify, and delete resources. Cloud SQL cross-region migration specifically requires permission to create a replica in the target region, promote it, and delete the original.

So the advice "use read-only credentials with agents" covers roughly the set of tasks where the agent is not doing anything interesting. As soon as the task has production consequences, write access follows.

The gap in the advice is that it treats the problem as a permissions scoping problem. The real problem is that the agent holds credentials at all during the session. An agent that holds credentials can use them at any point in the session, for any number of calls, with no per-call approval requirement.

The part that gets skipped in most write-ups

Budget alerts did not help here. The $500 threshold fired, but the evaluation period is daily by default in GCP Billing. An alert set to daily evaluation fires at most once every 24 hours. A loop running from Friday afternoon through Sunday night crosses three evaluation windows. One alert fires Friday night. The developer sees it Monday.

You can set evaluation windows to one hour in GCP Billing. Almost nobody does, because the dashboard discourages it with a warning about alert fatigue. Tighten that window if you are running agents against infrastructure.

The second thing that gets skipped: the agent context window did not contain a cost signal. The agent could see Terraform output, GCP error messages, and the current state of the .tf files. It could not see a running bill total. It had no way to know that each retry cost money. This is an architectural gap in how agents are connected to cloud tooling today, not a misconfiguration.

A human engineer in the same situation would feel the weight of each failed attempt as a decision point. The agent feels nothing and retries.

The structural problem is credential exposure time

Consider what the timeline looked like from the credentials' perspective.

The service account key existed on disk for a week before the incident. It was available to any process with read access to the project directory. The agent session ran for 56 hours. During those 56 hours, the key was available to every tool call the agent made.

Now compare that to a model where credentials have a bounded exposure window. The developer starts a session, requests a short-lived token for infrastructure work, and the token is valid for four hours. If the agent loops past the four-hour mark, the next apply call returns a 401. The loop breaks. The billing stops.

GCP supports short-lived tokens via Workload Identity Federation and via the gcloud auth print-access-token command, which issues tokens that expire in 60 minutes by default. Service account keys with no expiry, sitting in a project directory, are the opposite of that model.

The question is not "should the agent have read-only or write credentials?" It is "how long does the agent hold credentials, and what can interrupt that window?"

If a broker like hasp sat in the path here, the service account key would never have been in the agent's environment at all. The agent requests a credential when the Terraform call starts, gets a token scoped to that one process, and the token dies at exit. A retry loop that runs past the token's ceiling hits a 401 on the next apply. The loop breaks. The bill stops.

What to change before the next agent session

## GCP + coding agent pre-session checklist

- [ ] No .json key files in any directory the agent process can read
- [ ] No GOOGLE_APPLICATION_CREDENTIALS pointing to a long-lived key file
- [ ] GCP budget alert evaluation window set to 1 hour, not daily
- [ ] Budget alert delivery goes to a channel monitored on weekends (PagerDuty, SMS, Slack with mobile notifications on)
- [ ] Alert threshold at 10% of monthly budget, not a fixed dollar amount
- [ ] Terraform state is remote (GCS bucket), not local: agent can read state but errors stay bounded to Terraform calls
- [ ] Agent session explicitly scoped: if task is "migrate Cloud SQL", the agent has no reason to touch Compute Engine quotas
- [ ] Short-lived token used instead of key file: gcloud auth print-access-token expires in 60 minutes
- [ ] Application Default Credentials cleared before agent session if not needed: unset GOOGLE_APPLICATION_CREDENTIALS
- [ ] Post-session: gcloud auth revoke if using user credentials, or delete short-lived token file

Paste this into your team's infrastructure runbook or PR template for any session that involves Terraform or gcloud.

Alert tuning is not optional

The default GCP billing alert setup is optimized for catching slow month-long drift, not weekend runaway events. Three changes matter.

First, switch evaluation frequency from "monthly" to "per usage" (which GCP evaluates more frequently as data arrives). For critical accounts, set a separate alert at 1 hour intervals using a Cloud Monitoring alerting policy on the billing/monthly_cost metric. This is not in the billing console; it requires Monitoring > Alerting > Create policy.

Second, route the alert to a notification channel that pages. Email to a developer's personal inbox is not paging. Set up a PagerDuty or OpsGenie integration so that a billing spike above threshold wakes someone up. GCP Billing supports Pub/Sub notifications; wire that to your on-call tooling.

Third, set multiple thresholds at different percentages of monthly budget: 10%, 25%, 50%. A single $500 threshold on an account where the normal bill is $2,000 per month catches very little. Percentage thresholds scale with actual usage patterns.

None of these changes prevent the agent from running. They do mean you find out faster.

What this means for your stack

The incident is not a GCP bug or an agent bug. It is the predictable outcome of a model where a long-running process holds cloud credentials and treats infrastructure failures as problems to iterate on. The agent did what it was designed to do. The problem is what it was holding while it did it.

The structural fix is to stop treating cloud credentials as ambient environment. Instead of a service account key file in the project directory (or a GOOGLE_APPLICATION_CREDENTIALS export that lives for the full agent session), credentials belong in a local broker that injects them into a specific child process at exec time, for a bounded duration, and revokes them when the process exits. The agent context holds a reference, not the key. A loop that runs for 56 hours with that model hits a 401 on hour one, not a $82,000 bill on Monday.

hasp is one working implementation of that model. curl -fsSL https://gethasp.com/install.sh | sh, hasp setup, bind your project's secrets, and the agent process gets a credential injected at the moment of the specific tool call that needs it, not for the duration of the session. The audit log records every grant. Source-available (FCL-1.0), local-first, macOS and Linux, no account.

Credential scope and credential lifetime are two different problems. IAM roles handle scope. Runtime brokering handles lifetime. You need both. The February incident and this weekend's $82,000 bill are the same problem wearing different clothes: too much, for too long, with no one watching.

Sources· cited above, in one place

NEXT STEP~90 seconds

Stop handing the agent your real keys.

hasp keeps secrets in one local encrypted vault, brokers them into the child process at exec, and never lets the agent read the value.

  • Local, encrypted vault — no account, no cloud, no telemetry by default.
  • Brokered run — agent gets a reference, the child process gets the value.
  • Pre-commit + pre-push hooks catch managed values before they ship.
  • Append-only HMAC audit log answers "did the agent touch the prod token?" in seconds.
→ okvault unlocked · binding ./api
→ okgrant once · pid 88421
→ okagent never read

macOS & Linux. Source-available (FCL-1.0, converts to Apache 2.0). No account.

Browse all clusters· eight threads, one index