Runbook — Budget Alerts
Alert: Budget Threshold Crossed (50% or 80%)
Section titled “Alert: Budget Threshold Crossed (50% or 80%)”Topic: Discord webhook (or budget.alert.threshold on bus)
Triggered when: Project daily spend or total daily spend exceeds 50% or 80% of the configured cap.
Immediate response
Section titled “Immediate response”- Check current spend via the metrics endpoint or
ops.alert.budgetevents - Review recent
budget.decision.*events for the project/agent - If approaching $10 project cap — consider pausing non-critical agent runs
- If approaching $50 daily cap — escalate to team lead immediately
No further action needed if:
Section titled “No further action needed if:”- Spend is on track with expected project activity
- Threshold is 50% and there is no unusual spike
Alert: Autonomous Rate Below 85%
Section titled “Alert: Autonomous Rate Below 85%”Topic: ops.alert.budget with type: "autonomous_rate_below_threshold"
Triggered when: The 24-hour autonomous rate drops below 85%.
- Read the diagnostic report in the alert payload — it includes tier breakdown
- Check for unusual L3 escalation patterns (are certain agents/projects spiking?)
- Review tier thresholds — are they still appropriate for current usage patterns?
- Do NOT auto-adjust tier thresholds or circuit breaker config without a team review
- Open a ticket for the engineering team to review and adjust configuration
Alert: Cost Discrepancy >20%
Section titled “Alert: Cost Discrepancy >20%”Topic: ops.alert.budget with type: "cost_discrepancy"
Triggered when: The actual post-execution cost differs from the pre-flight estimate by more than 20%.
- Check the
estimatedvsactualvalues in the alert payload - Review the model being used — is it returning more tokens than expected?
- Adjust
estimatedCompletionTokensin the calling code if consistently over-estimating - The budget tracker will adjust the recorded spend to the actual value automatically
- If discrepancy is systematic (>20% consistently), escalate to HITL for budget review
Alert: Circuit Breaker OPEN
Section titled “Alert: Circuit Breaker OPEN”Topic: budget.circuit.open.{goalId}:{agentId}
Triggered when: A goal×agent circuit breaker transitions to OPEN state.
- Identify which
goalId:agentIdcombination tripped the breaker - Review the recent
budget_ledgerrecords for that combination - Determine root cause: was the budget exhausted, or is there a runaway agent?
- If the agent is behaving correctly and budget is available, use the emergency override:
// From ops console or emergency responder toolingcircuitBreaker.override("goal-id", "agent-id", "CLOSED", "budget replenished — resuming", "ops-oncall");- Log the override with justification (cost deducted from next period’s allocation)
- Monitor the agent for the next 10 minutes to confirm normal behavior
Alert: HITL Escalation Timeout
Section titled “Alert: HITL Escalation Timeout”Topic: hitl.expired.{correlationId}
Triggered when: A HITL approval request expires without a decision (default: 30 minutes).
- Per deviation rule: the request is auto-rejected after timeout
- Review the original escalation context in the expired request
- If the request was legitimate, manually re-trigger from the source
- Notify the ops team about the missed escalation
- Review escalation alert routing if this is happening frequently