L2 Planner Operational Runbook
Monitoring
Section titled “Monitoring”Dashboard
Section titled “Dashboard”See monitoring/dashboards/l2-planner-dashboard.json for the dashboard definition.
Key panels:
- Escalation Rate: Should trend downward over time
- Success Rate: Target > 80%
- Confidence Distribution: Healthy system has most plans > 0.7
- Learned Rules Count: Should grow steadily
Alerts
Section titled “Alerts”| Condition | Severity | Action |
|---|---|---|
| Escalation rate > 50% (1h window) | Warning | Check L2 planner health |
| Escalation rate > 80% (1h window) | Critical | Check A2A connectivity and action graph |
| P95 latency > 30s | Warning | Check A2A response times |
| Learned rule failure rate > 30% | Warning | Trigger rule audit |
Common Operations
Section titled “Common Operations”Adjusting Confidence Thresholds
Section titled “Adjusting Confidence Thresholds”Edit src/config/routing-config.yaml:
routing: l2_confidence_threshold: 0.5 # Lower = more autonomous, higher = more escalationsInspecting Learned Rules
Section titled “Inspecting Learned Rules”const registry = dispatcher.getRuleRegistry();const stats = registry.getStats();const rules = registry.getAll();Rolling Back a Learned Rule
Section titled “Rolling Back a Learned Rule”const migration = new RuleMigration(registry, versioning, auditor);migration.rollback("rule-id");Checking Escalation Trends
Section titled “Checking Escalation Trends”const tracker = dispatcher.getEscalationTracker();const trend = tracker.getTrend(3600_000); // 1-hour bucketsconst reasons = tracker.getTopReasons(10);Troubleshooting
Section titled “Troubleshooting”High Escalation Rate
Section titled “High Escalation Rate”- Check A2A client connectivity (is Ava responding?)
- Verify action graph has sufficient actions for the goal types
- Check if confidence threshold is too high
- Review top escalation reasons:
tracker.getTopReasons(10)
Learned Rule Causing Failures
Section titled “Learned Rule Causing Failures”- Identify the rule:
registry.findByGoal(goalPattern) - Check failure count:
rule.failureCount - Rollback if needed:
migration.rollback(rule.id) - Audit trail:
auditor.getForRule(rule.id)
L2 Planner Not Learning
Section titled “L2 Planner Not Learning”- Check minimum learning confidence: plans below 0.8 are not extracted
- Check promotion threshold: rules need 3 successes before promotion
- Verify registry is not at max capacity (500 default)
- Check if plans are too long (max 10 actions for extraction)