Monitoring¶
AIS provides comprehensive observability through Prometheus metrics, Grafana dashboards, and structured logging.
Stack Overview¶
| Component | Purpose | Port |
|---|---|---|
| Prometheus | Metrics collection and storage | 9090 |
| Grafana | Dashboard visualization | 3000 |
| Alertmanager | Alert routing and notification | 9093 |
| Pushgateway | Backtest metric ingestion | 9091 |
Prometheus Metrics¶
All metrics are exposed at GET /metrics. See the Metrics Reference for the complete list.
Key metrics to monitor:
| Metric | Alert Threshold | Meaning |
|---|---|---|
ais_drawdown |
> 0.04 | Approaching drawdown limit |
ais_leverage |
> 2.5 | High leverage |
ais_kill_switch_triggers_total |
> 0 | Kill switch activated |
ais_risk_rejections_total |
Increasing | Risk engine blocking orders |
ais_aster_errors_total |
Increasing | Exchange connectivity issues |
Grafana Dashboards¶
Pre-built dashboards are provisioned automatically:
- AIS Overview — NAV, P&L, drawdown, gross/net exposure
- Agent Performance — Signal generation rates, latency, approval ratios
- Execution — Order flow, fill rates, paper trade activity
- Risk — Kill switch events, rejection breakdown, leverage history
Access Grafana at http://localhost:3000 (default credentials configured via GF_ADMIN_PASSWORD).
Alertmanager¶
Configured at monitoring/alertmanager.yml. Default routing sends alerts via webhook.
To add Slack notifications:
receivers:
- name: slack
slack_configs:
- api_url: ${AIS_SLACK_WEBHOOK_URL}
channel: '#trading-alerts'
title: 'AIS Alert'
Position Reconciliation¶
PositionReconciler compares internal state against exchange data:
- Position reconciliation — Internal quantities vs exchange positions
- Balance reconciliation — Expected NAV vs exchange balance
- Unauthorized trade detection — Known order IDs vs exchange trades
Results are persisted as reconciliation events in the EventStore.
Statuses: MATCH, MISMATCH, MISSING_INTERNAL, MISSING_EXCHANGE, ERROR
Resilience Monitoring¶
Circuit Breaker¶
Per-service circuit breakers track failure rates:
- Closed — Normal operation
- Open — Service failing, requests rejected
- Half-open — Probing for recovery
Rate Limiter¶
Token-bucket rate limiters prevent exchange API abuse. Monitor via limiter.stats().
Structured Logging¶
All logs are structured JSON via aiswarm.utils.logging:
{
"level": "INFO",
"logger": "aiswarm.orchestration.coordinator",
"message": "Cycle completed",
"extra_json": {
"cycle": 42,
"signals_generated": 2,
"signal_selected": true,
"risk_approved": true,
"duration_ms": 245
}
}
Decision audit trail: JSONL files at the configured decision_log_path.