ADR-011: Observability Stack — Hybrid Approach
Status: Proposed Date: 2026-05-17
Context
BayanCore requires comprehensive observability across all layers: infrastructure, application, and AI services. The system must provide:
- Metrics: CPU, memory, disk, network, request rates, error rates, latency percentiles
- Logging: Application logs, audit logs, access logs, AI query logs
- Tracing: Distributed tracing across services (Next.js → API → ERPNext → Guardian → AI)
- Alerting: Threshold-based and anomaly-based alerts for critical systems
- Dashboards: Operational dashboards for SRE team and business dashboards for product team
We need to decide between OCI-native observability services, open-source tools, or a hybrid approach.
Decision
A hybrid observability stack is adopted:
Infrastructure Layer: OCI Native Services
- OCI Monitoring: Infrastructure metrics (compute, network, storage, database)
- OCI Logging: Centralized log aggregation from OCI services
- OCI Application Performance Monitoring (APM): Service-level performance metrics
Application Layer: Open-Source Stack
- Prometheus: Application metrics collection and storage
- Grafana: Metrics visualization and dashboards
- Jaeger: Distributed tracing (OpenTelemetry backend)
- OpenTelemetry: Unified instrumentation library for Node.js, Python, and Go services
Rationale:
- OCI native services provide zero-ops infrastructure monitoring
- Open-source stack provides portability and deeper application-level customization
- OpenTelemetry is the industry standard — vendor-agnostic instrumentation
- Grafana dashboards can combine OCI metrics (via plugin) and Prometheus metrics
- Jaeger provides end-to-end request tracing across all services
- Avoids full vendor lock-in while leveraging managed services where appropriate
Observability Pipeline:
Application (OTel SDK) → OTel Collector → Prometheus (metrics)
→ Jaeger (traces)
→ OCI Logging (logs)
OCI Services → OCI Monitoring → Grafana (via OCI plugin)
Key Metrics to Track:
| Category | Metrics |
|---|---|
| Infrastructure | CPU, memory, disk I/O, network throughput |
| Application | Request rate, error rate, latency (P50/P95/P99), active users |
| Database | Query latency, connection pool usage, slow queries |
| AI/ML | Token usage, response latency, hallucination rate, RAG retrieval accuracy |
| Compliance | ZATCA submission success rate, audit log write latency |
| Business | FWCR (First-Workflow-Completion-Rate), workflow error rate |
Alerting Strategy:
- Critical: PagerDuty integration (system down, data loss risk)
- Warning: Slack notifications (elevated error rates, performance degradation)
- Info: Dashboard only (routine operational metrics)
Consequences
- Positive: Best of both worlds (managed infra + flexible app monitoring), portable instrumentation, industry-standard tools
- Trade-offs: More complex setup than single-vendor solution, need to manage Prometheus/Jaeger infrastructure
- Risks: Prometheus storage scaling requires planning, OTel adoption across all services takes time
Alternatives Considered
- OCI-native only: Simpler setup but vendor lock-in, less flexible for application-level customization
- Open-source only: Full portability but higher operational overhead (self-manage everything)
- Datadog/New Relic: Comprehensive but expensive, data leaves KSA (PDPL violation), vendor lock-in
- ELK Stack (Elasticsearch, Logstash, Kibana): Good for logging but heavier than needed, Elasticsearch licensing changes