Skip to main content

ADR-011: Observability Stack — Hybrid Approach

Status: Proposed Date: 2026-05-17

Context

BayanCore requires comprehensive observability across all layers: infrastructure, application, and AI services. The system must provide:

  1. Metrics: CPU, memory, disk, network, request rates, error rates, latency percentiles
  2. Logging: Application logs, audit logs, access logs, AI query logs
  3. Tracing: Distributed tracing across services (Next.js → API → ERPNext → Guardian → AI)
  4. Alerting: Threshold-based and anomaly-based alerts for critical systems
  5. Dashboards: Operational dashboards for SRE team and business dashboards for product team

We need to decide between OCI-native observability services, open-source tools, or a hybrid approach.

Decision

A hybrid observability stack is adopted:

Infrastructure Layer: OCI Native Services

  • OCI Monitoring: Infrastructure metrics (compute, network, storage, database)
  • OCI Logging: Centralized log aggregation from OCI services
  • OCI Application Performance Monitoring (APM): Service-level performance metrics

Application Layer: Open-Source Stack

  • Prometheus: Application metrics collection and storage
  • Grafana: Metrics visualization and dashboards
  • Jaeger: Distributed tracing (OpenTelemetry backend)
  • OpenTelemetry: Unified instrumentation library for Node.js, Python, and Go services

Rationale:

  • OCI native services provide zero-ops infrastructure monitoring
  • Open-source stack provides portability and deeper application-level customization
  • OpenTelemetry is the industry standard — vendor-agnostic instrumentation
  • Grafana dashboards can combine OCI metrics (via plugin) and Prometheus metrics
  • Jaeger provides end-to-end request tracing across all services
  • Avoids full vendor lock-in while leveraging managed services where appropriate

Observability Pipeline:

Application (OTel SDK) → OTel Collector → Prometheus (metrics)
→ Jaeger (traces)
→ OCI Logging (logs)
OCI Services → OCI Monitoring → Grafana (via OCI plugin)

Key Metrics to Track:

CategoryMetrics
InfrastructureCPU, memory, disk I/O, network throughput
ApplicationRequest rate, error rate, latency (P50/P95/P99), active users
DatabaseQuery latency, connection pool usage, slow queries
AI/MLToken usage, response latency, hallucination rate, RAG retrieval accuracy
ComplianceZATCA submission success rate, audit log write latency
BusinessFWCR (First-Workflow-Completion-Rate), workflow error rate

Alerting Strategy:

  • Critical: PagerDuty integration (system down, data loss risk)
  • Warning: Slack notifications (elevated error rates, performance degradation)
  • Info: Dashboard only (routine operational metrics)

Consequences

  • Positive: Best of both worlds (managed infra + flexible app monitoring), portable instrumentation, industry-standard tools
  • Trade-offs: More complex setup than single-vendor solution, need to manage Prometheus/Jaeger infrastructure
  • Risks: Prometheus storage scaling requires planning, OTel adoption across all services takes time

Alternatives Considered

  • OCI-native only: Simpler setup but vendor lock-in, less flexible for application-level customization
  • Open-source only: Full portability but higher operational overhead (self-manage everything)
  • Datadog/New Relic: Comprehensive but expensive, data leaves KSA (PDPL violation), vendor lock-in
  • ELK Stack (Elasticsearch, Logstash, Kibana): Good for logging but heavier than needed, Elasticsearch licensing changes