ADR-011: Observability Stack — Hybrid Approach

Status: Proposed Date: 2026-05-17

Context

BayanCore requires comprehensive observability across all layers: infrastructure, application, and AI services. The system must provide:

Metrics: CPU, memory, disk, network, request rates, error rates, latency percentiles
Logging: Application logs, audit logs, access logs, AI query logs
Tracing: Distributed tracing across services (Next.js → API → ERPNext → Guardian → AI)
Alerting: Threshold-based and anomaly-based alerts for critical systems
Dashboards: Operational dashboards for SRE team and business dashboards for product team

We need to decide between OCI-native observability services, open-source tools, or a hybrid approach.

Decision

A hybrid observability stack is adopted:

Infrastructure Layer: OCI Native Services

OCI Monitoring: Infrastructure metrics (compute, network, storage, database)
OCI Logging: Centralized log aggregation from OCI services
OCI Application Performance Monitoring (APM): Service-level performance metrics

Application Layer: Open-Source Stack

Prometheus: Application metrics collection and storage
Grafana: Metrics visualization and dashboards
Jaeger: Distributed tracing (OpenTelemetry backend)
OpenTelemetry: Unified instrumentation library for Node.js, Python, and Go services

Rationale:

OCI native services provide zero-ops infrastructure monitoring
Open-source stack provides portability and deeper application-level customization
OpenTelemetry is the industry standard — vendor-agnostic instrumentation
Grafana dashboards can combine OCI metrics (via plugin) and Prometheus metrics
Jaeger provides end-to-end request tracing across all services
Avoids full vendor lock-in while leveraging managed services where appropriate

Observability Pipeline:

Application (OTel SDK) → OTel Collector → Prometheus (metrics)
                                      → Jaeger (traces)
                                      → OCI Logging (logs)
OCI Services → OCI Monitoring → Grafana (via OCI plugin)

Key Metrics to Track:

Category	Metrics
Infrastructure	CPU, memory, disk I/O, network throughput
Application	Request rate, error rate, latency (P50/P95/P99), active users
Database	Query latency, connection pool usage, slow queries
AI/ML	Token usage, response latency, hallucination rate, RAG retrieval accuracy
Compliance	ZATCA submission success rate, audit log write latency
Business	FWCR (First-Workflow-Completion-Rate), workflow error rate

Alerting Strategy:

Critical: PagerDuty integration (system down, data loss risk)
Warning: Slack notifications (elevated error rates, performance degradation)
Info: Dashboard only (routine operational metrics)

Consequences

Positive: Best of both worlds (managed infra + flexible app monitoring), portable instrumentation, industry-standard tools
Trade-offs: More complex setup than single-vendor solution, need to manage Prometheus/Jaeger infrastructure
Risks: Prometheus storage scaling requires planning, OTel adoption across all services takes time

Alternatives Considered

OCI-native only: Simpler setup but vendor lock-in, less flexible for application-level customization
Open-source only: Full portability but higher operational overhead (self-manage everything)
Datadog/New Relic: Comprehensive but expensive, data leaves KSA (PDPL violation), vendor lock-in
ELK Stack (Elasticsearch, Logstash, Kibana): Good for logging but heavier than needed, Elasticsearch licensing changes

Context​

Decision​

Infrastructure Layer: OCI Native Services​

Application Layer: Open-Source Stack​

Consequences​

Alternatives Considered​

Related Documents​