📊 Observability & Monitoring

BayanCore implements a comprehensive observability stack based on OpenTelemetry, ensuring complete visibility across container health, database performance, transactional latency, and AI model quality.

1. The Three Pillars of Observability

All telemetry data is collected, aggregated, and stored strictly inside Saudi Arabia (OCI Riyadh).

Metrics Collection (Prometheus & OCI Monitoring)

Infrastructure: Monitors container CPU shares, RAM utilization, block storage IOPS, and Kubernetes node health.
Application: Tracks request counts, error rates, and P95 latency metrics across REST/GraphQL endpoints.
Business & Compliance Key Performance Indicators (KPIs):
- FWCR (First-Workflow-Completion-Rate): Target >80%.
- ZATCA Latency: Target <2 seconds.
- AI OCR Extraction Accuracy: Target >95%.

Structured Logging (Grafana Loki)

Structured Format: All services output logs as standard structured JSON blocks:

{
  "timestamp": "2026-06-02T01:16:39Z",
  "service": "guardian-validator",
  "level": "INFO",
  "trace_id": "82a71f092b3a1",
  "message": "Invoice UBL 2.1 validation passed",
  "metadata": { "invoice_id": "INV-2026-0102", "company_id": "comp-12" }
}

Log Isolation: Debug and application logs are stored in OCI Object Storage with an automated lifecycle policy that purges records after 2 years.

Distributed Tracing (OpenTelemetry & Jaeger)

Span Tracking: Every incoming API call is assigned a unique trace_id by the API Gateway. This trace ID propagates across background tasks, Celery queues, database queries, and AI inferences, allowing engineers to isolate latency bottlenecks.

2. AI Model Drift & Telemetry

Observing generative AI pipelines requires specific monitoring parameters:

Feedback Loops: Captures instances where users manually correct AI suggestions (such as modifying fields populated via receipt OCR). A correction rate exceeding 5% triggers alerts for model retraining.
Semantic Shift: Analyzes shifts in user input query distributions (e.g. new vocabulary or regional dialects), updating RAG retrieval parameters accordingly.
LLM Hallucinations: Tracks confidence scores and deterministic post-validation failures. If the validation engine blocks LLM outputs due to data mismatches at a rate >1%, it triggers alerts to fallback to rule-based execution.

3. Incident Alerting Matrix

Alerts are routed to operations teams based on severity:

Severity Level	Trigger Threshold	Primary Alert Channel	Escalation Policy
P1 (Critical)	ZATCA API failure rate >5%, DB Connection Pool exhausted, site offline	SMS / WhatsApp Call	Page on-call engineer; escalate to CTO if unresolved in 15m
P2 (High)	API error rate >1%, P95 latency >3s, disk usage >85%	Slack & Email	Route to active dev team channel; resolve within 4 hours
P3 (Medium)	AI OCR correction rate >10%, Redis cache hit rate <80%	Slack Alert	Log to backlog; review during weekly maintenance sprints
P4 (Low)	UI layout alignment warning, non-critical package updates	Dashboard Notification	Review during bi-weekly release cycles

1. The Three Pillars of Observability​

Metrics Collection (Prometheus & OCI Monitoring)​

Structured Logging (Grafana Loki)​

Distributed Tracing (OpenTelemetry & Jaeger)​

2. AI Model Drift & Telemetry​

3. Incident Alerting Matrix​