📊 Observability & Monitoring
BayanCore implements a comprehensive observability stack based on OpenTelemetry, ensuring complete visibility across container health, database performance, transactional latency, and AI model quality.
1. The Three Pillars of Observability
All telemetry data is collected, aggregated, and stored strictly inside Saudi Arabia (OCI Riyadh).
Metrics Collection (Prometheus & OCI Monitoring)
- Infrastructure: Monitors container CPU shares, RAM utilization, block storage IOPS, and Kubernetes node health.
- Application: Tracks request counts, error rates, and P95 latency metrics across REST/GraphQL endpoints.
- Business & Compliance Key Performance Indicators (KPIs):
- FWCR (First-Workflow-Completion-Rate): Target >80%.
- ZATCA Latency: Target <2 seconds.
- AI OCR Extraction Accuracy: Target >95%.
Structured Logging (Grafana Loki)
- Structured Format: All services output logs as standard structured JSON blocks:
{"timestamp": "2026-06-02T01:16:39Z","service": "guardian-validator","level": "INFO","trace_id": "82a71f092b3a1","message": "Invoice UBL 2.1 validation passed","metadata": { "invoice_id": "INV-2026-0102", "company_id": "comp-12" }}
- Log Isolation: Debug and application logs are stored in OCI Object Storage with an automated lifecycle policy that purges records after 2 years.
Distributed Tracing (OpenTelemetry & Jaeger)
- Span Tracking: Every incoming API call is assigned a unique
trace_idby the API Gateway. This trace ID propagates across background tasks, Celery queues, database queries, and AI inferences, allowing engineers to isolate latency bottlenecks.
2. AI Model Drift & Telemetry
Observing generative AI pipelines requires specific monitoring parameters:
- Feedback Loops: Captures instances where users manually correct AI suggestions (such as modifying fields populated via receipt OCR). A correction rate exceeding 5% triggers alerts for model retraining.
- Semantic Shift: Analyzes shifts in user input query distributions (e.g. new vocabulary or regional dialects), updating RAG retrieval parameters accordingly.
- LLM Hallucinations: Tracks confidence scores and deterministic post-validation failures. If the validation engine blocks LLM outputs due to data mismatches at a rate >1%, it triggers alerts to fallback to rule-based execution.
3. Incident Alerting Matrix
Alerts are routed to operations teams based on severity:
| Severity Level | Trigger Threshold | Primary Alert Channel | Escalation Policy |
|---|---|---|---|
| P1 (Critical) | ZATCA API failure rate >5%, DB Connection Pool exhausted, site offline | SMS / WhatsApp Call | Page on-call engineer; escalate to CTO if unresolved in 15m |
| P2 (High) | API error rate >1%, P95 latency >3s, disk usage >85% | Slack & Email | Route to active dev team channel; resolve within 4 hours |
| P3 (Medium) | AI OCR correction rate >10%, Redis cache hit rate <80% | Slack Alert | Log to backlog; review during weekly maintenance sprints |
| P4 (Low) | UI layout alignment warning, non-critical package updates | Dashboard Notification | Review during bi-weekly release cycles |