Skip to main content

⚙️ MLOps (Model Lifecycle & Monitoring)

BayanCore's MLOps pipeline regulates the deployment, parameter tuning, version control, and infrastructure monitoring of local AI models running on private OCI GPU nodes.


1. Local GPU Inference Nodes

To ensure compliance with PDPL data sovereignty rules, all model inference tasks execute domestically:

  • Inference Shapes: Mistral-7B-Instruct and Llama-3-70B-Instruct models are hosted on OCI bare-metal GPU clusters (running NVIDIA A10 Tensor Core GPUs) in ap-riyadh-1.
  • Serving Engine: Models are served using the vLLM engine, implementing PagedAttention to optimize throughput and reduce latency under concurrent request loads.

2. Fine-Tuning & Model Versioning

Our models are adapted to bilingual Saudi business intent using Parameter-Efficient Fine-Tuning (PEFT):

  • Fine-Tuning Method: Low-Rank Adaptation (LoRA) is executed on anonymized tenant interaction logs, training models to parse Arabic/English code-switching and Saudi dialects.
  • Model Registry: Model checkpoints, LoRA weights, and embedding weights are registered in a local MLflow registry hosted on private OCI Object Storage.
  • Deployment Pipeline: Promoting a model to staging/production requires running automated evaluation evaluations (checking accuracy thresholds and verifying zero hallucinated financial proposals).

3. Drift & Resource Monitoring

AI models are monitored continuously to detect resource issues or response degradation:

  • Semantic Drift: Evaluates user interaction history over time. If a user repeatedly rejects or corrects the AI's OCR suggestions, the system flags the interaction for review.
  • Latency Monitoring: Logs the Time-To-First-Token (TTFT) and total generation latencies, targeting a average TTFT of <500ms.
  • GPU Metrics: Tracks GPU memory utilization, temperature, and compute cycles, alerting system administrators if memory thresholds exceed 90% (preventing Out-Of-Memory reboots).