🗣️ Bilingual NLP & Dialect Handling

Saudi Arabian business operations employ a mix of standard business Arabic, localized English terminology (code-switching), and regional Saudi dialects (Najdi, Hijazi, Eastern). BayanCore implements specialized NLP processing pipelines to parse these structures accurately.

1. Saudi Dialect Normalization

The system parses regional spelling variations and colloquial phrases to map them to core operations:

Dialect Translation Layer: A localized pre-processing layer translates regional expressions into standard business vocabulary:
- Najdi: “أبي أسوي فاتورة” (I want to make an invoice) -> action: create_invoice
- Hijazi: “زبط لي كشف حساب” (Adjust / generate an account statement for me) -> action: generate_statement
- Eastern: “شيك على المخزن” (Check the warehouse) -> action: query_inventory
Orthographic Normalization: Standardizes spelling variations of Arabic characters (such as swapping between أ, إ, آ, and ا) before executing vector similarity searches.

2. Bilingual Code-Switching Tokenization

In Saudi business environments, users frequently mix languages within single sentences. For example:

“أرسل الـ PO للمورد” (Send the PO to the supplier)
“الـ invoice هذا مضاف له VAT؟” (Is this invoice added with VAT?)
Bilingual Vocabulary Mapping: Our tokenizer uses a custom vocabulary dictionary that treats mixed terms (like الـ PO or الـ invoice) as atomic tokens mapped to their respective English equivalents (purchase_order, sales_invoice), preventing syntax errors during embedding calculation.
Cross-Lingual Retrieval: RAG vector databases are trained on bilingual embeddings. An Arabic natural language query can retrieve context documents written in English and vice-versa, ensuring comprehensive record searches.

3. Intent Classification Pipeline

Queries originating from user clients or WhatsApp are classified through a pipeline:

[User Input] ──> [Pre-processing (Normalize & Tokenize)] ──> [Intent Classifier (BERT-based)]
                                                                        │
                                                             Maps to Bounded Action
                                                                        │
                                                             Trigger Target Microservice

BERT-Based Intent Classifier: A lightweight, local BERT model classifies the user's request, routing it to specific microservices:
- Financial Query -> Routes to the Information Agent database queries.
- Administrative Action -> Routes to the Action Agent tool executor.
- Policy / Guide Lookup -> Routes to the RAG vector search database.
Fallback Confidence: If the classifier's confidence score drops below 70%, the system presents a selection card in the UI, asking: "Did you mean to: A) Create an Invoice, B) View a Statement, or C) Search Company Policies?"

1. Saudi Dialect Normalization​

2. Bilingual Code-Switching Tokenization​

3. Intent Classification Pipeline​

1. Saudi Dialect Normalization

2. Bilingual Code-Switching Tokenization

3. Intent Classification Pipeline