Twkeed RAG (توكيد للاسترجاع)
Arabic RAG (Retrieval-Augmented Generation) Pipeline combining embeddings and reranking for high-accuracy Arabic document retrieval.
Models Used
| Component | Model | Purpose |
|---|---|---|
| Embedding | BAAI/bge-m3 | Fast retrieval (100+ languages) |
| Reranker | BAAI/bge-reranker-v2-m3 | Accurate re-ranking |
Pipeline Architecture
Query → [BGE-M3 Embedding] → FAISS Top 100 → [BGE-Reranker] → Top K Results
↓ ↓
Fast O(1) Accurate O(n)
Performance (Arabic)
| Query | Top Result | Score |
|---|---|---|
| ما هي عاصمة السعودية؟ | الرياض | 0.9995 |
| أين تقع الكعبة المشرفة؟ | مكة المكرمة | 0.9430 |
| ما هي رؤية 2030؟ | رؤية السعودية 2030 | 0.9176 |
Installation
pip install torch sentence-transformers FlagEmbedding faiss-cpu
Quick Start
from twkeed_rag import TwkeedRAG, Document
# Initialize
rag = TwkeedRAG()
# Add Arabic documents
rag.add_documents([
Document(id="1", text="المملكة العربية السعودية هي أكبر دولة في شبه الجزيرة العربية"),
Document(id="2", text="الرياض هي عاصمة المملكة العربية السعودية"),
Document(id="3", text="مكة المكرمة هي أقدس مدينة في الإسلام"),
])
# Search
results = rag.search("ما هي عاصمة السعودية؟", top_k=3)
for r in results:
print(f"Score: {r.rerank_score:.4f} - {r.document.text}")
Full Pipeline (with twkeed-vision)
from twkeed_integrated import TwkeedPipeline, Document
# Initialize with Vision + RAG
pipeline = TwkeedPipeline(load_vision=True, load_rag=True)
# OCR: Extract text from image
text = pipeline.ocr("document.jpg")
# Index: Add to searchable index
pipeline.add_document("doc1", text)
# Or do both in one step
pipeline.add_image("doc2", "receipt.jpg")
# Search
results = pipeline.search("ما هو المبلغ الإجمالي؟")
API Reference
TwkeedRAG
class TwkeedRAG:
def __init__(
self,
embedding_model: str = "BAAI/bge-m3",
reranker_model: str = "BAAI/bge-reranker-v2-m3",
use_fp16: bool = True
)
def add_documents(self, documents: List[Document])
def search(self, query: str, top_k: int = 10) -> List[SearchResult]
def save(self, path: str)
def load(self, path: str)
TwkeedPipeline
class TwkeedPipeline:
def __init__(self, load_vision: bool = True, load_rag: bool = True)
def ocr(self, image_path: str) -> str
def add_document(self, doc_id: str, text: str, metadata: Dict = None)
def add_image(self, doc_id: str, image_path: str) -> str
def search(self, query: str, top_k: int = 5) -> List[SearchResult]
Integration with twkeed-vision
This RAG pipeline integrates with twkeed-vision for complete Arabic document processing:
- twkeed-vision: OCR & document understanding (Qwen3-VL-4B)
- twkeed-rag: Embedding & retrieval (BGE-M3 + Reranker)
Use Cases
- Arabic Document Search: Search through Arabic documents, contracts, reports
- Receipt/Invoice Processing: Extract and search invoice data
- Knowledge Base: Build Arabic Q&A systems
- RAG for LLMs: Retrieve context for Arabic language models
Hardware Requirements
- Minimum: 8GB RAM, CPU
- Recommended: 16GB+ RAM, Apple Silicon (MPS) or NVIDIA GPU
- Tested on: Mac Studio M3 Ultra 96GB
License
Apache 2.0
Related Projects
- twkeed-vision - Arabic OCR & Document Understanding
- BAAI/bge-m3 - Multilingual Embeddings
- BAAI/bge-reranker-v2-m3 - Multilingual Reranker
Acknowledgments
- BAAI for BGE models
- Qwen Team for Qwen3-VL
- Apple for MLX framework