Twkeed RAG (توكيد للاسترجاع)

Arabic RAG (Retrieval-Augmented Generation) Pipeline combining embeddings and reranking for high-accuracy Arabic document retrieval.

Models Used

Component Model Purpose
Embedding BAAI/bge-m3 Fast retrieval (100+ languages)
Reranker BAAI/bge-reranker-v2-m3 Accurate re-ranking

Pipeline Architecture

Query → [BGE-M3 Embedding] → FAISS Top 100 → [BGE-Reranker] → Top K Results
              ↓                                     ↓
         Fast O(1)                            Accurate O(n)

Performance (Arabic)

Query Top Result Score
ما هي عاصمة السعودية؟ الرياض 0.9995
أين تقع الكعبة المشرفة؟ مكة المكرمة 0.9430
ما هي رؤية 2030؟ رؤية السعودية 2030 0.9176

Installation

pip install torch sentence-transformers FlagEmbedding faiss-cpu

Quick Start

from twkeed_rag import TwkeedRAG, Document

# Initialize
rag = TwkeedRAG()

# Add Arabic documents
rag.add_documents([
    Document(id="1", text="المملكة العربية السعودية هي أكبر دولة في شبه الجزيرة العربية"),
    Document(id="2", text="الرياض هي عاصمة المملكة العربية السعودية"),
    Document(id="3", text="مكة المكرمة هي أقدس مدينة في الإسلام"),
])

# Search
results = rag.search("ما هي عاصمة السعودية؟", top_k=3)

for r in results:
    print(f"Score: {r.rerank_score:.4f} - {r.document.text}")

Full Pipeline (with twkeed-vision)

from twkeed_integrated import TwkeedPipeline, Document

# Initialize with Vision + RAG
pipeline = TwkeedPipeline(load_vision=True, load_rag=True)

# OCR: Extract text from image
text = pipeline.ocr("document.jpg")

# Index: Add to searchable index
pipeline.add_document("doc1", text)

# Or do both in one step
pipeline.add_image("doc2", "receipt.jpg")

# Search
results = pipeline.search("ما هو المبلغ الإجمالي؟")

API Reference

TwkeedRAG

class TwkeedRAG:
    def __init__(
        self,
        embedding_model: str = "BAAI/bge-m3",
        reranker_model: str = "BAAI/bge-reranker-v2-m3",
        use_fp16: bool = True
    )

    def add_documents(self, documents: List[Document])
    def search(self, query: str, top_k: int = 10) -> List[SearchResult]
    def save(self, path: str)
    def load(self, path: str)

TwkeedPipeline

class TwkeedPipeline:
    def __init__(self, load_vision: bool = True, load_rag: bool = True)

    def ocr(self, image_path: str) -> str
    def add_document(self, doc_id: str, text: str, metadata: Dict = None)
    def add_image(self, doc_id: str, image_path: str) -> str
    def search(self, query: str, top_k: int = 5) -> List[SearchResult]

Integration with twkeed-vision

This RAG pipeline integrates with twkeed-vision for complete Arabic document processing:

  1. twkeed-vision: OCR & document understanding (Qwen3-VL-4B)
  2. twkeed-rag: Embedding & retrieval (BGE-M3 + Reranker)

Use Cases

  • Arabic Document Search: Search through Arabic documents, contracts, reports
  • Receipt/Invoice Processing: Extract and search invoice data
  • Knowledge Base: Build Arabic Q&A systems
  • RAG for LLMs: Retrieve context for Arabic language models

Hardware Requirements

  • Minimum: 8GB RAM, CPU
  • Recommended: 16GB+ RAM, Apple Silicon (MPS) or NVIDIA GPU
  • Tested on: Mac Studio M3 Ultra 96GB

License

Apache 2.0

Related Projects

Acknowledgments

  • BAAI for BGE models
  • Qwen Team for Qwen3-VL
  • Apple for MLX framework
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support