HaluGate Sentinel (Quantized ONNX Versions)

HaluGate Sentinel is a high-efficiency binary classifier designed as a "Stage 0" gatekeeper for LLM pipelines. It analyzes incoming user prompts and decides whether they require factual grounding (RAG/Tooling) or can be handled by a creative/reasoning engine directly.

This repository contains optimized ONNX versions of the model, specifically tuned for deployment in browser environments via Transformers.js and edge devices via ONNX Runtime.

🚀 Model Variants

Through rigorous testing, we have found that 4-bit and 1-bit/BNB4 quantization significantly outperforms standard 8-bit quantization for this architecture due to outlier handling.

Format Quantization Recommendation Use Case
model.onnx FP32 Baseline Reference / Server-side
model_fp16.onnx FP16 High Performance WebGPU (Browser)
model_q4.onnx 4-bit Quantized Best Balance General Web / Transformers.js
model_q4f16.onnx 4-bit with Float16 accumulation Best Balance (Recommended) General Web / Transformers.js
model_bnb4.onnx BitsAndBytes 4-bit Ultra Light Mobile / Low Bandwidth
model_uint8.onnx 8-bit Unsigned Integer Stable CPU / WASM
model_qunatized.onnx 8-bit Quantized Not Recommended
model_int8.onnx 8-bit Signed Integer Not Recommended

Note on Quantization Performance: ModernBERT architectures often exhibit "outlier" activations. In our tests, 8-bit global quantization (INT8) caused significant confidence degradation. Using block-wise 4-bit (Q4) or BitsAndBytes (BNB4) isolates these outliers, resulting in performance that nearly matches the original FP32 precision.

📊 Classification Labels

The model outputs two classes based on the prompt's intent:

  1. FACT_CHECK_NEEDED (Label 1): Information-seeking queries that rely on world knowledge (e.g., "What is the current price of Bitcoin?").
  2. NO_FACT_CHECK_NEEDED (Label 0): Creative, coding, opinion, or pure reasoning tasks (e.g., "Write a poem about a cat" or "How do I sort a list in Python?").

✅ Tests

The user is requested to test the model for their use case, here are some light-weight tests done on the model for reference:

Model Used Test Results
fp32 Query: What is the current price of Bitcoin?
Result: FACT_CHECK_NEEDED (88.73%)

Query: Write a nursery rhyme about a cat.
Result: NO_FACT_CHECK_NEEDED (100.00%)

Query: How do I sort a list in Python?
Result: NO_FACT_CHECK_NEEDED (99.97%)
fp16 Query: What is the current price of Bitcoin?
Result: FACT_CHECK_NEEDED (88.68%)

Query: Write a nursery rhyme about a cat.
Result: NO_FACT_CHECK_NEEDED (100.00%)

Query: How do I sort a list in Python?
Result: NO_FACT_CHECK_NEEDED (99.97%)
q4 Query: What is the current price of Bitcoin?
Result: FACT_CHECK_NEEDED (99.13%)

Query: Write a nursery rhyme about a cat.
Result: NO_FACT_CHECK_NEEDED (100.00%)

Query: How do I sort a list in Python?
Result: NO_FACT_CHECK_NEEDED (99.99%)
bnb4 Query: What is the current price of Bitcoin?
Result: FACT_CHECK_NEEDED (99.49%)

Query: Write a nursery rhyme about a cat.
Result: NO_FACT_CHECK_NEEDED (100.00%)

Query: How do I sort a list in Python?
Result: NO_FACT_CHECK_NEEDED (99.91%)
q4f16 Query: What is the current price of Bitcoin?
Result: FACT_CHECK_NEEDED (99.14%)

Query: Write a nursery rhyme about a cat.
Result: NO_FACT_CHECK_NEEDED (100.00%)

Query: How do I sort a list in Python?
Result: NO_FACT_CHECK_NEEDED (99.99%)
uint8 Query: What is the current price of Bitcoin?
Result: FACT_CHECK_NEEDED (98.92%)

Query: Write a nursery rhyme about a cat.
Result: NO_FACT_CHECK_NEEDED (99.97%)

Query: How do I sort a list in Python?
Result: NO_FACT_CHECK_NEEDED (95.22%)
q8, int8
(avoid)
Query: What is the current price of Bitcoin?
Result: NO_FACT_CHECK_NEEDED (55.46%)

Query: Write a nursery rhyme about a cat.
Result: NO_FACT_CHECK_NEEDED (100.00%)

Query: How do I sort a list in Python?
Result: NO_FACT_CHECK_NEEDED (78.03%)

📈 Performance Notes

  • Precision Stability: Unlike many models where 8-bit is the standard, HaluGate Sentinel shows improved confidence scores in q4f16 and bnb4 formats. This is likely due to the "Block-wise" quantization techniques preserving ModernBERT's internal outlier activations better than global 8-bit scaling.
  • Context Window: Supports up to 8192 tokens (ModernBERT backbone).

⚠️ Limitations

  • English Only: The model was primarily trained and validated on English datasets.
  • Borderline Queries: Philosophical or hybrid prompts (e.g., "Is time travel possible?") may show lower confidence scores. We recommend implementing a "default-to-safe" (Fact Check Needed) policy for scores below 0.70.

🛠 Usage (Transformers.js / JavaScript)

import { pipeline } from '@huggingface/transformers';

// Load the 4-bit version for optimal performance
const classifier = await pipeline('text-classification', 'vmanvs/halugate-sentinel-onnx', {
    device: 'webgpu',
    dtype: 'q4f16', // or 'bnb4' for maximum compression
});

const result = await classifier("Who won the 2020 world series?");
console.log(result);
// Output: [{ label: 'FACT_CHECK_NEEDED', score: 0.991... }]

🐍 Usage (Python / ONNX Runtime)

Python

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vmanvs/halugate-sentinel-onnx")
session = ort.InferenceSession("model_q4f16.onnx")

inputs = tokenizer("How do you implement a binary tree?", return_tensors="np")
outputs = session.run(None, dict(inputs))
# Process logits with softmax...
def softmax(x): 
        """Compute softmax values for each sets of scores in x.""" 
        e_x = np.exp(x - np.max(x, axis=-1, keepdims=True)) 
        return e_x / e_x.sum(axis=-1, keepdims=True)

📝 Citation

This model is a derivation of HaluGate Sentiel by LLM Semantic Router Team.

Plaintext

@misc{halugate2025,
  author = {LLM Semantic Router Team},
  title = {HaluGate Sentinel: A Frontline Switch for Hallucination Mitigation},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Repository},
}
Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vmanvs/halugate-sentinel-onnx

Quantized
(1)
this model