HaluGate Sentinel (Quantized ONNX Versions)
HaluGate Sentinel is a high-efficiency binary classifier designed as a "Stage 0" gatekeeper for LLM pipelines. It analyzes incoming user prompts and decides whether they require factual grounding (RAG/Tooling) or can be handled by a creative/reasoning engine directly.
This repository contains optimized ONNX versions of the model, specifically tuned for deployment in browser environments via Transformers.js and edge devices via ONNX Runtime.
🚀 Model Variants
Through rigorous testing, we have found that 4-bit and 1-bit/BNB4 quantization significantly outperforms standard 8-bit quantization for this architecture due to outlier handling.
| Format | Quantization | Recommendation | Use Case |
|---|---|---|---|
model.onnx |
FP32 | Baseline | Reference / Server-side |
model_fp16.onnx |
FP16 | High Performance | WebGPU (Browser) |
model_q4.onnx |
4-bit Quantized | Best Balance | General Web / Transformers.js |
model_q4f16.onnx |
4-bit with Float16 accumulation | Best Balance (Recommended) | General Web / Transformers.js |
model_bnb4.onnx |
BitsAndBytes 4-bit | Ultra Light | Mobile / Low Bandwidth |
model_uint8.onnx |
8-bit Unsigned Integer | Stable | CPU / WASM |
model_qunatized.onnx |
8-bit Quantized | Not Recommended | |
model_int8.onnx |
8-bit Signed Integer | Not Recommended |
Note on Quantization Performance: ModernBERT architectures often exhibit "outlier" activations. In our tests, 8-bit global quantization (INT8) caused significant confidence degradation. Using block-wise 4-bit (Q4) or BitsAndBytes (BNB4) isolates these outliers, resulting in performance that nearly matches the original FP32 precision.
📊 Classification Labels
The model outputs two classes based on the prompt's intent:
FACT_CHECK_NEEDED(Label 1): Information-seeking queries that rely on world knowledge (e.g., "What is the current price of Bitcoin?").NO_FACT_CHECK_NEEDED(Label 0): Creative, coding, opinion, or pure reasoning tasks (e.g., "Write a poem about a cat" or "How do I sort a list in Python?").
✅ Tests
The user is requested to test the model for their use case, here are some light-weight tests done on the model for reference:
| Model Used | Test Results |
|---|---|
| fp32 | Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (88.73%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.97%) |
| fp16 | Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (88.68%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.97%) |
| q4 | Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (99.13%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.99%) |
| bnb4 | Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (99.49%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.91%) |
| q4f16 | Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (99.14%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.99%) |
| uint8 | Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (98.92%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (99.97%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (95.22%) |
| q8, int8 (avoid) |
Query: What is the current price of Bitcoin? Result: NO_FACT_CHECK_NEEDED (55.46%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (78.03%) |
📈 Performance Notes
- Precision Stability: Unlike many models where 8-bit is the standard, HaluGate Sentinel shows improved confidence scores in q4f16 and bnb4 formats. This is likely due to the "Block-wise" quantization techniques preserving ModernBERT's internal outlier activations better than global 8-bit scaling.
- Context Window: Supports up to 8192 tokens (ModernBERT backbone).
⚠️ Limitations
- English Only: The model was primarily trained and validated on English datasets.
- Borderline Queries: Philosophical or hybrid prompts (e.g., "Is time travel possible?") may show lower confidence scores. We recommend implementing a "default-to-safe" (Fact Check Needed) policy for scores below 0.70.
🛠 Usage (Transformers.js / JavaScript)
import { pipeline } from '@huggingface/transformers';
// Load the 4-bit version for optimal performance
const classifier = await pipeline('text-classification', 'vmanvs/halugate-sentinel-onnx', {
device: 'webgpu',
dtype: 'q4f16', // or 'bnb4' for maximum compression
});
const result = await classifier("Who won the 2020 world series?");
console.log(result);
// Output: [{ label: 'FACT_CHECK_NEEDED', score: 0.991... }]
🐍 Usage (Python / ONNX Runtime)
Python
import onnxruntime as ort
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("vmanvs/halugate-sentinel-onnx")
session = ort.InferenceSession("model_q4f16.onnx")
inputs = tokenizer("How do you implement a binary tree?", return_tensors="np")
outputs = session.run(None, dict(inputs))
# Process logits with softmax...
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return e_x / e_x.sum(axis=-1, keepdims=True)
📝 Citation
This model is a derivation of HaluGate Sentiel by LLM Semantic Router Team.
Plaintext
@misc{halugate2025,
author = {LLM Semantic Router Team},
title = {HaluGate Sentinel: A Frontline Switch for Hallucination Mitigation},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face Repository},
}
- Downloads last month
- 41
Model tree for vmanvs/halugate-sentinel-onnx
Base model
answerdotai/ModernBERT-base