HaluGate Sentinel (Quantized ONNX Versions)

HaluGate Sentinel is a high-efficiency binary classifier designed as a "Stage 0" gatekeeper for LLM pipelines. It analyzes incoming user prompts and decides whether they require factual grounding (RAG/Tooling) or can be handled by a creative/reasoning engine directly.

This repository contains optimized ONNX versions of the model, specifically tuned for deployment in browser environments via Transformers.js and edge devices via ONNX Runtime.

🚀 Model Variants

Through rigorous testing, we have found that 4-bit and 1-bit/BNB4 quantization significantly outperforms standard 8-bit quantization for this architecture due to outlier handling.

Format	Quantization	Recommendation	Use Case
`model.onnx`	FP32	Baseline	Reference / Server-side
`model_fp16.onnx`	FP16	High Performance	WebGPU (Browser)
`model_q4.onnx`	4-bit Quantized	Best Balance	General Web / Transformers.js
`model_q4f16.onnx`	4-bit with Float16 accumulation	Best Balance (Recommended)	General Web / Transformers.js
`model_bnb4.onnx`	BitsAndBytes 4-bit	Ultra Light	Mobile / Low Bandwidth
`model_uint8.onnx`	8-bit Unsigned Integer	Stable	CPU / WASM
`model_qunatized.onnx`	8-bit Quantized	Not Recommended
`model_int8.onnx`	8-bit Signed Integer	Not Recommended

Note on Quantization Performance: ModernBERT architectures often exhibit "outlier" activations. In our tests, 8-bit global quantization (INT8) caused significant confidence degradation. Using block-wise 4-bit (Q4) or BitsAndBytes (BNB4) isolates these outliers, resulting in performance that nearly matches the original FP32 precision.

📊 Classification Labels

The model outputs two classes based on the prompt's intent:

FACT_CHECK_NEEDED (Label 1): Information-seeking queries that rely on world knowledge (e.g., "What is the current price of Bitcoin?").
NO_FACT_CHECK_NEEDED (Label 0): Creative, coding, opinion, or pure reasoning tasks (e.g., "Write a poem about a cat" or "How do I sort a list in Python?").

✅ Tests

The user is requested to test the model for their use case, here are some light-weight tests done on the model for reference:

Model Used	Test Results
fp32	Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (88.73%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.97%)
fp16	Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (88.68%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.97%)
q4	Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (99.13%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.99%)
bnb4	Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (99.49%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.91%)
q4f16	Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (99.14%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (99.99%)
uint8	Query: What is the current price of Bitcoin? Result: FACT_CHECK_NEEDED (98.92%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (99.97%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (95.22%)
q8, int8 (avoid)	Query: What is the current price of Bitcoin? Result: NO_FACT_CHECK_NEEDED (55.46%) Query: Write a nursery rhyme about a cat. Result: NO_FACT_CHECK_NEEDED (100.00%) Query: How do I sort a list in Python? Result: NO_FACT_CHECK_NEEDED (78.03%)

📈 Performance Notes

Precision Stability: Unlike many models where 8-bit is the standard, HaluGate Sentinel shows improved confidence scores in q4f16 and bnb4 formats. This is likely due to the "Block-wise" quantization techniques preserving ModernBERT's internal outlier activations better than global 8-bit scaling.
Context Window: Supports up to 8192 tokens (ModernBERT backbone).

⚠️ Limitations

English Only: The model was primarily trained and validated on English datasets.
Borderline Queries: Philosophical or hybrid prompts (e.g., "Is time travel possible?") may show lower confidence scores. We recommend implementing a "default-to-safe" (Fact Check Needed) policy for scores below 0.70.

🛠 Usage (Transformers.js / JavaScript)

import { pipeline } from '@huggingface/transformers';

// Load the 4-bit version for optimal performance
const classifier = await pipeline('text-classification', 'vmanvs/halugate-sentinel-onnx', {
    device: 'webgpu',
    dtype: 'q4f16', // or 'bnb4' for maximum compression
});

const result = await classifier("Who won the 2020 world series?");
console.log(result);
// Output: [{ label: 'FACT_CHECK_NEEDED', score: 0.991... }]

🐍 Usage (Python / ONNX Runtime)

Python

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("vmanvs/halugate-sentinel-onnx")
session = ort.InferenceSession("model_q4f16.onnx")

inputs = tokenizer("How do you implement a binary tree?", return_tensors="np")
outputs = session.run(None, dict(inputs))
# Process logits with softmax...
def softmax(x): 
        """Compute softmax values for each sets of scores in x.""" 
        e_x = np.exp(x - np.max(x, axis=-1, keepdims=True)) 
        return e_x / e_x.sum(axis=-1, keepdims=True)

📝 Citation

This model is a derivation of HaluGate Sentiel by LLM Semantic Router Team.

Plaintext

@misc{halugate2025,
  author = {LLM Semantic Router Team},
  title = {HaluGate Sentinel: A Frontline Switch for Hallucination Mitigation},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Repository},
}

Downloads last month: 41

Model tree for vmanvs/halugate-sentinel-onnx

Base model

answerdotai/ModernBERT-base

Adapter

llm-semantic-router/halugate-sentinel

Quantized

(1)

this model