SicariusSicariiStuff/Impish_Bloodmoon_12B — FP8 (TensorRT-LLM)

This is an FP8 quantized version of https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B, optimized for TensorRT-LLM inference.

Model Overview

Key Features:

High-Performance Inference: Optimized for NVIDIA GPUs with TensorRT-LLM
Memory Efficient: 8 (FP8)-bit weights reduce VRAM usage vs FP16
Production Ready: Built for low-latency, high-throughput chat serving
Portable Checkpoints: Checkpoints work across systems; engines are hardware-specific

Specification	Details
Source Model	https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B
Quantization Method	FP8
Precision	8 (FP8)-bit weights
KV Cache	fp8
Block/Group Size	128
TensorRT-LLM Version	`1.2.0rc5` (used for quantization)
Max Batch Size	64
Max Input Length	5525
Max Output Length	150
SM Architecture	sm90
GPU	NVIDIA H100 NVL
CUDA Toolkit	13.0
Generated	2025-12-29 17:36:59 UTC

trt-llm/
  checkpoints/
    *.safetensors
    config.json
  engines/sm90_trt-llm-1.2.0rc5_cuda13.0/ 
    rank*.engine
    config.json

Checkpoints: Portable across systems with compatible TensorRT-LLM versions; rebuild engines on the target GPU
Engines: Hardware-specific (rebuild for different GPU/CUDA versions/SMs, e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX)
INT4-AWQ checkpoints are portable across sm89/sm90+ GPUs; rebuild engines for the target GPU (e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX) or reuse one of the prebuild engines in case they match your GPU.