SicariusSicariiStuff/Impish_Bloodmoon_12B β FP8 (TensorRT-LLM)
This is an FP8 quantized version of https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B, optimized for TensorRT-LLM inference.
Model Overview
Key Features:
- High-Performance Inference: Optimized for NVIDIA GPUs with TensorRT-LLM
- Memory Efficient: 8 (FP8)-bit weights reduce VRAM usage vs FP16
- Production Ready: Built for low-latency, high-throughput chat serving
- Portable Checkpoints: Checkpoints work across systems; engines are hardware-specific
Technical Specifications
| Specification | Details |
|---|---|
| Source Model | https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B |
| Quantization Method | FP8 |
| Precision | 8 (FP8)-bit weights |
| KV Cache | fp8 |
| Block/Group Size | 128 |
| TensorRT-LLM Version | 1.2.0rc5 (used for quantization) |
| Max Batch Size | 64 |
| Max Input Length | 5525 |
| Max Output Length | 150 |
| SM Architecture | sm90 |
| GPU | NVIDIA H100 NVL |
| CUDA Toolkit | 13.0 |
| Generated | 2025-12-29 17:36:59 UTC |
Artifact Layout
trt-llm/
checkpoints/
*.safetensors
config.json
engines/sm90_trt-llm-1.2.0rc5_cuda13.0/
rank*.engine
config.json
Quantization Details
| Parameter | Value |
|---|---|
| Method | FP8 |
| Calibration Size | 64 samples |
| Calibration Seq Length | 5675 |
| AWQ Block Size | 128 |
| Calibration Batch Size | 16 |
Compatibility
Requirements
- GPU: NVIDIA with Compute Capability β₯ 9.0 (Hopper / H100)
- CUDA: 13.0+
- TensorRT-LLM:
1.2.0rc5 - Python: 3.10+
Portability Notes
- Checkpoints: Portable across systems with compatible TensorRT-LLM versions; rebuild engines on the target GPU
- Engines: Hardware-specific (rebuild for different GPU/CUDA versions/SMs, e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX)
- INT4-AWQ checkpoints are portable across sm89/sm90+ GPUs; rebuild engines for the target GPU (e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX) or reuse one of the prebuild engines in case they match your GPU.
Troubleshooting
Engine fails to load on different GPU
Engines are compiled for specific SM architecture and CUDA version. Either:- Use the checkpoints and rebuild the engine on your target system
- Download an engine matching your GPU from the
engines/subdirectories
Out of Memory
Reduce `max_batch_size` or `max_seq_len` when building the engine. Adjust `kv_cache_config.free_gpu_memory_fraction` at runtime.Resources
License
This quantized model inherits the license from the original base model: apache-2.0
See the original model's license for full terms.
Model tree for yapwithai/sicariussicariistuff-impish-bloodmoon-12B-trt-fp8
Base model
mistralai/Mistral-Nemo-Base-2407
Finetuned
mistralai/Mistral-Nemo-Instruct-2407