SicariusSicariiStuff/Impish_Bloodmoon_12B β€” FP8 (TensorRT-LLM)

This is an FP8 quantized version of https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B, optimized for TensorRT-LLM inference.


Model Overview

Key Features:

  • High-Performance Inference: Optimized for NVIDIA GPUs with TensorRT-LLM
  • Memory Efficient: 8 (FP8)-bit weights reduce VRAM usage vs FP16
  • Production Ready: Built for low-latency, high-throughput chat serving
  • Portable Checkpoints: Checkpoints work across systems; engines are hardware-specific

Technical Specifications

Specification Details
Source Model https://huggingface.co/SicariusSicariiStuff/Impish_Bloodmoon_12B
Quantization Method FP8
Precision 8 (FP8)-bit weights
KV Cache fp8
Block/Group Size 128
TensorRT-LLM Version 1.2.0rc5 (used for quantization)
Max Batch Size 64
Max Input Length 5525
Max Output Length 150
SM Architecture sm90
GPU NVIDIA H100 NVL
CUDA Toolkit 13.0
Generated 2025-12-29 17:36:59 UTC

Artifact Layout

trt-llm/
  checkpoints/
    *.safetensors
    config.json
  engines/sm90_trt-llm-1.2.0rc5_cuda13.0/ 
    rank*.engine
    config.json

Quantization Details

Parameter Value
Method FP8
Calibration Size 64 samples
Calibration Seq Length 5675
AWQ Block Size 128
Calibration Batch Size 16

Compatibility

Requirements

  • GPU: NVIDIA with Compute Capability β‰₯ 9.0 (Hopper / H100)
  • CUDA: 13.0+
  • TensorRT-LLM: 1.2.0rc5
  • Python: 3.10+

Portability Notes

  • Checkpoints: Portable across systems with compatible TensorRT-LLM versions; rebuild engines on the target GPU
  • Engines: Hardware-specific (rebuild for different GPU/CUDA versions/SMs, e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX)
  • INT4-AWQ checkpoints are portable across sm89/sm90+ GPUs; rebuild engines for the target GPU (e.g., H100/H200/B200/Blackwell, L40S, 4090/RTX) or reuse one of the prebuild engines in case they match your GPU.

Troubleshooting

Engine fails to load on different GPU Engines are compiled for specific SM architecture and CUDA version. Either:
  1. Use the checkpoints and rebuild the engine on your target system
  2. Download an engine matching your GPU from the engines/ subdirectories
Out of Memory Reduce `max_batch_size` or `max_seq_len` when building the engine. Adjust `kv_cache_config.free_gpu_memory_fraction` at runtime.

Resources


License

This quantized model inherits the license from the original base model: apache-2.0

See the original model's license for full terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for yapwithai/sicariussicariistuff-impish-bloodmoon-12B-trt-fp8