Model split into two vision_model and text_decoder. Run the vision model once and capture the outputs encoder_hidden_states and encoder_attention_mask. Feed them as inputs to the text decoder and generate the image caption.
converted to onnx from source: https://huggingface.co/Salesforce/blip-image-captioning-base
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support