Model split into two vision_model and text_decoder. Run the vision model once and capture the outputs encoder_hidden_states and encoder_attention_mask. Feed them as inputs to the text decoder and generate the image caption.

converted to onnx from source: https://huggingface.co/Salesforce/blip-image-captioning-base

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support