A Coding Guide to Build a Multimodal Image Captioning App Using Salesforce BLIP Model, Streamlit, Ngrok, and Hugging Face

0 2 minutes read

In this tutorial, we will learn how to create a multimedia interactive application for imaging using the Google Colab platform, the strong Blip model of Salesforce, and a simplification of an intuitive web interface. The multimedia models, which combine the possibilities of image processing and text, have increased importance in artificial intelligence applications, allowing tasks such as the lines of the image, answering visible questions, and more. This step guarantees a seamless preparation, clearly eats the common ships, explains how to combine and publish advanced artificial intelligence solutions, even without extensive experience.

!pip install transformers torch torchvision streamlit Pillow pyngrok

First, we install transformers, flame, topouch, flow, pillow, pyngrok, all the dependencies to build a multimedia illustrative application. It includes transformers (for the Blip model), Torch & Torchvision (for deep learning and photo processing), Speremlit (to create the user interface), pillow (for processing image files), and Pyngrok (to expose the application via NGROK).

%%writefile app.py
import torch
from transformers import BlipProcessor, BlipForConditionalGeneration
import streamlit as st
from PIL import Image


device = "cuda" if torch.cuda.is_available() else "cpu"


@st.cache_resource
def load_model():
    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
    model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to(device)
    return processor, model


processor, model = load_model()


st.title("🖼️ Image Captioning with BLIP")


uploaded_file = st.file_uploader("Upload your image:", type=["jpg", "jpeg", "png"])


if uploaded_file is not None:
    image = Image.open(uploaded_file).convert('RGB')
    st.image(image, caption="Uploaded Image", use_column_width=True)


    if st.button("Generate Caption"):
        inputs = processor(image, return_tensors="pt").to(device)
        outputs = model.generate(**inputs)
        caption = processor.decode(outputs[0], skip_special_tokens=True)
        st.markdown(f"### ✅ **Caption:** {caption}")

Then we create an explanatory application for multimedia photos based on the Blip model. First download BlipProcessor and Blipforcondalgeneration from the embrace, allowing the model to process images and generate illustrations. The Stiplelit user interface allows users to download and display an image and create a comment when clicking the button. Using @st

from pyngrok import ngrok


NGROK_TOKEN = "use your own NGROK token here"
ngrok.set_auth_token(NGROK_TOKEN)


public_url = ngrok.connect(8501)
print("🌐 Your Streamlit app is available at:", public_url)


# run streamlit app
!streamlit run app.py &>/dev/null &

Finally, we have prepared the Spematlit app that generally works in Google Colab using NGROK. He does the following:

NGROK is ratified by using your personal distinctive code (`Ngrok_token ‘to create a safe tunnel.
The Speremlit app, which works on the port, is displayed to the URL, via ngrok.connect (8501).
The general URL address, which can be used to access the application in any browser.
Run the Speremlit app (App.PY`) in the background.

This method allows you to interact remotely with your Captioning Image app, although Google Colab does not provide a web hosting directly.

In conclusion, we have succeeded in creating a multimedia illustrative application and published by Salesforce’s Blip and SPHOLELIT, which was securely hosted via NGROK from the Google Colab environment. This practical exercise has shown how easy to integrate advanced machine learning models into easy -to -use interfaces and provide a basis for further exploration and customization of multimedia applications.

Here is Clap notebook. Also, do not forget to follow us twitter And join us Telegram channel and LinkedIn GrOup. Don’t forget to join 80k+ ml subreddit.

Asif Razzaq is the CEO of Marktechpost Media Inc .. As a pioneer and vision engineer, ASIF is committed to harnessing the potential of artificial intelligence for social goodness. His last endeavor is to launch the artificial intelligence platform, Marktechpost, which highlights its in -depth coverage of machine learning and deep learning news, which is technically sound and can be easily understood by a wide audience. The platform is proud of more than 2 million monthly views, which shows its popularity among the masses.

Parlant: Building a confrontation customer with AI with llms 💬 ✅ (promoted)

2025-03-14 03:29:00