LIVE workshop: Build a NanoVLM from scratch

Happening on Saturday, November 8th

and

Nov 05, 2025

How does a VLM work?

A VLM has two separate encoders:

1) Image Encoder – usually a Vision Transformer (ViT) or a CNN that converts an image into a vector.
2) Text Encoder – a Transformer-based model (like BERT or GPT) that converts a caption, question, or sentence into a vector.

The goal is to make both these vectors live in the same embedding space.

For example, the text “a dog running” and an image of a dog running should end up close to each other in this space.
The text “a red apple” and a picture of an apple should also align.
Mismatched pairs should be far apart.

This is trained using contrastive learning, just like in CLIP by OpenAI.

Once text and images share a common space, the model can:
-> Generate a caption for any image
-> Answer questions about an image (VQA)
-> Retrieve the correct image when you type a sentence
-> Retrieve the correct caption when you upload an image
-> Act or plan in robotics and self-driving using both vision and language (VLA, VLP)

The model is not just looking at pixels or words anymore. It is aligning the meaning across both.

What you build in a NanoVLM
-> A small text encoder
-> A CNN or ViT-based image encoder
-> A joint embedding space
-> Contrastive loss to train both encoders together
-> Visualization of how text and image alignment improves during training

I am conducting a live hands-on workshop on
“Build a Nano Vision Language Model (VLM) from Scratch” at Vizuara.

Nov 8th, Saturday
10.30am – 1.00pm IST
Link to register: https://vizuara.ai/courses/build-a-nanovlm-from-scratch/

LIVE workshop: Build a NanoVLM from scratch

Happening on Saturday, November 8th

Discussion about this post