I have wanted to teach transformers + vision for quite some time now.
And now it is becoming a reality
Hello,
I have wanted to teach transformers + vision for quite some time now. And soon it is becoming a reality.
I am incredibly excited to announce a new live program at Vizuara. It is a 14-week intensive on Transformers for Vision and Multimodal LLMs, starting on Monday, Sep 29th, 2025, from 2-3:30 PM IST each week.
If you have been following the work by me, Raj Abhijit Dandekar, and Rajat Dandekar, you know that we keep things hands-on and practical, and this bootcamp follows the same philosophy with a clear focus on building real projects that you can proudly put on your GitHub and on your resume.
Across the 14 weeks, we will move steadily from the foundations. We will begin with why the field moved from CNNs to attention-based models, then work through the core vision transformer stack with ViT and DeiT, go deeper into hierarchical backbones like Swin, and then apply these ideas to detection and segmentation with DETR and Mask2Former where we will see how set prediction and queries change the game. We will then move to promptable segmentation with SAM, step into video with TimeSformer and VideoMAE, and finally connect vision with language through CLIP, BLIP, Flamingo, and LLaVA, ending with diffusion models and ControlNet so that you experience both understanding and generation in one coherent journey.
Every lecture pairs the ideas with code. You will fine-tune models, compare architectures, and implement small but meaningful systems. By the end, you will have at least five complete projects that demonstrate skills that matter to teams today, such as object detection without anchors, unified segmentation, zero-shot classification using contrastive pretraining, few-shot multimodal reasoning, and controllable image generation. We will write code that you can adapt for work and further research, and we will review your assignments with detailed feedback.
This cohort will be taught by me. I did my PhD at MIT, and computer vision has been my bread and butter for a long time now. I have taught hundreds of learners who have been generous with their feedback about the clarity and depth in our classes.
If you are a beginner with some Python and ML basics, you will be fine.
We have a Free plan that gives you access to the live or recorded lectures. The Pro plan is ₹25,000 and includes comprehensive notes, code repositories, hands-on assignments with feedback, office hours with our team, a dedicated Discord community, lifetime access, a certificate, and structured career guidance.
If you want a program that balances theory with implementation and gives you the confidence to speak and build in this fast-moving area, I would be happy to have you in class. Classes start on September 29. You can message me here for the enrollment link or write to sreedath@vizuara.com.
Join the cohort here: https://vision-transformer.vizuara.ai/
Best,
Dr. Sreedath Panat,
IIT Madras, MIT PhD
Co-founder, Vizuara