Dissecting the Vision Transformer paper: In 3 hours and 40 minutes
Let us cultivate the habit of reading research papers
Over the last few days, I recorded a new lecture in my “Reading research papers” series. This time, I picked a paper that changed the entire direction of computer vision - “An Image is Worth 16×16 Words - Transformers for Image Recognition at Scale.”
This is the most famous paper on Vision Transformers. When I checked three months ago, it had around 63,000 citations. Now it has crossed 75,000.
In this lecture, I did not want to simply give a summary of the paper or make it artificially easy. I wanted to actually read the paper with you (line by line). Like how we would do in a real classroom. I have gone through this paper multiple times before, I have implemented Vision Transformer from scratch, I have written pages of notes and questions. But in this lecture, I pretended as if I am also opening it for the first time with you.
Because I genuinely believe that one of the most important skills in research is not watching summaries of papers, but learning how to actually read a paper with all its confusion, doubts, equations, arguments, experiments, and even the parts that don’t make sense in the first reading.
This paper came from Google. The beauty is that they showed with very minor modifications the same transformer architecture from “Attention is All You Need” could be used for images as well. No convolutions; just patches of the image treated as tokens, positional embeddings added, and then fed to a vanilla transformer.
The full paper is around 22 pages including references and supplementary. The core paper is 9 pages. But it is written so cleanly that every time I read it, I enjoy it. Some figures are extremely self-explanatory.
And that is exactly what I wanted to bring into this lecture. The real process of reading a research paper.
If you want to build the habit of reading papers, maybe start here. Read along with me, pause when you want, skip when you want, come back later. Whatever you do, stay with the original text. That is where the real learning is.
Here is the full video (3 hrs + 40mins long) on Vizuara‘s channel:



