Discussion about this post

User's avatar
Neural Foundry's avatar

Solid walkthrough of transformer mechanics for vision applications. The breakdown of how patch embeddings work versus token embeddings is really helpfull, especially the visualization of the CLS token attending to all patches simultaneously. I've worked with ViTs for robotics perception before and the positional encoding step is often underappreciated, the rearranged patches example makes it obvious why that matters. The 96% accuracy after just 3 epochs on the bean disease dataset is pretty telling about transfer learning effectivness with pretrained ViTs.

Expand full comment

No posts

Ready for more?