We have a detailed look at the transformer encoder and decoder architecture. Then we will look at Vision Transformers and their importance in the field of Robotics.
Solid walkthrough of transformer mechanics for vision applications. The breakdown of how patch embeddings work versus token embeddings is really helpfull, especially the visualization of the CLS token attending to all patches simultaneously. I've worked with ViTs for robotics perception before and the positional encoding step is often underappreciated, the rearranged patches example makes it obvious why that matters. The 96% accuracy after just 3 epochs on the bean disease dataset is pretty telling about transfer learning effectivness with pretrained ViTs.
Solid walkthrough of transformer mechanics for vision applications. The breakdown of how patch embeddings work versus token embeddings is really helpfull, especially the visualization of the CLS token attending to all patches simultaneously. I've worked with ViTs for robotics perception before and the positional encoding step is often underappreciated, the rearranged patches example makes it obvious why that matters. The 96% accuracy after just 3 epochs on the bean disease dataset is pretty telling about transfer learning effectivness with pretrained ViTs.
Thanks for the feedback!