<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Vizuara AI Labs]]></title><description><![CDATA[Vizuara AI Labs]]></description><link>https://www.vizuaranewsletter.com</link><image><url>https://substackcdn.com/image/fetch/$s_!uFo6!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0717dc2f-5f66-4978-99db-8ebaf589e1dd_1088x1088.png</url><title>Vizuara AI Labs</title><link>https://www.vizuaranewsletter.com</link></image><generator>Substack</generator><lastBuildDate>Mon, 15 Jun 2026 17:32:06 GMT</lastBuildDate><atom:link href="https://www.vizuaranewsletter.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Banque Populaire]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[hello@vizuara.com]]></webMaster><itunes:owner><itunes:email><![CDATA[hello@vizuara.com]]></itunes:email><itunes:name><![CDATA[Vizuara AI Labs]]></itunes:name></itunes:owner><itunes:author><![CDATA[Vizuara AI Labs]]></itunes:author><googleplay:owner><![CDATA[hello@vizuara.com]]></googleplay:owner><googleplay:email><![CDATA[hello@vizuara.com]]></googleplay:email><googleplay:author><![CDATA[Vizuara AI Labs]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Teaching Robots to Fold Clothes: SmolVLA for Bimanual Cloth Manipulation]]></title><description><![CDATA[How a 450M-parameter Vision-Language-Action model learns to coordinate two robot arms for fabric folding &#8212; from just 50 human demonstrations.]]></description><link>https://www.vizuaranewsletter.com/p/teaching-robots-to-fold-clothes-smolvla</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/teaching-robots-to-fold-clothes-smolvla</guid><dc:creator><![CDATA[Dr Rajat Dandekar]]></dc:creator><pubDate>Thu, 16 Apr 2026 05:23:01 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JTb0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Try to fold a piece of cloth. Notice what your hands do &#8212; one grips an edge while the other lifts and folds over, then both press down to crease. Now imagine teaching a robot to do the same thing.</p><p>Cloth folding is one of those tasks that seems trivially easy for humans but is brutally hard for robots. Fabric has effectively infinite degrees of freedom. It crumples, slides, and deforms unpredictably. You can&#8217;t plan a rigid trajectory the way you would for picking up a block &#8212; the cloth reshapes itself with every touch.</p><p>In this post, I&#8217;ll walk through how I fine-tuned <strong>SmolVLA</strong> &#8212; a compact 450M-parameter Vision-Language-Action model &#8212; on 50 teleoperated demonstrations to teach two robot arms to fold a cloth. I&#8217;ll compare it head-to-head against <strong>ACT </strong>(Action Chunking with Transformers), a popular lightweight imitation learning baseline, and share what worked, what didn&#8217;t, and what surprised me.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;21a0f063-43cd-4d6b-bde4-e774f22a05ec&quot;,&quot;duration&quot;:null}"></div><p><em>SmolVLA successfully initiates the fold, completes it tightly, and pushes the cloth aside &#8212; scoring 2.80/3.00.*</em></p><p><strong>The Setup: Two Arms, Three Cameras, One Cloth</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JTb0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JTb0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!JTb0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!JTb0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!JTb0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JTb0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:215122,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JTb0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!JTb0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!JTb0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!JTb0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89dfdf3-c24b-43f2-88b9-23d329a572f1_2752x1536.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>The bimanual SO-101 setup with three camera views</em></p><p>The hardware is deliberately simple and accessible:</p><p>-<strong>Two SO-101 robot arms</strong> &#8212; open-source, low-cost, 6-DOF manipulators (~$300 each)</p><p>- <strong>Three cameras</strong>&#8212; left wrist, right wrist, and an overhead webcam (all 480x640, synced at 30 FPS)</p><p>- <strong>Action space</strong> &#8212; 12 dimensions (6 joint positions per arm)</p><p>I collected <strong>50 teleoperation demonstrations</strong> of myself folding a cloth, totaling ~48,000 synchronized frames stored in LeRobot v3.0 format. Each demonstration captures the full pipeline: reach for the cloth edge, fold it over, press down, push it aside.</p><p>Why 50? That&#8217;s enough to cover the natural variation in cloth placement and draping, while staying practical for a single-person data collection session (~2 hours).</p><p><strong>How SmolVLA Works</strong></p><p>SmolVLA is a Vision-Language-Action model &#8212; it takes in camera images and a language instruction, and directly outputs motor commands. Here&#8217;s what makes its architecture interesting:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!usfV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!usfV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!usfV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!usfV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!usfV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!usfV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:246908,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!usfV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!usfV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!usfV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!usfV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa142eeda-a66f-4075-be5e-57fceea0b18b_2752x1536.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p><em>SmolVLA architecture: visual tokens from SigLIP are compressed via PixelShuffle and fused with language tokens in a VLM backbone. The action expert generates chunked trajectories via flow matching.</em></p><p><strong>Vision: See the Cloth</strong></p><p>A <strong>SigLIP</strong> vision encoder processes each camera frame into tokens. The key trick is <strong>PixelShuffle</strong> compression &#8212; 1,024 visual tokens per frame get compressed down to just 64, a 94% reduction. With three cameras, that&#8217;s 192 visual tokens instead of 3,072. This makes the model tractable to train and run in real-time.</p><p><strong>Language: Understand the Task</strong></p><p>The instruction &#8220;Fold the cloth&#8221; gets tokenized and processed alongside the visual tokens through a <strong>SmolVLM2</strong> language backbone. In a single-task setting like ours, this might seem unnecessary &#8212; but it means the same model architecture could handle &#8220;Fold the cloth in half&#8221; vs. &#8220;Roll the cloth up&#8221; without retraining the vision pipeline.</p><p><strong>Action: Move the Arms</strong></p><p>This is where SmolVLA diverges from standard VLMs. Instead of predicting text tokens, it uses a dedicated <strong>action expert</strong> that generates &#8220;action chunks&#8221; &#8212; sequences of future joint positions &#8212; using <strong>flow matching</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hmv0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hmv0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!hmv0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!hmv0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!hmv0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hmv0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267530,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hmv0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!hmv0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!hmv0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!hmv0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8e135e66-4dc6-41e1-91a5-d33312e95887_2752x1536.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Flow matching training: noise is added to ground-truth trajectories, and the model learns to predict the velocity field that denoises back to the clean trajectory.</em></p><p>Think of flow matching like this: during training, you take a real trajectory and add noise to it. The model learns to predict the &#8220;velocity field&#8221; &#8212; the direction to push each noisy point to recover the clean trajectory. At inference, you start from pure noise and iteratively apply the learned velocity field to generate a smooth, coherent action sequence.</p><p>Why action chunks instead of single-step predictions? Because predicting one joint angle at a time compounds errors quickly &#8212; the robot drifts. By predicting a chunk of ~50 future timesteps at once, the motion stays coherent and smooth.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f--O!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f--O!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!f--O!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!f--O!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!f--O!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f--O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:312940,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f--O!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!f--O!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!f--O!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!f--O!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2361db62-115e-4a74-9c2c-a54df8d935a9_2752x1536.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em> The action expert uses interleaved cross-attention (grounding in visual context) and self-attention (temporal coherence across the action chunk).</em></p><p><strong>Training: Pretraining Is the Secret Weapon</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gLZY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gLZY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!gLZY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!gLZY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!gLZY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gLZY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:211487,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gLZY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic 424w, https://substackcdn.com/image/fetch/$s_!gLZY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic 848w, https://substackcdn.com/image/fetch/$s_!gLZY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic 1272w, https://substackcdn.com/image/fetch/$s_!gLZY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10319e87-8b18-4687-a76e-5eb124fac311_2752x1536.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Training pipeline &#8212; pretrained on 30K community episodes, then fine-tuned on 50 cloth-folding demonstrations.</em></p><p>SmolVLA&#8217;s base model (`lerobot/smolvla_base`) was pretrained on <strong>~30,000 episodes</strong> from LeRobot community datasets &#8212; various robots doing various manipulation tasks. This broad exposure gives the model prior knowledge about how objects move, how grippers interact with surfaces, and how to coordinate motor commands.</p><p>Fine-tuning on our 50 cloth-folding episodes took <strong>~2 hours 49 minutes on a single A100 GPU**</strong> for 10,000 steps. The final training loss was 0.021.</p><p>How much does pretraining matter? According to the SmolVLA paper, the gap is dramatic:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gD8x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gD8x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!gD8x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!gD8x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!gD8x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gD8x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44163,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gD8x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!gD8x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!gD8x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!gD8x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4feeaa8b-1ebd-448c-8327-1371539048d0_1376x768.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Without pretraining: 51.7% success. With pretraining: 78.3% success. A +26.6% absolute improvement from pretraining alone.</em></p><p>The model doesn&#8217;t just learn faster &#8212; it learns <em>better</em>, because it already has a foundation of sensorimotor knowledge to build on.</p><p><strong>The Benchmark: SmolVLA vs. ACT</strong></p><p>To put SmolVLA&#8217;s performance in context, I trained <strong>ACT (Action Chunking with Transformers) </strong>on the exact same 50 demonstrations and evaluated both models side-by-side.</p><p>ACT is a fundamentally different philosophy:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Rvyb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Rvyb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!Rvyb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!Rvyb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!Rvyb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Rvyb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87541,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Rvyb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!Rvyb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!Rvyb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!Rvyb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd057be3-317c-45cd-a9e8-afcf1dc76525_1376x768.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>SmolVLA: the &#8220;Generalist adapted to a specialty&#8221; &#8212; 450M parameters, pretrained on 30K episodes, language-conditioned, flow matching. ACT: the &#8220;Specialist&#8221; &#8212; 52M parameters, trained from scratch, vision-only, Conditional VAE.</em></p><p>The question: does all that extra capacity and pretraining actually help on a single concrete task?</p><p><strong>Results: The Numbers Tell a Clear Story</strong></p><p>I evaluated SmolVLA on 7 episodes and ACT on 10 episodes, scoring each on three criteria (0-1 each, max 3.0 total):</p><p>- <strong>R1 &#8212; Fold Initiation:</strong> One arm clamps, the other folds over</p><p>- <strong>R2 &#8212; Fold Completion:</strong> Press down, tighten the fold</p><p>- <strong>R3 &#8212; Push Aside:**</strong> Move the folded cloth to clear the workspace</p><p><strong>The Headline Numbers</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cyiY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cyiY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!cyiY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!cyiY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!cyiY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cyiY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70819,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cyiY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!cyiY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!cyiY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!cyiY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e064ff5-be5e-41dd-8599-099b81cfb24d_1376x768.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Head-to-head comparison: SmolVLA scores 2.27/3.0 mean total reward vs ACT&#8217;s 0.63/3.0 &#8212; a 260% improvement. SmolVLA achieves 86% partial success rate vs ACT&#8217;s 20%.</em></p><p><strong>SmolVLA outperforms ACT by 260% on mean total reward.</strong></p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;a6c77e43-9588-45f8-9981-70081ae803f3&quot;,&quot;duration&quot;:null}"></div><p><em>ACT&#8217;s best episode (1.80/3.00) &#8212; it manages a fold and push, but this level of performance was the exception, not the norm.*</em></p><p><strong>What ACT Gets Wrong</strong></p><p>ACT&#8217;s biggest failure mode is <strong>inaction</strong> &#8212; in 5 out of 10 episodes, the robot simply didn&#8217;t move. The arms positioned near the cloth but never executed a folding motion. When ACT does act, the fold initiation is actually decent (R1 of 0.70-0.80), but it almost never completes the full pipeline.</p><p><strong>What SmolVLA Gets Right</strong></p><p>SmolVLA is remarkably <strong>consistent</strong>. 6 out of 7 episodes initiate the fold, complete it to a reasonable degree, and push the cloth aside. The scores cluster tightly between 1.80 and 2.80 &#8212; it&#8217;s not that SmolVLA occasionally nails it, it&#8217;s that it reliably executes the full task pipeline.</p><p><strong>Per-Episode Breakdown</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BBwg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BBwg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!BBwg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!BBwg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!BBwg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BBwg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115094,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BBwg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!BBwg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!BBwg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!BBwg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd651e498-5e70-4379-b5d1-fbdcb44287ee_1376x768.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Per-episode performance heatmap. Left: ACT shows sparse activation &#8212; 50% of episodes had zero motion (white rows). Right: SmolVLA shows dense, consistent high scores across all episodes. The contrast visually captures SmolVLA&#8217;s reliability vs ACT&#8217;s bimodal behavior.</em></p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;a73c707a-a01c-4dc5-a5e8-0696b2ddd59c&quot;,&quot;duration&quot;:null}"></div><p><em>Consistency across episodes &#8212; SmolVLA follows the same fold-and-push strategy with minor variations.</em></p><div><hr></div><p></p><p><strong>Why the Gap Is So Large</strong></p><p>Three factors explain the 260% performance difference:</p><p><strong>1. Pretraining Provides Motor Priors</strong></p><p>SmolVLA doesn&#8217;t learn bimanual coordination from scratch. Its base model has seen 30,000 episodes of various manipulation tasks. It already &#8220;knows&#8221; how to coordinate two end-effectors, how to approach objects, and how to execute smooth trajectories. Fine-tuning on 50 episodes refines this knowledge for cloth specifically.</p><p>ACT starts from random initialization. 50 episodes and 100,000 gradient steps simply aren&#8217;t enough for a from-scratch model to reliably learn when to move and when not to. Hence the 50% inaction rate.</p><p><strong>2. Flow Matching Produces Smoother Trajectories</strong></p><p>SmolVLA&#8217;s flow matching generates trajectories by iterative denoising &#8212; the output is inherently smooth and temporally coherent. ACT uses a Conditional VAE, which can produce jerkier, less coordinated motions, especially in the bimanual setting where both arms need to synchronize.</p><p><strong>3. Multi-Camera Fusion</strong></p><p>Both models use the same three cameras, but SmolVLA&#8217;s SigLIP encoder (pretrained on internet-scale image data) extracts richer features from each view. ACT&#8217;s ResNet-18 (trained from scratch) has to learn visual representations simultaneously with motor policies &#8212; a much harder optimization problem.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pkvN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pkvN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!pkvN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!pkvN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!pkvN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pkvN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125909,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/194372845?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pkvN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic 424w, https://substackcdn.com/image/fetch/$s_!pkvN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic 848w, https://substackcdn.com/image/fetch/$s_!pkvN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic 1272w, https://substackcdn.com/image/fetch/$s_!pkvN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec8dc03c-b7e1-4796-9cf1-6033efae549b_1376x768.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Architectural comparison: ACT&#8217;s simpler pipeline (ResNet-18 + CVAE) vs SmolVLA&#8217;s richer pipeline (SigLIP + VLM backbone + Flow Matching action expert). Key differences highlighted: pretrained vs from-scratch vision, language conditioning, and flow matching vs CVAE.</em></p><div><hr></div><p><strong>Evaluation Method: Using an LLM as a Judge</strong></p><p>One practical contribution worth highlighting: I used <strong>Claude Sonnet 4</strong> as an automated evaluation judge. For each episode, the judge received 8-12 sampled frames and scored R1, R2, R3 on the defined rubric.</p><p>This mostly worked well, with one systematic bias: <strong>the VLM judge underscored R3 (push aside)</strong>. From the overhead webcam angle, lateral displacement of the cloth was hard for the vision model to detect. Manual frame-by-frame verification corrected this, bumping SmolVLA&#8217;s R3 from 0.30 to 0.81.</p><p>Lesson: VLM-based evaluation is a useful automation, but <strong>always validate on a sample</strong>, especially for spatial reasoning tasks where camera perspective matters.</p><p><strong>What I&#8217;d Do Differently</strong></p><p><strong>More demonstrations.</strong> 50 episodes is the minimum viable dataset. The fold completion scores (R2) are the weakest link &#8212; the model struggles with the fine-grained pressing motion. More demos with variation in fold tightness would likely help.</p><p><strong>Better fabric variety.</strong> All 50 demos used the same cloth. Thin, slippery fabrics would likely cause failures. Mixing fabric types during data collection would improve robustness.</p><p><strong>Recovery behavior.</strong> Neither model recovers after a missed grasp. The training data contains only successful demonstrations, so the policy has no &#8220;what to do when things go wrong&#8221; knowledge. Adding a few recovery demonstrations or using RL fine-tuning could address this.</p><p><strong>Episode-level evaluation at scale.</strong> 7 and 10 episodes give directional signal but not statistical significance. A proper evaluation would run 50+ episodes per model.</p><div><hr></div><p><strong> Key Takeaways</strong></p><p>1. <strong>Pretraining dominates.</strong> The single biggest factor in SmolVLA&#8217;s success is its pretrained base. 30,000 episodes of prior experience make 50 task-specific demonstrations go much further.</p><p>2. <strong>Compact VLAs are practical.</strong> 450M parameters runs on consumer hardware. You don&#8217;t need billion-parameter models for tabletop manipulation.</p><p>3. <strong>50 demonstrations is enough to get started.</strong> Not enough for production reliability, but enough to demonstrate clear task competence with a pretrained model.</p><p>4. <strong>ACT is not the right baseline for bimanual deformable manipulation.</strong> ACT shines on rigid-object tasks with clear visual cues. Bimanual cloth folding exposes its limitations &#8212; particularly the lack of pretraining and the weaker visual backbone.</p><p>5. <strong>The LeRobot ecosystem is a force multiplier.</strong> Standardized data formats, community pretrained models, and plug-and-play training scripts dramatically reduce the engineering burden. This entire project &#8212; data collection, training, evaluation &#8212; was done by one person.</p><div><hr></div><p><strong>Resources and Reproducibility</strong></p><p>Everything needed to reproduce this work is open:</p><p>- <strong>Training dataset:</strong> <a href="https://huggingface.co/datasets/RajatDandekar/so101_bimanual_cloth_fold">Dataset Link</a> (50 episodes)</p><p>- <strong>SmolVLA policy:</strong> <a href="https://huggingface.co/RajatDandekar/smolvla_bimanual_cloth_fold">Finetuned SmolVLA Model</a></p><p>- <strong>ACT policy:</strong> <a href="https://huggingface.co/RajatDandekar/act_bimanual_cloth_fold">Trained ACT Model</a></p><p>- <strong>Evaluation data (SmolVLA) </strong><a href="https://huggingface.co/datasets/RajatDandekar/eval_smolvla_cloth_fold_10ep)">SmolVLA Evaluation Set</a></p><p>- <strong>Evaluation data (ACT): </strong><a href="https://huggingface.co/datasets/RajatDandekar/eval_act_cloth_fold_10ep">ACT Evaluation Set</a></p><p>- <strong>Base model: </strong><a href="https://huggingface.co/lerobot/smolvla_base">https://huggingface.co/lerobot/smolvla_base</a></p><p>- <strong>Framework:</strong> <a href="https://github.com/huggingface/lerobot">https://github.com/huggingface/lerobot</a></p><p><em>If you&#8217;re working on manipulation with open-source robots or experimenting with VLAs, I&#8217;d love to hear about it. Drop a comment or reach out &#8212; this space moves fast and I think we&#8217;re just getting started.</em></p><p><em>Our live bootcamp on VLA and World Models: </em><a href="https://robotlearningmastery.vizuara.ai">https://robotlearningmastery.vizuara.ai</a></p><p><em>Minor in Robotics:</em></p><p><a href="https://minor-robotics.vizuara.ai">https://minor-robotics.vizuara.ai</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Knowledge Distillation & Data-Efficient Image Transformers]]></title><description><![CDATA[The original ViT needed 300 million images to outperform CNNs. DeiT does it with just 1.2 million, by letting a CNN teacher distill its visual intuitions into a transformer student.]]></description><link>https://www.vizuaranewsletter.com/p/knowledge-distillation-and-data-efficient</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/knowledge-distillation-and-data-efficient</guid><dc:creator><![CDATA[Mayank Pratap Singh]]></dc:creator><pubDate>Mon, 13 Apr 2026 04:15:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RRAx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RRAx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RRAx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif 424w, https://substackcdn.com/image/fetch/$s_!RRAx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif 848w, https://substackcdn.com/image/fetch/$s_!RRAx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif 1272w, https://substackcdn.com/image/fetch/$s_!RRAx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RRAx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:587320,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RRAx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif 424w, https://substackcdn.com/image/fetch/$s_!RRAx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif 848w, https://substackcdn.com/image/fetch/$s_!RRAx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif 1272w, https://substackcdn.com/image/fetch/$s_!RRAx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4006e14-8cfb-4957-89ae-21974e3ed874_1920x1080.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>This chapter covers</strong></p><ul><li><p>Understanding inductive biases in CNNs and why they matter for image understanding</p></li><li><p>Why Vision Transformers are data-hungry and computationally expensive without these biases</p></li><li><p>How DeiT (Data-efficient Image Transformers) achieves competitive accuracy using only ImageNet-1K</p></li><li><p>The mechanics of knowledge distillation, from soft labels to dark knowledge</p></li><li><p>The mathematics behind temperature scaling, KL divergence, and DeiT&#8217;s loss functions</p></li><li><p>Building a complete DeiT implementation from scratch in PyTorch</p></li></ul><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Data Efficient Image Transformer Code is available below</strong></p><p><a href="https://github.com/VizuaraAI/Transformers-for-vision-BOOK">https://github.com/VizuaraAI/Transformers-for-vision-BOOK</a></p><p><strong>[Don't forget to star the code repo!]</strong></p><p>In  previous blog, we explored how Vision Transformers (ViTs) brought the power of self-attention to computer vision by splitting images into patches and processing them as sequences. The results were remarkable: ViT achieved state-of-the-art accuracy on ImageNet. But there was a catch. That performance came at an enormous cost. The original ViT required pre-training on JFT-300M, a private Google dataset containing 300 million labeled images, and demanded thousands of TPU-days of compute. For most researchers and practitioners, this was simply out of reach.</p><p>In this blog, we will explore how DeiT (Data-efficient Image Transformers) solved this problem. <a href="https://arxiv.org/abs/2012.12877">Published by Touvron et al. from Facebook AI Research and Sorbonne University in 2021</a>, DeiT achieved competitive accuracy using <em>only</em> ImageNet-1K (1.2 million images), which is 250 times less data than the original ViT required. The key insight was combining a clever training recipe with <em>knowledge distillation</em>, a technique where a smaller student model learns from a larger, pre-trained teacher model. But before we dive into DeiT itself, we need to understand <em>why</em> Vision Transformers struggle with limited data in the first place. The answer lies in a concept called <em>inductive bias</em>.</p><h2><strong>Prerequisites blogs</strong></h2><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;f5387b42-1901-41c4-8a40-1461109aa48c&quot;,&quot;caption&quot;:&quot;The Transformer Architecture&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Transformers&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-03-17T03:32:41.080Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!Igi_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/the-transformers&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:190611987,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:149,&quot;comment_count&quot;:5,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;8c5fca0f-fb69-43a5-bc6e-c1ab957b2951&quot;,&quot;caption&quot;:&quot;Table of Contents&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Vision Transformers&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-03-22T04:10:41.659Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5d450573-3b45-4cec-bcd5-3ef3125044d4_1200x640.gif&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/vision-transformers&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:181494472,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:61,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h1>1.1 <em>Inductive bias: the built-in assumptions of neural networks</em></h1><p>When we train a neural network, we are asking it to learn patterns from data. But no model starts from a completely blank slate. Every architecture carries certain built-in assumptions about how data is structured, and these assumptions shape what the model finds easy or difficult to learn. These built-in assumptions are called <em>inductive biases</em>.</p><p>Think of it this way: if you were asked to find a lost cat in a neighborhood, you would naturally check under porches, in gardens, and near food sources. You would not start by searching the sky or underwater. Your prior knowledge about where cats tend to be gives you a huge advantage. You are not starting from scratch. Inductive bias works the same way for neural networks: it encodes structural assumptions that guide learning in the right direction, reducing the amount of data needed to reach good performance.</p><h3>What are the inductive biases in CNNs?</h3><p>Convolutional Neural Networks (CNNs) have two powerful inductive biases built directly into their architecture: <em>locality</em> and <em>translation equivariance</em>. Let us examine each in detail.</p><p><strong>Locality bias.</strong> CNNs assume that nearby pixels are more related to each other than distant pixels. This assumption is enforced by the small convolutional filters (typically 3x3 or 5x5) that examine only a local patch of the image at each step. The network processes the image by looking at small neighborhoods first, detecting low-level features like edges and textures, and then gradually combining these local features into higher-level concepts through deeper layers. An edge detector in the first layer might find horizontal lines. The next layer combines several edges into a corner or a curve. Deeper layers assemble these parts into recognizable shapes like eyes, wheels, or letters.</p><p>This hierarchical, local-to-global processing mirrors how natural images are actually structured. In a photograph, a pixel showing part of a dog&#8217;s ear is highly correlated with neighboring pixels that also show the ear, but has little direct relationship with a pixel in the far corner showing a blade of grass. CNNs exploit this statistical property by design.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tuM3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tuM3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png 424w, https://substackcdn.com/image/fetch/$s_!tuM3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png 848w, https://substackcdn.com/image/fetch/$s_!tuM3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png 1272w, https://substackcdn.com/image/fetch/$s_!tuM3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tuM3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png" width="1456" height="1633" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1633,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:811259,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tuM3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png 424w, https://substackcdn.com/image/fetch/$s_!tuM3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png 848w, https://substackcdn.com/image/fetch/$s_!tuM3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png 1272w, https://substackcdn.com/image/fetch/$s_!tuM3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7fd60e0-fad4-401c-ae4d-67c15d4e5cd2_1808x2028.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.1:</strong> The effect of inductive bias on translation handling. Top: a cat is placed at four different positions within the input grid. A translation invariant model (CNN) produces the same logits every time, correctly recognizing the cat regardless of where it appears. This is because weight sharing and pooling are built into the architecture. Bottom: the same cat at different positions is passed through a model sensitive to translation (such as a ViT trained on limited data), which produces different logits for each position. The CNN does not need to learn that position should not affect the classification; this property is baked into its design. A ViT must discover this from data, which is why it requires far more training examples.</em></p><p><strong>Translation equivariance.</strong></p><p>The second key inductive bias is <em>translation equivariance,</em> a CNN detects the same feature regardless of where it appears in the image. This property comes from <em>weight sharing</em> , where the same convolutional filter is applied identically at every spatial position. Mathematically, f is a convolution operation and <em>T</em> is a spatial translation, then:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;f(T(\\mathbf{x})) = T(f(\\mathbf{x}))&quot;,&quot;id&quot;:&quot;VLULYQKCAO&quot;}" data-component-name="LatexBlockToDOM"></div><p>This means that if an object shifts position in the input, the corresponding feature map shifts by exactly the same amount. A cat-ear detector will fire whether the ear appears in the top-left corner or the bottom-right corner of the image. Figure 1.2 shows two concrete examples: the same convolutional kernel slides over the entire input and detects the same pattern wherever it occurs, whether that is a left edge in a grid of circles or a car on a road versus on a building rooftop.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SlnS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SlnS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png 424w, https://substackcdn.com/image/fetch/$s_!SlnS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png 848w, https://substackcdn.com/image/fetch/$s_!SlnS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png 1272w, https://substackcdn.com/image/fetch/$s_!SlnS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SlnS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png" width="1028" height="972" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:972,&quot;width&quot;:1028,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55438,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SlnS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png 424w, https://substackcdn.com/image/fetch/$s_!SlnS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png 848w, https://substackcdn.com/image/fetch/$s_!SlnS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png 1272w, https://substackcdn.com/image/fetch/$s_!SlnS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67b8692f-7da1-428e-8dc1-46824e9a1b28_1028x972.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><em><strong>Figure 1.2:</strong> Translation equivariance in convolutions Top: a convolutional kernel (shown on the left) is applied across an input grid containing circles at different positions. The filter identifies left edges irrespective of where they appear in the input, producing the same edge-detection response in the output. Bottom: the same principle applied to a real-world scene. The kernel identifies &#8220;car&#8221; features regardless of whether the car is on the road or on a building rooftop. Because the same filter weights are shared across all spatial positions, the network does not need to learn separate detectors for every possible location of a feature. (conceptual illustration)</em></p><p>When we add pooling layers (max pooling or global average pooling) on top of convolutions, we achieve <em>translation invariance</em>: the final classification output remains the same regardless of where the object appears. The pooling operation discards precise positional information, summarizing each region with a single statistic (maximum or average value). This is why a CNN classifies an image as "cat" whether the cat is centered, shifted left, or shifted right.</p><blockquote><p><strong>NOTE</strong> </p><p>Translation equivariance and translation invariance are related but distinct properties. Equivariance means "the output shifts with the input" (convolution layers). Invariance means "the output stays the same regardless of position" (achieved after pooling). A CNN is equivariant through its convolutional layers and becomes invariant at its classification output through pooling.</p></blockquote><h3><strong>Why are these biases so powerful?</strong></h3><p>These inductive biases are powerful precisely because they match the statistical structure of natural images. Consider what they give us:</p><ol><li><p><strong>Fewer parameters.</strong> Weight sharing means one small filter is reused across the entire image, rather than learning separate weights for every position. A 3x3 filter on 3 channels has only 27 parameters, yet it processes every location in the image.</p></li><li><p><strong>Better generalization with less data.</strong> Because the model already &#8220;knows&#8221; that patterns can appear anywhere and that local neighborhoods matter, it does not need millions of examples to discover these facts.</p></li><li><p><strong>Natural feature hierarchies.</strong> The local-to-global processing pipeline naturally builds the kind of part-based representations that are useful for recognition: edges combine into textures, textures into parts, parts into objects.</p></li></ol><p>This is exactly why CNNs dominated computer vision for nearly a decade, from AlexNet in 2012 through the ResNet era. The built-in assumptions aligned beautifully with the task.</p><h3><strong>When do these biases become a limitation?</strong></h3><p>However, the same rigid assumptions that make CNNs data-efficient can become a ceiling on performance. Consider the image in figure 1.3: a person walking near a traffic signal. To understand this scene, the model needs to relate the person (in one part of the image) to the traffic signal (in a completely different part). These two elements are spatially far apart, but semantically they are deeply connected: the traffic signal determines whether the person should walk or stop.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!q5rD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!q5rD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png 424w, https://substackcdn.com/image/fetch/$s_!q5rD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png 848w, https://substackcdn.com/image/fetch/$s_!q5rD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png 1272w, https://substackcdn.com/image/fetch/$s_!q5rD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!q5rD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png" width="1456" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:621127,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!q5rD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png 424w, https://substackcdn.com/image/fetch/$s_!q5rD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png 848w, https://substackcdn.com/image/fetch/$s_!q5rD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png 1272w, https://substackcdn.com/image/fetch/$s_!q5rD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8deeb6ec-b2a3-4223-a5c3-a779622b6ad3_1482x609.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.3</strong> Long-range dependencies in images. The person and the traffic signal are spatially far apart in pixel space, but they are semantically related. A CNN must stack many layers to connect these distant regions through its limited local receptive field. A transformer, with global self-attention, can relate these elements directly in a single layer.</em></p><p>A CNN's local filters can only see a small neighborhood at each layer. To connect the person and the traffic signal, information must propagate through many layers, each expanding the receptive field slightly. By the time the model can "see" both elements simultaneously, the information has passed through many transformations and may be diluted or lost.</p><p>This is where transformers shine. Self-attention computes relationships between <em>all</em> patches simultaneously, regardless of spatial distance. A transformer can directly attend from the "person" patch to the "traffic signal" patch in a single layer. As we discussed in the context of natural language processing, this is analogous to how attention connects distant words in a sentence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7fxI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7fxI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png 424w, https://substackcdn.com/image/fetch/$s_!7fxI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png 848w, https://substackcdn.com/image/fetch/$s_!7fxI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png 1272w, https://substackcdn.com/image/fetch/$s_!7fxI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7fxI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png" width="1456" height="397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:397,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36073,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7fxI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png 424w, https://substackcdn.com/image/fetch/$s_!7fxI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png 848w, https://substackcdn.com/image/fetch/$s_!7fxI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png 1272w, https://substackcdn.com/image/fetch/$s_!7fxI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0efe7c17-f8bc-461a-ba75-6a526d5d1400_1968x536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.4</strong> Attention mechanisms connect distant elements directly. Just as attention in NLP allows the word "smiled" to attend strongly to "teacher" despite the intervening words, visual attention allows distant image patches to interact directly. The attention weights (shown below the sentence) indicate how strongly each word attends to others, capturing long-range dependencies that local convolutions would struggle with.</em></p><p>So, inductive bias in CNNs is a double-edged sword. When the assumptions match the data (local features, translation-invariant patterns, moderate image complexity), CNNs are remarkably efficient. When the task demands flexible, global reasoning across distant image regions, those same assumptions become a constraint. This trade-off is precisely what motivated the development of Vision Transformers, and subsequently, DeiT.</p><h2>1.2 Why Vision Transformers are data-hungry and computationally heavy</h2><p>Now that we understand what inductive biases are and why CNNs benefit from them, we can understand the fundamental challenge that Vision Transformers face.</p><h3><strong>The ViT paper&#8217;s stark finding</strong></h3><p>The original Vision Transformer paper by Dosovitskiy et al. (2020) contained an important admission that is easy to overlook:</p><div class="pullquote"><p>"We note that Vision Transformer has much less image-specific inductive bias than CNNs. In CNNs, locality, two-dimensional neighborhood structure, and translation equivariance are baked into each layer throughout the whole model. In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global."</p></div><p>The paper continued with a crucial finding: </p><div class="pullquote"><p><em>&#8220;Transformers do not generalize well when trained on insufficient amounts of data.&#8221;</em></p></div><p>Let us unpack exactly why. A Vision Transformer processes images very differently from a CNN:</p><ol><li><p><strong>No locality constraint.</strong> In every transformer layer, each patch attends to <em>every</em> other patch through self-attention. There is no architectural constraint forcing the model to prioritize nearby patches. The attention mechanism computes pairwise relationships between all patches simultaneously, whereas CNNs restrict each layer to local neighborhoods.</p></li><li><p><strong>No weight sharing across positions.</strong> Unlike CNNs where the same filter slides across the entire image, in ViTs the attention mechanism can learn different relationships for different positions. Positional information comes only from learned positional embeddings, not from the architecture itself.</p></li><li><p><strong>No hierarchical feature extraction by default.</strong> CNNs naturally build hierarchical features (edges, then textures, then parts, then objects) through stacked local convolutions with increasing receptive fields. Standard ViTs have uniform global attention at every layer, so this hierarchy must be learned entirely from data.</p></li></ol><h3><strong>The data requirements</strong></h3><p>The consequence is stark. Since a ViT must learn locality, translation patterns, and hierarchical feature extraction all from scratch, it needs vastly more data.</p><p>The original ViT paper systematically studied this relationship with a dataset scaling experiment:</p><ul><li><p><strong>ImageNet-1K</strong> (~1.2 million images): ViT-Large <em>underperforms</em> comparable CNNs. Worse, larger ViT variants perform <em>worse</em> than smaller ones due to overfitting. The model memorizes training data rather than generalizing.</p></li><li><p><strong>ImageNet-21K</strong> (~14 million images): ViT-Large and comparable CNNs perform similarly. The ViT begins to show its potential.</p></li><li><p><strong>JFT-300M</strong> (~303 million images): ViT-Large <em>significantly outperforms</em> comparable CNNs. The pattern reverses completely: larger models perform better, not worse.</p></li></ul><p>This reveals a fundamental trade-off in machine learning: the less prior knowledge (inductive bias) you build into an architecture, the more data it needs to learn those patterns from experience. CNNs &#8220;know&#8221; about locality and translation equivariance before seeing a single training image. ViTs must discover these properties purely from data, and discovering them requires seeing hundreds of millions of examples.</p><h3><strong>Scaling laws and the bitter lesson</strong></h3><p>This phenomenon connects to a broader principle in deep learning captured by <em>scaling laws</em>. Research by Kaplan et al. (2020) established that neural network performance follows predictable power-law relationships with three variables: model size (N parameters), dataset size (D tokens or images), and compute budget (C FLOPs):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L(N) \\propto N^{-\\alpha_N}, \\quad L(D) \\propto D^{-\\alpha_D}, \\quad L(C) \\propto C^{-\\alpha_C}&quot;,&quot;id&quot;:&quot;FAZVMJNOXH&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{where } L \\text{ is the loss and } \\alpha_N \\approx 0.076, \\alpha_D \\approx 0.095, \\alpha_C \\approx 0.050 \\text{ for language models.}&quot;,&quot;id&quot;:&quot;HABINDLXWU&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!12oe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!12oe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png 424w, https://substackcdn.com/image/fetch/$s_!12oe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png 848w, https://substackcdn.com/image/fetch/$s_!12oe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png 1272w, https://substackcdn.com/image/fetch/$s_!12oe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!12oe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png" width="1456" height="470" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:470,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:503924,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!12oe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png 424w, https://substackcdn.com/image/fetch/$s_!12oe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png 848w, https://substackcdn.com/image/fetch/$s_!12oe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png 1272w, https://substackcdn.com/image/fetch/$s_!12oe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff01bcf84-4e62-4590-b1c9-03aa3603bf2c_3135x1011.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.5</strong> Scaling laws demonstrate that model performance improves as a smooth power law of model size, dataset size, and compute budget. These relationships, first established for language models, were later confirmed for Vision Transformers as well. The declining curves show that each doubling of resources yields a predictable (though diminishing) improvement in performance.</em></p><p>The Scaling Vision Transformers paper by Zhai et al. (2022) confirmed that these same power-law relationships hold for ViTs, scaling from 5 million to 2 billion parameters. The key insight is that architectural details matter less than scale. With enough data and compute, the flexibility of transformers (their lack of restrictive inductive biases) becomes an <em>advantage</em> rather than a limitation. The ViT paper itself summarized this as: <em>&#8220;Large-scale training trumps inductive bias.&#8221;</em></p><p>But this raised an uncomfortable question: what if you do not have 300 million images and thousands of TPU-days? What if you have a single machine, a standard dataset, and a few days of training time? This is precisely the problem that DeiT set out to solve.</p><h3><strong>The small context window insight</strong></h3><p>Consider how a small local patch of an image can be ambiguous without global context. Figure 1.6 shows an interesting property: small context windows of different digits can look remarkably similar. The bottom halves of the digits 8, 0, and 6 share nearly identical local features. A local convolutional filter might struggle to distinguish them, while global attention can consider the entire digit.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TicD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TicD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png 424w, https://substackcdn.com/image/fetch/$s_!TicD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png 848w, https://substackcdn.com/image/fetch/$s_!TicD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png 1272w, https://substackcdn.com/image/fetch/$s_!TicD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TicD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png" width="828" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:828,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38184,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TicD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png 424w, https://substackcdn.com/image/fetch/$s_!TicD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png 848w, https://substackcdn.com/image/fetch/$s_!TicD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png 1272w, https://substackcdn.com/image/fetch/$s_!TicD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59eb24d7-f171-47e6-a6b8-6c3ab3b95190_828x330.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.6</strong> Small context windows can be ambiguous. The bottom portions of the digits 8, 0, and 6 appear nearly identical when viewed in isolation. This illustrates both the strength and weakness of local processing: CNNs capture these shared local features efficiently, but transformers with global attention can disambiguate by considering the full spatial context simultaneously.</em></p><p>This is a microcosm of the broader tension between CNNs and ViTs. Local processing is efficient but can miss the forest for the trees. Global processing is powerful but expensive to learn. DeiT&#8217;s genius was finding a way to get the best of both worlds.</p><h2><strong>1.3 DeiT: data-efficient training through distillation</strong></h2><p>With the problem clearly defined (ViTs need too much data and compute), let us explore how DeiT solved it. The paper &#8220;Training data-efficient image transformers &amp; distillation through attention&#8221; by Touvron et al. (2021) introduced a remarkably elegant solution that combines two key ingredients: a carefully designed training recipe and a novel form of knowledge distillation.</p><h3><strong>The core idea</strong></h3><p>DeiT&#8217;s central insight is this: instead of requiring the Vision Transformer to learn everything about images from scratch, we can transfer knowledge from a pre-trained CNN teacher. The CNN already understands locality, translation equivariance, and hierarchical features because these properties are built into its architecture. Through knowledge distillation, the ViT student can inherit these implicit biases <em>without</em> having them hardcoded into its architecture.</p><p>The result was striking: DeiT-B (86 million parameters) achieved <strong>83.1% top-1 accuracy</strong> on ImageNet using <em>only</em> ImageNet-1K for training, in approximately 53 hours on a single 8-GPU node. Compare this with the original ViT, which required JFT-300M (300 million images) and thousands of TPU-days to achieve similar performance.</p><h3><strong>DeiT architecture</strong></h3><p>DeiT retains the standard Vision Transformer architecture with one crucial addition: a <em>distillation token</em>. Let us walk through the full architecture as shown in figure 1.7.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AXAF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AXAF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png 424w, https://substackcdn.com/image/fetch/$s_!AXAF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png 848w, https://substackcdn.com/image/fetch/$s_!AXAF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png 1272w, https://substackcdn.com/image/fetch/$s_!AXAF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AXAF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png" width="1456" height="1942" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1942,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:467516,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AXAF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png 424w, https://substackcdn.com/image/fetch/$s_!AXAF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png 848w, https://substackcdn.com/image/fetch/$s_!AXAF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png 1272w, https://substackcdn.com/image/fetch/$s_!AXAF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb331b750-0b22-4f05-a928-be405b7aa634_1760x2348.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.7</strong> The DeiT architecture extends the standard Vision Transformer with a distillation token. The input image is divided into fixed-size patches, which are linearly projected into embeddings. Two special tokens are prepended: the [CLS] token (supervised by ground-truth labels) and the [DIST] token (supervised by the teacher model's predictions). Both tokens pass through all transformer encoder layers, interacting with patch tokens and with each other through self-attention. At the output, separate classification heads produce predictions from each token.</em></p><p>The architecture works as follows:</p><ol><li><p><strong>Patch embedding.</strong> The input image (e.g., 224x224 pixels) is divided into non-overlapping patches (e.g., 16x16 pixels each), yielding a sequence of </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N = (224/16)^2=196 &quot;,&quot;id&quot;:&quot;ZNQLJXSEPW&quot;}" data-component-name="LatexBlockToDOM"></div><p>patches. Each patch is linearly projected into a D-dimensional embedding vector using a learnable projection matrix. In practice, this is implemented as a single convolution with kernel size and stride equal to the patch size.</p></li><li><p><strong>Special tokens.</strong> Two learnable embedding vectors are prepended to the sequence:</p><ul><li><p>The <strong>[CLS] token</strong>: a standard classification token (as in the original ViT) that aggregates information for predicting the ground-truth label.</p></li><li><p>The <strong>[DIST] token</strong>: a novel <em>distillation token</em> that aggregates information for mimicking the teacher model&#8217;s predictions.</p></li></ul></li><li><p><strong>Positional embeddings.</strong> Learnable 1D position embeddings are added to all tokens (patches + CLS + DIST), giving the model information about spatial arrangement. The total sequence length is N+2.</p></li><li><p><strong>Transformer encoder.</strong> The sequence passes through L standard transformer encoder layers, each consisting of multi-head self-attention (MSA) and a feed-forward network (FFN). Both special tokens interact with all patch tokens and with each other through self-attention.</p></li><li><p><strong>Dual classification heads.</strong> At the output, two separate linear heads produce predictions:</p><p></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{z}_{s}^{\\text{cls}} = W_{\\text{cls}} \\cdot \\mathbf{x}_{\\text{cls}}&quot;,&quot;id&quot;:&quot;XKTNGJHIGS&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{logits from the CLS token, trained against ground-truth labels.}&quot;,&quot;id&quot;:&quot;MUHXKQVEMB&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{z}_{s}^{\\text{dist}} = W_{\\text{dist}} \\cdot \\mathbf{x}_{\\text{dist}}&quot;,&quot;id&quot;:&quot;ZZPMCRUSCS&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{ logits from the DIST token, trained against the teacher's predictions.}&quot;,&quot;id&quot;:&quot;BZYSRPKUVL&quot;}" data-component-name="LatexBlockToDOM"></div></li><li><p><strong>Inference.</strong> At test time, the predictions from both heads are combined by averaging the softmax outputs:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{p}_{\\text{inference}} = \\frac{1}{2}\\left[\\sigma(\\mathbf{z}_s^{\\text{cls}}) + \\sigma(\\mathbf{z}_s^{\\text{dist}})\\right]\n&quot;,&quot;id&quot;:&quot;GOEHRMKIPI&quot;}" data-component-name="LatexBlockToDOM"></div><p></p></li></ol><p>Now let us look at the original architecture diagram from the paper for additional perspective.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Qh2G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Qh2G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png 424w, https://substackcdn.com/image/fetch/$s_!Qh2G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png 848w, https://substackcdn.com/image/fetch/$s_!Qh2G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png 1272w, https://substackcdn.com/image/fetch/$s_!Qh2G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Qh2G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png" width="1164" height="1584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1584,&quot;width&quot;:1164,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91818,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Qh2G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png 424w, https://substackcdn.com/image/fetch/$s_!Qh2G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png 848w, https://substackcdn.com/image/fetch/$s_!Qh2G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png 1272w, https://substackcdn.com/image/fetch/$s_!Qh2G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9df0e8af-01a8-4c2e-81ac-42431af95cc4_1164x1584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.8</strong> The DeiT architecture as presented in the original paper, showing the complete training pipeline. The student (Vision Transformer) receives supervision from two sources: the ground-truth labels (via cross-entropy on the CLS token) and the pre-trained CNN teacher (via the distillation loss on the DIST token). The teacher model is frozen during training and its weights are never updated. The distillation token enables the transformer to learn the CNN's implicit understanding of image structure without architecturally constraining itself.</em></p><h3><strong>Why a separate distillation token?</strong></h3><p>You might wonder: why not simply add the teacher&#8217;s loss to the existing CLS token? Why introduce a separate token? The DeiT paper provides a compelling empirical answer.</p><p>The authors experimented with using two identical CLS tokens (both trained on ground-truth labels) and found that they converge to nearly identical representations, with cosine similarity approaching 0.999. They carry redundant information.</p><p>In contrast, the CLS token and distillation token, trained on <em>different</em> objectives, develop meaningfully different representations. Their cosine similarity is approximately 0.06 in early layers and rises to only about 0.93 by the final layer. This means the distillation token captures complementary information that the CLS token alone would miss. The two tokens learn different &#8220;perspectives&#8221; on the input, and combining them at inference yields better predictions than either alone.</p><h3><strong>The teacher model</strong></h3><p>DeiT uses a pre-trained CNN as the teacher. In the paper, the primary teacher is <strong>RegNetY-16GF</strong> (84 million parameters, 82.9% top-1 accuracy on ImageNet), though the authors also experimented with other architectures. Critically, the teacher is <em>frozen</em> during training: its weights are never updated. It simply provides predictions that the student learns to mimic.</p><p>A surprising finding from the paper is that <strong>a CNN teacher produces dramatically better student performance than a transformer teacher</strong>. The authors note that the transformer student &#8220;learned least from a transformer-teacher but learned most from a big convolution-teacher.&#8221; This supports the hypothesis that distillation effectively transfers the CNN&#8217;s inductive biases (locality, translation equivariance) to the transformer student, giving it the benefits of both architectures.</p><p>But how exactly does knowledge transfer from teacher to student? To understand this, we need to dive deep into the mechanics of knowledge distillation.</p><h2><strong>1.4 </strong><em><strong>Knowledge distillation: teaching a student network</strong></em></h2><p>Knowledge distillation is a model compression technique where a small <em>student</em> model is trained to mimic the behavior of a larger, more capable <em>teacher</em> model. The goal is to produce a compact model that retains much of the teacher&#8217;s performance while being faster and cheaper at inference. Let us trace the evolution of this idea and then build up the mathematics step by step.</p><h3><strong>A brief history</strong></h3><p>The idea of compressing knowledge from large models into small ones dates back to <strong>2006</strong>, when researchers at Cornell University proposed <em>model compression</em>. At the time, the best-performing models were not single networks but <em>ensembles</em> of hundreds or thousands of models whose predictions were averaged. These ensembles were accurate but far too large for deployment on devices like PDAs (personal digital assistants, the predecessors to smartphones).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pDdK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pDdK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png 424w, https://substackcdn.com/image/fetch/$s_!pDdK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png 848w, https://substackcdn.com/image/fetch/$s_!pDdK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png 1272w, https://substackcdn.com/image/fetch/$s_!pDdK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pDdK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png" width="1384" height="900" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:1384,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:76337,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pDdK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png 424w, https://substackcdn.com/image/fetch/$s_!pDdK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png 848w, https://substackcdn.com/image/fetch/$s_!pDdK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png 1272w, https://substackcdn.com/image/fetch/$s_!pDdK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc3150f3-565b-458c-bf21-99e0eafd17f9_1384x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.9</strong> The two-step process of model compression, as introduced in 2006. In Step 1, an ensemble of multiple models is trained, with each model making independent predictions. In Step 2, a single small model is trained to directly predict the averaged output of the ensemble, compressing the knowledge of many models into one.</em></p><p>The insight was elegantly simple: instead of shipping all the ensemble models to the device, train a single small model to predict the ensemble's averaged output. This small model captures the collective wisdom of the ensemble while being compact enough for deployment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_J-Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_J-Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png 424w, https://substackcdn.com/image/fetch/$s_!_J-Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png 848w, https://substackcdn.com/image/fetch/$s_!_J-Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png 1272w, https://substackcdn.com/image/fetch/$s_!_J-Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_J-Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png" width="1040" height="744" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/125a684a-3259-419e-870d-6aaee937a314_1040x744.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:744,&quot;width&quot;:1040,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60714,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_J-Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png 424w, https://substackcdn.com/image/fetch/$s_!_J-Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png 848w, https://substackcdn.com/image/fetch/$s_!_J-Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png 1272w, https://substackcdn.com/image/fetch/$s_!_J-Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F125a684a-3259-419e-870d-6aaee937a314_1040x744.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.10</strong> Ensemble soft voting. Multiple models (Model 1 through Model 5) each output a probability distribution over classes. These distributions are averaged to produce a final classification. The key insight is that this averaged distribution carries more information than any single model&#8217;s hard prediction.</em></p><p>In <strong>2015</strong>, Geoffrey Hinton and Jeff Dean (at Google) revisited this idea and gave it the name we use today: <em>knowledge distillation</em>. Their crucial insight was that distillation is valuable even <em>without</em> an ensemble. A single large model&#8217;s soft probability outputs carry richer information than hard labels alone, and this information can be transferred to a smaller student.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-0W7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-0W7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png 424w, https://substackcdn.com/image/fetch/$s_!-0W7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png 848w, https://substackcdn.com/image/fetch/$s_!-0W7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png 1272w, https://substackcdn.com/image/fetch/$s_!-0W7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-0W7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png" width="1292" height="876" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:876,&quot;width&quot;:1292,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89648,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-0W7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png 424w, https://substackcdn.com/image/fetch/$s_!-0W7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png 848w, https://substackcdn.com/image/fetch/$s_!-0W7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png 1272w, https://substackcdn.com/image/fetch/$s_!-0W7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833f7880-f581-41c3-b49a-287bf92b1ca8_1292x876.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.11</strong> Knowledge distillation does not require an ensemble. A single large teacher model generates soft labels (probability distributions) that carry more information than one-hot hard labels. The student learns from these enriched targets, inheriting the teacher&#8217;s nuanced understanding of inter-class relationships.</em></p><h3><strong>Hard labels vs. soft labels: the concept of dark knowledge</strong></h3><p>To understand why soft labels are so valuable, consider a concrete example. Suppose we have a teacher model classifying animal images into three categories: dog, cat, and mouse.</p><p>Given an image of a husky, the hard label (ground truth) is simply:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{y}_{\\text{hard}} = [1, 0, 0] \\quad \\text{(dog, cat, mouse)}&quot;,&quot;id&quot;:&quot;GZOPRDBEAM&quot;}" data-component-name="LatexBlockToDOM"></div><p>This tells the student: "This is a dog. Period." But the teacher's soft prediction might be:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{p}_{\\text{teacher}} = [0.80, 0.15, 0.05] \\quad \\text{(dog, cat, mouse)}&quot;,&quot;id&quot;:&quot;JHMHCTRAOH&quot;}" data-component-name="LatexBlockToDOM"></div><p>This soft distribution tells the student something much richer: &#8220;This is most likely a dog, but it looks somewhat like a cat (perhaps because of the fur texture), and it looks very little like a mouse.&#8221; This additional information about <em>what the input is not</em> (and how much it is &#8220;not&#8221; each class) is what Hinton poetically called <strong>dark knowledge</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5UKP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5UKP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png 424w, https://substackcdn.com/image/fetch/$s_!5UKP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png 848w, https://substackcdn.com/image/fetch/$s_!5UKP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png 1272w, https://substackcdn.com/image/fetch/$s_!5UKP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5UKP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png" width="744" height="616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:616,&quot;width&quot;:744,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:257327,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5UKP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png 424w, https://substackcdn.com/image/fetch/$s_!5UKP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png 848w, https://substackcdn.com/image/fetch/$s_!5UKP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png 1272w, https://substackcdn.com/image/fetch/$s_!5UKP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ec8cc90-7ee0-4595-b7d8-f636a672d961_744x616.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.12</strong>  Dark knowledge revealed through soft labels. While hard labels simply say "this is a husky," the teacher's soft probability distribution reveals richer inter-class relationships. A husky image gets high probability for "husky" but also notable probability for "wolf" and some for "dog," reflecting visual similarities between these animals that the teacher has learned. This relational information (the dark knowledge) helps the student learn more efficiently.</em></p><p>Dark knowledge encodes the teacher&#8217;s learned understanding of inter-class similarities and relationships. A husky looks more like a wolf than like a goldfish. A handwritten &#8220;7&#8221; looks more like a &#8220;1&#8221; than like a &#8220;0.&#8221; These relational insights are completely absent from one-hot hard labels but are naturally present in soft probability distributions.</p><h3><strong>The mathematics of knowledge distillation</strong></h3><p>Now let us formalize the distillation process mathematically. The framework involves three key components: temperature scaling, the distillation loss function, and the combined training objective.</p><h4><strong>Temperature scaling</strong></h4><p>When a well-trained teacher model makes predictions, its output distribution is often very &#8220;peaked&#8221;: the correct class gets probability close to 1.0, and all other classes get near-zero probabilities. For example:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{p}_{\\text{teacher}} = [0.99, 0.005, 0.005] \\quad \\text{(at } \\tau = 1\\text{)}&quot;,&quot;id&quot;:&quot;PEOOXRJRGP&quot;}" data-component-name="LatexBlockToDOM"></div><p>These near-zero probabilities contain the dark knowledge we want to transfer, but they are so small that they provide almost no gradient signal during training. The student cannot learn from values like 0.005 because the gradients are essentially zero.</p><p><em>Temperature scaling</em> solves this by softening the probability distribution. Given a vector of logits </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z = \\begin{bmatrix} z_1 \\\\ z_2 \\\\ \\vdots \\\\ z_K \\end{bmatrix}&quot;,&quot;id&quot;:&quot;BCMJMVIWXV&quot;}" data-component-name="LatexBlockToDOM"></div><p>(the raw, unnormalized outputs of the network before softmax), the standard softmax function is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma(z_i) = \\frac{e^{z_i}}{\\sum_{j=1}^{K} e^{z_j}}&quot;,&quot;id&quot;:&quot;XSSKPASXLQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>With a temperature parameter &#964;&gt;0<em> </em>, we define the <em>temperature-scaled softmax</em>:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma\\!\\left(\\frac{z_i}{\\tau}\\right) = \\frac{e^{z_i / \\tau}}{\\sum_{j=1}^{K} e^{z_j / \\tau}}&quot;,&quot;id&quot;:&quot;SJXSCQIGSQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>The temperature controls how "spread out" the distribution is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tau = 1&quot;,&quot;id&quot;:&quot;FZKBIGQOBX&quot;}" data-component-name="LatexBlockToDOM"></div><p>Standard softmax. The distribution reflects the model's learned confidence.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tau > 1&quot;,&quot;id&quot;:&quot;QRCETRUEGO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Higher temperature <em>softens</em> the distribution, making all probabilities more uniform and revealing the relative differences between logits. This amplifies dark knowledge.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tau \\to \\infty&quot;,&quot;id&quot;:&quot;YMEXYGZPBL&quot;}" data-component-name="LatexBlockToDOM"></div><p>The distribution approaches a uniform distribution 1/K for K classes.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tau \\to 0^+&quot;,&quot;id&quot;:&quot;BBIGDJZKTA&quot;}" data-component-name="LatexBlockToDOM"></div><p>The distribution collapses to a one-hot vector concentrated on the largest logit (argmax).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tkl4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tkl4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png 424w, https://substackcdn.com/image/fetch/$s_!Tkl4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png 848w, https://substackcdn.com/image/fetch/$s_!Tkl4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png 1272w, https://substackcdn.com/image/fetch/$s_!Tkl4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tkl4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png" width="1168" height="456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:1168,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29760,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tkl4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png 424w, https://substackcdn.com/image/fetch/$s_!Tkl4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png 848w, https://substackcdn.com/image/fetch/$s_!Tkl4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png 1272w, https://substackcdn.com/image/fetch/$s_!Tkl4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cc0d99d-5cf4-462a-b01c-45baae078ee7_1168x456.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.13</strong> </em> <em>The effect of temperature on softmax distributions. The formula shows how temperature controls the &#8220;peakiness&#8221; of the output. Left: the raw logit values before temperature-scaled softmax. Right: after applying softmax with high temperature, the distribution becomes more uniform, revealing the relative magnitudes of all logits. This smoothing is essential for transferring dark knowledge from teacher to student.</em></p><p>Let us work through a concrete numerical example. Suppose a teacher produces logits z=[5.0,2.0,1.0]<strong> </strong>for three classes. Here is how temperature affects the resulting probabilities:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{array}{lccccc}\n\\hline\n\\mathbf{Class} &amp; \\mathbf{Logit} &amp; \\mathbf{t=1} &amp; \\mathbf{t=2} &amp; \\mathbf{t=5} &amp; \\mathbf{t=10} \\\\\n\\hline\n\\mathrm{Dog}   &amp; 5.0 &amp; 0.936 &amp; 0.736 &amp; 0.500 &amp; 0.415 \\\\\n\\hline\n\\mathrm{Cat}   &amp; 2.0 &amp; 0.047 &amp; 0.164 &amp; 0.275 &amp; 0.307 \\\\\n\\hline\n\\mathrm{Mouse} &amp; 1.0 &amp; 0.017 &amp; 0.100 &amp; 0.225 &amp; 0.278 \\\\\n\\hline\n\\mathbf{Sum}   &amp;     &amp; \\mathbf{1.000} &amp; \\mathbf{1.000} &amp; \\mathbf{1.000} &amp; \\mathbf{1.000} \\\\\n\\hline\n\\end{array}\n&quot;,&quot;id&quot;:&quot;HPOLIIZKIE&quot;}" data-component-name="LatexBlockToDOM"></div><p>At &#964;=1, the distribution is peaked: dog gets 93.6% and the dark knowledge (cat and mouse probabilities) is barely visible. At &#964;=5, the distribution is much flatter: we can clearly see that the teacher considers this image more cat-like than mouse-like (27.5% vs. 22.5%). This relational information is the dark knowledge that distillation transfers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jii5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jii5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png 424w, https://substackcdn.com/image/fetch/$s_!jii5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png 848w, https://substackcdn.com/image/fetch/$s_!jii5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png 1272w, https://substackcdn.com/image/fetch/$s_!jii5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jii5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png" width="1080" height="404" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:404,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:9816,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jii5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png 424w, https://substackcdn.com/image/fetch/$s_!jii5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png 848w, https://substackcdn.com/image/fetch/$s_!jii5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png 1272w, https://substackcdn.com/image/fetch/$s_!jii5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F230cf60b-cc4b-46b9-8d5c-5b0caf4d088b_1080x404.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.14</strong> A direct comparison of probability distributions before and after temperature scaling. The left distribution (low temperature) shows sharp peaks where the correct class dominates. The right distribution (high temperature) shows a smoothed version where the relative probabilities of non-dominant classes become visible. This smoothing allows the student to learn not just which class is correct, but which incorrect classes are most similar to the correct one.</em></p><h4><strong>Cross-entropy loss</strong></h4><p>The standard classification loss in deep learning is <em>cross-entropy</em>. For a ground-truth one-hot label y=[y1,y2,&#8230;,yK] and predicted probabilities p=[p1,p2,&#8230;,pK], the cross-entropy is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{CE}}(\\mathbf{y}, \\mathbf{p}) = -\\sum_{i=1}^{K} y_i \\log(p_i)&quot;,&quot;id&quot;:&quot;PCGFKOEDYD&quot;}" data-component-name="LatexBlockToDOM"></div><p>Since <strong>y</strong> is one-hot with y_c=1 for the true class c<em>c</em> and y_i=0 for all other classes, this simplifies to:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{CE}} = -\\log(p_c)&quot;,&quot;id&quot;:&quot;JNZAASRHQT&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us verify with an example. If the true class is "cat" (c=1) and the model predicts p=[0.80,0.15,0.05]<strong>:</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{CE}} = -\\log(0.80) = -(-0.2231) = 0.2231&quot;,&quot;id&quot;:&quot;HDRKNGONCD&quot;}" data-component-name="LatexBlockToDOM"></div><p>A confident correct prediction gives a low loss. If instead pc=0.1<em>:</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{CE}} = -\\log(0.1) = 2.3026&quot;,&quot;id&quot;:&quot;UKLBWYLROK&quot;}" data-component-name="LatexBlockToDOM"></div><p>An incorrect or uncertain prediction gives a high loss, pushing the model to adjust its weights. Note that &#8722;log&#8289;(x) produces a positive value when x&#8712;(0,1)<em>,</em> which is always the case for probabilities.</p><h4><strong>KL divergence</strong></h4><p>While cross-entropy measures how well predictions match a target distribution, <em>Kullback-Leibler (KL) divergence</em> measures how different two probability distributions are from each other. For distributions p (teacher) and q (student):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;D_{\\text{KL}}(\\mathbf{p} \\| \\mathbf{q}) = \\sum_{i=1}^{K} p_i \\log\\frac{p_i}{q_i}&quot;,&quot;id&quot;:&quot;AMKISVYKVR&quot;}" data-component-name="LatexBlockToDOM"></div><p><br>KL divergence has several important properties:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dfvy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dfvy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png 424w, https://substackcdn.com/image/fetch/$s_!dfvy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png 848w, https://substackcdn.com/image/fetch/$s_!dfvy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png 1272w, https://substackcdn.com/image/fetch/$s_!dfvy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dfvy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png" width="1456" height="165" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:165,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82369,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dfvy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png 424w, https://substackcdn.com/image/fetch/$s_!dfvy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png 848w, https://substackcdn.com/image/fetch/$s_!dfvy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png 1272w, https://substackcdn.com/image/fetch/$s_!dfvy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F78c760ae-995a-4dcf-bc4f-730b22d1e8f3_2749x312.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Let us work through a concrete example. Suppose the teacher outputs </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p=[0.8,0.1,0.1] &quot;,&quot;id&quot;:&quot;IBIGMJCODC&quot;}" data-component-name="LatexBlockToDOM"></div><p>and the student outputs </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;q=[0.7,0.2,0.1]&quot;,&quot;id&quot;:&quot;QHPJTVLXCW&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;D_{\\text{KL}} = 0.8 \\log\\frac{0.8}{0.7} + 0.1 \\log\\frac{0.1}{0.2} + 0.1 \\log\\frac{0.1}{0.1}&quot;,&quot;id&quot;:&quot;PXAJWUJMZR&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;= 0.8 \\times 0.1335 + 0.1 \\times (-0.6931) + 0.1 \\times 0$&quot;,&quot;id&quot;:&quot;XGNOBLWWVQ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;= 0.1068 - 0.0693 + 0 = 0.0375&quot;,&quot;id&quot;:&quot;ZIFATDFIVU&quot;}" data-component-name="LatexBlockToDOM"></div><p>A small KL divergence indicates the student is already quite close to the teacher. Now consider an overconfident student that outputs </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;q=[0.95,0.025,0.025]&quot;,&quot;id&quot;:&quot;VVAYDIHPXC&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;D_{\\text{KL}} = 0.8 \\log\\frac{0.8}{0.95} + 0.1 \\log\\frac{0.1}{0.025} + 0.1 \\log\\frac{0.1}{0.025}&quot;,&quot;id&quot;:&quot;HBKDAKJRKQ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;= 0.8 \\times (-0.1719) + 0.1 \\times 1.3863 + 0.1 \\times 1.3863&quot;,&quot;id&quot;:&quot;EHHGLBUOJX&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;= -0.1375 + 0.1386 + 0.1386 = 0.1397&quot;,&quot;id&quot;:&quot;NTRARGVTGO&quot;}" data-component-name="LatexBlockToDOM"></div><p>The KL divergence is larger because the student is more confident than the teacher. KL divergence heavily penalizes cases where the student assigns very low probability to classes that the teacher considers plausible.</p><blockquote><p><strong>NOTE</strong> </p><p>There is an important mathematical subtlety about KL divergence when pi=0<em>. Since</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\lim_{x \\to 0^+} x \\log(x) = 0&quot;,&quot;id&quot;:&quot;HZFCJHRRAM&quot;}" data-component-name="LatexBlockToDOM"></div><p>(which can be verified using L'H&#244;pital's rule), terms where pi=0<em> </em>contribute zero to the divergence. This means the KL divergence is well-defined even when the teacher assigns zero probability to some classes.</p></blockquote><p>The relationship between KL divergence and cross-entropy is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H(\\mathbf{p}, \\mathbf{q}) = H(\\mathbf{p}) + D_{\\text{KL}}(\\mathbf{p} \\| \\mathbf{q})&quot;,&quot;id&quot;:&quot;JLCKSAZKDM&quot;}" data-component-name="LatexBlockToDOM"></div><p>where</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H(\\mathbf{p}, \\mathbf{q}) = -\\sum_i p_i \\log q_i&quot;,&quot;id&quot;:&quot;UNJOORLAXW&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the cross-entropy and</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H(\\mathbf{p}) = -\\sum_i p_i \\log p_i&quot;,&quot;id&quot;:&quot;FFLDNCGNDD&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the entropy of the teacher distribution. Since the teacher is frozen, H(p) is a constant, and minimizing KL divergence is equivalent to minimizing the cross-entropy between teacher and student distributions.</p><h4><strong>The complete knowledge distillation loss</strong></h4><p>Putting it all together, the knowledge distillation loss from Hinton et al. (2015) combines two terms:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{KD}} = (1 - \\alpha) \\cdot \\mathcal{L}_{\\text{CE}}\\big(\\mathbf{y},\\; \\sigma(\\mathbf{z}_s)\\big) + \\alpha \\cdot \\tau^2 \\cdot D_{\\text{KL}}\\!\\left(\\sigma\\!\\left(\\frac{\\mathbf{z}_t}{\\tau}\\right) \\;\\bigg\\|\\; \\sigma\\!\\left(\\frac{\\mathbf{z}_s}{\\tau}\\right)\\right)&quot;,&quot;id&quot;:&quot;MJZDQMHOHS&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DQ5y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DQ5y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png 424w, https://substackcdn.com/image/fetch/$s_!DQ5y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png 848w, https://substackcdn.com/image/fetch/$s_!DQ5y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png 1272w, https://substackcdn.com/image/fetch/$s_!DQ5y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DQ5y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png" width="1456" height="475" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:475,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105151,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DQ5y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png 424w, https://substackcdn.com/image/fetch/$s_!DQ5y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png 848w, https://substackcdn.com/image/fetch/$s_!DQ5y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png 1272w, https://substackcdn.com/image/fetch/$s_!DQ5y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff0b7355b-0140-44be-aee4-9cb031eca6c8_2001x653.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The first term is the standard cross-entropy between the student&#8217;s predictions (at &#964;=1) and the ground-truth labels. This keeps the student honest: it must still learn to classify correctly.</p><p>The second term is the KL divergence between the teacher&#8217;s and student&#8217;s softened predictions (both at temperature &#964;). This transfers the dark knowledge from teacher to student.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P9ng!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P9ng!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png 424w, https://substackcdn.com/image/fetch/$s_!P9ng!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png 848w, https://substackcdn.com/image/fetch/$s_!P9ng!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png 1272w, https://substackcdn.com/image/fetch/$s_!P9ng!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P9ng!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png" width="1456" height="1006" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1006,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139387,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P9ng!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png 424w, https://substackcdn.com/image/fetch/$s_!P9ng!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png 848w, https://substackcdn.com/image/fetch/$s_!P9ng!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png 1272w, https://substackcdn.com/image/fetch/$s_!P9ng!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff585be65-48a4-4304-bca1-6b3e033ae325_1644x1136.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.15</strong> The complete knowledge distillation framework. Training data feeds into both the teacher (large network, frozen) and the student (small network, being trained). The teacher produces soft labels via temperature-scaled softmax, while the ground truth provides hard labels. The student's total loss combines two components: the cross-entropy loss L_CE against hard labels, and the KL divergence loss L_KL against the teacher's soft labels. The formula</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;{L} = (1-\\alpha)\\mathcal{L}_{\\text{CE}} + \\alpha \\tau^2 \\mathcal{L}_{\\text{KL}}&quot;,&quot;id&quot;:&quot;RRNBLPGOXE&quot;}" data-component-name="LatexBlockToDOM"></div><p><em>weights these components, and gradients flow back through the student to update its weights.</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Why the } \\tau^2 \\text{ factor?}\n&quot;,&quot;id&quot;:&quot;DQXKIRFICN&quot;}" data-component-name="LatexBlockToDOM"></div><p>You may wonder why the KL divergence term is multiplied by &#964;^2. This is not arbitrary; it is a mathematical necessity for keeping the gradient magnitudes balanced.</p><p>When we compute the gradient of the soft cross-entropy loss with respect to the student's logits zs,i:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}_{\\text{soft}}}{\\partial z_{s,i}} = \\frac{1}{\\tau}\\left(\\sigma\\!\\left(\\frac{z_{s,i}}{\\tau}\\right) - \\sigma\\!\\left(\\frac{z_{t,i}}{\\tau}\\right)\\right)&quot;,&quot;id&quot;:&quot;WCMMCAOEYE&quot;}" data-component-name="LatexBlockToDOM"></div><p>The temperature scaling introduces a factor of 1/<em>&#964;</em> in the gradient. Additionally, the softened probabilities themselves compress the differences between logits by another factor of approximately 1/<em>&#964;</em>, yielding an overall gradient scaling of approximately 1/<em>&#964;</em>2.</p><p>Without compensation, the soft-label gradients would vanish as we increase temperature, defeating the purpose of softening. The &#964;_2 multiplier restores the gradient magnitudes to be comparable with the hard-label loss, ensuring that the weighting coefficient &#945;<em>&#945;</em> behaves predictably regardless of the chosen temperature.</p><h4><strong>High-temperature approximation: why distillation equals logit matching</strong></h4><p>There is an elegant mathematical insight that connects distillation to a simpler concept. At high temperatures, the softmax can be approximated using a Taylor expansion:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma\\!\\left(\\frac{z_i}{\\tau}\\right) \\approx \\frac{1}{K}\\left(1 + \\frac{z_i - \\bar{z}}{\\tau}\\right) + O\\!\\left(\\frac{1}{\\tau^2}\\right)&quot;,&quot;id&quot;:&quot;ISFGGKQKVD&quot;}" data-component-name="LatexBlockToDOM"></div><p>where</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;bar{z} = \\frac{1}{K}\\sum_j z_j&quot;,&quot;id&quot;:&quot;VMWQPEEBNI&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the mean logit and K is the number of classes.</p><p>Substituting this into the KL divergence and simplifying, the soft-label loss reduces to:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{soft}} \\approx \\frac{1}{K\\tau^2}\\sum_{i=1}^{K} (z_{s,i} - z_{t,i})^2 + \\text{constant}$&quot;,&quot;id&quot;:&quot;MQFWETIOLR&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is simply a mean squared error between the logits of the student and teacher! At high temperature, knowledge distillation is approximately equivalent to logit matching, which was actually proposed in 2014 as a precursor to Hinton's method. The &#964;_2 factor cancels the 1/&#964;2 scaling, confirming that the multiplication is necessary. This result also reveals why dark knowledge transfer works: the student is effectively learning the teacher's internal ranking of all classes, not just its top prediction.</p><h2><strong>1.5 DeiT&#8217;s distillation: hard labels beat soft labels</strong></h2><p>Now that we understand the mechanics of knowledge distillation, let us see how DeiT adapts this framework. DeiT introduces two variants: soft distillation and hard distillation. Surprisingly, the simpler approach wins.</p><h3><strong>Soft distillation in DeiT</strong></h3><p>DeiT&#8217;s soft distillation uses the standard knowledge distillation framework, but with the distillation token producing separate logits:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{DeiT-soft}} = (1-\\alpha) \\cdot \\mathcal{L}_{\\text{CE}}\\!\\left(\\mathbf{y},\\; \\sigma(\\mathbf{z}_s^{\\text{cls}})\\right) + \\alpha \\cdot \\tau^2 \\cdot D_{\\text{KL}}\\!\\left(\\sigma\\!\\left(\\frac{\\mathbf{z}_t}{\\tau}\\right) \\;\\bigg\\|\\; \\sigma\\!\\left(\\frac{\\mathbf{z}_s^{\\text{dist}}}{\\tau}\\right)\\right)&quot;,&quot;id&quot;:&quot;RKVPVXSCJO&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HJni!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HJni!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png 424w, https://substackcdn.com/image/fetch/$s_!HJni!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png 848w, https://substackcdn.com/image/fetch/$s_!HJni!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png 1272w, https://substackcdn.com/image/fetch/$s_!HJni!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HJni!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png" width="1456" height="290" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:290,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74452,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HJni!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png 424w, https://substackcdn.com/image/fetch/$s_!HJni!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png 848w, https://substackcdn.com/image/fetch/$s_!HJni!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png 1272w, https://substackcdn.com/image/fetch/$s_!HJni!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12a4c351-e538-40b1-8928-2a1c6032026c_2200x438.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The CLS token head learns from ground-truth labels. The distillation token head learns to match the teacher&#8217;s full soft probability distribution.</p><h3><strong>Hard distillation in DeiT</strong></h3><p>Hard distillation replaces the KL divergence with a simple cross-entropy against the teacher&#8217;s hard prediction (argmax):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\text{DeiT-hard}} = \\frac{1}{2} \\cdot \\mathcal{L}_{\\text{CE}}\\!\\left(\\mathbf{y},\\; \\sigma(\\mathbf{z}_s^{\\text{cls}})\\right) + \\frac{1}{2} \\cdot \\mathcal{L}_{\\text{CE}}\\!\\left(y_t,\\; \\sigma(\\mathbf{z}_s^{\\text{dist}})\\right)$&quot;,&quot;id&quot;:&quot;ZVLCSWNXYC&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{where } y_t = \\mathop{\\operatorname{arg\\,max_c}} (\\mathbf{z}_t) \\text{ is the teacher's hard predicted class label.}&quot;,&quot;id&quot;:&quot;VHRNQBRIBQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is remarkably simple: no temperature parameter, no KL divergence, no &#964;2 correction. Just two cross-entropy losses weighted equally. The CLS token learns from the ground truth, and the distillation token learns from the teacher&#8217;s top-1 prediction.</p><h3><strong>Why hard distillation works better</strong></h3><p>The DeiT paper found that hard distillation outperforms soft distillation by approximately +1.0--1.2% accuracy on ImageNet. This was surprising because soft labels contain strictly more information than hard labels.</p><p>The authors hypothesize that this relates to the fundamental difference between the teacher (a CNN) and the student (a transformer). CNNs and transformers process visual features in fundamentally different ways: CNNs use local convolutional filters while transformers use global self-attention. When the teacher provides a full soft distribution, it reflects the CNN&#8217;s specific way of processing the image, which may not transfer well to the transformer&#8217;s very different processing pipeline. Hard labels abstract away these architectural details, providing a cleaner supervisory signal that is easier for the transformer to learn from.</p><p>Think of it like learning to cook. Soft distillation is like watching a master chef&#8217;s exact hand movements (which depend on their specific knife and grip style). Hard distillation is like reading their recipe (the end result, abstracted from the specific process). If you have different tools and a different grip, the recipe is more useful than mimicking movements that do not suit your tools.</p><h3><strong>DeiT results</strong></h3><p>The results are compelling. Figure 1.16 shows the performance-throughput trade-off for various models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1XbR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1XbR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png 424w, https://substackcdn.com/image/fetch/$s_!1XbR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png 848w, https://substackcdn.com/image/fetch/$s_!1XbR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!1XbR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1XbR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png" width="1456" height="1226" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1226,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:239564,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/192406256?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1XbR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png 424w, https://substackcdn.com/image/fetch/$s_!1XbR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png 848w, https://substackcdn.com/image/fetch/$s_!1XbR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png 1272w, https://substackcdn.com/image/fetch/$s_!1XbR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14c180ec-60f9-47d3-acc1-d05cd56981b8_1554x1308.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.16</strong> Performance comparison between DeiT variants and competing architectures, plotting throughput (images per second, higher is better) against ImageNet top-1 accuracy (higher is better). DeiT-B with distillation (marked with the alembic symbol) achieves 85.2% accuracy, matching or exceeding EfficientNet while being significantly faster at inference. The key takeaway is that DeiT achieves competitive accuracy with ViT-level throughput, using only ImageNet-1K for training rather than the 300 million images that the original ViT required.</em></p><p>DeiT demonstrated that the Vision Transformer&#8217;s dependence on massive datasets was not an inherent limitation of the architecture, but a training problem with a training solution. By combining knowledge distillation with a dedicated distillation token, DeiT gave the transformer a way to absorb the inductive biases of a CNN teacher without modifying the transformer architecture itself. The CLS token learns from ground truth labels while the distillation token learns from the teacher&#8217;s predictions, and the two complementary signals together produce a model that generalizes better than either supervision source alone. This insight, that architectural inductive biases can be transferred rather than engineered, opened the door to practical vision transformers that anyone with a single multi-GPU machine could train.</p><h2><strong>1.6 Building DeiT from scratch in PyTorch</strong></h2><p><strong>Data Efficient Image Transformer Code is available below</strong></p><p><a href="https://github.com/VizuaraAI/Transformers-for-vision-BOOK">https://github.com/VizuaraAI/Transformers-for-vision-BOOK</a></p><p><strong>[Don&#8217;t forget to star the code repo!]</strong></p><p>With the theory firmly established, let us now build a complete DeiT implementation from scratch. We will train a small-scale version on MNIST digits to demonstrate all the key concepts: patch embedding, the distillation token, the teacher-student setup, and the knowledge distillation loss. While MNIST is far simpler than ImageNet, it lets us see the entire pipeline working end-to-end on a single machine.</p><h3><strong>Setting up imports and configuration</strong></h3><p>We begin by importing the necessary libraries and defining our hyperparameters. We will use PyTorch for the implementation, torchvision for datasets and the pre-trained teacher model, and NumPy for utility operations.</p><p><strong>Listing 1.1 Importing libraries and setting up the device</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;6b81e808-ca4c-42dd-8887-7ae20a681a8e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
from torch.utils.data import Subset, DataLoader
from torchvision import transforms, datasets, models
import numpy as np
import matplotlib.pyplot as plt

device = 'cuda' if torch.cuda.is_available() else 'cpu' #A</code></pre></div><p><code>#A Automatically selects GPU if available, otherwise falls back to CPU</code></p><p>Next, we define the hyperparameters that control our model and training process. These values are deliberately small compared to the full DeiT paper to allow rapid experimentation on a personal machine.</p><p><strong>Listing 1.2 Defining hyperparameters</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;1e2cf826-66cc-4a51-95ff-29218c27a1fd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">BATCH_SIZE = 12
ATTENTION_HEADS = 4          #A
TRANSFORMER_LAYERS = 4       #B
EMBED_DIM = 16               #C
IMG_SIZE = 28                #D
PATCH_SIZE = 7               #E
CLASSES = 10                 #F
EPOCHS_STUDENT = 10
LR_STUDENT = 1e-4
TEMPERATURE = 4              #G
ALPHA = 0.1                  #H
CHANNELS = 3                 #I</code></pre></div><p><code>#A Number of attention heads in each transformer layer </code></p><p><code>#B Number of stacked transformer encoder layers </code></p><p><code>#C Embedding dimension for patch tokens (small for demonstration) </code></p><p><code>#D MNIST images are 28x28 pixels </code></p><p><code>#E Each patch is 7x7 pixels, giving us (28/7)^2 = 16 patches </code></p><p><code>#F MNIST has 10 digit classes (0--9) </code></p><p><code>#G Temperature for softening the teacher's probability distribution </code></p><p><code>#H Weight for the KL divergence term in the distillation loss </code></p><p><code>#I Number of input channels (expanded from 1 to 3 for CNN compatibility)</code></p><p>Let us understand the patch arithmetic. With 28x28 images and 7x7 patches, each image is divided into </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(28/7)^2=16&quot;,&quot;id&quot;:&quot;HCJUONUITT&quot;}" data-component-name="LatexBlockToDOM"></div><p>non-overlapping patches. Each patch is flattened and linearly projected into a 16-dimensional embedding. Together with the CLS token and distillation token, this gives us a sequence of 16+2=18 tokens.</p><h3><strong>Preparing the data</strong></h3><p>MNIST images are grayscale (1 channel), but our teacher model (ResNet50) expects 3-channel RGB inputs. We handle this by repeating the single channel three times. We also use only 10% of the training set to simulate a data-scarce scenario, which is the exact setting where DeiT&#8217;s distillation approach shines.</p><p><strong>Listing 1.3 Loading and preparing the MNIST dataset</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;f5abd7c4-2602-417e-83ab-e97308a6f014&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">tfm = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda t: t.repeat(3, 1, 1)),    #A
])

train_full = datasets.MNIST('./data', train=True, download=True, transform=tfm)
test = datasets.MNIST('./data', train=False, download=True, transform=tfm)

n = int(0.1 * len(train_full))                         #B
subset_idx = np.random.permutation(len(train_full))[:n]
train = Subset(train_full, subset_idx)

train_dl = DataLoader(train, batch_size=BATCH_SIZE, shuffle=True)
test_dl = DataLoader(test, batch_size=BATCH_SIZE)</code></pre></div><p><code>#A Converts single-channel grayscale to 3-channel by repeating, making it compatible with the ResNet teacher</code></p><p><code>#B Uses only 10% of training data (6,000 images from 60,000) to simulate data scarcity</code></p><p>Using only 6,000 training images is deliberately challenging. This mimics the real-world scenario that motivated DeiT: how do we train a Vision Transformer effectively when data is limited? Knowledge distillation is the answer.</p><h3><strong>Setting up the teacher model</strong></h3><p>Our teacher is a pre-trained ResNet50, one of the most well-known CNN architectures. We load it with ImageNet pre-trained weights and modify the final classification layer to output 10 classes (for MNIST digits) instead of the original 1,000 ImageNet classes.</p><p>Critically, we <em>freeze</em> all the teacher&#8217;s parameters except the final classification layer. The teacher&#8217;s convolutional feature extractors already understand visual patterns from ImageNet pre-training. We only need to adapt the final layer to map those features to our 10 digit classes.</p><p><strong>Listing 1.4 Setting up the pre-trained CNN teacher</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;5c76b5eb-ab6a-465d-b5c3-84f515a15646&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">teacher = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
teacher.fc = nn.Linear(teacher.fc.in_features, CLASSES)    #A
teacher.to(device)

for param in teacher.parameters():                         #B
    param.requires_grad = False

for param in teacher.fc.parameters():                      #C
    param.requires_grad = True</code></pre></div><p><code>#A Replaces the 1000-class ImageNet head with a 10-class head for MNIST</code></p><p><code>#B Freezes all layers: the convolutional backbone will not be updated</code></p><p><code>#C Unfreezes only the final classification layer so it can learn MNIST-specific mappings</code></p><p>This setup mirrors the DeiT paper&#8217;s approach: the teacher is a powerful CNN that has already learned rich visual representations. By freezing its backbone, we preserve the inductive biases (locality, translation equivariance, hierarchical features) that the CNN learned during ImageNet pre-training. The student will learn to mimic these capabilities through distillation.</p><h3><strong>Building the student Vision Transformer</strong></h3><p>Now we build the student model: a Vision Transformer with a distillation token. This is the heart of the DeiT architecture. Let us break it into two components: the patch embedding layer and the full ViT model.</p><p><strong>Listing 1.5 Patch embedding module</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;f76e2ce9-09eb-4136-a3be-f6400a14a74e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class PatchEmbed(nn.Module):
    def __init__(self, img_size=IMG_SIZE, patch=PATCH_SIZE,
                 dim=EMBED_DIM, channels=CHANNELS):
        super().__init__()
        self.proj = nn.Conv2d(channels, dim, patch, patch)  #A
        self.n = (img_size // patch) ** 2                   #B

    def forward(self, x):
        x = self.proj(x)           #C
        x = x.flatten(2)           #D
        x = x.transpose(1, 2)     #E
        return x</code></pre></div><p><code>#A A Conv2d with kernel_size=patch_size and stride=patch_size extracts non-overlapping patches and projects them to the embedding dimension in one operation</code></p><p><code>#B Computes the number of patches: (28/7)^2 = 16 patches</code></p><p><code>#C Applies the convolution: (B, 3, 28, 28) -&gt; (B, 16, 4, 4)</code></p><p><code>#D Flattens the spatial dimensions: (B, 16, 4, 4) -&gt; (B, 16, 16)</code></p><p><code>#E Transposes to get sequence format: (B, 16, 16) -&gt; (B, 16, 16) where first 16 is sequence length and second 16 is embedding dim</code></p><p>The patch embedding uses a single convolution to simultaneously extract patches and project them into the embedding space. This is mathematically equivalent to extracting each patch, flattening it, and multiplying by a weight matrix, but it is computationally more efficient.</p><p>Now let us build the full ViT model with the distillation token:</p><p><strong>Listing 1.6 The DeiT student model with distillation token</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;29f94a07-37d0-46db-9913-12e3c3de7db8&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class ViT(nn.Module):
    def __init__(self, num_classes=CLASSES, dim=EMBED_DIM,
                 depth=TRANSFORMER_LAYERS, heads=ATTENTION_HEADS):
        super().__init__()
        self.patch = PatchEmbed()
        n = self.patch.n

        self.cls = nn.Parameter(torch.zeros(1, 1, dim))      #A
        self.distill = nn.Parameter(torch.zeros(1, 1, dim))   #B
        self.pos = nn.Parameter(torch.zeros(1, n + 2, dim))   #C

        self.blocks = nn.Sequential(*[                        #D
            nn.TransformerEncoderLayer(
                dim, heads, dim * 4, batch_first=True
            )
            for _ in range(depth)
        ])

        self.norm = nn.LayerNorm(dim)                         #E
        self.head_cls = nn.Linear(dim, num_classes)           #F
        self.head_dist = nn.Linear(dim, num_classes)          #G

    def forward(self, x):
        B = x.size(0)
        x = self.patch(x)                                     #H

        cls = self.cls.expand(B, -1, -1)                      #I
        dist = self.distill.expand(B, -1, -1)

        x = torch.cat([cls, x, dist], dim=1) + self.pos      #J

        x = self.blocks(x)                                    #K
        x = self.norm(x)

        cls_out = x[:, 0]                                     #L
        dist_out = x[:, -1]                                   #M

        cls_logits = self.head_cls(cls_out)                    #N
        dist_logits = self.head_dist(dist_out)

        return cls_logits, dist_logits

student = ViT().to(device)
opt_s = torch.optim.AdamW(student.parameters(), lr=LR_STUDENT)</code></pre></div><p><code>#A Learnable CLS token: initialized to zeros, shape (1, 1, dim) </code></p><p><code>#B Learnable distillation token: the key DeiT innovation, also initialized to zeros</code></p><p><code>#C Positional embeddings for all tokens: n patches + CLS + DIST = 18 positions </code></p><p><code>#D Stack of transformer encoder layers, each with multi-head self-attention and FFN </code></p><p><code>#E Layer normalization applied after the final transformer block</code></p><p><code>#F Classification head for the CLS token (trained against ground truth)</code></p><p><code>#G Separate classification head for the distillation token (trained against teacher)</code></p><p><code>#H Convert image to patch embeddings: (B, 3, 28, 28) -&gt; (B, 16, 16) </code></p><p><code>#I Expand special tokens to match batch size: (1, 1, 16) -&gt; (B, 1, 16) </code></p><p><code>#J Concatenate CLS + patches + DIST and add positional embeddings: (B, 18, 16) </code></p><p><code>#K Pass through all transformer encoder layers </code></p><p><code>#L Extract the CLS token output (first position) </code></p><p><code>#M Extract the distillation token output (last position)</code></p><p><code>#N Produce class logits from each token through separate linear heads</code></p><p>There are several important details to notice in this implementation:</p><ol><li><p><strong>Two separate learnable tokens.</strong> Both <code>self.cls</code> and <code>self.distill</code> are <code>nn.Parameter</code> objects initialized to zeros. During training, backpropagation will update them to learn useful representations. Despite identical initialization, they will diverge because they receive different gradient signals (ground-truth loss vs. teacher loss).</p></li><li><p><strong>Positional embeddings.</strong> The <code>self.pos</code> tensor has shape <code>(1, n+2, dim)</code>, providing a unique positional encoding for each of the 18 positions. This is essential because, unlike CNNs, the transformer has no built-in notion of spatial arrangement.</p></li><li><p><strong>Two output heads.</strong> The <code>head_cls</code> and <code>head_dist</code> are separate linear layers that map the final token representations to class logits. Each head receives its own supervision signal during training.</p></li><li><p><strong>Token placement.</strong> The CLS token is placed at position 0 and the distillation token at the last position. This is a convention; both tokens interact with all patch tokens through self-attention regardless of their position.</p></li></ol><h3><strong>Implementing the knowledge distillation loss</strong></h3><p>Now we implement the loss function that drives the distillation process. This is where the mathematics we developed earlier becomes code.</p><p><strong>Listing 1.7 Knowledge distillation loss function</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;eac89e19-851b-4fd4-a718-434988f7f439&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def kd_loss(s_logits, t_logits, y, T=TEMPERATURE, alpha=ALPHA):
    kd = F.kl_div(                                        #A
        F.softmax(s_logits / T, dim=1),                   #B
        F.softmax(t_logits / T, dim=1),                   #C
        reduction='batchmean'
    ) * (T * T)                                           #D
    ce = F.cross_entropy(s_logits, y)                     #E
    return alpha * kd + (1 - alpha) * ce                  #F</code></pre></div><p><code>#A KL divergence between the student&#8217;s and teacher&#8217;s softened distributions #B Student&#8217;s log-probabilities at temperature T (PyTorch&#8217;s kl_div expects log-probabilities for the first argument)</code></p><p><code>#C Teacher&#8217;s probabilities at temperature T (target distribution) </code></p><p><code>#D Multiply by T^2 to compensate for the reduced gradient magnitude</code></p><p><code>#E Standard cross-entropy between student predictions and ground-truth labels </code></p><p><code>#F Weighted combination: alpha controls the distillation-vs-classification balance</code></p><p>Let us trace through this function to make sure we understand each step. Given student logits z_s&#8203;, teacher logits z_t&#8203;, ground-truth labels y, temperature T=4, and &#945;=0.1:</p><ol><li><p><code>F.softmax(s_logits / T, dim=1)</code> computes &#963;(zs/4): the student&#8217;s softened probabilities</p></li><li><p><code>F.softmax(t_logits / T, dim=1)</code> computes &#963;(zt/4): the teacher&#8217;s softened probabilities</p></li><li><p><code>F.kl_div(...)</code> computes D_KL(teacher&#8741;student), the divergence between these distributions</p></li><li><p>Multiplying by T^2=16 compensates for the gradient scaling</p></li><li><p><code>F.cross_entropy(s_logits, y)</code> computes L_CE at &#964;=1</p></li><li><p>The final loss is 0.1&#215;(KL term)+0.9&#215;(CE term)</p></li></ol><p>With &#945;=0.1, we weight the cross-entropy heavily (90%) and the distillation lightly (10%). This means the student primarily learns from the ground-truth labels, with the teacher&#8217;s knowledge providing supplementary guidance.</p><h3><strong>Training the student</strong></h3><p>With all components in place, we can now train the student model. The training loop follows the standard PyTorch pattern, but with the crucial addition of generating teacher predictions on the fly.</p><p><strong>Listing 1.8 Training the DeiT student with knowledge distillation</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;3b438716-534e-4818-bd52-15ce71d48868&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">print("Training student...")
for e in range(EPOCHS_STUDENT):
    student.train()
    for x, y in train_dl:
        x, y = x.to(device), y.to(device)

        with torch.no_grad():                            #A
            t_logits = teacher(x)

        cls_logits, dist_logits = student(x)             #B
        loss_cls = F.cross_entropy(cls_logits, y)  

        loss_distill = kd_loss(dist_logits, t_logits, y) #C
        loss = loss_distill + loss_cls        
        opt_s.zero_grad()                                #D
        loss.backward()
        opt_s.step()

    print(f"Epoch {e+1} done")</code></pre></div><p>Output</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;da716aba-706f-47fe-9b32-ca30a75b5158&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Training student...
Epoch 1 done
Epoch 2 done
Epoch 3 done
Epoch 4 done
Epoch 5 done
Epoch 6 done
Epoch 7 done
Epoch 8 done
Epoch 9 done
Epoch 10 done</code></pre></div><p><code>#A Teacher inference with no gradient tracking: the teacher is frozen and never updated</code></p><p><code>#B Student forward pass returns two sets of logits (CLS and distillation)</code></p><p><code>#C CLS loss trains the classification head with hard labels; distillation loss aligns the DIST token's logits with the teacher's soft logits. The total loss combines both.</code></p><p><code>#D Standard gradient descent: zero gradients, compute backpropagation, update weights</code></p><p>There is an important detail to highlight: we pass <code>dist_logits</code> (from the distillation token) to the loss function, not <code>cls_logits</code>. The distillation token is specifically designed to learn from the teacher. In a full DeiT implementation, you would compute a separate cross-entropy loss for the CLS token against ground truth and add it to the distillation loss. In our simplified version, the <code>kd_loss</code> function handles both terms using the distillation token.</p><p>Notice the <code>torch.no_grad()</code> context manager around the teacher&#8217;s forward pass. Since the teacher is frozen, we do not need to track gradients for its computations. This saves memory and computation: we only backpropagate through the student.</p><h3><strong>Evaluating the trained model</strong></h3><p>After training, we evaluate the student by combining the predictions from both tokens. At inference time, the CLS and distillation tokens each produce their own logits, and we average the softmax outputs to get the final prediction.</p><p><strong>Listing 1.9 Evaluating the DeiT student</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;f6287c11-ff2a-4496-941c-bf9af93228fa&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">student.eval()
correct = 0
total = 0
samples = []

with torch.no_grad():                                     #A
    for x, y in test_dl:
        x, y = x.to(device), y.to(device)
        cls_logits, dist_logits = student(x)

        cls_dist = (cls_logits + dist_logits) / 2          #B
        pred = cls_dist.argmax(1)                          #C

        correct += (pred == y).sum().item()
        total += y.size(0)

        if len(samples) &lt; 15:
            samples.append((x.cpu(), pred.cpu(), y.cpu()))

acc = 100 * correct / total
print(f"Test Accuracy: {acc:.2f}%")</code></pre></div><p><code>#A No gradient computation needed during evaluation</code></p><p><code>#B Average the logits from both tokens: this combines the ground-truth-informed CLS prediction with the teacher-informed distillation prediction</code></p><p><code>#C Take the class with highest average logit as the final prediction</code></p><p>The averaging of CLS and distillation logits at inference is a key DeiT design choice. Each token has learned a different perspective on the input: the CLS token is optimized for ground-truth classification, while the distillation token is optimized for mimicking the teacher. Combining them yields a prediction that benefits from both information sources.</p><p>Finally, we can visualize some predictions to qualitatively assess the model:</p><p><strong>Listing 1.10 Displaying sample predictions</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;bb6edd03-c4b6-4413-8488-60704d52120b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">fig, axs = plt.subplots(1, len(samples), figsize=(12, 3))
for i, (img, pred, true) in enumerate(samples):
    img = img[0].permute(1, 2, 0).numpy()                 #A
    axs[i].imshow(img)
    axs[i].set_title(f"P:{pred[0].item()} T:{true[0].item()}")  #B
    axs[i].axis('off')
plt.show()</code></pre></div><p>Output</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uc4G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uc4G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png 424w, https://substackcdn.com/image/fetch/$s_!uc4G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png 848w, https://substackcdn.com/image/fetch/$s_!uc4G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png 1272w, https://substackcdn.com/image/fetch/$s_!uc4G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uc4G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png" width="954" height="94" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:94,&quot;width&quot;:954,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uc4G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png 424w, https://substackcdn.com/image/fetch/$s_!uc4G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png 848w, https://substackcdn.com/image/fetch/$s_!uc4G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png 1272w, https://substackcdn.com/image/fetch/$s_!uc4G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F316977d1-9fea-4cb4-b7b6-8f21b2f2a4be_954x94.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><code>#A Convert from (C, H, W) tensor format to (H, W, C) NumPy array for matplotlib</code></p><p><code>#B Display the predicted label (P) and true label (T) for each sample</code></p><p>With this minimal implementation, we achieve approximately 94% accuracy on MNIST using only 6,000 training images and 100 epochs in just 18 mins of training on T4 GPU. While this is modest compared to state-of-the-art results, remember that we are using an extremely small model (16-dimensional embeddings, 4 layers, 4 heads) with very little data and training time. On the full ImageNet dataset with the proper hyperparameters from section 1.5, DeiT-B achieves 83.4% top-1 accuracy, competitive with models trained on 250 times more data.</p><p>The code demonstrates all the core DeiT concepts:</p><ul><li><p>Patch embedding via convolution</p></li><li><p>CLS and distillation tokens as learnable parameters</p></li><li><p>Dual classification heads with separate objectives</p></li><li><p>Knowledge distillation loss combining cross-entropy and KL divergence</p></li><li><p>Temperature scaling for softening probability distributions</p></li><li><p>Averaging token predictions at inference</p></li></ul><h2><em><strong>Summary</strong></em></h2><ul><li><p><strong>Inductive biases</strong> are built-in architectural assumptions that shape how a model learns. CNNs have strong inductive biases (locality and translation equivariance) that make them data-efficient but limit their ability to capture long-range dependencies</p></li><li><p><strong>Vision Transformers lack these biases</strong>, processing all patches globally through self-attention. This flexibility becomes an advantage with enough data (300 million images) but a severe limitation with standard datasets (1.2 million images), causing overfitting and poor generalization</p></li><li><p><strong>Scaling laws</strong> show that model performance follows predictable power-law relationships with model size, dataset size, and compute. These laws hold for both language models and Vision Transformers, confirming that scale can compensate for lack of inductive bias</p></li><li><p><strong>Knowledge distillation</strong> transfers knowledge from a large teacher model to a smaller student by training the student on the teacher&#8217;s soft probability outputs rather than just hard labels. The teacher&#8217;s soft outputs contain <em>dark knowledge</em> (inter-class similarity information) that helps the student learn more efficiently</p></li><li><p><strong>Temperature scaling</strong> softens the teacher&#8217;s probability distribution, amplifying dark knowledge. The mathematical relationship &#963;(zi/&#964;)controls the softness, and the &#964;2 factor in the loss compensates for reduced gradient magnitudes</p></li><li><p><strong>DeiT introduces a distillation token</strong> alongside the standard CLS token, creating separate pathways for ground-truth learning and teacher imitation. The two tokens develop complementary representations that are combined at inference</p></li><li><p><strong>Hard distillation outperforms soft distillation</strong> in DeiT, achieving approximately 1% higher accuracy on ImageNet. This surprising result is attributed to the architectural mismatch between CNN teachers and transformer students</p></li><li><p><strong>DeiT achieves 83.4% top-1 accuracy</strong> on ImageNet using only ImageNet-1K (1.2 million images), matching models trained on 250 times more data, and can be trained in approximately 53 hours on a single 8-GPU node</p></li></ul><p></p><h1>Resources</h1><p><strong>Original Paper</strong></p><p><a href="https://arxiv.org/pdf/2012.12877">https://arxiv.org/pdf/2012.12877</a></p><p><strong>Dr <a href="https://www.linkedin.com/in/sreedath-panat/">Sreedath Panat</a></strong> <strong>has amazing videos on the same topic.</strong></p><div id="youtube2-7Cmw3D5zEdk" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;7Cmw3D5zEdk&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/7Cmw3D5zEdk?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div id="youtube2-d6EaVdjsCHI" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;d6EaVdjsCHI&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/d6EaVdjsCHI?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h1>Some More Substacks</h1><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e243957e-14b1-4eaf-a732-30e93fd45825&quot;,&quot;caption&quot;:&quot;Figure 0: Detailed Architecture of the Segment Anything Model (SAM).&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Segment Anything Model (SAM)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:136642032,&quot;name&quot;:&quot;Sreedath Panat&quot;,&quot;bio&quot;:&quot;I am the co-founder of Vizuara AI Labs and a PhD from MIT. I use this space to put down my thoughts and knowledge on AI/ML related topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa047b9-4cee-4d9d-8ed7-4a63c5f919b4_974x1220.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-20T09:19:46.533Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ea6440e-c81a-4e4e-b357-db44820234f5_1920x1278.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/segment-anything-model-sam&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:184705881,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:16,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5e6c7e23-a66e-40fb-83b9-73c3fd415385&quot;,&quot;caption&quot;:&quot;Table Of Content&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Detection Transformer (DETR): An introduction&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:136642032,&quot;name&quot;:&quot;Sreedath Panat&quot;,&quot;bio&quot;:&quot;I am the co-founder of Vizuara AI Labs and a PhD from MIT. I use this space to put down my thoughts and knowledge on AI/ML related topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa047b9-4cee-4d9d-8ed7-4a63c5f919b4_974x1220.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-15T08:40:59.104Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!M0HN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/detection-transformer-detr-an-introduction&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:183945695,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:10,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;c46463cd-96e3-4e1e-ae08-9941f613ebe9&quot;,&quot;caption&quot;:&quot;Table of Content&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;An beginners introduction to Swin transformer&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:136642032,&quot;name&quot;:&quot;Sreedath Panat&quot;,&quot;bio&quot;:&quot;I am the co-founder of Vizuara AI Labs and a PhD from MIT. I use this space to put down my thoughts and knowledge on AI/ML related topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa047b9-4cee-4d9d-8ed7-4a63c5f919b4_974x1220.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-13T09:20:10.516Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!X8hk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/an-beginners-introduction-to-swin&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:183324523,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:9,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>I&#8217;m also building Audio Deep Learning projects and Exploring and Finetuning different tts,sst models, sharing and discussing them on LinkedIn and Twitter. If you&#8217;re someone curious about these topics, I&#8217;d love to connect with you all!</p><p><strong>Mayank Pratap Singh</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/mayankpratapsingh022/">www.linkedin.com/in/mayankpratapsingh022</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/Mayank_022">x.com/Mayank_022</a>.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[TurboQuant: The Surprisingly Simple Trick That's Changing How We Compress LLMs]]></title><description><![CDATA[A first-principles walkthrough of the key idea behind Google's viral quantization paper]]></description><link>https://www.vizuaranewsletter.com/p/turboquant-the-surprisingly-simple</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/turboquant-the-surprisingly-simple</guid><dc:creator><![CDATA[Vizuara AI Labs]]></dc:creator><pubDate>Tue, 07 Apr 2026 11:39:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!haId!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em><strong>Table of Contents:</strong></em></p><ol><li><p><em>What Is Quantization, Really?</em></p></li><li><p><em>The Outlier Problem</em></p></li><li><p><em>The Key Insight &#8212; Just Rotate It</em></p></li><li><p><em>A Concrete Example</em></p></li><li><p><em>The Second Trick &#8212; Fixing the Inner Product Bias</em></p></li><li><p><em>Why This Matters for LLM Inference</em></p></li><li><p><em>Connections and Prior Art</em></p><p></p></li></ol><p>Every few months, a paper drops that makes the ML community collectively lose its mind. This month, it&#8217;s <strong><a href="https://arxiv.org/abs/2504.19874">TurboQuant</a></strong> (Zandieh et al. 2025): a new vector quantization method from Google Research that achieves near-optimal compression of LLM weights and KV caches at 2.5&#8211;3.5 bits per parameter.</p><p>The Twitter discourse has been&#8230; colorful. &#8220;It&#8217;s just polar coordinates!&#8221; &#8220;It&#8217;s information theory!&#8221; &#8220;It&#8217;s magic!&#8221;</p><p>None of that is quite right. The core idea is shockingly simple, and I&#8217;m going to explain it from scratch: with visuals, concrete numbers, and zero hand-waving.</p><h2><strong>What Is Quantization, Really?</strong></h2><p>Before we get to TurboQuant, let&#8217;s make sure we&#8217;re on the same page about what quantization actually does.</p><p>You have a neural network. Every weight, every activation, every KV cache entry is a number &#8212; typically stored as a 16-bit floating point value (FP16). That&#8217;s 2 bytes per number.</p><p>A model like Llama 3 70B has 70 billion parameters. At FP16, that&#8217;s:</p><blockquote><p><strong>70B &#215; 2 bytes = 140 GB</strong></p></blockquote><p>That doesn&#8217;t fit in a single GPU. Quantization is the art of making those numbers smaller.</p><p>The simplest possible quantization? Just round to fewer decimal places:</p><pre><code><code>Original:    0.2374623   0.7237428   0.5434738   0.1001233
Quantized:   0.237       0.724       0.543       0.100</code></code></pre><p>You&#8217;ve lost some precision, but you&#8217;ve saved memory. Real quantization schemes are more sophisticated &#8212; they map continuous values to a discrete set of levels &#8212; but the core idea is always: <strong>reduce the precision of each number to use fewer bits</strong>.</p><p>At 4 bits per weight, our 70B model becomes 35 GB. At 2 bits, it&#8217;s 17.5 GB. The question is: how low can you go before the model stops working?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tN_P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tN_P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!tN_P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!tN_P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!tN_P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tN_P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1681240,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tN_P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!tN_P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!tN_P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!tN_P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb494aff6-10f1-4746-a683-df625094cf0f_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>The Outlier Problem</strong></h2><p>Here&#8217;s where things get interesting &#8212; and where TurboQuant enters the picture.</p><p>In an ideal world, the values in a neural network vector would be spread roughly evenly across their range. Something like:</p><pre><code><code>Nice vector:  [0.24, 0.31, 0.18, 0.27, 0.22, 0.29, 0.20, 0.25]</code></code></pre><p>Every component is in a similar range. If you quantize each one to 4 bits (16 levels), you can divide the range [0.18, 0.31] into 16 uniform buckets and represent each value with minimal error.</p><p>But real neural network vectors don&#8217;t look like that. They look like this:</p><pre><code><code>Real vector:  [0.0001, 0.9999, 0.0002, 0.0001, 0.0003, 0.0001, 0.0002, 0.0001]</code></code></pre><p><strong>One component is enormous. The rest are near zero.</strong></p><p>This phenomenon goes by many names in the transformer literature:</p><ul><li><p><strong><a href="https://arxiv.org/abs/2402.17762">Massive activations</a></strong><a href="https://arxiv.org/abs/2402.17762"> (Sun et al. 2024)</a></p></li><li><p><strong><a href="https://arxiv.org/abs/2410.10781">Attention sinks</a></strong><a href="https://arxiv.org/abs/2410.10781"> (Gu et al. 2024)</a></p></li><li><p><strong><a href="https://arxiv.org/pdf/2208.07339">Outlier features</a></strong><a href="https://arxiv.org/pdf/2208.07339"> (Dettmers et al. 2022)</a></p></li></ul><p>Whatever you call it, the effect on quantization is devastating.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4C1K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4C1K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!4C1K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!4C1K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!4C1K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4C1K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1871886,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4C1K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!4C1K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!4C1K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!4C1K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fab18c66a-9568-47bd-876a-795a521436eb_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Why Outliers Kill Quantization</strong></h3><p>Think about what happens when you quantize that spiky vector. Your quantization grid has to span the full range [0.0001, 0.9999]. With only 16 levels (4 bits), each bucket covers about 0.0625 of that range.</p><p>The massive component (0.9999) maps cleanly to the top bucket &#8212; no problem.</p><p>But all the tiny components (0.0001, 0.0002, 0.0003) get <strong>crushed into the very first bucket</strong>. They all become 0. The quantized vector is essentially:</p><pre><code><code>Quantized:  [0, 1, 0, 0, 0, 0, 0, 0]</code></code></pre><p>This is a <strong>cardinal direction</strong> &#8212; a unit vector pointing along a single axis. It contains almost no information. An 8-dimensional cardinal direction can be described with just log&#8322;(2&#215;8) = 4 bits total, but we spent 4 &#215; 8 = 32 bits to represent it.</p><p>We&#8217;re wasting bits, and we&#8217;ve lost all the subtle information that was encoded in the small components.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!haId!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!haId!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!haId!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!haId!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!haId!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!haId!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1207477,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!haId!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!haId!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!haId!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!haId!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcb2f3d1a-77f5-4863-b66b-74a498558a1b_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>The Key Insight &#8212; Just Rotate It</strong></h2><p>Here is the entire key idea of TurboQuant, in one sentence:</p><blockquote><p><strong>Before quantizing a vector, randomly rotate it. After dequantizing, rotate it back.</strong></p></blockquote><p>That&#8217;s it. That&#8217;s the paper. (Well, most of it &#8212; there&#8217;s a clever second trick we&#8217;ll get to.)</p><p>Let me say that again, because it sounds too simple to be a major research contribution:</p><ol><li><p>Take your vector</p></li><li><p>Multiply it by a random rotation matrix</p></li><li><p>Quantize the rotated vector</p></li><li><p>To dequantize: undo the quantization, then multiply by the inverse rotation</p></li></ol><p>The rotation is <strong>data-independent</strong>. It&#8217;s chosen once and applied to everything. It doesn&#8217;t need to be learned or calibrated. It&#8217;s just a random orthogonal matrix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Atzb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Atzb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Atzb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Atzb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Atzb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Atzb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/edf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2083770,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Atzb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Atzb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Atzb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Atzb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedf3235f-8bbb-41c7-a23b-e99f949d62c9_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h3><strong>But Why Does This Work?</strong></h3><p>This is the beautiful part, and it requires a little geometric intuition.</p><p>Remember our problem vector?</p><pre><code><code>Spiky:  [0.0001, 0.9999, 0.0002, 0.0001]</code></code></pre><p>This vector is nearly aligned with a coordinate axis &#8212; it points almost exactly in the direction of the second basis vector. Geometrically, it lives very close to one of the &#8220;poles&#8221; of the unit sphere.</p><p>Now imagine randomly rotating this vector. Where does it end up?</p><p><strong>Almost certainly nowhere near any coordinate axis.</strong></p><p>Think about it in 3D first. If you take a vector pointing straight up (the North Pole) and apply a random rotation, it could end up pointing in literally any direction. The chance of it landing near another pole is vanishingly small &#8212; because the poles occupy a tiny fraction of the sphere&#8217;s surface area.</p><p>After rotation, the components of our vector are spread out across all dimensions:</p><pre><code><code>Before rotation: [0.0001,  0.9999,  0.0002,  0.0001]
After rotation:  [0.5012, -0.4998,  0.5001, -0.4989]</code></code></pre><p><strong>The outlier has been smeared across all components.</strong> Now each component has a similar magnitude, and our quantization grid can capture all of them with roughly equal fidelity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1jAn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1jAn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!1jAn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!1jAn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!1jAn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1jAn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2064854,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1jAn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!1jAn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!1jAn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!1jAn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7ba48e6-a591-4045-92e2-f04fdde64e39_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>The Geometry: Why Random Directions Are &#8220;Spread Out&#8221;</strong></h3><p>Here&#8217;s the deep geometric reason this works, and it gets stronger in high dimensions.</p><p>In <em>d</em> dimensions, the unit sphere has a surface area that is overwhelmingly concentrated away from the coordinate axes. As <em>d</em> grows, the fraction of the sphere near any axis shrinks exponentially. A randomly oriented vector in high dimensions will have components that are all roughly the same size &#8212; around 1/&#8730;d each.</p><p>More precisely, after rotation, each coordinate of a unit vector follows a <strong>Beta distribution</strong> that concentrates tightly around zero. When you square each coordinate, the sum must equal 1 (because the vector has unit length), so no single coordinate can be much larger than the others.</p><p>This is a fundamental property of high-dimensional geometry, and it&#8217;s the mathematical engine behind TurboQuant.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NUZu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NUZu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!NUZu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!NUZu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!NUZu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NUZu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2036803,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NUZu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!NUZu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!NUZu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!NUZu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48cd862c-c142-403c-8408-4ce440d791c6_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>A Concrete Example</strong></h2><p>Let&#8217;s trace through the full TurboQuant process with actual numbers. We&#8217;ll use a small 4D example.</p><h3><strong>Step 1: Start With a Spiky Vector</strong></h3><pre><code><code>x = [0.01, 0.98, 0.02, 0.01]</code></code></pre><p>Norm: &#8214;x&#8214; &#8776; 0.9807</p><p>Normalized: x&#770; = [0.0102, 0.9993, 0.0204, 0.0102]</p><h3><strong>Step 2: Random Rotation</strong></h3><p>We apply a random orthogonal matrix R (generated once, shared across all vectors):</p><pre><code><code>x_rotated = R &#215; x&#770; = [0.5127, -0.4831, 0.5042, -0.4998]</code></code></pre><p>Notice: the energy is now spread evenly across all 4 dimensions. Each component has magnitude &#8776; 0.5.</p><h3><strong>Step 3: Quantize</strong></h3><p>With a 2-bit quantizer (4 levels), our grid levels might be: {-0.75, -0.25, 0.25, 0.75}</p><pre><code><code>x_quantized = [0.75, -0.75, 0.75, -0.75]</code></code></pre><p>Hmm, that&#8217;s crude &#8212; but every component gets a meaningful representation. Compare this to quantizing the original spiky vector:</p><pre><code><code>Original quantized: [0.0, 1.0, 0.0, 0.0]  &#8592; cardinal direction!
Rotated quantized:  [0.75, -0.75, 0.75, -0.75]  &#8592; much more info!</code></code></pre><h3><strong>Step 4: Dequantize and Un-Rotate</strong></h3><pre><code><code>x_dequantized = R&#8315;&#185; &#215; x_quantized &#215; &#8214;x&#8214;
             &#8776; [0.03, 0.95, 0.04, 0.02]</code></code></pre><p>Compare to original: [0.01, 0.98, 0.02, 0.01]</p><p>The reconstruction is far better than the naive approach!</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s1VD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s1VD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!s1VD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!s1VD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!s1VD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s1VD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2190783,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s1VD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!s1VD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!s1VD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!s1VD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90fc5030-37fd-4296-a997-b82b975256b2_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>The Second Trick &#8212; Fixing the Inner Product Bias</strong></h2><p>TurboQuant doesn&#8217;t stop at rotation. There&#8217;s a second, more subtle insight that matters specifically for attention computation.</p><p>In a transformer&#8217;s attention layer, we don&#8217;t just store KV cache vectors &#8212; we compute <strong>inner products</strong> between queries and keys:</p><pre><code><code>score = q &#183; k</code></code></pre><p>It turns out that even when rotation makes quantization better in terms of mean squared error (MSE), it can introduce a <strong>systematic bias</strong> in inner products. The quantized vectors tend to produce inner products that are slightly &#8220;off&#8221; &#8212; not randomly off, but consistently biased in one direction.</p><h3><strong>The Fix: Residual Quantization with QJL</strong></h3><p>TurboQuant uses a two-step approach:</p><ol><li><p><strong>Quantize the rotated vector</strong> using an MSE-optimal scalar quantizer (the standard part)</p></li><li><p><strong>Compute the residual</strong> (the error between the true vector and the quantized version)</p></li><li><p><strong>Compress the residual</strong> using a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform</p></li></ol><p>The QJL step takes the residual error vector and represents it with just 1 bit per dimension using a random projection. This extra bit of information is enough to correct the bias in inner product computations.</p><p>The combined system &#8212; rotation + optimal quantizer + QJL residual &#8212; achieves inner product distortion that&#8217;s within a constant factor (&#8776;2.7&#215;) of the information-theoretic lower bound. That&#8217;s near-optimal.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tMBa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tMBa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!tMBa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!tMBa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!tMBa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tMBa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2032587,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tMBa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!tMBa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!tMBa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!tMBa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa3a6891c-81ca-4ebb-b9bc-411e97e363c8_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>Why This Matters for LLM Inference</strong></h2><p>All of this theory is nice, but why should you care? Because TurboQuant directly addresses one of the biggest bottlenecks in LLM inference: the <strong>KV cache</strong>.</p><p>During autoregressive generation, every token you&#8217;ve processed has a key vector and a value vector stored in the KV cache. For a model like Llama 3 70B generating a sequence of 8K tokens:</p><pre><code><code>KV cache = 2 (K+V) &#215; 80 layers &#215; 8192 tokens &#215; 8192 dim &#215; 2 bytes
         &#8776; 21 GB</code></code></pre><p>That&#8217;s 21 GB of memory just for one request&#8217;s context! For batched serving with multiple users, this quickly becomes the dominant memory cost.</p><p>TurboQuant compresses this KV cache from 16 bits to 2.5&#8211;3.5 bits per value &#8212; a <strong>5-6&#215; reduction</strong> &#8212; with minimal quality degradation. The paper demonstrates:</p><ul><li><p><strong>3.5 bits/value</strong>: Near-lossless quality on standard benchmarks</p></li><li><p><strong>2.5 bits/value</strong>: Slightly degraded but still remarkably good</p></li><li><p><strong>Training-free</strong>: No fine-tuning or calibration needed &#8212; just rotate, quantize, and go</p></li></ul><p>This is what makes TurboQuant so exciting for production inference. It&#8217;s simple to implement, requires no retraining, and achieves near-optimal compression.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s0lZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s0lZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!s0lZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!s0lZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!s0lZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s0lZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2063898,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193451193?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s0lZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!s0lZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!s0lZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!s0lZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff396310b-3b9c-487a-b21c-2274ded42898_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>Connections and Prior Art</strong></h2><p>TurboQuant isn&#8217;t the first method to use rotation for quantization. Two notable predecessors:</p><p><strong><a href="https://arxiv.org/abs/2307.13304">QuIP</a></strong><a href="https://arxiv.org/abs/2307.13304"> (Chee et al. 2023)</a> uses random orthogonal transformations for weight quantization, with a similar intuition about spreading outlier energy. However, QuIP uses the rotation as part of a more complex optimization procedure, while TurboQuant isolates the rotation as a standalone preprocessing step with clean theoretical guarantees.</p><p><strong><a href="https://arxiv.org/abs/2405.12497">RaBitQ</a></strong><a href="https://arxiv.org/abs/2405.12497"> (Gao et al. 2024) </a>employs random rotation for vector database compression and nearest-neighbor search. TurboQuant extends this idea with the bias-correcting QJL step, which is critical for the inner product computations in attention.</p><p>What TurboQuant contributes beyond these is: <strong>(1)</strong> a clean theoretical analysis showing the rotation + quantizer is near-optimal, <strong>(2)</strong> the bias correction via QJL for inner products, and <strong>(3)</strong> a practical demonstration on LLM KV cache compression.</p><div><hr></div><h2><strong>The Takeaway</strong></h2><blockquote><p>Neural network vectors tend to be &#8220;spiky&#8221; &#8212; nearly aligned with coordinate axes. Quantization snaps spiky vectors to cardinal directions, destroying information. A random rotation spreads the energy evenly across all dimensions, making every component equally important and equally well-served by the quantization grid.</p></blockquote><p>The fix is almost embarrassingly simple: multiply by a random rotation matrix before quantizing, and multiply by its inverse after dequantizing. No training. No calibration. No data dependence. Just linear algebra.</p><p>Combined with a 1-bit residual correction for inner product bias, this yields a quantization scheme that&#8217;s within a small constant factor of the information-theoretic limit.</p><p>Sometimes the most impactful ideas in ML aren&#8217;t the most complex ones. They&#8217;re the ones that see a fundamental geometric truth that was hiding in plain sight.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption"><em>If you enjoyed this deep dive, subscribe for more first-principles breakdowns of the latest inference engineering research.</em></p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Attention Residuals: Teaching transformers to choose which layers matter]]></title><description><![CDATA[How Moonshot AI replaced a decade-old fixed-weight residual connection with learned depth-wise attention, unlocking 25% more compute efficiency in Kimi]]></description><link>https://www.vizuaranewsletter.com/p/attention-residuals-teaching-transformers</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/attention-residuals-teaching-transformers</guid><dc:creator><![CDATA[Naman Dwivedi]]></dc:creator><pubDate>Mon, 06 Apr 2026 12:10:57 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!wekb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This article covers:</p><ul><li><p><strong>The PreNorm dilution problem</strong>: Why standard residual connections cause hidden states to grow uncontrollably with depth, drowning out individual layer contributions</p></li><li><p><strong>The depth-time duality</strong>: The elegant insight that information dilution across depth is structurally identical to memory loss across a sequence, and can be solved the same way</p></li><li><p><strong>Attention Residuals (AttnRes)</strong>: How replacing fixed accumulation with softmax attention over previous layer outputs gives each layer selective, input-dependent access to earlier representations</p></li><li><p><strong>Block AttnRes</strong>: The practical variant that partitions layers into blocks, reducing memory from O(Ld) to O(Nd) while preserving most gains</p></li><li><p><strong>Quantifying the gains</strong>: Scaling law experiments showing a 1.25x compute advantage, with benchmark improvements up to +7.5 points on GPQA-Diamond</p></li></ul><p>This article assumes familiarity with transformers, self-attention, and residual connections. If you have read about how standard transformer blocks work, including the alternation of attention and feed-forward sublayers connected by residual streams, you have all the background you need.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Let&#8217;s begin with a roadmap of what we will cover, as shown in figure 1.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i5Gs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i5Gs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png 424w, https://substackcdn.com/image/fetch/$s_!i5Gs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png 848w, https://substackcdn.com/image/fetch/$s_!i5Gs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png 1272w, https://substackcdn.com/image/fetch/$s_!i5Gs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i5Gs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png" width="1377" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1377,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:988572,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/193336787?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i5Gs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png 424w, https://substackcdn.com/image/fetch/$s_!i5Gs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png 848w, https://substackcdn.com/image/fetch/$s_!i5Gs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png 1272w, https://substackcdn.com/image/fetch/$s_!i5Gs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F539cb5f2-af7f-4023-a7ff-d0c671d79544_1377x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As shown in figure 1, we will progress through six stages. We start by understanding a hidden problem inside every modern LLM, build the conceptual insight that makes the fix possible, walk through the mechanism in detail, scale it to real-world models, prove it works mathematically, and finally measure the gains.</p><p>To understand why Attention Residuals matter, we first need to confront a hidden problem lurking inside every modern LLM.</p><div><hr></div><h2><strong>The hidden flaw in residual connections</strong></h2><p>Residual connections are one of the most important innovations in deep learning. Introduced by He et al. in 2016, they allow gradients to flow through deep networks by providing shortcut paths around each layer. Every modern LLM, from GPT to LLaMA to DeepSeek, relies on them.</p><p>Yet these connections have a fundamental design limitation that has gone largely unaddressed for a decade. Every layer output is added to the running hidden state with a fixed weight of 1.0, creating an ever-growing signal that progressively drowns out individual layer contributions. This is the PreNorm dilution problem, and it wastes model capacity while limiting the effective utilization of depth.</p><h3><strong>The residual stream</strong></h3><p>Let&#8217;s examine how residual connections work in a standard PreNorm transformer, as illustrated in figure 2.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LkLI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LkLI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!LkLI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!LkLI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!LkLI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LkLI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 2. The residual stream in a standard transformer. Each layer adds its output to the running sum with a fixed weight of 1.0, creating a single, ever-growing hidden state.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 2. The residual stream in a standard transformer. Each layer adds its output to the running sum with a fixed weight of 1.0, creating a single, ever-growing hidden state." title="Figure 2. The residual stream in a standard transformer. Each layer adds its output to the running sum with a fixed weight of 1.0, creating a single, ever-growing hidden state." srcset="https://substackcdn.com/image/fetch/$s_!LkLI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!LkLI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!LkLI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!LkLI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bad61f4-59ce-4ae6-bb01-702806742c13_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 2. The residual stream in a standard transformer. Each layer adds its output to the running sum with a fixed weight of 1.0, creating a single, ever-growing hidden state.</em></figcaption></figure></div><p>As shown in figure 2, each transformer layer receives the current hidden state, applies RMSNorm, processes it through attention and a feed-forward network, and then adds the result back to the running sum. The formula at each layer is straightforward:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;h_l = h_{l-1} + f_l(\\text{RMSNorm}(h_{l-1}))&quot;,&quot;id&quot;:&quot;PUIGENJXSI&quot;}" data-component-name="LatexBlockToDOM"></div><p>The key observation is that every layer contributes its output with the same fixed weight of 1.0. There is no mechanism for one layer to contribute more or less than any other. The running sum, often called the &#8220;residual stream,&#8221; simply accumulates all outputs equally.</p><p>For our running example, we use four tokens, &#8220;The&#8221;, &#8220;cat&#8221;, &#8220;sat&#8221;, &#8220;down&#8221;, each with an 8-dimensional embedding. The input matrix X has shape (4, 8). We trace these tokens through 6 layers.</p><h3><strong>The magnitude growth problem</strong></h3><p>Since every layer adds its output with weight 1.0, the hidden state norm grows linearly with depth. This is not a minor bookkeeping detail. It is a fundamental property that shapes how the entire model behaves.</p><p>The hidden state at the final layer is literally the sum of all previous outputs:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;h_L = x + f_1(\\text{RMSNorm}(h_0)) + f_2(\\text{RMSNorm}(h_1)) + \\ldots + f_L(\\text{RMSNorm}(h_{L-1}))&quot;,&quot;id&quot;:&quot;MULVQLUXBG&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let&#8217;s see this growth in action with our running example, as shown in figure 3.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O-KJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O-KJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png 424w, https://substackcdn.com/image/fetch/$s_!O-KJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png 848w, https://substackcdn.com/image/fetch/$s_!O-KJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png 1272w, https://substackcdn.com/image/fetch/$s_!O-KJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O-KJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png" width="1320" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1320,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 3. Hidden state magnitude growth across depth in a standard PreNorm transformer. The orange line shows the standard residual norm growing linearly with the number of layers, while the gray bars show each individual layer's fractional contribution shrinking toward zero. The blue line shows AttnRes norm for comparison, staying bounded near 1.0.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 3. Hidden state magnitude growth across depth in a standard PreNorm transformer. The orange line shows the standard residual norm growing linearly with the number of layers, while the gray bars show each individual layer's fractional contribution shrinking toward zero. The blue line shows AttnRes norm for comparison, staying bounded near 1.0." title="Figure 3. Hidden state magnitude growth across depth in a standard PreNorm transformer. The orange line shows the standard residual norm growing linearly with the number of layers, while the gray bars show each individual layer's fractional contribution shrinking toward zero. The blue line shows AttnRes norm for comparison, staying bounded near 1.0." srcset="https://substackcdn.com/image/fetch/$s_!O-KJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png 424w, https://substackcdn.com/image/fetch/$s_!O-KJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png 848w, https://substackcdn.com/image/fetch/$s_!O-KJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png 1272w, https://substackcdn.com/image/fetch/$s_!O-KJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9819c69-814d-4c49-ac4a-06ac0fe8e3a8_1320x850.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 3. Hidden state magnitude growth across depth in a standard PreNorm transformer. The orange line shows the standard residual norm growing linearly with the number of layers, while the gray bars show each individual layer&#8217;s fractional contribution shrinking toward zero. The blue line shows AttnRes norm for comparison, staying bounded near 1.0.</em></figcaption></figure></div><p>As shown in figure 3, the standard residual norm (orange line) grows steadily from 1.0 at the embedding layer to over 5.0 by layer 12. Meanwhile, the layer contribution fraction (gray bars) shrinks from 1.0 at layer 0 to less than 0.1 by layer 12. The RMSNorm before each sublayer normalizes the input, but the accumulated output keeps growing unchecked.</p><p>In a 50-layer model, the hidden state is the sum of approximately 50 vectors. Each layer&#8217;s contribution represents roughly 2% of the total signal. This is the PreNorm dilution problem.</p><h3><strong>The signal dilution effect</strong></h3><p>The consequence of this magnitude growth is that each layer&#8217;s voice gets progressively quieter relative to the total signal. Let&#8217;s visualize what the final hidden state actually looks like, as shown in figure 4.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lmtn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lmtn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!lmtn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!lmtn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!lmtn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lmtn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 4. The contribution of each layer as a fraction of the total hidden state, shown for a 6-layer model. Layer 1's output, which might encode crucial low-level features, represents only 1/6 of the final hidden state. In a 50-layer model, each layer is only 2% of the total.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 4. The contribution of each layer as a fraction of the total hidden state, shown for a 6-layer model. Layer 1's output, which might encode crucial low-level features, represents only 1/6 of the final hidden state. In a 50-layer model, each layer is only 2% of the total." title="Figure 4. The contribution of each layer as a fraction of the total hidden state, shown for a 6-layer model. Layer 1's output, which might encode crucial low-level features, represents only 1/6 of the final hidden state. In a 50-layer model, each layer is only 2% of the total." srcset="https://substackcdn.com/image/fetch/$s_!lmtn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!lmtn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!lmtn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!lmtn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc61ddb32-dbf5-41fd-823d-daf0ecdb3c87_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 4. The contribution of each layer as a fraction of the total hidden state, shown for a 6-layer model. Layer 1&#8217;s output, which might encode crucial low-level features, represents only 1/6 of the final hidden state. In a 50-layer model, each layer is only 2% of the total.</em></figcaption></figure></div><p>As illustrated in figure 4, the final hidden state h_6 is the sum of 7 components: the token embedding x plus 6 layer outputs f_1 through f_6. Each contributes roughly 14% of the total signal. This means that layer 1, which might encode crucial low-level syntactic features, has the same influence as layer 6, which handles high-level reasoning. There is no selectivity.</p><p>This lack of selectivity creates a deeper problem. Both the attention sublayer and the FFN sublayer within each block receive the same blended signal, even though they may benefit from very different mixtures of earlier information.</p><h3><strong>Evidence of wasted capacity</strong></h3><p>The dilution problem is not just theoretical. Research has demonstrated that entire layers can be removed from deep LLMs with minimal performance impact, as shown in figure 5.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lFLx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lFLx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!lFLx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!lFLx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!lFLx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lFLx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 5. The redundancy problem. Research has shown that entire layers can be removed from deep LLMs with minimal performance impact, suggesting that uniform accumulation wastes model capacity.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 5. The redundancy problem. Research has shown that entire layers can be removed from deep LLMs with minimal performance impact, suggesting that uniform accumulation wastes model capacity." title="Figure 5. The redundancy problem. Research has shown that entire layers can be removed from deep LLMs with minimal performance impact, suggesting that uniform accumulation wastes model capacity." srcset="https://substackcdn.com/image/fetch/$s_!lFLx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!lFLx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!lFLx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!lFLx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a8015b1-f9ef-4972-8735-81f4b0359628_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 5. The redundancy problem. Research has shown that entire layers can be removed from deep LLMs with minimal performance impact, suggesting that uniform accumulation wastes model capacity.</em></figcaption></figure></div><p>As shown in figure 5, removing 3 middle layers from a 12-layer model results in only a 3% performance drop. This suggests that the uniform accumulation of residual connections makes many intermediate layers effectively redundant. The model cannot efficiently utilize depth because each additional layer has diminishing marginal impact on the blended hidden state.</p><p>Now that we see the problem clearly, a natural question arises: can we fix this by applying the same trick that transformers already use to solve a structurally identical problem?</p><div><hr></div><h2><strong>The depth-time duality</strong></h2><p>The conceptual breakthrough behind Attention Residuals comes from a simple but profound observation: the problem of information dilution across network depth is structurally identical to the problem of memory loss across a sequence of tokens. And we already know how to solve the sequence version.</p><h3><strong>The sequence analogy</strong></h3><p>Before transformers, recurrent neural networks processed sequences by passing a hidden state from one time step to the next. Early tokens were progressively forgotten as the sequence grew, because each new token blended into the same fixed-size hidden state. This &#8220;forgetting&#8221; problem limited how far back an RNN could effectively look.</p><p>The transformer&#8217;s self-attention mechanism solved this elegantly, as shown in figure 6.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CGof!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CGof!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!CGof!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!CGof!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!CGof!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CGof!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 6. The memory loss problem in sequences. Top: Without attention (RNN), early tokens progressively fade as the sequence grows, and the hidden state contains mostly recent information. Bottom: With attention (Transformer), any position can selectively access any previous position with learned, content-dependent weights, preventing the forgetting problem.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 6. The memory loss problem in sequences. Top: Without attention (RNN), early tokens progressively fade as the sequence grows, and the hidden state contains mostly recent information. Bottom: With attention (Transformer), any position can selectively access any previous position with learned, content-dependent weights, preventing the forgetting problem." title="Figure 6. The memory loss problem in sequences. Top: Without attention (RNN), early tokens progressively fade as the sequence grows, and the hidden state contains mostly recent information. Bottom: With attention (Transformer), any position can selectively access any previous position with learned, content-dependent weights, preventing the forgetting problem." srcset="https://substackcdn.com/image/fetch/$s_!CGof!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!CGof!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!CGof!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!CGof!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43910995-a77c-4d93-826e-8e67b7c1ab4b_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 6. The memory loss problem in sequences. Top: Without attention (RNN), early tokens progressively fade as the sequence grows, and the hidden state contains mostly recent information. Bottom: With attention (Transformer), any position can selectively access any previous position with learned, content-dependent weights, preventing the forgetting problem.</em></figcaption></figure></div><p>As illustrated in figure 6, the transformer replaced the sequential blending of an RNN with selective retrieval via attention. Instead of each position receiving only the previous position&#8217;s output, any position can directly attend to any earlier position. The attention weights are content-dependent, meaning the model learns which earlier positions are most relevant for each new token.</p><h3><strong>Rotating attention from time to depth</strong></h3><p>Now here is the key insight. In the depth dimension of a transformer, the same &#8220;forgetting&#8221; problem occurs. Early layer outputs are progressively diluted as more layers are added, just as early tokens were forgotten in RNNs. The fix is structurally identical: let each layer directly access any previous layer with learned, content-dependent weights.</p><p>This is what the authors of Attention Residuals describe as &#8220;rotating&#8221; the attention mechanism 90 degrees, as shown in figure 7.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NZd3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NZd3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!NZd3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!NZd3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!NZd3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NZd3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 7. The depth-time duality. Left: standard self-attention operates horizontally across sequence positions, letting each token attend to previous tokens using softmax-weighted combinations. Right: AttnRes operates vertically across network depth, letting each layer attend to previous layers using the same mathematical structure. The formula h = sum(alpha_i * v_i) is identical in both cases.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 7. The depth-time duality. Left: standard self-attention operates horizontally across sequence positions, letting each token attend to previous tokens using softmax-weighted combinations. Right: AttnRes operates vertically across network depth, letting each layer attend to previous layers using the same mathematical structure. The formula h = sum(alpha_i * v_i) is identical in both cases." title="Figure 7. The depth-time duality. Left: standard self-attention operates horizontally across sequence positions, letting each token attend to previous tokens using softmax-weighted combinations. Right: AttnRes operates vertically across network depth, letting each layer attend to previous layers using the same mathematical structure. The formula h = sum(alpha_i * v_i) is identical in both cases." srcset="https://substackcdn.com/image/fetch/$s_!NZd3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!NZd3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!NZd3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!NZd3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffab25541-41bb-41fe-ad24-784c287835ad_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 7. The depth-time duality. Left: standard self-attention operates horizontally across sequence positions, letting each token attend to previous tokens using softmax-weighted combinations. Right: AttnRes operates vertically across network depth, letting each layer attend to previous layers using the same mathematical structure. The formula h = sum(alpha_i * v_i) is identical in both cases.</em></figcaption></figure></div><p>As shown in figure 7, both mechanisms compute h = sum(alpha_i * v_i), where the alpha weights are computed via softmax over learned scores. The only difference is the dimension: standard attention operates across sequence positions (horizontal), while AttnRes operates across network depth (vertical).</p><p>Let&#8217;s visualize this rotation more concretely, as illustrated in figure 8.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-GFb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-GFb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!-GFb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!-GFb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!-GFb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-GFb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 8. Rotating attention 90 degrees. Left: a standard self-attention matrix with token positions on the axes, attending across the sequence (horizontal). Right: an AttnRes attention pattern with layer depths on the axes, attending across depth (vertical). The same causal masking structure applies: each position/layer can only attend to those that came before it.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 8. Rotating attention 90 degrees. Left: a standard self-attention matrix with token positions on the axes, attending across the sequence (horizontal). Right: an AttnRes attention pattern with layer depths on the axes, attending across depth (vertical). The same causal masking structure applies: each position/layer can only attend to those that came before it." title="Figure 8. Rotating attention 90 degrees. Left: a standard self-attention matrix with token positions on the axes, attending across the sequence (horizontal). Right: an AttnRes attention pattern with layer depths on the axes, attending across depth (vertical). The same causal masking structure applies: each position/layer can only attend to those that came before it." srcset="https://substackcdn.com/image/fetch/$s_!-GFb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!-GFb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!-GFb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!-GFb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb57e419c-2d6b-45ce-b451-005026c503a1_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 8. Rotating attention 90 degrees. Left: a standard self-attention matrix with token positions on the axes, attending across the sequence (horizontal). Right: an AttnRes attention pattern with layer depths on the axes, attending across depth (vertical). The same causal masking structure applies: each position/layer can only attend to those that came before it.</em></figcaption></figure></div><p>As shown in figure 8, the standard self-attention matrix (left) has tokens on both axes, with the causal mask ensuring each token only attends to previous tokens. The AttnRes pattern (right) has layers on both axes, with each layer attending only to previous layers. The mathematical structure is the same, just applied to a different dimension.</p><p>This is a profound reframing. The residual stream has always been performing a kind of &#8220;attention&#8221; over depth, but the crudest possible kind: every layer gets weight 1.0 regardless of content. AttnRes upgrades this to full softmax attention with learned, input-dependent weights.</p><p>With this duality in mind, let&#8217;s see exactly how Attention Residuals work, step by step.</p><div><hr></div><h2><strong>The mechanics of Attention Residuals: a hands-on walkthrough</strong></h2><p>We now have the intuition: replace fixed-weight residual accumulation with softmax attention over previous layer outputs. Let&#8217;s build the mechanism piece by piece, starting with a review of what standard residuals compute and ending with the full AttnRes forward pass.</p><h3><strong>Standard residuals in action</strong></h3><p>Let&#8217;s first trace through our running example with standard residual connections to see the problem concretely. The side-by-side comparison in figure 9 frames what we are about to build.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z-_K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z-_K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Z-_K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Z-_K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Z-_K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z-_K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 9. Standard residual connections vs Attention Residuals. Left: each layer blindly adds its output to the accumulated sum with fixed weight 1.0, and the magnitude grows as O(L). Right: each layer uses a learned pseudo-query vector w_l to selectively weight previous layer outputs via softmax attention, keeping magnitudes bounded.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 9. Standard residual connections vs Attention Residuals. Left: each layer blindly adds its output to the accumulated sum with fixed weight 1.0, and the magnitude grows as O(L). Right: each layer uses a learned pseudo-query vector w_l to selectively weight previous layer outputs via softmax attention, keeping magnitudes bounded." title="Figure 9. Standard residual connections vs Attention Residuals. Left: each layer blindly adds its output to the accumulated sum with fixed weight 1.0, and the magnitude grows as O(L). Right: each layer uses a learned pseudo-query vector w_l to selectively weight previous layer outputs via softmax attention, keeping magnitudes bounded." srcset="https://substackcdn.com/image/fetch/$s_!Z-_K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Z-_K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Z-_K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Z-_K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7b806c-537f-491e-9dc9-9b42c4ca8be7_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 9. Standard residual connections vs Attention Residuals. Left: each layer blindly adds its output to the accumulated sum with fixed weight 1.0, and the magnitude grows as O(L). Right: each layer uses a learned pseudo-query vector w_l to selectively weight previous layer outputs via softmax attention, keeping magnitudes bounded.</em></figcaption></figure></div><p>As illustrated in figure 9, the standard approach (left) produces a hidden state h_4 with growing magnitude because all four layer outputs are summed with equal weight. The AttnRes approach (right) produces a bounded h_4 because the softmax weights, computed via pseudo-query vectors w_1 through w_4, always sum to 1.</p><p>Now let&#8217;s walk through the standard computation layer by layer, as shown in figure 10.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6tGZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6tGZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!6tGZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!6tGZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!6tGZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6tGZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 10. Standard residual computation for our running example. Four tokens flow through 6 layers, with each layer's output added with weight 1.0. The hidden state vectors grow progressively larger at each layer, from norm 1.0 at the embedding to norm 3.2 at layer 6. The bar chart on the right confirms the linear O(L) growth.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 10. Standard residual computation for our running example. Four tokens flow through 6 layers, with each layer's output added with weight 1.0. The hidden state vectors grow progressively larger at each layer, from norm 1.0 at the embedding to norm 3.2 at layer 6. The bar chart on the right confirms the linear O(L) growth." title="Figure 10. Standard residual computation for our running example. Four tokens flow through 6 layers, with each layer's output added with weight 1.0. The hidden state vectors grow progressively larger at each layer, from norm 1.0 at the embedding to norm 3.2 at layer 6. The bar chart on the right confirms the linear O(L) growth." srcset="https://substackcdn.com/image/fetch/$s_!6tGZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!6tGZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!6tGZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!6tGZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5e15fe2-ad0c-4b05-9a7c-c6523536ef04_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 10. Standard residual computation for our running example. Four tokens flow through 6 layers, with each layer&#8217;s output added with weight 1.0. The hidden state vectors grow progressively larger at each layer, from norm 1.0 at the embedding to norm 3.2 at layer 6. The bar chart on the right confirms the linear O(L) growth.</em></figcaption></figure></div><p>As shown in figure 10, our tokens start with embedding norm 1.0. After layer 1, the hidden state has norm 1.4 (the sum of the embedding and layer 1&#8217;s output). By layer 3, the norm has grown to 2.3. By layer 6, it reaches 3.2. Each layer contributes roughly 0.4 to the total norm, creating steady linear growth.</p><p>We can verify this growth pattern across all four of our tokens, as shown in figure 11.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!py8A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!py8A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png 424w, https://substackcdn.com/image/fetch/$s_!py8A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png 848w, https://substackcdn.com/image/fetch/$s_!py8A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png 1272w, https://substackcdn.com/image/fetch/$s_!py8A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!py8A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png" width="1248" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 11. Hidden state magnitudes for our 4 tokens across 6 layers with standard residuals. All four tokens (&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 11. Hidden state magnitudes for our 4 tokens across 6 layers with standard residuals. All four tokens (" title="Figure 11. Hidden state magnitudes for our 4 tokens across 6 layers with standard residuals. All four tokens (" srcset="https://substackcdn.com/image/fetch/$s_!py8A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png 424w, https://substackcdn.com/image/fetch/$s_!py8A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png 848w, https://substackcdn.com/image/fetch/$s_!py8A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png 1272w, https://substackcdn.com/image/fetch/$s_!py8A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29e01df5-ec3f-4c22-a632-48331ead1a4b_1248x850.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 11. Hidden state magnitudes for our 4 tokens across 6 layers with standard residuals. All four tokens (&#8221;The&#8221;, &#8220;cat&#8221;, &#8220;sat&#8221;, &#8220;down&#8221;) show monotonically increasing norms, confirming the O(L) growth problem in our running example. The curves are tightly clustered, growing from 1.0 to approximately 3.4-3.6.</em></figcaption></figure></div><p>As illustrated in figure 11, all four tokens exhibit the same monotonic growth pattern. The norms rise from 1.0 at the embedding layer to between 3.4 and 3.6 by layer 6. This confirms that the dilution problem is universal across all tokens, not specific to any particular input.</p><h3><strong>The pseudo-query vector</strong></h3><p>The first new concept in AttnRes is the pseudo-query vector. Each layer l has a learned parameter vector w_l of the same dimension as the hidden state (in our example, d = 8). This vector encodes the question: &#8220;which previous layers are most relevant for my computation?&#8221;</p><p>Let&#8217;s examine this concept in detail, as illustrated in figure 12.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BBcE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BBcE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!BBcE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!BBcE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!BBcE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BBcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af5a846b-04b6-4586-8537-0c335e060039_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 12. The pseudo-query vector. Each layer l has a learned vector w_l of shape (8,) that acts as a &quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 12. The pseudo-query vector. Each layer l has a learned vector w_l of shape (8,) that acts as a " title="Figure 12. The pseudo-query vector. Each layer l has a learned vector w_l of shape (8,) that acts as a " srcset="https://substackcdn.com/image/fetch/$s_!BBcE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!BBcE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!BBcE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!BBcE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf5a846b-04b6-4586-8537-0c335e060039_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 12. The pseudo-query vector. Each layer l has a learned vector w_l of shape (8,) that acts as a &#8220;question&#8221; asking which previous layers matter most. The pseudo-query computes dot products with RMSNorm-normalized representations from previous layers, producing scores that pass through softmax to yield attention weights. The scores in this example are 0.2, 0.8, 1.5, 0.6 for layers 0-3, producing weights alpha_0=0.05, alpha_1=0.20, alpha_2=0.55, alpha_3=0.20.</em></figcaption></figure></div><p>As shown in figure 12, the pseudo-query w_l is a fixed learned parameter, not derived from the input like a standard attention query. However, the attention weights are still input-dependent because the &#8220;keys&#8221; come from previous layer outputs (v_i), which depend on the input tokens. Different inputs produce different keys, which produce different dot products with the same pseudo-query, which produce different attention weights.</p><blockquote><p><strong>What is a pseudo-query?</strong> A pseudo-query is a learned parameter vector that functions like a query in standard attention, but instead of being derived from the input, it is a fixed vector that each layer learns during training. It asks the same &#8220;question&#8221; for every input, but gets different &#8220;answers&#8221; because the keys (previous layer outputs) vary with the input.</p></blockquote><h3><strong>Computing attention weights over depth</strong></h3><p>Now let&#8217;s see the full computation at a single layer. At layer 4, we have four previous outputs to attend over: v_0 (embeddings), v_1 (layer 1 output), v_2 (layer 2 output), and v_3 (layer 3 output). The detailed computation is shown in figure 13.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wekb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wekb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wekb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wekb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wekb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wekb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 13. AttnRes computation at layer 4 in detail. The pseudo-query w_4 computes dot products with RMSNorm-normalized representations from layers 0 through 3, producing logits [0.2, 0.8, 1.5, 0.6]. After softmax, the weights are alpha_0=0.08, alpha_1=0.15, alpha_2=0.55, alpha_3=0.22. The output h_4 = 0.08*v_0 + 0.15*v_1 + 0.55*v_2 + 0.22*v_3 is bounded because the weights form a convex combination.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 13. AttnRes computation at layer 4 in detail. The pseudo-query w_4 computes dot products with RMSNorm-normalized representations from layers 0 through 3, producing logits [0.2, 0.8, 1.5, 0.6]. After softmax, the weights are alpha_0=0.08, alpha_1=0.15, alpha_2=0.55, alpha_3=0.22. The output h_4 = 0.08*v_0 + 0.15*v_1 + 0.55*v_2 + 0.22*v_3 is bounded because the weights form a convex combination." title="Figure 13. AttnRes computation at layer 4 in detail. The pseudo-query w_4 computes dot products with RMSNorm-normalized representations from layers 0 through 3, producing logits [0.2, 0.8, 1.5, 0.6]. After softmax, the weights are alpha_0=0.08, alpha_1=0.15, alpha_2=0.55, alpha_3=0.22. The output h_4 = 0.08*v_0 + 0.15*v_1 + 0.55*v_2 + 0.22*v_3 is bounded because the weights form a convex combination." srcset="https://substackcdn.com/image/fetch/$s_!wekb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wekb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wekb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wekb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff132bdc9-1291-40b0-af92-07b86eb35ea2_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 13. AttnRes computation at layer 4 in detail. The pseudo-query w_4 computes dot products with RMSNorm-normalized representations from layers 0 through 3, producing logits [0.2, 0.8, 1.5, 0.6]. After softmax, the weights are alpha_0=0.08, alpha_1=0.15, alpha_2=0.55, alpha_3=0.22. The output h_4 = 0.08*v_0 + 0.15*v_1 + 0.55*v_2 + 0.22*v_3 is bounded because the weights form a convex combination.</em></figcaption></figure></div><p>As shown in figure 13, the computation proceeds in four steps:</p><ol><li><p><strong>Normalize</strong>: Each previous output v_i is passed through RMSNorm to produce a key k_i</p></li><li><p><strong>Score</strong>: The pseudo-query w_4 is dotted with each key to produce logits [0.2, 0.8, 1.5, 0.6]</p></li><li><p><strong>Softmax</strong>: The logits are converted to attention weights that sum to 1: [0.08, 0.15, 0.55, 0.22]</p></li><li><p><strong>Aggregate</strong>: The output is the weighted sum: h_4 = 0.08v_0 + 0.15v_1 + 0.55v_2 + 0.22v_3</p></li></ol><p>The formal equation for the attention weights is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\alpha_{i \\rightarrow l} = \\text{softmax}_i(w_l^T \\cdot \\text{RMSNorm}(k_i)) = \\frac{\\exp(w_l^T \\cdot \\text{RMSNorm}(k_i))}{\\sum_{j=0}^{l-1} \\exp(w_l^T \\cdot \\text{RMSNorm}(k_j))}&quot;,&quot;id&quot;:&quot;KVFNJKGDGP&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Let&#8217;s zoom into the attention weight computation itself, as shown in figure 14.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NACG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NACG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!NACG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!NACG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!NACG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NACG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 14. Attention weight computation for layer 4. The pseudo-query w_4 of shape (8,) is dotted with the key matrix K of shape (4, 8) containing the RMSNorm-normalized outputs from layers 0-3. The dot products produce logits of shape (4,): [0.2, 0.8, 1.5, 0.6]. After softmax, the resulting weights are alpha_0=0.08, alpha_1=0.15, alpha_2=0.55, alpha_3=0.22. The bar chart shows that layer 4 draws 55% of its input from layer 2's output, making it the dominant source.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 14. Attention weight computation for layer 4. The pseudo-query w_4 of shape (8,) is dotted with the key matrix K of shape (4, 8) containing the RMSNorm-normalized outputs from layers 0-3. The dot products produce logits of shape (4,): [0.2, 0.8, 1.5, 0.6]. After softmax, the resulting weights are alpha_0=0.08, alpha_1=0.15, alpha_2=0.55, alpha_3=0.22. The bar chart shows that layer 4 draws 55% of its input from layer 2's output, making it the dominant source." title="Figure 14. Attention weight computation for layer 4. The pseudo-query w_4 of shape (8,) is dotted with the key matrix K of shape (4, 8) containing the RMSNorm-normalized outputs from layers 0-3. The dot products produce logits of shape (4,): [0.2, 0.8, 1.5, 0.6]. After softmax, the resulting weights are alpha_0=0.08, alpha_1=0.15, alpha_2=0.55, alpha_3=0.22. The bar chart shows that layer 4 draws 55% of its input from layer 2's output, making it the dominant source." srcset="https://substackcdn.com/image/fetch/$s_!NACG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!NACG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!NACG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!NACG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf21c5b3-0df5-4d0a-b1d2-9ed0e8a35310_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 14. Attention weight computation for layer 4. The pseudo-query w_4 of shape (8,) is dotted with the key matrix K of shape (4, 8) containing the RMSNorm-normalized outputs from layers 0-3. The dot products produce logits of shape (4,): [0.2, 0.8, 1.5, 0.6]. After softmax, the resulting weights are alpha_0=0.08, alpha_1=0.15, alpha_2=0.55, alpha_3=0.22. The bar chart shows that layer 4 draws 55% of its input from layer 2&#8217;s output, making it the dominant source.</em></figcaption></figure></div><p>As illustrated in figure 14, layer 4 selectively draws 55% of its input from layer 2&#8217;s output. Notice how different this is from standard residuals, where all layers contribute equally. The model has learned that layer 2&#8217;s representation is the most useful input for layer 4&#8217;s computation, and it can focus its attention accordingly.</p><h3><strong>The full AttnRes forward pass</strong></h3><p>Now let&#8217;s trace the complete forward pass with AttnRes through all 6 layers of our running example, as shown in figure 15.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!obNp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!obNp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!obNp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!obNp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!obNp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!obNp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 15. Full AttnRes computation for our running example. The same 4 tokens flow through 6 layers, but each layer selectively weights its inputs via softmax attention. At layer 1, alpha_0 = 1.0 (only the embeddings are available). At layer 2, the weights are [0.3, 0.7]. At layer 3, the weights are [0.1, 0.2, 0.7]. The hidden state magnitudes remain bounded throughout: norms stay between 0.95 and 1.1, compared to the 1.0 to 3.2 range with standard residuals.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 15. Full AttnRes computation for our running example. The same 4 tokens flow through 6 layers, but each layer selectively weights its inputs via softmax attention. At layer 1, alpha_0 = 1.0 (only the embeddings are available). At layer 2, the weights are [0.3, 0.7]. At layer 3, the weights are [0.1, 0.2, 0.7]. The hidden state magnitudes remain bounded throughout: norms stay between 0.95 and 1.1, compared to the 1.0 to 3.2 range with standard residuals." title="Figure 15. Full AttnRes computation for our running example. The same 4 tokens flow through 6 layers, but each layer selectively weights its inputs via softmax attention. At layer 1, alpha_0 = 1.0 (only the embeddings are available). At layer 2, the weights are [0.3, 0.7]. At layer 3, the weights are [0.1, 0.2, 0.7]. The hidden state magnitudes remain bounded throughout: norms stay between 0.95 and 1.1, compared to the 1.0 to 3.2 range with standard residuals." srcset="https://substackcdn.com/image/fetch/$s_!obNp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!obNp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!obNp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!obNp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2ab9545-2e94-420f-9d96-253a814cc07b_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 15. Full AttnRes computation for our running example. The same 4 tokens flow through 6 layers, but each layer selectively weights its inputs via softmax attention. At layer 1, alpha_0 = 1.0 (only the embeddings are available). At layer 2, the weights are [0.3, 0.7]. At layer 3, the weights are [0.1, 0.2, 0.7]. The hidden state magnitudes remain bounded throughout: norms stay between 0.95 and 1.1, compared to the 1.0 to 3.2 range with standard residuals.</em></figcaption></figure></div><p>As shown in figure 15, the hidden state magnitudes stay bounded near 1.0 throughout all 6 layers. This is a dramatic contrast with the standard residual version, where magnitudes grew to 3.2. The key difference is visible in the bar chart on the right: instead of linearly increasing bars, we see roughly uniform bars that stay below the dashed line marking the standard residual growth.</p><p>Let&#8217;s confirm this with a direct comparison of the magnitude curves, as shown in figure 16.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r2o_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r2o_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png 424w, https://substackcdn.com/image/fetch/$s_!r2o_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png 848w, https://substackcdn.com/image/fetch/$s_!r2o_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png 1272w, https://substackcdn.com/image/fetch/$s_!r2o_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r2o_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png" width="1248" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 16. Hidden state magnitudes comparison. The orange line (Standard Transformer) grows linearly from 1.0 to 3.45 over 6 layers. The blue line (Attention Residual) stays flat near 1.0, oscillating between 0.95 and 1.05. This bounded behavior is a direct consequence of the convex combination property: since softmax weights sum to 1, the output magnitude can never exceed the largest input magnitude.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 16. Hidden state magnitudes comparison. The orange line (Standard Transformer) grows linearly from 1.0 to 3.45 over 6 layers. The blue line (Attention Residual) stays flat near 1.0, oscillating between 0.95 and 1.05. This bounded behavior is a direct consequence of the convex combination property: since softmax weights sum to 1, the output magnitude can never exceed the largest input magnitude." title="Figure 16. Hidden state magnitudes comparison. The orange line (Standard Transformer) grows linearly from 1.0 to 3.45 over 6 layers. The blue line (Attention Residual) stays flat near 1.0, oscillating between 0.95 and 1.05. This bounded behavior is a direct consequence of the convex combination property: since softmax weights sum to 1, the output magnitude can never exceed the largest input magnitude." srcset="https://substackcdn.com/image/fetch/$s_!r2o_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png 424w, https://substackcdn.com/image/fetch/$s_!r2o_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png 848w, https://substackcdn.com/image/fetch/$s_!r2o_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png 1272w, https://substackcdn.com/image/fetch/$s_!r2o_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F670c55b1-7af5-434a-aa6e-7f06e67734a5_1248x850.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 16. Hidden state magnitudes comparison. The orange line (Standard Transformer) grows linearly from 1.0 to 3.45 over 6 layers. The blue line (Attention Residual) stays flat near 1.0, oscillating between 0.95 and 1.05. This bounded behavior is a direct consequence of the convex combination property: since softmax weights sum to 1, the output magnitude can never exceed the largest input magnitude.</em></figcaption></figure></div><p>As illustrated in figure 16, the difference is stark. The standard residual norm (orange) grows from 1.0 to 3.45 over just 6 layers. The AttnRes norm (blue) stays essentially flat near 1.0. This is the mathematical magic of Attention Residuals: by replacing fixed-weight accumulation with a softmax-weighted combination, the hidden state magnitude is automatically bounded.</p><h3><strong>Input-dependent layer selection</strong></h3><p>The final piece of the puzzle is that different tokens produce different attention weight distributions. This is what makes AttnRes strictly more expressive than fixed-weight approaches like DenseFormer. Let&#8217;s see this in action, as shown in figure 17.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gA5s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gA5s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!gA5s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!gA5s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!gA5s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gA5s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 17. Input-dependent attention weights. The same pseudo-query w_4 produces different weight distributions for different tokens. Token &quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 17. Input-dependent attention weights. The same pseudo-query w_4 produces different weight distributions for different tokens. Token " title="Figure 17. Input-dependent attention weights. The same pseudo-query w_4 produces different weight distributions for different tokens. Token " srcset="https://substackcdn.com/image/fetch/$s_!gA5s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!gA5s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!gA5s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!gA5s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46a870d2-62a3-432c-bbd7-adc5cbcbff1c_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 17. Input-dependent attention weights. The same pseudo-query w_4 produces different weight distributions for different tokens. Token &#8220;The&#8221; (a function word) attends mostly to the embedding layer (alpha_0 = 0.60). Token &#8220;cat&#8221; (a content word) focuses on intermediate layers (alpha_2 = 0.55). Token &#8220;down&#8221; (context-dependent) attends most to recent layers (alpha_3 = 0.40). This input-dependence is what makes AttnRes strictly more expressive than DenseFormer.</em></figcaption></figure></div><p>As shown in figure 17, the same pseudo-query w_4 produces three very different attention patterns depending on the token:</p><ul><li><p><strong>Token &#8220;The&#8221;</strong>: A function word that attends mostly to the embedding layer (alpha_0 = 0.60). Function words carry relatively stable syntactic information that does not change much across layers.</p></li><li><p><strong>Token &#8220;cat&#8221;</strong>: A content word that focuses on intermediate layers (alpha_2 = 0.55). Content words benefit from the semantic processing performed by middle layers.</p></li><li><p><strong>Token &#8220;down&#8221;</strong>: A context-dependent word that attends most to recent layers (alpha_3 = 0.40). Its meaning depends heavily on the context built up by previous layers.</p></li></ul><p>This is the key advantage over fixed-weight approaches. DenseFormer uses the same weights for every token; AttnRes adapts its layer selection to each token individually.</p><p>We now have the full AttnRes mechanism, but there is a practical challenge: storing all L layer outputs requires O(Ld) memory. For massive models with tens of billions of parameters, we need a more efficient approach.</p><div><hr></div><h2><strong>Block AttnRes: scaling to real-world models</strong></h2><p>Full AttnRes is elegant but expensive. At layer l, we need to store all l-1 previous outputs, each of dimension d. For a 48-layer model with d = 8192, this represents significant memory overhead. In pipeline-parallel training, where different layers live on different GPUs, the situation is even more challenging.</p><p>The solution is Block AttnRes, which partitions layers into blocks and applies attention only at block boundaries.</p><h3><strong>The memory challenge</strong></h3><p>Let&#8217;s first understand the architecture at a high level, as shown in figure 18.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JnTO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JnTO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!JnTO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!JnTO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!JnTO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JnTO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 18. Block AttnRes architecture. Layers are partitioned into blocks (shown here with 4 blocks of 6 layers each for a 24-layer model). Within each block, standard residual connections accumulate outputs with weight 1.0. At block boundaries, attention-based aggregation selectively combines all previous block representations plus the original token embedding. Memory is O(Nd) instead of O(Ld).&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 18. Block AttnRes architecture. Layers are partitioned into blocks (shown here with 4 blocks of 6 layers each for a 24-layer model). Within each block, standard residual connections accumulate outputs with weight 1.0. At block boundaries, attention-based aggregation selectively combines all previous block representations plus the original token embedding. Memory is O(Nd) instead of O(Ld)." title="Figure 18. Block AttnRes architecture. Layers are partitioned into blocks (shown here with 4 blocks of 6 layers each for a 24-layer model). Within each block, standard residual connections accumulate outputs with weight 1.0. At block boundaries, attention-based aggregation selectively combines all previous block representations plus the original token embedding. Memory is O(Nd) instead of O(Ld)." srcset="https://substackcdn.com/image/fetch/$s_!JnTO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!JnTO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!JnTO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!JnTO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5fd6f0c-1dc2-46e2-818f-db174b073906_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 18. Block AttnRes architecture. Layers are partitioned into blocks (shown here with 4 blocks of 6 layers each for a 24-layer model). Within each block, standard residual connections accumulate outputs with weight 1.0. At block boundaries, attention-based aggregation selectively combines all previous block representations plus the original token embedding. Memory is O(Nd) instead of O(Ld).</em></figcaption></figure></div><p>As shown in figure 18, Block AttnRes uses standard residuals within blocks (the familiar h_l = h_{l-1} + f_l pattern) but applies attention-based aggregation at the boundaries between blocks. The token embeddings x are always available as one of the attention sources, providing a direct path from input to any block.</p><p>The memory savings are substantial, as illustrated in figure 19.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r62v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r62v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!r62v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!r62v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!r62v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r62v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 19. Memory comparison. Full AttnRes stores all 48 layer outputs, requiring O(L*d) = O(48*d) memory. Block AttnRes stores only 8 block-level representations plus the embeddings, requiring O(N*d) = O(8*d) memory. For a 48-layer model with 8 blocks, this is a 6x reduction in stored representations.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 19. Memory comparison. Full AttnRes stores all 48 layer outputs, requiring O(L*d) = O(48*d) memory. Block AttnRes stores only 8 block-level representations plus the embeddings, requiring O(N*d) = O(8*d) memory. For a 48-layer model with 8 blocks, this is a 6x reduction in stored representations." title="Figure 19. Memory comparison. Full AttnRes stores all 48 layer outputs, requiring O(L*d) = O(48*d) memory. Block AttnRes stores only 8 block-level representations plus the embeddings, requiring O(N*d) = O(8*d) memory. For a 48-layer model with 8 blocks, this is a 6x reduction in stored representations." srcset="https://substackcdn.com/image/fetch/$s_!r62v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!r62v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!r62v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!r62v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40a4842e-dd6d-46d7-8610-afe9e25730c4_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 19. Memory comparison. Full AttnRes stores all 48 layer outputs, requiring O(L*d) = O(48*d) memory. Block AttnRes stores only 8 block-level representations plus the embeddings, requiring O(N*d) = O(8*d) memory. For a 48-layer model with 8 blocks, this is a 6x reduction in stored representations.</em></figcaption></figure></div><p>As illustrated in figure 19, full AttnRes requires storing 48 separate representations for a 48-layer model. Block AttnRes with N = 8 blocks stores only 9 representations (8 block outputs plus the token embeddings), a 6x reduction. The authors found that 8 blocks is the optimal trade-off between performance and overhead.</p><h3><strong>How Block AttnRes works</strong></h3><p>Let&#8217;s walk through Block AttnRes with our running example: 6 layers partitioned into 2 blocks of 3 layers each. The step-by-step computation is shown in figure 20.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PHRI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PHRI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!PHRI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!PHRI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!PHRI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PHRI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 20. Block AttnRes step-by-step for our running example with L=6 layers and N=2 blocks. Block 1 (Layers 1-3) uses standard residuals internally with weight 1.0. At the block boundary, attention aggregates the token embeddings x and Block 1's output B_1 with weights alpha_x = 0.35 and alpha_B1 = 0.65, producing h_boundary. Block 2 (Layers 4-6) then proceeds with standard residuals using h_boundary as input.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 20. Block AttnRes step-by-step for our running example with L=6 layers and N=2 blocks. Block 1 (Layers 1-3) uses standard residuals internally with weight 1.0. At the block boundary, attention aggregates the token embeddings x and Block 1's output B_1 with weights alpha_x = 0.35 and alpha_B1 = 0.65, producing h_boundary. Block 2 (Layers 4-6) then proceeds with standard residuals using h_boundary as input." title="Figure 20. Block AttnRes step-by-step for our running example with L=6 layers and N=2 blocks. Block 1 (Layers 1-3) uses standard residuals internally with weight 1.0. At the block boundary, attention aggregates the token embeddings x and Block 1's output B_1 with weights alpha_x = 0.35 and alpha_B1 = 0.65, producing h_boundary. Block 2 (Layers 4-6) then proceeds with standard residuals using h_boundary as input." srcset="https://substackcdn.com/image/fetch/$s_!PHRI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!PHRI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!PHRI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!PHRI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a18fdf7-693c-4764-acc5-559aac4e8e85_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 20. Block AttnRes step-by-step for our running example with L=6 layers and N=2 blocks. Block 1 (Layers 1-3) uses standard residuals internally with weight 1.0. At the block boundary, attention aggregates the token embeddings x and Block 1&#8217;s output B_1 with weights alpha_x = 0.35 and alpha_B1 = 0.65, producing h_boundary. Block 2 (Layers 4-6) then proceeds with standard residuals using h_boundary as input.</em></figcaption></figure></div><p>As shown in figure 20, the computation proceeds in three phases:</p><ol><li><p><strong>Block 1 (Layers 1-3)</strong>: Standard residual connections. Each layer adds its output to the running sum. The block output B_1 = h_3 has a growing magnitude, just like in a standard transformer.</p></li><li><p><strong>Block boundary</strong>: The attention mechanism computes weights over the token embeddings x and block 1&#8217;s output B_1. In this example, the weights are alpha_x = 0.35 and alpha_B1 = 0.65, producing h_boundary = 0.35x + 0.65B_1. This &#8220;resets&#8221; the magnitude by forming a convex combination.</p></li><li><p><strong>Block 2 (Layers 4-6)</strong>: Standard residual connections resume, using h_boundary as the starting point.</p></li></ol><p>Notice the critical design insight: within blocks, O(L) growth is allowed to occur. But at block boundaries, the attention-based aggregation &#8220;resets&#8221; the magnitude back to a bounded value. This means the maximum growth within any single block is limited to O(L/N), which is much smaller than O(L).</p><p>Now let&#8217;s examine the block boundary computation in more detail, as shown in figure 21.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q-dN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q-dN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Q-dN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Q-dN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Q-dN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q-dN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 21. Attention computation at a block boundary. In Stage 1 (Stack), the token embeddings, previous block outputs, and current partial sum are gathered. In Stage 2 (Normalize), all sources are normalized with RMSNorm. In Stage 3 (Score), the learned projection weight (pseudo-query) computes attention logits via einsum, then softmax produces weights. In Stage 4 (Aggregate), the weighted sum produces the aggregated block input for the next block.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 21. Attention computation at a block boundary. In Stage 1 (Stack), the token embeddings, previous block outputs, and current partial sum are gathered. In Stage 2 (Normalize), all sources are normalized with RMSNorm. In Stage 3 (Score), the learned projection weight (pseudo-query) computes attention logits via einsum, then softmax produces weights. In Stage 4 (Aggregate), the weighted sum produces the aggregated block input for the next block." title="Figure 21. Attention computation at a block boundary. In Stage 1 (Stack), the token embeddings, previous block outputs, and current partial sum are gathered. In Stage 2 (Normalize), all sources are normalized with RMSNorm. In Stage 3 (Score), the learned projection weight (pseudo-query) computes attention logits via einsum, then softmax produces weights. In Stage 4 (Aggregate), the weighted sum produces the aggregated block input for the next block." srcset="https://substackcdn.com/image/fetch/$s_!Q-dN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!Q-dN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!Q-dN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!Q-dN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F639201e0-c201-4a41-b9e1-0b1272c366c7_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 21. Attention computation at a block boundary. In Stage 1 (Stack), the token embeddings, previous block outputs, and current partial sum are gathered. In Stage 2 (Normalize), all sources are normalized with RMSNorm. In Stage 3 (Score), the learned projection weight (pseudo-query) computes attention logits via einsum, then softmax produces weights. In Stage 4 (Aggregate), the weighted sum produces the aggregated block input for the next block.</em></figcaption></figure></div><p>As illustrated in figure 21, the block boundary attention follows four stages:</p><ol><li><p><strong>Stack</strong>: Gather the token embeddings, all previous block outputs, and the current block&#8217;s partial sum into a tensor V of shape (N_prev+1, batch, seq, d)</p></li><li><p><strong>Normalize</strong>: Apply RMSNorm to stabilize the attention computation</p></li><li><p><strong>Score</strong>: Use the learned projection weight (the pseudo-query) to compute logits via einsum, then apply softmax to get weights</p></li><li><p><strong>Aggregate</strong>: Compute the weighted sum to produce the input for the next block</p></li></ol><h3><strong>Pipeline parallelism and system design</strong></h3><p>In large-scale training, models are split across multiple GPUs using pipeline parallelism. Each GPU holds a subset of layers, typically corresponding to one or two blocks. The challenge is that block boundary attention needs representations from blocks on other GPUs.</p><p>The solution is cache-based point-to-point communication, as shown in figure 22.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EN95!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EN95!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!EN95!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!EN95!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!EN95!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EN95!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 22. Block AttnRes in pipeline-parallel training. Four GPUs each hold 2 blocks. Each GPU caches its block representations and sends them to the next GPU via point-to-point communication. GPU 0 caches [x, B_1, B_2] and sends [B_1, B_2] to GPU 1. GPU 1 adds its own block outputs and forwards [B_1..B_4] to GPU 2, and so on. Each GPU runs a two-phase computation: Phase 1 computes standard attention within the current block, Phase 2 applies cross-block attention using cached representations.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 22. Block AttnRes in pipeline-parallel training. Four GPUs each hold 2 blocks. Each GPU caches its block representations and sends them to the next GPU via point-to-point communication. GPU 0 caches [x, B_1, B_2] and sends [B_1, B_2] to GPU 1. GPU 1 adds its own block outputs and forwards [B_1..B_4] to GPU 2, and so on. Each GPU runs a two-phase computation: Phase 1 computes standard attention within the current block, Phase 2 applies cross-block attention using cached representations." title="Figure 22. Block AttnRes in pipeline-parallel training. Four GPUs each hold 2 blocks. Each GPU caches its block representations and sends them to the next GPU via point-to-point communication. GPU 0 caches [x, B_1, B_2] and sends [B_1, B_2] to GPU 1. GPU 1 adds its own block outputs and forwards [B_1..B_4] to GPU 2, and so on. Each GPU runs a two-phase computation: Phase 1 computes standard attention within the current block, Phase 2 applies cross-block attention using cached representations." srcset="https://substackcdn.com/image/fetch/$s_!EN95!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!EN95!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!EN95!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!EN95!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a0ea332-ddf6-49b0-90a4-f87cb40b191d_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 22. Block AttnRes in pipeline-parallel training. Four GPUs each hold 2 blocks. Each GPU caches its block representations and sends them to the next GPU via point-to-point communication. GPU 0 caches [x, B_1, B_2] and sends [B_1, B_2] to GPU 1. GPU 1 adds its own block outputs and forwards [B_1..B_4] to GPU 2, and so on. Each GPU runs a two-phase computation: Phase 1 computes standard attention within the current block, Phase 2 applies cross-block attention using cached representations.</em></figcaption></figure></div><p>As shown in figure 22, each GPU stage maintains a cache of all block representations computed so far. When a block boundary is reached, the cached representations from previous stages are available for the attention computation. The two-phase strategy ensures efficient execution:</p><ul><li><p><strong>Phase 1</strong>: Compute the standard attention and FFN operations within the current block</p></li><li><p><strong>Phase 2</strong>: Apply cross-block attention using the cached block representations</p></li></ul><p>During inference, online softmax is used to amortize the computation cost, keeping the overhead minimal.</p><p>Having built the complete mechanism, both the idealized full version and the practical block variant, let&#8217;s formalize the mathematics and prove why AttnRes solves the dilution problem.</p><div><hr></div><h2><strong>The mathematics of Attention Residuals</strong></h2><p>We have built intuition for why AttnRes works. Now let&#8217;s prove it. The mathematics reveals three key properties: bounded hidden states, improved gradient flow, and a beautiful connection between standard residuals and linear attention.</p><h3><strong>Bounding the hidden state</strong></h3><p>The most important mathematical property of AttnRes is that it produces a convex combination of previous outputs. Since softmax weights sum to 1, the output magnitude is automatically bounded, as shown in figure 23.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eER-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eER-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!eER-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!eER-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!eER-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eER-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 23. The convex combination bound. Top: Standard residuals compute h_L = v_0 + v_1 + v_2 + ... + v_L, and the magnitude can grow as O(L). AttnRes computes h_L = alpha_0*v_0 + alpha_1*v_1 + ... + alpha_L*v_L where the alphas sum to 1, and the magnitude satisfies the bound: norm of h_L is at most the maximum norm of any v_i. Bottom: the geometric intuition shows that the AttnRes output always lies within the convex hull of the input vectors, while the standard residual output can escape far beyond this region.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 23. The convex combination bound. Top: Standard residuals compute h_L = v_0 + v_1 + v_2 + ... + v_L, and the magnitude can grow as O(L). AttnRes computes h_L = alpha_0*v_0 + alpha_1*v_1 + ... + alpha_L*v_L where the alphas sum to 1, and the magnitude satisfies the bound: norm of h_L is at most the maximum norm of any v_i. Bottom: the geometric intuition shows that the AttnRes output always lies within the convex hull of the input vectors, while the standard residual output can escape far beyond this region." title="Figure 23. The convex combination bound. Top: Standard residuals compute h_L = v_0 + v_1 + v_2 + ... + v_L, and the magnitude can grow as O(L). AttnRes computes h_L = alpha_0*v_0 + alpha_1*v_1 + ... + alpha_L*v_L where the alphas sum to 1, and the magnitude satisfies the bound: norm of h_L is at most the maximum norm of any v_i. Bottom: the geometric intuition shows that the AttnRes output always lies within the convex hull of the input vectors, while the standard residual output can escape far beyond this region." srcset="https://substackcdn.com/image/fetch/$s_!eER-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!eER-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!eER-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!eER-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d2b102-02ab-49bd-a192-68a462144e84_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 23. The convex combination bound. Top: Standard residuals compute h_L = v_0 + v_1 + v_2 + ... + v_L, and the magnitude can grow as O(L). AttnRes computes h_L = alpha_0*v_0 + alpha_1*v_1 + ... + alpha_L*v_L where the alphas sum to 1, and the magnitude satisfies the bound: norm of h_L is at most the maximum norm of any v_i. Bottom: the geometric intuition shows that the AttnRes output always lies within the convex hull of the input vectors, while the standard residual output can escape far beyond this region.</em></figcaption></figure></div><p>As illustrated in figure 23, the bound is elegantly simple:</p><p><strong>Standard residuals:</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\|h_L\\| = \\|\\sum_{i=0}^{L} v_i\\| \\text{ can grow as } O(L)&quot;,&quot;id&quot;:&quot;NEJYOCDYAP&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p><strong>AttnRes:</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\|h_L\\| = \\|\\sum_{i=0}^{L} \\alpha_i v_i\\| \\leq \\max_i \\|v_i\\| \\text{ (bounded)}&quot;,&quot;id&quot;:&quot;ZGRCVBUUIT&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>This follows directly from the triangle inequality and the fact that the alpha weights are non-negative and sum to 1. The output is a weighted average of the input vectors, and a weighted average can never have a larger magnitude than the largest input. This single property eliminates the entire PreNorm dilution problem, regardless of how many layers the model has.</p><h3><strong>Gradient flow and training dynamics</strong></h3><p>The second mathematical advantage concerns gradient flow during training. In standard residuals, gradients must propagate backward through a chain of multiplicative terms. In AttnRes, each layer receives a direct gradient signal.</p><p>Let&#8217;s compare the two gradient flow patterns. The standard residual gradient flow is shown in figure 24.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HqKF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HqKF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!HqKF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!HqKF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!HqKF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HqKF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 24. Gradient flow in standard residuals. The gradient from the loss must propagate backward through a chain of (L-l) multiplicative terms: dL/dh_l = dL/dh_L * product of (I + df_k/dh). While the identity shortcut helps, gradients still concentrate in shallow layers (near the loss) and attenuate for deeper layers (far from the loss). The bar chart on the right shows gradient norms heavily concentrated in recent layers.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 24. Gradient flow in standard residuals. The gradient from the loss must propagate backward through a chain of (L-l) multiplicative terms: dL/dh_l = dL/dh_L * product of (I + df_k/dh). While the identity shortcut helps, gradients still concentrate in shallow layers (near the loss) and attenuate for deeper layers (far from the loss). The bar chart on the right shows gradient norms heavily concentrated in recent layers." title="Figure 24. Gradient flow in standard residuals. The gradient from the loss must propagate backward through a chain of (L-l) multiplicative terms: dL/dh_l = dL/dh_L * product of (I + df_k/dh). While the identity shortcut helps, gradients still concentrate in shallow layers (near the loss) and attenuate for deeper layers (far from the loss). The bar chart on the right shows gradient norms heavily concentrated in recent layers." srcset="https://substackcdn.com/image/fetch/$s_!HqKF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!HqKF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!HqKF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!HqKF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01d65ca8-97e8-49ca-80b0-59433615102d_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 24. Gradient flow in standard residuals. The gradient from the loss must propagate backward through a chain of (L-l) multiplicative terms: dL/dh_l = dL/dh_L * product of (I + df_k/dh). While the identity shortcut helps, gradients still concentrate in shallow layers (near the loss) and attenuate for deeper layers (far from the loss). The bar chart on the right shows gradient norms heavily concentrated in recent layers.</em></figcaption></figure></div><p>As shown in figure 24, the standard gradient formula involves a product of (L-l) Jacobian terms. While the identity component of each term helps prevent complete vanishing, the gradients still tend to concentrate in the layers closest to the loss, leaving early layers with weaker gradient signals.</p><p>Now let&#8217;s examine the AttnRes gradient flow, as shown in figure 25.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!meqv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!meqv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!meqv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!meqv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!meqv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!meqv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 25. Gradient flow in AttnRes. Each layer receives a direct gradient signal weighted by its attention weight alpha. Layers that contribute more to the output (higher alpha) get proportionally stronger gradient signals. This creates a self-reinforcing learning dynamic: useful layers get trained more effectively. The bar chart shows gradient norms distributed much more uniformly across depth.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 25. Gradient flow in AttnRes. Each layer receives a direct gradient signal weighted by its attention weight alpha. Layers that contribute more to the output (higher alpha) get proportionally stronger gradient signals. This creates a self-reinforcing learning dynamic: useful layers get trained more effectively. The bar chart shows gradient norms distributed much more uniformly across depth." title="Figure 25. Gradient flow in AttnRes. Each layer receives a direct gradient signal weighted by its attention weight alpha. Layers that contribute more to the output (higher alpha) get proportionally stronger gradient signals. This creates a self-reinforcing learning dynamic: useful layers get trained more effectively. The bar chart shows gradient norms distributed much more uniformly across depth." srcset="https://substackcdn.com/image/fetch/$s_!meqv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!meqv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!meqv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!meqv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07435cc4-2dd4-4d12-b84a-8778c067162e_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 25. Gradient flow in AttnRes. Each layer receives a direct gradient signal weighted by its attention weight alpha. Layers that contribute more to the output (higher alpha) get proportionally stronger gradient signals. This creates a self-reinforcing learning dynamic: useful layers get trained more effectively. The bar chart shows gradient norms distributed much more uniformly across depth.</em></figcaption></figure></div><p>As illustrated in figure 25, AttnRes provides each layer with a direct gradient path:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{\\partial \\mathcal{L}}{\\partial v_l} = \\sum_{k=l+1}^{L} \\alpha_{l \\rightarrow k} \\cdot \\frac{\\partial \\mathcal{L}}{\\partial h_k} + \\text{(attention weight gradient terms)}&quot;,&quot;id&quot;:&quot;LXCQVXEHQO&quot;}" data-component-name="LatexBlockToDOM"></div><p>Every layer receives a gradient signal weighted by its attention weight alpha. Layers that contribute more to the output get proportionally stronger gradients, creating a self-reinforcing learning dynamic: useful layers train faster, become even more useful, and receive even stronger gradients.</p><p>We can verify this improved gradient distribution quantitatively, as shown in figure 26.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qWy3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qWy3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png 424w, https://substackcdn.com/image/fetch/$s_!qWy3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png 848w, https://substackcdn.com/image/fetch/$s_!qWy3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png 1272w, https://substackcdn.com/image/fetch/$s_!qWy3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qWy3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png" width="1248" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d54b4b45-7508-4865-8e5e-66436004726f_1248x850.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 26. Gradient norm distribution across layers. The orange line (Standard) shows gradients growing from 0.15 at layer 1 to 1.0 at layer 8, with early layers receiving much weaker gradient signals. The blue line (AttnRes) shows stable gradient norms between 0.82 and 0.97 across all layers, enabling effective training at every depth.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 26. Gradient norm distribution across layers. The orange line (Standard) shows gradients growing from 0.15 at layer 1 to 1.0 at layer 8, with early layers receiving much weaker gradient signals. The blue line (AttnRes) shows stable gradient norms between 0.82 and 0.97 across all layers, enabling effective training at every depth." title="Figure 26. Gradient norm distribution across layers. The orange line (Standard) shows gradients growing from 0.15 at layer 1 to 1.0 at layer 8, with early layers receiving much weaker gradient signals. The blue line (AttnRes) shows stable gradient norms between 0.82 and 0.97 across all layers, enabling effective training at every depth." srcset="https://substackcdn.com/image/fetch/$s_!qWy3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png 424w, https://substackcdn.com/image/fetch/$s_!qWy3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png 848w, https://substackcdn.com/image/fetch/$s_!qWy3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png 1272w, https://substackcdn.com/image/fetch/$s_!qWy3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54b4b45-7508-4865-8e5e-66436004726f_1248x850.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 26. Gradient norm distribution across layers. The orange line (Standard) shows gradients growing from 0.15 at layer 1 to 1.0 at layer 8, with early layers receiving much weaker gradient signals. The blue line (AttnRes) shows stable gradient norms between 0.82 and 0.97 across all layers, enabling effective training at every depth.</em></figcaption></figure></div><p>As shown in figure 26, the standard transformer (orange) shows a severe gradient imbalance: layer 1 receives a gradient norm of only 0.15, while layer 8 receives 1.0. The AttnRes model (blue) maintains gradient norms between 0.82 and 0.97 across all layers. This near-uniform distribution means every layer can train effectively, regardless of its position in the network.</p><h3><strong>Standard residuals as low-rank linear attention</strong></h3><p>The paper establishes a beautiful theoretical result: standard residual connections are equivalent to low-rank linear attention over depth. This reframing reveals that the residual stream has always been performing attention over depth, just a very limited form of it. The comparison is shown in figure 27.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wo-F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wo-F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wo-F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wo-F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wo-F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wo-F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 27. Standard residuals as low-rank linear attention. Left: standard residuals compute h_L = sum(1 * v_i) with fixed weight 1.0 for all layers, providing no selectivity. Center: this is equivalent to linear attention with constant weights w_i = 1. Right: AttnRes generalizes this to full-rank softmax attention with input-dependent, learned weights, providing full selectivity. The progression from left to right represents increasing expressiveness.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 27. Standard residuals as low-rank linear attention. Left: standard residuals compute h_L = sum(1 * v_i) with fixed weight 1.0 for all layers, providing no selectivity. Center: this is equivalent to linear attention with constant weights w_i = 1. Right: AttnRes generalizes this to full-rank softmax attention with input-dependent, learned weights, providing full selectivity. The progression from left to right represents increasing expressiveness." title="Figure 27. Standard residuals as low-rank linear attention. Left: standard residuals compute h_L = sum(1 * v_i) with fixed weight 1.0 for all layers, providing no selectivity. Center: this is equivalent to linear attention with constant weights w_i = 1. Right: AttnRes generalizes this to full-rank softmax attention with input-dependent, learned weights, providing full selectivity. The progression from left to right represents increasing expressiveness." srcset="https://substackcdn.com/image/fetch/$s_!wo-F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!wo-F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!wo-F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!wo-F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F570b2041-602a-40fc-9e74-6de9e052a6cd_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 27. Standard residuals as low-rank linear attention. Left: standard residuals compute h_L = sum(1 * v_i) with fixed weight 1.0 for all layers, providing no selectivity. Center: this is equivalent to linear attention with constant weights w_i = 1. Right: AttnRes generalizes this to full-rank softmax attention with input-dependent, learned weights, providing full selectivity. The progression from left to right represents increasing expressiveness.</em></figcaption></figure></div><p>As illustrated in figure 27, the standard residual sum is mathematically identical to linear attention with constant weights over depth. AttnRes simply upgrades this to full softmax attention, providing strictly more expressive depth-wise information routing. The residual stream has always been doing a form of attention over depth; AttnRes just makes it a much better form.</p><p>The math confirms what our intuition suggested: AttnRes fundamentally solves the dilution problem. Now, let&#8217;s see how much this matters in practice.</p><div><hr></div><h2><strong>Quantifying the gains</strong></h2><p>Theory is important, but the question practitioners care about is: does it work, and how much does it cost? The Kimi team evaluated Block AttnRes across scaling law experiments, benchmark evaluations, and architecture search, and the results are compelling.</p><h3><strong>Scaling law experiments</strong></h3><p>The most informative experiment compares validation loss at different compute budgets. The results are shown in figure 28.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wjv7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wjv7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png 424w, https://substackcdn.com/image/fetch/$s_!Wjv7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png 848w, https://substackcdn.com/image/fetch/$s_!Wjv7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png 1272w, https://substackcdn.com/image/fetch/$s_!Wjv7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wjv7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png" width="1264" height="853" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:853,&quot;width&quot;:1264,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 28. Scaling law curves for standard residuals vs Block AttnRes. Validation loss is plotted against training compute on a log scale. AttnRes (blue) consistently achieves lower loss at every compute budget. The annotation shows that AttnRes matches the standard baseline trained with approximately 25% more compute, meaning a model with AttnRes is equivalent to a 25% larger standard model.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 28. Scaling law curves for standard residuals vs Block AttnRes. Validation loss is plotted against training compute on a log scale. AttnRes (blue) consistently achieves lower loss at every compute budget. The annotation shows that AttnRes matches the standard baseline trained with approximately 25% more compute, meaning a model with AttnRes is equivalent to a 25% larger standard model." title="Figure 28. Scaling law curves for standard residuals vs Block AttnRes. Validation loss is plotted against training compute on a log scale. AttnRes (blue) consistently achieves lower loss at every compute budget. The annotation shows that AttnRes matches the standard baseline trained with approximately 25% more compute, meaning a model with AttnRes is equivalent to a 25% larger standard model." srcset="https://substackcdn.com/image/fetch/$s_!Wjv7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png 424w, https://substackcdn.com/image/fetch/$s_!Wjv7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png 848w, https://substackcdn.com/image/fetch/$s_!Wjv7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png 1272w, https://substackcdn.com/image/fetch/$s_!Wjv7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51be14db-c6b8-4f50-b700-d2318faad4dd_1264x853.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 28. Scaling law curves for standard residuals vs Block AttnRes. Validation loss is plotted against training compute on a log scale. AttnRes (blue) consistently achieves lower loss at every compute budget. The annotation shows that AttnRes matches the standard baseline trained with approximately 25% more compute, meaning a model with AttnRes is equivalent to a 25% larger standard model.</em></figcaption></figure></div><p>As shown in figure 28, the AttnRes curve (blue) is consistently below the standard curve (orange) at every compute budget from 0.5 to 128 x 10^18 FLOPs. The key result is annotated on the plot: Block AttnRes matches baseline performance trained with 1.25x more compute. This means that adding AttnRes to your model is equivalent to increasing your compute budget by 25%, for free.</p><h3><strong>Benchmark results</strong></h3><p>The scaling law result is confirmed by benchmark evaluations on Kimi Linear, a production model with 48B total parameters (3B activated via mixture-of-experts), trained on 1.4 trillion tokens. The results are shown in figure 29.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KsrP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KsrP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png 424w, https://substackcdn.com/image/fetch/$s_!KsrP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png 848w, https://substackcdn.com/image/fetch/$s_!KsrP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png 1272w, https://substackcdn.com/image/fetch/$s_!KsrP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KsrP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png" width="1456" height="933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:933,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 29. Benchmark results comparing baseline vs AttnRes on Kimi Linear. The grouped bar chart shows improvements across all 9 benchmarks. The largest gains appear in multi-step reasoning (GPQA-Diamond +7.5 points) and mathematics (Math +3.6 points). Coding tasks also improve significantly (HumanEval +3.1 points). Even knowledge-heavy benchmarks like MMLU and CMMLU show modest but consistent improvements.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 29. Benchmark results comparing baseline vs AttnRes on Kimi Linear. The grouped bar chart shows improvements across all 9 benchmarks. The largest gains appear in multi-step reasoning (GPQA-Diamond +7.5 points) and mathematics (Math +3.6 points). Coding tasks also improve significantly (HumanEval +3.1 points). Even knowledge-heavy benchmarks like MMLU and CMMLU show modest but consistent improvements." title="Figure 29. Benchmark results comparing baseline vs AttnRes on Kimi Linear. The grouped bar chart shows improvements across all 9 benchmarks. The largest gains appear in multi-step reasoning (GPQA-Diamond +7.5 points) and mathematics (Math +3.6 points). Coding tasks also improve significantly (HumanEval +3.1 points). Even knowledge-heavy benchmarks like MMLU and CMMLU show modest but consistent improvements." srcset="https://substackcdn.com/image/fetch/$s_!KsrP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png 424w, https://substackcdn.com/image/fetch/$s_!KsrP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png 848w, https://substackcdn.com/image/fetch/$s_!KsrP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png 1272w, https://substackcdn.com/image/fetch/$s_!KsrP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e542a54-2ac5-482f-bf2c-ff1aa6a9bd88_1519x973.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 29. Benchmark results comparing baseline vs AttnRes on Kimi Linear. The grouped bar chart shows improvements across all 9 benchmarks. The largest gains appear in multi-step reasoning (GPQA-Diamond +7.5 points) and mathematics (Math +3.6 points). Coding tasks also improve significantly (HumanEval +3.1 points). Even knowledge-heavy benchmarks like MMLU and CMMLU show modest but consistent improvements.</em></figcaption></figure></div><p>As illustrated in figure 29, AttnRes delivers consistent improvements across all 9 benchmarks:</p><ul><li><p><strong>GPQA-Diamond</strong>: +7.5 points (36.9 to 44.4). This is the standout result. GPQA-Diamond tests graduate-level reasoning, and the 7.5-point improvement suggests AttnRes significantly enhances the model&#8217;s ability to chain multi-step reasoning.</p></li><li><p><strong>Math</strong>: +3.6 points (53.5 to 57.1). Mathematical problem-solving requires precise, sequential reasoning across many steps, exactly the kind of task where selective layer access matters most.</p></li><li><p><strong>HumanEval</strong>: +3.1 points (59.1 to 62.2). Code generation benefits from the improved depth-wise routing, likely because different layers specialize in different aspects of code understanding.</p></li><li><p><strong>C-Eval</strong>: +2.9 points (79.6 to 82.5). Even on knowledge-focused benchmarks, the improvements are meaningful.</p></li></ul><h3><strong>Deeper models, narrower architectures</strong></h3><p>AttnRes does not just improve performance at a fixed architecture. It fundamentally changes the optimal architecture itself, as shown in figure 30.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Zb-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Zb-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png 424w, https://substackcdn.com/image/fetch/$s_!1Zb-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png 848w, https://substackcdn.com/image/fetch/$s_!1Zb-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png 1272w, https://substackcdn.com/image/fetch/$s_!1Zb-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Zb-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png" width="1456" height="744" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:744,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 30. Optimal depth-to-width ratio. Left panel: At each depth, AttnRes achieves optimal performance with a narrower model than standard residuals. For example, at 48 layers, the standard optimal width is 4096 while the AttnRes optimal width is 3072. Right panel: At the optimal configuration, AttnRes achieves lower validation loss at every depth, with the advantage growing larger at greater depths. The minimum loss for AttnRes occurs at 64 layers with width 2560, while standard residuals bottom out at 36 layers with width 5120.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 30. Optimal depth-to-width ratio. Left panel: At each depth, AttnRes achieves optimal performance with a narrower model than standard residuals. For example, at 48 layers, the standard optimal width is 4096 while the AttnRes optimal width is 3072. Right panel: At the optimal configuration, AttnRes achieves lower validation loss at every depth, with the advantage growing larger at greater depths. The minimum loss for AttnRes occurs at 64 layers with width 2560, while standard residuals bottom out at 36 layers with width 5120." title="Figure 30. Optimal depth-to-width ratio. Left panel: At each depth, AttnRes achieves optimal performance with a narrower model than standard residuals. For example, at 48 layers, the standard optimal width is 4096 while the AttnRes optimal width is 3072. Right panel: At the optimal configuration, AttnRes achieves lower validation loss at every depth, with the advantage growing larger at greater depths. The minimum loss for AttnRes occurs at 64 layers with width 2560, while standard residuals bottom out at 36 layers with width 5120." srcset="https://substackcdn.com/image/fetch/$s_!1Zb-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png 424w, https://substackcdn.com/image/fetch/$s_!1Zb-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png 848w, https://substackcdn.com/image/fetch/$s_!1Zb-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png 1272w, https://substackcdn.com/image/fetch/$s_!1Zb-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa8f3c33-c704-45ea-bbbb-2191cd6f9d88_1830x935.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 30. Optimal depth-to-width ratio. Left panel: At each depth, AttnRes achieves optimal performance with a narrower model than standard residuals. For example, at 48 layers, the standard optimal width is 4096 while the AttnRes optimal width is 3072. Right panel: At the optimal configuration, AttnRes achieves lower validation loss at every depth, with the advantage growing larger at greater depths. The minimum loss for AttnRes occurs at 64 layers with width 2560, while standard residuals bottom out at 36 layers with width 5120.</em></figcaption></figure></div><p>As shown in figure 30, standard residuals favor shallower, wider architectures because depth is poorly utilized (due to dilution). AttnRes shifts the optimal point toward deeper, narrower architectures because depth is now effectively utilized through selective attention. This is a fundamental shift in model design philosophy: with AttnRes, you can build deeper models that actually benefit from their depth.</p><h3><strong>Overhead analysis</strong></h3><p>The final question is cost. How much overhead does Block AttnRes add? The answer, shown in figure 31, is remarkably little.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XVMB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XVMB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png 424w, https://substackcdn.com/image/fetch/$s_!XVMB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png 848w, https://substackcdn.com/image/fetch/$s_!XVMB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png 1272w, https://substackcdn.com/image/fetch/$s_!XVMB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XVMB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png" width="1456" height="659" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:659,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 31. Overhead breakdown for Block AttnRes. Left panel (Cost): Training cost increases by only 3.8%, and inference latency increases by only 1.9%. Right panel (Benefits): The gains far outweigh the costs. The compute advantage is 25% (equivalent to 1.25x more training compute), GPQA-Diamond improves by 7.5 points, Math by 3.6 points, and HumanEval by 3.1 points.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 31. Overhead breakdown for Block AttnRes. Left panel (Cost): Training cost increases by only 3.8%, and inference latency increases by only 1.9%. Right panel (Benefits): The gains far outweigh the costs. The compute advantage is 25% (equivalent to 1.25x more training compute), GPQA-Diamond improves by 7.5 points, Math by 3.6 points, and HumanEval by 3.1 points." title="Figure 31. Overhead breakdown for Block AttnRes. Left panel (Cost): Training cost increases by only 3.8%, and inference latency increases by only 1.9%. Right panel (Benefits): The gains far outweigh the costs. The compute advantage is 25% (equivalent to 1.25x more training compute), GPQA-Diamond improves by 7.5 points, Math by 3.6 points, and HumanEval by 3.1 points." srcset="https://substackcdn.com/image/fetch/$s_!XVMB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png 424w, https://substackcdn.com/image/fetch/$s_!XVMB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png 848w, https://substackcdn.com/image/fetch/$s_!XVMB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png 1272w, https://substackcdn.com/image/fetch/$s_!XVMB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09bafeba-dcbe-48d4-914c-3e7d1ccbdcd2_2066x935.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 31. Overhead breakdown for Block AttnRes. Left panel (Cost): Training cost increases by only 3.8%, and inference latency increases by only 1.9%. Right panel (Benefits): The gains far outweigh the costs. The compute advantage is 25% (equivalent to 1.25x more training compute), GPQA-Diamond improves by 7.5 points, Math by 3.6 points, and HumanEval by 3.1 points.</em></figcaption></figure></div><p>As illustrated in figure 31, the overhead is minimal:</p><ul><li><p><strong>Training cost increase</strong>: 3.8%. The additional attention computation at block boundaries adds less than 4% to the total training cost.</p></li><li><p><strong>Inference latency increase</strong>: 1.9%. Block attention is fast relative to the standard attention within layers.</p></li></ul><p>The 1.25x compute advantage far outweighs the less than 4% overhead. For every dollar you spend on the attention computation, you get 25 cents of equivalent compute back in model quality. The authors confirmed that 8 blocks is the optimal trade-off between performance and overhead.</p><p>These improvements did not emerge from a vacuum. Attention Residuals represent the latest step in a decade-long quest to improve how deep networks combine information across depth.</p><div><hr></div><h2><strong>The evolution of depth-wise aggregation</strong></h2><p>To appreciate where AttnRes sits in the landscape, it helps to trace the history of how researchers have approached the problem of combining information across depth.</p><h3><strong>From ResNet to AttnRes: a ten-year journey</strong></h3><p>The progression from fixed residual connections to learned depth-wise attention has unfolded over a decade, as shown in figure 32.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hTVR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hTVR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!hTVR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!hTVR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!hTVR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hTVR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 32. The evolution of depth-wise aggregation in deep learning. From ResNet (2016, fixed weight 1.0 addition) through DenseNet (2016, concatenate all outputs), DenseFormer (Feb 2024, learned input-independent scalar weights), ResFormer (Oct 2024, value residual connections from layer 1), DeepCrossAttention (2025, input-dependent cross-layer weights), to AttnRes (2026, full softmax attention over depth via pseudo-queries). The arrow shows increasing expressiveness from fixed and uniform to learned and selective.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 32. The evolution of depth-wise aggregation in deep learning. From ResNet (2016, fixed weight 1.0 addition) through DenseNet (2016, concatenate all outputs), DenseFormer (Feb 2024, learned input-independent scalar weights), ResFormer (Oct 2024, value residual connections from layer 1), DeepCrossAttention (2025, input-dependent cross-layer weights), to AttnRes (2026, full softmax attention over depth via pseudo-queries). The arrow shows increasing expressiveness from fixed and uniform to learned and selective." title="Figure 32. The evolution of depth-wise aggregation in deep learning. From ResNet (2016, fixed weight 1.0 addition) through DenseNet (2016, concatenate all outputs), DenseFormer (Feb 2024, learned input-independent scalar weights), ResFormer (Oct 2024, value residual connections from layer 1), DeepCrossAttention (2025, input-dependent cross-layer weights), to AttnRes (2026, full softmax attention over depth via pseudo-queries). The arrow shows increasing expressiveness from fixed and uniform to learned and selective." srcset="https://substackcdn.com/image/fetch/$s_!hTVR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!hTVR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!hTVR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!hTVR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3eb1210b-93a3-4a2d-b115-1193f53508ba_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 32. The evolution of depth-wise aggregation in deep learning. From ResNet (2016, fixed weight 1.0 addition) through DenseNet (2016, concatenate all outputs), DenseFormer (Feb 2024, learned input-independent scalar weights), ResFormer (Oct 2024, value residual connections from layer 1), DeepCrossAttention (2025, input-dependent cross-layer weights), to AttnRes (2026, full softmax attention over depth via pseudo-queries). The arrow shows increasing expressiveness from fixed and uniform to learned and selective.</em></figcaption></figure></div><p>As shown in figure 32, each step in this timeline adds more expressiveness to depth-wise information routing:</p><ul><li><p><strong>ResNet (2016)</strong>: Fixed weight 1.0. Every layer contributes equally. Zero overhead, but no selectivity.</p></li><li><p><strong>DenseNet (2016)</strong>: Concatenates all previous outputs instead of summing them. Very expressive but memory-intensive. Primarily used in computer vision.</p></li><li><p><strong>DenseFormer (Feb 2024)</strong>: Learns scalar weights per layer pair, but these weights are input-independent. Minimal overhead, but limited expressiveness.</p></li><li><p><strong>ResFormer (Oct 2024)</strong>: Adds residual connections specifically to value vectors from the first layer. Addresses the related but different problem of attention concentration.</p></li><li><p><strong>DeepCrossAttention (2025)</strong>: Uses full input-dependent cross-attention between layers. Claims up to 3x training speedup. Most similar to AttnRes in concept.</p></li><li><p><strong>AttnRes (March 2026)</strong>: Full softmax attention over depth via a single pseudo-query per layer. Less than 2% inference overhead with 1.25x compute advantage.</p></li></ul><h3><strong>Comparing the approaches</strong></h3><p>A detailed comparison of the key methods is shown in figure 33.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!295t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!295t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!295t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!295t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!295t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!295t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 33. Comparison of depth-wise aggregation methods across five dimensions: weights type, input-dependence, overhead, and key property. Standard residuals use fixed weights with 0% overhead but O(L) growth. DenseFormer uses learned scalars that are input-independent. ResFormer targets attention concentration rather than hidden state dilution. DeepCrossAttention is the first fully input-dependent method. AttnRes achieves input-dependent softmax attention with less than 2% inference overhead and bounded magnitudes.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 33. Comparison of depth-wise aggregation methods across five dimensions: weights type, input-dependence, overhead, and key property. Standard residuals use fixed weights with 0% overhead but O(L) growth. DenseFormer uses learned scalars that are input-independent. ResFormer targets attention concentration rather than hidden state dilution. DeepCrossAttention is the first fully input-dependent method. AttnRes achieves input-dependent softmax attention with less than 2% inference overhead and bounded magnitudes." title="Figure 33. Comparison of depth-wise aggregation methods across five dimensions: weights type, input-dependence, overhead, and key property. Standard residuals use fixed weights with 0% overhead but O(L) growth. DenseFormer uses learned scalars that are input-independent. ResFormer targets attention concentration rather than hidden state dilution. DeepCrossAttention is the first fully input-dependent method. AttnRes achieves input-dependent softmax attention with less than 2% inference overhead and bounded magnitudes." srcset="https://substackcdn.com/image/fetch/$s_!295t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!295t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!295t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!295t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1153a136-938a-42d6-a99a-55c42676f0d7_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 33. Comparison of depth-wise aggregation methods across five dimensions: weights type, input-dependence, overhead, and key property. Standard residuals use fixed weights with 0% overhead but O(L) growth. DenseFormer uses learned scalars that are input-independent. ResFormer targets attention concentration rather than hidden state dilution. DeepCrossAttention is the first fully input-dependent method. AttnRes achieves input-dependent softmax attention with less than 2% inference overhead and bounded magnitudes.</em></figcaption></figure></div><p>As illustrated in figure 33, AttnRes occupies a unique position in the design space: it provides full input-dependent attention over depth with less than 2% inference overhead. The key distinguishing properties are bounded magnitudes (from the convex combination) and the 1.25x compute advantage (from better depth utilization).</p><h3><strong>When does AttnRes struggle?</strong></h3><p>No technique is universally superior, and AttnRes is no exception. Ziming Liu of MIT and Caltech provides a nuanced analysis of when AttnRes excels and when it falls short.</p><p>AttnRes excels on structured tasks where skipping intermediate layers is valuable. Natural language has rich hierarchical structure, with different layers specializing in syntax, semantics, and reasoning. The attention mechanism can learn to focus on specific layers without needing to suppress intermediate representations.</p><p>However, AttnRes can struggle on pure memorization tasks where uniform blending works fine. If the task requires every layer&#8217;s contribution equally, the selective attention provides no advantage and may even hurt by constraining the representation.</p><p>There is also a risk of representation collapse: if the attention weights converge to a uniform distribution during training, AttnRes degenerates to averaging all previous hidden states. This uniform bias can limit expressive capacity in certain settings. Natural language&#8217;s structured nature likely explains why Kimi&#8217;s strong empirical results generalize well across diverse benchmarks.</p><p>Having explored the theory, results, and context, let&#8217;s look inside a trained model to see what patterns the depth-wise attention actually learns.</p><div><hr></div><h2><strong>Inside a trained model</strong></h2><p>The most illuminating way to understand AttnRes is to examine what the model actually learns. What patterns do the attention weights develop? How do different blocks interact?</p><h3><strong>What the attention weights learn</strong></h3><p>The attention weight heatmap from a trained Kimi Linear model reveals striking patterns, as shown in figure 34.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!T1xq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!T1xq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!T1xq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!T1xq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!T1xq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!T1xq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 34. Learned depth-wise attention weight heatmap from the Kimi Linear 48B model. Source blocks on the x-axis, target blocks on the y-axis. Dark blue indicates high attention weight. Three notable patterns emerge: (1) Early blocks attend strongly to the token embeddings (column 0 is dark for rows 1-2). (2) Middle blocks attend primarily to nearby predecessors (the diagonal is prominent for rows 3-5). (3) Deep blocks develop long-range connections back to early layers (rows 7-8 have dark cells in columns 0-1), suggesting they retrieve fundamental features directly.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 34. Learned depth-wise attention weight heatmap from the Kimi Linear 48B model. Source blocks on the x-axis, target blocks on the y-axis. Dark blue indicates high attention weight. Three notable patterns emerge: (1) Early blocks attend strongly to the token embeddings (column 0 is dark for rows 1-2). (2) Middle blocks attend primarily to nearby predecessors (the diagonal is prominent for rows 3-5). (3) Deep blocks develop long-range connections back to early layers (rows 7-8 have dark cells in columns 0-1), suggesting they retrieve fundamental features directly." title="Figure 34. Learned depth-wise attention weight heatmap from the Kimi Linear 48B model. Source blocks on the x-axis, target blocks on the y-axis. Dark blue indicates high attention weight. Three notable patterns emerge: (1) Early blocks attend strongly to the token embeddings (column 0 is dark for rows 1-2). (2) Middle blocks attend primarily to nearby predecessors (the diagonal is prominent for rows 3-5). (3) Deep blocks develop long-range connections back to early layers (rows 7-8 have dark cells in columns 0-1), suggesting they retrieve fundamental features directly." srcset="https://substackcdn.com/image/fetch/$s_!T1xq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!T1xq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!T1xq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!T1xq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F680f9f6c-0f5d-4348-b1f5-5a291243359e_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 34. Learned depth-wise attention weight heatmap from the Kimi Linear 48B model. Source blocks on the x-axis, target blocks on the y-axis. Dark blue indicates high attention weight. Three notable patterns emerge: (1) Early blocks attend strongly to the token embeddings (column 0 is dark for rows 1-2). (2) Middle blocks attend primarily to nearby predecessors (the diagonal is prominent for rows 3-5). (3) Deep blocks develop long-range connections back to early layers (rows 7-8 have dark cells in columns 0-1), suggesting they retrieve fundamental features directly.</em></figcaption></figure></div><p>As shown in figure 34, three distinct patterns emerge from the learned attention weights:</p><ul><li><p><strong>Early blocks attend to embeddings</strong>: Blocks 1 and 2 place strong attention on the token embeddings (source block 0). This makes sense because early processing needs direct access to the raw input features.</p></li><li><p><strong>Middle blocks attend to neighbors</strong>: Blocks 3 through 5 show a strong diagonal pattern, attending primarily to their immediate predecessors. This suggests a sequential refinement strategy where each block builds on the most recent computation.</p></li><li><p><strong>Deep blocks develop long-range connections</strong>: Blocks 7 and 8 show renewed attention to early blocks (columns 0-1), even though they also attend to nearby blocks. This is the most intriguing pattern: deep layers &#8220;reach back&#8221; to retrieve fundamental features that may have been diluted by intermediate processing.</p></li></ul><p>These patterns would be impossible with standard residual connections, where every block receives the same blended signal regardless of what it needs.</p><h3><strong>Implementation walkthrough</strong></h3><p>The implementation of Block AttnRes is remarkably concise. The annotated pseudocode is shown in figure 35.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uNAh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uNAh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!uNAh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!uNAh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!uNAh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uNAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Figure 35. Annotated PyTorch pseudocode for Block AttnRes. The function block_attn_res takes block representations, a partial sum, and a learned projection. Step 1: Stack all sources into tensor V. Step 2: Normalize with RMSNorm. Step 3: Compute logits via einsum between the projection weight and V. Step 4: Apply softmax over the block dimension (dim=0) to get weights that sum to 1. Step 5: Weighted aggregation via einsum to produce the bounded output.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Figure 35. Annotated PyTorch pseudocode for Block AttnRes. The function block_attn_res takes block representations, a partial sum, and a learned projection. Step 1: Stack all sources into tensor V. Step 2: Normalize with RMSNorm. Step 3: Compute logits via einsum between the projection weight and V. Step 4: Apply softmax over the block dimension (dim=0) to get weights that sum to 1. Step 5: Weighted aggregation via einsum to produce the bounded output." title="Figure 35. Annotated PyTorch pseudocode for Block AttnRes. The function block_attn_res takes block representations, a partial sum, and a learned projection. Step 1: Stack all sources into tensor V. Step 2: Normalize with RMSNorm. Step 3: Compute logits via einsum between the projection weight and V. Step 4: Apply softmax over the block dimension (dim=0) to get weights that sum to 1. Step 5: Weighted aggregation via einsum to produce the bounded output." srcset="https://substackcdn.com/image/fetch/$s_!uNAh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png 424w, https://substackcdn.com/image/fetch/$s_!uNAh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png 848w, https://substackcdn.com/image/fetch/$s_!uNAh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!uNAh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd2626f7-af3f-4520-ba56-cd3470c16407_2752x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Figure 35. Annotated PyTorch pseudocode for Block AttnRes. The function block_attn_res takes block representations, a partial sum, and a learned projection. Step 1: Stack all sources into tensor V. Step 2: Normalize with RMSNorm. Step 3: Compute logits via einsum between the projection weight and V. Step 4: Apply softmax over the block dimension (dim=0) to get weights that sum to 1. Step 5: Weighted aggregation via einsum to produce the bounded output.</em></figcaption></figure></div><p>As illustrated in figure 35, the entire Block AttnRes computation fits in five steps:</p><pre><code><code>def block_attn_res(block_reprs, partial_sum, proj):
    # Step 1: Stack all sources
    V = stack(block_reprs + [partial_sum])      # (N+1, batch, seq, d)

    # Step 2: Normalize
    V = RMSNorm(V)

    # Step 3: Compute attention logits
    logits = einsum('d, n b t d -&gt; n b t',
                    proj.weight.squeeze(), V)

    # Step 4: Softmax over block dimension
    weights = logits.softmax(dim=0)

    # Step 5: Weighted aggregation
    output = einsum('n b t, n b t d -&gt; b t d',
                    weights, V)

    return output</code></code></pre><p>The function takes the previous block representations, the current block&#8217;s partial sum, and a learned projection (the pseudo-query). The stack, normalize, score, softmax, aggregate pattern is clean and efficient. Adding Block AttnRes to an existing transformer requires modifying only the block boundary logic, making it a true drop-in replacement for standard residual connections.</p><div><hr></div><h2><strong>Summary</strong></h2><p>We have traced the complete story of Attention Residuals, from the problem they solve to the mechanism they use to the gains they deliver. Let&#8217;s recap the key takeaways.</p><ul><li><p><strong>The PreNorm dilution problem</strong>: Standard residual connections add all layer outputs with fixed weight 1.0, causing hidden state magnitudes to grow as O(L) with depth. This progressively dilutes each layer&#8217;s contribution, making deeper layers less effective and wasting model capacity. In a 50-layer model, each layer represents only 2% of the final hidden state.</p></li></ul><ul><li><p><strong>The depth-time duality</strong>: Information dilution across network depth is structurally identical to memory loss across a sequence. Just as self-attention solved the sequence problem by enabling selective access to any previous position, AttnRes solves the depth problem by enabling selective access to any previous layer. The mathematical structure is the same, just rotated 90 degrees.</p></li></ul><ul><li><p><strong>The AttnRes mechanism</strong>: Each layer uses a learned pseudo-query vector to compute softmax attention weights over all previous layer outputs, producing an input-dependent weighted combination. Since softmax weights sum to 1, the resulting hidden state is a bounded convex combination that can never exceed the magnitude of the largest input. This completely eliminates the O(L) growth problem while providing each layer with selective, content-aware access to earlier representations.</p></li></ul><ul><li><p><strong>Block AttnRes for scalability</strong>: The practical variant partitions layers into approximately 8 blocks, using standard residuals within blocks and attention-based aggregation at block boundaries. This reduces memory from O(Ld) to O(Nd) while recovering most of the full AttnRes gains. Pipeline-parallel training is supported through cache-based point-to-point communication of block representations, with less than 2% inference overhead.</p></li></ul><ul><li><p><strong>Quantified improvements</strong>: Block AttnRes achieves a 1.25x compute advantage, matching baseline models trained with 25% more resources. On Kimi Linear (48B total, 3B activated, 1.4T tokens), it improved GPQA-Diamond by +7.5 points, Math by +3.6 points, and HumanEval by +3.1 points. The largest gains appear in multi-step reasoning tasks. AttnRes also shifts the optimal architecture toward deeper, narrower models that more effectively utilize their depth.</p></li></ul><div><hr></div><h2><strong>Further reading</strong></h2><ul><li><p><a href="https://arxiv.org/abs/2603.15031">Attention Residuals (arXiv:2603.15031)</a> - The original paper by Chen et al. at Moonshot AI (Kimi team), March 2026. Full mathematical formulation, scaling law experiments, and Kimi Linear integration.</p></li><li><p><a href="https://github.com/MoonshotAI/Attention-Residuals">MoonshotAI/Attention-Residuals on GitHub</a> - Official implementation with PyTorch pseudocode, Block AttnRes details, and benchmark tables.</p></li><li><p><a href="https://kindxiaoming.github.io/blog/2026/attention-residual/">When Does Attention Residuals Work? (Ziming Liu)</a> - Critical analysis from MIT/Caltech showing when AttnRes excels (structured tasks) and when it struggles (pure memorization).</p></li><li><p><a href="https://arxiv.org/abs/2402.02622">DenseFormer (arXiv:2402.02622)</a> - The predecessor using Depth-Weighted Averaging with input-independent scalar weights. Important comparison point.</p></li><li><p><a href="https://arxiv.org/abs/2502.06785">DeepCrossAttention (arXiv:2502.06785)</a> - Google&#8217;s learnable input-dependent cross-layer weights, claiming 3x training speedup.</p></li><li><p><a href="https://arxiv.org/abs/2410.17897">Value Residual Learning / ResFormer (arXiv:2410.17897)</a> - Residual connections on value vectors to address attention concentration, saving 10-14% parameters.</p></li></ul><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Vision Transformers]]></title><description><![CDATA[Understanding and fine-tuning Vision Transformers (ViT) for image classification, to hands-on transfer learning with pretrained models.]]></description><link>https://www.vizuaranewsletter.com/p/vision-transformers</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/vision-transformers</guid><dc:creator><![CDATA[Mayank Pratap Singh]]></dc:creator><pubDate>Sun, 22 Mar 2026 04:10:41 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5d450573-3b45-4cec-bcd5-3ef3125044d4_1200x640.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E3g4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E3g4!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif 424w, https://substackcdn.com/image/fetch/$s_!E3g4!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif 848w, https://substackcdn.com/image/fetch/$s_!E3g4!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif 1272w, https://substackcdn.com/image/fetch/$s_!E3g4!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E3g4!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:null,&quot;width&quot;:null,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13148366,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E3g4!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif 424w, https://substackcdn.com/image/fetch/$s_!E3g4!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif 848w, https://substackcdn.com/image/fetch/$s_!E3g4!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif 1272w, https://substackcdn.com/image/fetch/$s_!E3g4!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2bfae427-1b4f-483e-a558-a5a6b7599c44_1920x1278.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p></p><p><strong>Table of Contents</strong></p><ul><li><p>Adapting transformers to images: patch embeddings and flattening</p></li><li><p>Positional encodings in vision</p></li><li><p>Encoder-only structure for classification</p></li><li><p>Benefits and drawbacks of ViT</p></li><li><p>Real-world applications of ViT</p></li><li><p>Hands-on: fine-tuning ViT for image classification</p></li></ul><p><strong>Finetuning Vision Transformer Code is available below</strong></p><p><a href="https://github.com/VizuaraAI/Transformers-for-vision-BOOK/tree/main">https://github.com/VizuaraAI/Transformers-for-vision-BOOK</a></p><h1>1.1 Introduction to Vision Transformers and Comparison with CNNs</h1><p>Vision Transformers adapt the transformer architecture from language modeling to images. Instead of scanning an image with small sliding filters, they treat small image patches as tokens and learn how these patches relate to one another through self attention. For now it is enough to think of a Vision Transformer as a model that can look at all parts of an image at once and decide which regions should influence each other.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qI1a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qI1a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png 424w, https://substackcdn.com/image/fetch/$s_!qI1a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png 848w, https://substackcdn.com/image/fetch/$s_!qI1a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png 1272w, https://substackcdn.com/image/fetch/$s_!qI1a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qI1a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png" width="1425" height="1074" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1074,&quot;width&quot;:1425,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:356535,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qI1a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png 424w, https://substackcdn.com/image/fetch/$s_!qI1a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png 848w, https://substackcdn.com/image/fetch/$s_!qI1a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png 1272w, https://substackcdn.com/image/fetch/$s_!qI1a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fdc6de7-e966-413e-961e-2ba98529aedb_1425x1074.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 1</strong> </em>Self-attention vs. convolution receptive fields on a bird image</figcaption></figure></div><p>The key architectural difference between Vision Transformers and convolutional networks lies in how they see an image. A convolutional layer only looks at a small neighborhood of pixels when it computes its output. Its receptive field grows only gradually as we stack more layers and pooling operations. This locality bias has been very successful for classic vision tasks, but it means that long distance relationships in an image are only captured indirectly and late in the network. Figure 1 illustrates this contrast between global self attention and local convolution on a simple bird image.</p><p>Self attention in a Vision Transformer has a global receptive field from the first layer. For any query location, the model can compare it directly with every other patch in the image and decide which ones are relevant. In the bird illustration, a single pixel or patch can immediately connect to any other region in the picture, while a convolution sees only its nearby neighborhood and must rely on many stacked layers to move information from one side of the image to the other.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XSp1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XSp1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png 424w, https://substackcdn.com/image/fetch/$s_!XSp1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png 848w, https://substackcdn.com/image/fetch/$s_!XSp1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png 1272w, https://substackcdn.com/image/fetch/$s_!XSp1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XSp1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png" width="1456" height="838" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:838,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1083031,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XSp1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png 424w, https://substackcdn.com/image/fetch/$s_!XSp1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png 848w, https://substackcdn.com/image/fetch/$s_!XSp1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png 1272w, https://substackcdn.com/image/fetch/$s_!XSp1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88c3dd9d-4dc6-4c57-a77f-add0ebdcccac_1647x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2 Optical illusion painting: global attention vs. local convolution</figcaption></figure></div><p>The second figure we have an optical illusion painting to make this difference more concrete. A human viewer can perceive both the detailed scene of a rider by the river and the larger face formed by the entire painting. A convolutional model tends to focus on the local textures of rocks, water, and fur, while a Vision Transformer can link distant regions that together form the face. Figure 2 shows how self attention attends across the whole painting, while a convolution still operates on a restricted local view.</p><p>The classic story of several blind people trying to describe an elephant gives another intuitive picture of this difference. Each person touches a single part and concludes that the elephant is a rope, a wall, or a snake, depending on whether they hold the tail, the body, or the trunk. A convolutional network behaves in a similar way, since each unit only has access to a small patch and builds its understanding from many separate local views. A Vision Transformer behaves more like a group that can share information freely. Even if each observer starts with a limited view, self attention lets them combine their observations and agree on the full shape of the elephant.</p><p>In the rest of this chapter we shift from this high level comparison to the inner workings of Vision Transformers. We will unpack how images are converted into patch tokens, how positional information is added, and how self attention layers process these tokens. Step by step, we will build up the full Vision Transformer encoder and its classification head so that the complete architecture becomes clear and concrete.</p><p>Since we've already covered the fundamentals of Transformers, I won't be going into too much detail here. If you're new to the architecture, I highly recommend reading my previous post, to get a solid foundation before we dive into Vision Transformers.</p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;d038f8b6-0901-4372-9c27-ca000f6ab1bd&quot;,&quot;caption&quot;:&quot;The Transformer Architecture&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Transformers&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-03-17T03:32:41.080Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!Igi_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/the-transformers&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:190611987,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:90,&quot;comment_count&quot;:5,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h3>From Text Transformers to Vision Transformers</h3><p>Transformers for text and transformers for images share the same core ideas. In language models such as GPT, we start from a sequence of tokens, embed them, and apply masked self attention so that each token can only attend to tokens at its current or earlier positions. This matches the next token prediction objective, where the final context vector of the sequence is used to predict the following token. In Vision Transformers, an image is first tokenized into a sequence of patches, self attention is applied without masking so that every patch can attend to every other patch, and a special class token provides a single representation that is passed to a small MLP head for classification.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-1sE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-1sE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png 424w, https://substackcdn.com/image/fetch/$s_!-1sE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png 848w, https://substackcdn.com/image/fetch/$s_!-1sE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png 1272w, https://substackcdn.com/image/fetch/$s_!-1sE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-1sE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png" width="1456" height="584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:584,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51020,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-1sE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png 424w, https://substackcdn.com/image/fetch/$s_!-1sE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png 848w, https://substackcdn.com/image/fetch/$s_!-1sE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png 1272w, https://substackcdn.com/image/fetch/$s_!-1sE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe43463b-cb47-4696-adfd-d8d24095bee3_1578x633.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 3</strong> Comparing text and image transformers. The top row shows a GPT style model that tokenizes a sentence, applies masked self attention, and uses the last context vector for next token prediction. The bottom row shows a Vision Transformer that tokenizes an image into patches, applies unmasked self attention, and uses the class token for image classification.</em></figcaption></figure></div><p></p><p>BERT provides a second reference point for Vision Transformers. Instead of predicting the next word, BERT is trained to recover masked tokens inside a sentence, so it uses unmasked self attention over the entire sequence to capture bidirectional context. Vision Transformers adopt this encoder style design from BERT, but apply it to image patches together with a class token, giving a close analogue of BERT style sequence understanding in the image domain.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LRPO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LRPO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png 424w, https://substackcdn.com/image/fetch/$s_!LRPO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png 848w, https://substackcdn.com/image/fetch/$s_!LRPO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png 1272w, https://substackcdn.com/image/fetch/$s_!LRPO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LRPO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png" width="612" height="381" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/168d22fb-c14a-4161-8292-ba3b773514af_612x381.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:381,&quot;width&quot;:612,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10289,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LRPO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png 424w, https://substackcdn.com/image/fetch/$s_!LRPO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png 848w, https://substackcdn.com/image/fetch/$s_!LRPO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png 1272w, https://substackcdn.com/image/fetch/$s_!LRPO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F168d22fb-c14a-4161-8292-ba3b773514af_612x381.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 4</strong> BERT masked language modeling. The model takes a sentence with some tokens replaced by a special mask symbol, processes the full sequence with unmasked self attention, and predicts the original values of the masked tokens at the output.</em></figcaption></figure></div><p>Now that we have compared transformers for text and images at a high level, we can focus on how a Vision Transformer actually sees an image. In the next section we will follow an image as it is cut into small patches, converted into vectors, and arranged into a sequence that looks very much like a sentence of tokens. This patch embedding stage is the first step that lets a standard transformer encoder operate directly on images.</p><h1>1.2 Adapting transformers to images: patch embeddings and flattening</h1><p>Adapting a transformer to images starts with a simple question: how can we turn a 2D image into the kind of 1D token sequence that a text transformer expects? Vision Transformers answer this by cutting the image into a grid of fixed size patches and treating each patch as a token. In this section we work through that patch embedding step using the 640 by 640 cat image and then show two common implementations: a flatten plus linear projection and an equivalent convolutional layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cycd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cycd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png 424w, https://substackcdn.com/image/fetch/$s_!Cycd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png 848w, https://substackcdn.com/image/fetch/$s_!Cycd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png 1272w, https://substackcdn.com/image/fetch/$s_!Cycd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cycd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png" width="1456" height="872" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e651445f-46d4-47fd-b601-d45500aff44d_1473x882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:872,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:133065,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Cycd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png 424w, https://substackcdn.com/image/fetch/$s_!Cycd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png 848w, https://substackcdn.com/image/fetch/$s_!Cycd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png 1272w, https://substackcdn.com/image/fetch/$s_!Cycd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe651445f-46d4-47fd-b601-d45500aff44d_1473x882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 5</strong> Patchifying a 640 by 640 image. The image is partitioned into a 4 by 4 grid of non overlapping 160 by 160 patches, numbered from 1 to 16. These 16 patches will become 16 tokens for the transformer, and a separate class token will later be prepended to form a sequence of 17 tokens.</em></figcaption></figure></div><p>For the cat example, suppose the image has height and width 640 pixels and three color channels. We choose square patches of size 160 pixels. The image is divided into a 4 by 4 grid of non overlapping patches, each 160 by 160 by 3. In general, for an image of height H, width W, and patch size P, the number of patches N is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N \\;=\\; \\frac{H}{P} \\times \\frac{W}{P}\n    \\;=\\; \\frac{H\\,W}{P^{2}},\n&quot;,&quot;id&quot;:&quot;HSXCTYQQCM&quot;}" data-component-name="LatexBlockToDOM"></div><p>assuming H and W are multiples of P. For a square image of side S this reduces to</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N \\;=\\; \\left(\\frac{S}{P}\\right)^{2}.\n&quot;,&quot;id&quot;:&quot;WNWTMOOWNK&quot;}" data-component-name="LatexBlockToDOM"></div><p>With H = W = 640 and P = 160 we obtain N = (640 / 160)&#178; = 4&#178; = 16 patches.</p><p>Once the image has been split into patches, each patch must be converted into a vector that lives in a common embedding space of dimension D, just as word tokens are embedded in text models. One straightforward approach, used in the original Vision Transformer paper, is to flatten each patch and apply a linear projection. A single patch has shape P by P by C, where C is the number of channels. Flattening gives a vector of length P&#178;C. </p><p>In the transformer it is convenient to line these patches up in a fixed order and think of them as a one-dimensional sequence rather than a two-dimensional grid. For instance, we might start from the top-left patch, then move left to right, row by row, until we reach the bottom-right patch.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Ge8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Ge8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png 424w, https://substackcdn.com/image/fetch/$s_!4Ge8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png 848w, https://substackcdn.com/image/fetch/$s_!4Ge8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png 1272w, https://substackcdn.com/image/fetch/$s_!4Ge8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Ge8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png" width="1311" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1311,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:56475,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Ge8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png 424w, https://substackcdn.com/image/fetch/$s_!4Ge8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png 848w, https://substackcdn.com/image/fetch/$s_!4Ge8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png 1272w, https://substackcdn.com/image/fetch/$s_!4Ge8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58b8a5fb-ecbd-47b7-a206-71023f4995bb_1311x246.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><em><strong>Figure 6.</strong> The sixteen patches are arranged in a sequence from left to right. Each small tile is still an image patch, but from the transformer&#8217;s point of view this will soon become a sequence of sixteen tokens.</em></figcaption></figure></div><p><strong>Patch embeddings without convolution</strong></p><p>In the first subsection we will build patch embeddings <strong>without</strong> any convolutional layers. We will treat each 160&#215;160&#215;3 patch as a tiny image, flatten all of its pixels into a long vector, and pass that vector through a learnable linear layer to obtain a D-dimensional patch embedding. This view mirrors word embeddings in language models: each patch is simply another token whose embedding is learned directly from data.</p><p>To keep things concrete, return to the 640&#215;640 cat image from the previous section. We divide it into a 4&#215;4 grid of non-overlapping patches, each of size 160&#215;160 pixels.</p><p>Each of these sixteen patches will become one token for the transformer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e_3c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e_3c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png 424w, https://substackcdn.com/image/fetch/$s_!e_3c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png 848w, https://substackcdn.com/image/fetch/$s_!e_3c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png 1272w, https://substackcdn.com/image/fetch/$s_!e_3c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e_3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png" width="1456" height="673" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:673,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:155379,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!e_3c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png 424w, https://substackcdn.com/image/fetch/$s_!e_3c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png 848w, https://substackcdn.com/image/fetch/$s_!e_3c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png 1272w, https://substackcdn.com/image/fetch/$s_!e_3c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F039e82e3-4af1-4d2e-aa0d-f537a17d4449_1674x774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><strong>Figure 7.</strong> A single 160&#215;160 patch (patch 10) from a 640&#215;640 RGB image. The patch has shape 160&#215;160&#215;3. We zoom into the channels and write out the red, green, and blue values for each pixel. All these pixel values are then flattened into one long vector of length 160&#215;160&#215;3, which will be mapped to a patch embedding.</figcaption></figure></div><p>Consider patch 10, the square around the cat&#8217;s eye. As an RGB patch it is a small 3-D tensor with shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;P&#215;P&#215;C=160&#215;160&#215;3&quot;,&quot;id&quot;:&quot;PSUDIHPLKT&quot;}" data-component-name="LatexBlockToDOM"></div><p>where C=3 is the number of colour channels. The image in Figure 7 shows this patch split into its red, green, and blue planes, and then shows a few individual pixels as (R,G,B) triplets. The first step of the non-convolutional patch embedding is to flatten this tensor into a single vector by concatenating all pixel values from all channels in a fixed order. If we denote the flattened version of the <em>i-th patch</em> by <em>flat_patch_i</em> then its length is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{flat_patch}_i \\in \\mathbb{R}^{P^{2} C}.\n&quot;,&quot;id&quot;:&quot;UWGDKBLNOM&quot;}" data-component-name="LatexBlockToDOM"></div><p>For a 160&#215;160 RGB patch this means</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;P^{2}C = 160 \\times 160 \\times 3 = 76{,}800\n&quot;,&quot;id&quot;:&quot;OXEFYVPARX&quot;}" data-component-name="LatexBlockToDOM"></div><p>numbers per patch. Flattening does not learn anything; it is just a reshaping operation that turns a 160&#215;160&#215;3 block of pixels into a vector of length 76,800.</p><p>A transformer, however, does not want raw pixel vectors of length 76,800. It expects a much shorter embedding vector of some dimension D, such as D=32 in our toy diagrams or D=768 in a ViT-Base model. The simplest way to obtain that is to apply a shared linear layer to every flattened patch. We introduce a weight matrix W_patch&#8203; and a bias vector b_patch&#8203; and define the patch embedding for the <em>i-th patch</em> as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{x}_i = W_{\\text{patch}} \\,\\text{flat_patch}_i + \\mathbf{b}_{\\text{patch}},&quot;,&quot;id&quot;:&quot;HFQZFHTZBS&quot;}" data-component-name="LatexBlockToDOM"></div><p>with</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;W_{\\text{patch}} \\in \\mathbb{R}^{D \\times P^{2}C}, \\quad\n\\mathbf{b}_{\\text{patch}} \\in \\mathbb{R}^{D}.&quot;,&quot;id&quot;:&quot;RRJWVVRWAG&quot;}" data-component-name="LatexBlockToDOM"></div><p>The dimensions line up in the natural way: <em>W_patch</em> takes a <em>length-P^2C</em> vector and maps it down to a <em>length-D vector</em>, while <em>b_patch</em> shifts the result. In the cat example, if we choose <em>D=32</em>, then <em>W_patch</em>&#8203; has shape 32&#215;76,800 and <em>b_patch</em>&#8203; has length 32. The output</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{x}_i \\in \\mathbb{R}^{D}\n&quot;,&quot;id&quot;:&quot;ZZSDXJMHAO&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>is a <em>D-dimensional</em> <strong>patch embedding</strong>. The same parameters W_patch and b_patch&#8203; are reused for every patch in every image, so they are heavily shared and trained end-to-end with the rest of the model. You can think of each row of W_patch&#8203; as a learned template that looks at the entire patch and responds with a single number; stacking D such responses gives the embedding vector.</p><p>At this point we have turned one patch into one token. Repeating the same flatten-and-project operation for all N patches yields a collection of patch embeddings</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{x}_1, \\mathbf{x}_2, \\ldots, \\mathbf{x}_N \\in \\mathbb{R}^{D}.\n&quot;,&quot;id&quot;:&quot;SRGXGGTCQC&quot;}" data-component-name="LatexBlockToDOM"></div><p>We arrange them in a fixed, deterministic order, for example row by row across the image, and stack them into a matrix</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{X}_{\\text{patch}}\n=\n\\begin{bmatrix}\n\\mathbf{x}_1 \\\\\n\\mathbf{x}_2 \\\\\n\\vdots \\\\\n\\mathbf{x}_N\n\\end{bmatrix}\n\\in \\mathbb{R}^{N \\times D}.\n&quot;,&quot;id&quot;:&quot;YEIFZGFDWQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>This matrix is the direct analogue of the token-embedding matrix in a text transformer. The only difference is that each row now encodes an entire 160&#215;160 coloured region of the image instead of a word or subword.</p><p>It is helpful to interpret this as describing the whole pipeline. Before embedding, the model sees N raw patches, each of shape <em>3&#215;P&#215;P</em>. After flattening we conceptually have N vectors of length <em>3P^2</em>. After the linear projection we have N patch embeddings of length D. Once we add the special class token and positional embeddings in the following sections, this becomes an <em>(N+1)&#215;D matrix</em> that is finally presented to the transformer encoder.</p><p><strong>Patch embeddings with convolution</strong></p><p>In this  subsection we will see that a single Conv2D layer can perform almost the same job in one shot. By choosing the kernel size and stride to match the patch size, a convolution turns the 640&#215;640 image into a 4&#215;4 grid of feature maps whose channel dimension is exactly our embedding dimension (for example, 32 channels). Each spatial location in this feature map is then interpreted as one patch token.</p><p>The idea is that a convolution with kernel size P and stride P can visit each patch exactly once, compress it, and write the output into a grid of size <em>(H/P,W/P)</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EaP8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EaP8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png 424w, https://substackcdn.com/image/fetch/$s_!EaP8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png 848w, https://substackcdn.com/image/fetch/$s_!EaP8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!EaP8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EaP8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png" width="1456" height="927" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:927,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:78809,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EaP8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png 424w, https://substackcdn.com/image/fetch/$s_!EaP8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png 848w, https://substackcdn.com/image/fetch/$s_!EaP8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png 1272w, https://substackcdn.com/image/fetch/$s_!EaP8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ca2b618-b959-424c-8bdc-2e4cff8b4d0e_1725x1098.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 8.</strong> Patch embedding as a convolution. The input is a 640&#215;640 three-channel image. Thirty-two learnable kernels of shape 3&#215;160&#215;160 are applied with stride 160. The result is a 4&#215;4 grid with 32 output channels. Each cell in this grid is a 32-dimensional patch embedding corresponding to one image patch. The 32 output channels define the embedding dimension D.</em></figcaption></figure></div><p>In our running example we use kernels of spatial size 160&#215;160 and stride 160 The input is a tensor of shape (3,640,640). We apply a convolution layer whose RGB image we set C=3. The number of output channels is chosen to be our desired embedding dimension, for example D=32. Both the kernel size and the stride are set to 160, which means the kernel slides over the 640&#215;640 image in non-overlapping 160&#215;160 steps, and we use zero padding so that the image is neatly tiled into patches without any extra border pixels being added.</p><p>The convolution layer contains D separate kernels. Each kernel is a learnable weight tensor of shape (C,P,P)=(3,160,160). When the layer processes the image, each kernel slides over the image in steps of 160 pixels, producing one response per patch. Because the stride equals the kernel size, there is no overlap between neighbouring receptive fields. After the convolution, the output tensor has shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(D, \\tfrac{H}{P}, \\tfrac{W}{P}) = (32, 4, 4)&quot;,&quot;id&quot;:&quot;APAIBDQJVI&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each spatial location (u,v) in this 4&#215;4 grid corresponds to one patch of the original image. The vector of 32 numbers at that location comes from the 32 kernels, which together play the role of the rows of the linear matrix W_patch&#8203; in the previous subsection. If we flatten the 4&#215;4 grid of spatial locations into a length-16 sequence, and read out the 32-dimensional vector at each location, we obtain the same set of patch embeddings as before.</p><p>Taken together, these two constructions show that Vision Transformers do not depend on any particular patch-extraction trick: what matters is ending up with a sequence of N patch embeddings of dimension D. Whether those embeddings come from flattening plus a linear layer or from a carefully configured convolution is largely an implementation choice; once we have the N&#215;D matrix of patch tokens, the rest of the Vision Transformer proceeds in exactly the same way.</p><p><strong>Adding the class token and forming the sequence</strong></p><p>So far we have obtained N patch embeddings <em>x1,&#8230;,xN</em>, each of dimension D. For classification tasks the Vision Transformer introduces one extra token, called the <strong>class token</strong>. This token does not come from any particular patch; it is a learned vector that is added to the front of the sequence and is meant to gather information from all other tokens through self-attention.</p><p>We denote the class-token embedding by</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{x}_0 \\in \\mathbb{R}^D.\n&quot;,&quot;id&quot;:&quot;BJQTYWRUUK&quot;}" data-component-name="LatexBlockToDOM"></div><p>This vector is a trainable parameter, initialized randomly when we create the model and optimized along with all other weights. Once we have <em>x_0</em> and the patch embeddings <em>x1,&#8230;,x_N&#8203;</em>, we can form the full sequence matrix.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;X =\n\\begin{bmatrix}\n\\mathbf{x}_0 \\\\\n\\mathbf{x}_1 \\\\\n\\vdots \\\\\n\\mathbf{x}_N\n\\end{bmatrix}\n\\in \\mathbb{R}^{(N+1)\\times D}.\n\n&quot;,&quot;id&quot;:&quot;WOZCMLVMVF&quot;}" data-component-name="LatexBlockToDOM"></div><p>The number of tokens entering the transformer encoder is therefore</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{NumTokens} = N + 1 = 1 + \\frac{HW}{P^2}.\n&quot;,&quot;id&quot;:&quot;TSQUZWGVOI&quot;}" data-component-name="LatexBlockToDOM"></div><p>In the 640 by 640 example with patch size 160, this gives N = 16 plus one class token, so there are 17 tokens in total, each of dimension D=32.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vTy2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vTy2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png 424w, https://substackcdn.com/image/fetch/$s_!vTy2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png 848w, https://substackcdn.com/image/fetch/$s_!vTy2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png 1272w, https://substackcdn.com/image/fetch/$s_!vTy2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vTy2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png" width="1456" height="821" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dcde9101-b851-4b06-8394-7afda0879b19_1506x849.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:821,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89948,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vTy2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png 424w, https://substackcdn.com/image/fetch/$s_!vTy2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png 848w, https://substackcdn.com/image/fetch/$s_!vTy2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png 1272w, https://substackcdn.com/image/fetch/$s_!vTy2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdcde9101-b851-4b06-8394-7afda0879b19_1506x849.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 9.</strong> The sixteen image patches are converted into 32-dimensional embeddings by a shared linear projection or convolution. An extra learnable class embedding x_0 is appended at the beginning of the sequence. The result is a matrix</em></figcaption></figure></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;X \\in \\mathbb{R}^{17 \\times 32}\n&quot;,&quot;id&quot;:&quot;GTJFRVSTOJ&quot;}" data-component-name="LatexBlockToDOM"></div><p><em>that will become the input to the Vision Transformer, once we add positional information.</em></p><h1>1.3 <strong>Positional encodings in Vision Transformers</strong></h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YLTX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YLTX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png 424w, https://substackcdn.com/image/fetch/$s_!YLTX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png 848w, https://substackcdn.com/image/fetch/$s_!YLTX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png 1272w, https://substackcdn.com/image/fetch/$s_!YLTX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YLTX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png" width="798" height="411" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:411,&quot;width&quot;:798,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90997,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YLTX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png 424w, https://substackcdn.com/image/fetch/$s_!YLTX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png 848w, https://substackcdn.com/image/fetch/$s_!YLTX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png 1272w, https://substackcdn.com/image/fetch/$s_!YLTX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51bf4c1d-cd6d-4fa6-b650-eb648a71e051_798x411.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 10</strong> Left: without positional embeddings, the 16 cat patches can be fed to the model in any order, so spatial structure is lost. Right: with positional embeddings, patches are tied back to their grid locations, allowing the model to understand where each piece of the cat belongs.</em></figcaption></figure></div><p>Self-attention in a Vision Transformer has no built-in notion of order. In the left panel of Figure 10, we could shuffle the cat patches so that a tile from the ear region swaps places with a tile from the plain background, and the encoder would happily process this jumbled sequence as if nothing were wrong. For images this is clearly problematic: a patch showing the cat&#8217;s eye carries very different meaning from a patch showing only empty purple background. To give the model a sense of where each token comes from in the original cat image grid, we add a positional embedding to every token before it enters the transformer encoder.</p><p>So after patch embedding and adding the class token we have a sequence of <em>N+1 tokens</em>, each of dimension D. We collect them in a matrix X, where x_0&#8203; is the class token and <em>x1,&#8230;,x_N</em> are the patch embeddings. The Vision Transformer introduces a learnable positional embedding matrix</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;P =\n\\begin{bmatrix}\np_0 \\\\\np_1 \\\\\n\\vdots \\\\\np_N\n\\end{bmatrix}\n\\in \\mathbb{R}^{(N+1)\\times D},\n&quot;,&quot;id&quot;:&quot;ADAQTADXPV&quot;}" data-component-name="LatexBlockToDOM"></div><p>where each row <em>p_i</em>&#8203; is a trainable vector that represents the position of <em>token i</em>. During training these vectors are updated like any other parameter in the model. For a mini batch of size B we broadcast the same positional matrix over the batch and form the final input to the encoder by simple elementwise addition</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E = X + P, \\qquad\nE_i = x_i + p_i \\in \\mathbb{R}^D, \\quad i = 0, \\ldots, N.\n\n\n&quot;,&quot;id&quot;:&quot;SLAMOBPBUE&quot;}" data-component-name="LatexBlockToDOM"></div><p>The tensor sent into the transformer encoder therefore has shape <em>B&#215;(N+1)&#215;D</em>. The sequence length and embedding size are unchanged, but each token now carries two kinds of information at once: the visual content of its patch and the location of that patch in the original grid.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TMct!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TMct!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png 424w, https://substackcdn.com/image/fetch/$s_!TMct!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png 848w, https://substackcdn.com/image/fetch/$s_!TMct!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png 1272w, https://substackcdn.com/image/fetch/$s_!TMct!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TMct!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png" width="1248" height="861" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:861,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44311,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TMct!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png 424w, https://substackcdn.com/image/fetch/$s_!TMct!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png 848w, https://substackcdn.com/image/fetch/$s_!TMct!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png 1272w, https://substackcdn.com/image/fetch/$s_!TMct!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe173fa25-d0de-4169-b3de-c54cbb08ec4a_1248x861.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 11.</strong></em> Learned positional embeddings in a Vision Transformer. Each patch embedding <em>x_i</em>&#8203; and the class token <em>x_0</em>&#8203; is paired with a learned positional vector <em>pi</em>. Adding them produces a new token <em>E_i = x_i + p_i</em> that encodes both what is in the patch and where it comes from. The sequence length stays at <em>N + 1 = 17</em> and the embedding dimension remains <em>D = 32</em>, but the model now has access to the spatial structure of the image.</figcaption></figure></div><h1>1.4 <strong>Encoder-only structure for classification</strong></h1><p>In a Vision Transformer we only keep the encoder side of the original transformer architecture. The image is converted into a sequence of tokens and these tokens pass through a stack of <em>L</em> identical encoder blocks. Each block contains multi-head self attention, a feed-forward MLP, and residual connections with layer normalization, but there is no decoder that predicts future tokens. Instead, we prepend a single learnable class token to the sequence and treat the encoder as a feature extractor. After the last encoder block we read out only the final hidden state of this class token and feed it into a small MLP head that produces the class logits for the image. In this sense a ViT is an encoder-only model trained for classification.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1YR2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1YR2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png 424w, https://substackcdn.com/image/fetch/$s_!1YR2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png 848w, https://substackcdn.com/image/fetch/$s_!1YR2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png 1272w, https://substackcdn.com/image/fetch/$s_!1YR2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1YR2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png" width="1443" height="975" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:975,&quot;width&quot;:1443,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69635,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1YR2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png 424w, https://substackcdn.com/image/fetch/$s_!1YR2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png 848w, https://substackcdn.com/image/fetch/$s_!1YR2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png 1272w, https://substackcdn.com/image/fetch/$s_!1YR2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe9d55410-0e7f-4364-a3c6-062af9656971_1443x975.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><strong>Figure 12</strong> Vision Transformer as an encoder-only model. The sequence of patch embeddings with positional information is processed by a stack of L encoder blocks, and the final context vector of the class token becomes the input to an MLP classification head.</figcaption></figure></div><p><strong>The entire path from patch embeddings to context vectors</strong></p><p>From the previous sections we already have a matrix of embedded tokens that combines patch information and positional information. We then add one extra token for classification, so the total sequence length is <em>N+1.</em> Each token has embedding dimension D. If we stack all token vectors row wise we obtain a matrix</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E \\in \\mathbb{R}^{(N+1)\\times D},&quot;,&quot;id&quot;:&quot;UZXSKTFWEY&quot;}" data-component-name="LatexBlockToDOM"></div><p>With the class token we therefore have <em>N+1= 17 tokens</em>. If we choose an embedding dimension D=32, the matrix E has shape <em>17&#215;32</em>. This matrix is the input to the transformer encoder stack and from the encoder&#8217;s perspective it looks exactly like the token embeddings of a language model: a batch of sequences, each of length 17, each token represented by a 32-dimensional vector.</p><p>The encoder does not change the sequence length. After each encoder block we still have a matrix of shape <em>(N+1)&#215;D</em>, but every row vector has been updated to incorporate information from all other tokens through self attention and the MLP. These updated vectors are what we call context vectors, because they encode both the content of a token and the context supplied by the other tokens in the sequence.</p><p><strong>Transformer encoder and attention</strong></p><p>To understand what happens inside one encoder block it is helpful to zoom into the self-attention sublayer. At the input of a block we have a matrix</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{Z}^{(j)} \\in \\mathbb{R}^{(N+1)\\times D},\n&quot;,&quot;id&quot;:&quot;OTAQZAHQOU&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>j</em> denotes the depth of the block in the stack. The rows of this matrix are the current representations of the tokens. Self attention transforms this matrix into a new matrix of the same shape by letting every token look at every other token and decide how much to pay attention to it.</p><p>The first step is to project the token representations into three new spaces called queries, keys and values. Concretely, we multiply <em>Z^(j)</em> by three learnable weight matrices</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;W_Q, W_K, W_V \\in \\mathbb{R}^{D \\times d_h},\n&quot;,&quot;id&quot;:&quot;CTJETITZLR&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>d_h</em>&#8203; is the head dimension for a single attention head. These weight matrices are shared across all positions in the sequence and are learned during training. Applying them gives three new matrices</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Q = Z^{(j)} W_Q,\\qquad\nK = Z^{(j)} W_K,\\qquad\nV = Z^{(j)} W_V,\n&quot;,&quot;id&quot;:&quot;FEWCGKVGOS&quot;}" data-component-name="LatexBlockToDOM"></div><p>each of shape <em>(N+1)&#215;dh&#8203;.</em> Intuitively, the query vector <em>q_i</em> for token i encodes what that token is looking for in its context, the key vector <em>k_i</em> encodes what that token offers to others, and the value vector <em>v_i</em>&#8203; encodes the actual information that will be blended into other tokens when they attend to it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HxXF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HxXF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png 424w, https://substackcdn.com/image/fetch/$s_!HxXF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png 848w, https://substackcdn.com/image/fetch/$s_!HxXF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png 1272w, https://substackcdn.com/image/fetch/$s_!HxXF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HxXF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png" width="1452" height="888" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88572,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HxXF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png 424w, https://substackcdn.com/image/fetch/$s_!HxXF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png 848w, https://substackcdn.com/image/fetch/$s_!HxXF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png 1272w, https://substackcdn.com/image/fetch/$s_!HxXF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7074ba-bb55-4630-bb82-5483beed942e_1452x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 13</strong></em>. <em>Inside one attention head. The sequence of patch embeddings is linearly projected into three parallel sequences: keys k_i&#8203;, queries q_i&#8203; and values v_i&#8203;. The weight matrices W_Q, W_K&#8203; and W_V&#8203; are learnable and shared across all tokens, so they determine what &#8220;questions&#8221; and &#8220;answers&#8221; the head focuses on.</em></figcaption></figure></div><p>The second step is to turn queries and keys into attention weights. For a given query vector <em>q_i</em>&#8203; we compute its similarity with every key <em>k_j</em>&#8203; using a scaled dot product, which produces a scalar score for each pair of positions</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s_{ij} = \\frac{q_i k_j^{\\top}}{\\sqrt{d_h}}.\n&quot;,&quot;id&quot;:&quot;WLMELQVAST&quot;}" data-component-name="LatexBlockToDOM"></div><p>The softmax function along the index j converts these scores into a probability distribution</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\alpha_{ij} = \\mathrm{softmax}_j (s_{ij}),\n&quot;,&quot;id&quot;:&quot;NBBAJVYEQD&quot;}" data-component-name="LatexBlockToDOM"></div><p>so that </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\alpha_{ij} \\ge 0 \\quad \\text{and} \\quad \\sum_j \\alpha_{ij} = 1.\n&quot;,&quot;id&quot;:&quot;HSWGNTDNTR&quot;}" data-component-name="LatexBlockToDOM"></div><p>The coefficient <em>&#945;_ij</em>&#8203; can be read as <em>&#8220;how much token i attends to token j&#8221;</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V0NG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V0NG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png 424w, https://substackcdn.com/image/fetch/$s_!V0NG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png 848w, https://substackcdn.com/image/fetch/$s_!V0NG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png 1272w, https://substackcdn.com/image/fetch/$s_!V0NG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V0NG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png" width="1456" height="697" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:697,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58258,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V0NG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png 424w, https://substackcdn.com/image/fetch/$s_!V0NG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png 848w, https://substackcdn.com/image/fetch/$s_!V0NG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png 1272w, https://substackcdn.com/image/fetch/$s_!V0NG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F85fae439-acea-48d0-a8e4-5c1d5d55ade6_1617x774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 14.</strong> Scaled dot-product attention for one query. The query corresponding to a particular patch compares itself to all keys, producing relevance scores that are normalized by softmax. The resulting attention weights tell the model how strongly this patch should attend to each other patch in the image.</em></figcaption></figure></div><p>The third step is to use these attention weights to blend the value vectors. For <em>token i</em> we take a weighted sum of all values</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{z}_i = \\sum_{j=0}^{N} \\alpha_{ij} v_j .\n&quot;,&quot;id&quot;:&quot;WKWNOQNROU&quot;}" data-component-name="LatexBlockToDOM"></div><p>The vector</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{z}_i \\in \\mathbb{R}^{d_h}\n&quot;,&quot;id&quot;:&quot;UDOIEWDODT&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the new representation of token <em>i</em> produced by this attention head. It contains a mixture of the value vectors of all tokens, with larger weights coming from positions that were judged more relevant by the scaled dot product. If a head learns to focus on the cat&#8217;s eye, for example, the value vectors from patches around the eye will receive larger coefficients when computing the context vector for the class token.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GEAz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GEAz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png 424w, https://substackcdn.com/image/fetch/$s_!GEAz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png 848w, https://substackcdn.com/image/fetch/$s_!GEAz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png 1272w, https://substackcdn.com/image/fetch/$s_!GEAz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GEAz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png" width="1110" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:1110,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31835,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GEAz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png 424w, https://substackcdn.com/image/fetch/$s_!GEAz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png 848w, https://substackcdn.com/image/fetch/$s_!GEAz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png 1272w, https://substackcdn.com/image/fetch/$s_!GEAz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F813cc116-a3a8-4786-89d3-176525971ba2_1110x393.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 15.</strong> From attention weights to context matrix. The matrix of attention weights, of shape (N+1)&#215;(N+1), multiplies the stacked value vectors</em> <em>V</em> <em>to produce a new matrix of token representations. Each row of the output is the weighted sum of all value vectors for one query position.</em></figcaption></figure></div><p>In practice a Vision Transformer uses multi-head attention rather than a single head. This means we repeat the procedure above several times in parallel with different sets of projection matrices</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;W_Q\n\t&#8203;\n\n,W_K\n\t&#8203;\n\n,W_V\n\t&#8203;\n\n.&quot;,&quot;id&quot;:&quot;MZBSQRTMHC&quot;}" data-component-name="LatexBlockToDOM"></div><p>&#8203;Each head has its own head dimension <em>d_h</em>&#8203;, so after computing </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{z}_i&quot;,&quot;id&quot;:&quot;YQVUOAQACX&quot;}" data-component-name="LatexBlockToDOM"></div><p>for every head we concatenate the results and use another learned projection to return to the original embedding dimension <em>D</em>. This gives the attention sublayer output matrix of shape <em>(N+1)&#215;D</em>, which is then passed through the MLP sublayer and residual connections to form the updated matrix </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Z^{(j+1)}.&quot;,&quot;id&quot;:&quot;ATHTFGQWEI&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Context vectors and output dimensions</strong></p><p>To keep track of how representations evolve through the encoder stack, it is useful to introduce a simple notation. Let</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Z^{(0)} = E \\in \\mathbb{R}^{(N+1)\\times D}\n&quot;,&quot;id&quot;:&quot;TJKXMPHNML&quot;}" data-component-name="LatexBlockToDOM"></div><p>be the initial matrix of token embeddings after patch embedding and positional embedding. After the <em>j-th</em> encoder block we write</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Z^{(j)} =\n\\begin{bmatrix}\nz^{(j)}_0 \\\\\nz^{(j)}_1 \\\\\n\\vdots \\\\\nz^{(j)}_N\n\\end{bmatrix}\n\\in \\mathbb{R}^{(N+1)\\times D}, \\qquad\nj = 0, 1, \\ldots, L,\n&quot;,&quot;id&quot;:&quot;ASUWQYBFIA&quot;}" data-component-name="LatexBlockToDOM"></div><p>where</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_i^{(j)} \\in \\mathbb{R}^D\n&quot;,&quot;id&quot;:&quot;AEFRYOWKKY&quot;}" data-component-name="LatexBlockToDOM"></div><p>is the context vector for token <em>i</em> at depth <em>j</em>. The index <em>i</em> runs from <em>0</em> to <em>N.</em> When <em>i=0</em> the vector corresponds to the class token. When <em>i &#8805; 1</em> it corresponds to one of the image patches. Because the encoder stack never changes the sequence length, every <em>Z^(j)</em> has exactly the same shape: <em>(N+1)&#215;D</em>. For our 640&#215;640 cat image with <em>P=160</em> and <em>D=32</em> that means each encoder block takes a <em>17&#215;32</em> matrix as input and produces another <em>17&#215;32</em> matrix as output.</p><p>After the final encoder block we obtain <em>Z^(L)</em>. The most important vector in this matrix is <em>z_0^(L)&#8203;,</em> the last context vector of the class token. During training this vector has learned to aggregate information from all patch tokens through the repeated layers of self attention and MLPs. As a result it acts as a compact summary of the entire image. We feed <em>z_0^(L)</em>&#8203; into a small MLP head that maps the D-dimensional vector to a vector of class logits, for example of dimension 1000 for ImageNet-1k. A softmax over these logits then gives a probability distribution over classes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j18_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j18_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png 424w, https://substackcdn.com/image/fetch/$s_!j18_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png 848w, https://substackcdn.com/image/fetch/$s_!j18_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png 1272w, https://substackcdn.com/image/fetch/$s_!j18_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j18_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png" width="1443" height="975" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:975,&quot;width&quot;:1443,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66812,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j18_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png 424w, https://substackcdn.com/image/fetch/$s_!j18_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png 848w, https://substackcdn.com/image/fetch/$s_!j18_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png 1272w, https://substackcdn.com/image/fetch/$s_!j18_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6cc1f58-828a-40fa-8c3d-1e3d578268ab_1443x975.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 12</strong>. Context vectors across the encoder stack. Each encoder block takes a matrix Z^(j) and outputs a matrix Z^(j+1) of the same shape. The class token&#8217;s final context vector z_0^(L)&#8203; becomes a learned summary of the entire image and is the only vector passed to the classification head.</em></figcaption></figure></div><p><strong>MLP head and classification</strong></p><p>By the time the sequence has passed through the transformer encoder, all of the heavy lifting has already happened. Starting from our cat image, we created N = 16 patch tokens, added a single learnable class token at the beginning, and mapped everything into a D-dimensional embedding space. After L encoder blocks, we obtain the final sequence matrix </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Z^{(L)} \\in \\mathbb{R}^{(N+1)\\times D}\n&quot;,&quot;id&quot;:&quot;JLENKECTOZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each row</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_i^{(L)}\n\n&quot;,&quot;id&quot;:&quot;SBBXUNSDFN&quot;}" data-component-name="LatexBlockToDOM"></div><p>is a context vector for token <em>i</em>, where <em>i=0</em> corresponds to the class token and <em>i = 1,&#8230;,N</em> correspond to the image patches. In our running toy example <em>N+1=17</em> and <em>D=32</em> , so Z^(L) has shape 17&#215;32</p><p>For image classification we do not feed all seventeen context vectors into a separate network. Instead, we follow the original ViT design and use only the final context vector of the class token,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; z_0^{(L)} &#8712;R^D. &quot;,&quot;id&quot;:&quot;VELLHHAHUB&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lu_Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lu_Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png 424w, https://substackcdn.com/image/fetch/$s_!lu_Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png 848w, https://substackcdn.com/image/fetch/$s_!lu_Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png 1272w, https://substackcdn.com/image/fetch/$s_!lu_Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lu_Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png" width="1456" height="827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125069,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/181494472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lu_Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png 424w, https://substackcdn.com/image/fetch/$s_!lu_Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png 848w, https://substackcdn.com/image/fetch/$s_!lu_Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png 1272w, https://substackcdn.com/image/fetch/$s_!lu_Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b5503-5e3c-4cce-98ea-4428ab74dc00_1506x855.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><strong>Figure 16.</strong> Overview of the Vision Transformer classifier. The image is turned into patch tokens with positional embeddings, processed by a stack of transformer encoder blocks, and the final class prediction is produced by an MLP head that reads only the context vector of the special class token.</em></figcaption></figure></div><p>This vector has attended to every patch token at every encoder layer, so it acts as a learned summary of the entire image. Using a single summary vector keeps the architecture simple and keeps the number of parameters in the final classifier small. In principle one could pool or concatenate all patch context vectors, but this would increase the dimensionality of the classifier input and did not bring clear benefits in the ViT experiments.</p><p>The MLP head is an ordinary feed-forward classifier that takes</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_0^{(L)}\n&quot;,&quot;id&quot;:&quot;UBYMEGDPHC&quot;}" data-component-name="LatexBlockToDOM"></div><p>as input and outputs one logit for each class. In the simplest case it consists of a single linear layer with weight matrix</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;W \\in \\mathbb{R}^{C \\times D}\n&quot;,&quot;id&quot;:&quot;QJVBJQLVAR&quot;}" data-component-name="LatexBlockToDOM"></div><p>and bias vector</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;b \\in \\mathbb{R}^C\n&quot;,&quot;id&quot;:&quot;RRADULHQTD&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>C</em> is the number of labels (for example, cat, dog, bird, and so on). The logits vector is then</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y = W z_0^{(L)} + b \\in \\mathbb{R}^C .&quot;,&quot;id&quot;:&quot;CQHNITTXWG&quot;}" data-component-name="LatexBlockToDOM"></div><p>Many practical ViT implementations insert a small two-layer MLP here instead of a single linear layer. In that case we first project</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;z_0^{(L)}\n&quot;,&quot;id&quot;:&quot;ANDIRBXFIX&quot;}" data-component-name="LatexBlockToDOM"></div><p>to a hidden dimension <em>D_mlp</em>&#8203;, apply a nonlinearity such as GELU, optionally apply dropout for regularisation, and then project down to <em>C</em> dimensions. The overall effect is to give the classifier a bit more capacity to reshape the representation coming from the transformer encoder before turning it into class scores.</p><p>The output <em>y</em> is a vector of unnormalised scores, or logits, one per class. At inference time we usually take the index of the largest logit as the predicted label. During training we pass <em>y</em> through a softmax to obtain a probability distribution over classes and compute a cross-entropy loss against the true label. Gradients from this loss flow back through the MLP head into the transformer encoder and further into the patch and positional embeddings, allowing the entire Vision Transformer to be trained end to end.</p><h1>1.5 Benefits and drawbacks of ViT</h1><p>Vision Transformers offer a conceptually clean and flexible alternative to convolutional networks by modeling images as sequences of tokens and relying entirely on self-attention to capture relationships between image regions. One of their key strengths is global context modeling: from the very first encoder layer, every image patch can attend to every other patch. This makes ViTs particularly effective at capturing long-range dependencies, such as relationships between distant parts of an object or interactions between foreground and background regions. In addition, the ViT architecture scales extremely well with data and model size. When trained on large-scale datasets, Vision Transformers often surpass convolutional networks in accuracy, showing that explicit convolutional inductive biases are not strictly necessary when sufficient data is available. Their architectural simplicity is another advantage: apart from the patch embedding stage, the model closely mirrors standard transformer encoders used in language models, making it easy to reuse ideas, optimizations, and tooling across vision and language domains.</p><p>However, these benefits come with important trade-offs. Vision Transformers are generally less data-efficient than convolutional networks, especially on small or medium-sized datasets. Without the strong locality and translation-equivariance biases of convolutions, ViTs must learn many visual regularities directly from data, which can lead to poorer performance when training data is limited. Self-attention also introduces higher computational and memory costs, as attention scales quadratically with the number of patches. For high-resolution images, this can quickly become a bottleneck. As a result, many practical ViT variants introduce hierarchical structures, windowed attention, or hybrid CNN&#8211;Transformer designs to mitigate these issues. In short, Vision Transformers excel when data and compute are abundant, but require careful design choices to remain competitive in more constrained settings.</p><h1>1.6 Real-World Applications of Vision Transformers</h1><p>Vision Transformers are now widely used across a broad range of real-world vision tasks, particularly in settings where large datasets and pretraining are available. In image classification, ViTs and their variants have become strong alternatives to deep convolutional networks, achieving state-of-the-art performance on large benchmarks when pretrained on massive image collections and fine-tuned on downstream tasks. Beyond classification, Vision Transformers have proven highly effective in object detection and image segmentation, where global context is especially valuable. Tasks such as detecting small objects in cluttered scenes or segmenting large, spatially distributed structures benefit from the ability of self-attention to relate distant patches directly.</p><p>In industrial and applied domains, Vision Transformers are increasingly used in medical imaging, remote sensing, and autonomous systems. In medical imaging, ViTs help model complex spatial relationships in high-resolution scans, such as MRI or histopathology images, where long-range dependencies can be diagnostically important. In satellite and aerial imagery, they are used for land-use classification, change detection, and large-scale scene understanding. Vision Transformers are also central to modern multimodal systems, where images must be aligned with text, audio, or other modalities. Models such as image&#8211;text encoders rely on ViT backbones to produce visual representations that integrate naturally with language transformers. As a result, Vision Transformers have become a foundational component in systems for image captioning, visual question answering, and large multimodal models, reinforcing their role as a unifying architecture across perception tasks.</p><h1>1.7 Hands-on: fine-tuning ViT for image classification</h1><p><strong>Finetuning Vision Transformer Code Repo Link available below</strong></p><p><a href="https://github.com/VizuaraAI/Transformers-for-vision-BOOK/tree/main">https://github.com/VizuaraAI/Transformers-for-vision-BOOK</a></p><p>In this section we fine-tune a pretrained Vision Transformer on a real-world, high-resolution image classification task. The Oxford-IIIT Pet dataset provides sufficiently detailed visual structure to match the inductive biases of Vision Transformers, making it an ideal dataset for demonstrating practical ViT fine-tuning. We adapt a ViT-Base model pretrained on ImageNet and fine-tune it to classify pet images into breed categories, following a standard transfer-learning workflow used in modern vision systems.</p><h3><strong>Dataset and problem setup</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2jBX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2jBX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png 424w, https://substackcdn.com/image/fetch/$s_!2jBX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png 848w, https://substackcdn.com/image/fetch/$s_!2jBX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png 1272w, https://substackcdn.com/image/fetch/$s_!2jBX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2jBX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png" width="1227" height="915" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1227,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;image.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="image.png" title="image.png" srcset="https://substackcdn.com/image/fetch/$s_!2jBX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png 424w, https://substackcdn.com/image/fetch/$s_!2jBX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png 848w, https://substackcdn.com/image/fetch/$s_!2jBX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png 1272w, https://substackcdn.com/image/fetch/$s_!2jBX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F384925c3-686d-4995-a2ac-c16a73a516af_1227x915.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Oxford-IIIT Pet Dataset consists of 7,349 images of cats and dogs across 37 fine-grained breed classes. Images vary in resolution but are typically larger than 200&#215;200 pixels and contain rich texture, shape, and spatial cues. Each image is labeled with a single breed, making this a multiclass classification problem. Although the dataset images are already reasonably high-resolution, pretrained Vision Transformers expect inputs of size 224&#215;224, so we standardize all images to this resolution during preprocessing.</p><p>This dataset is a good fit for demonstrating ViT fine-tuning for several reasons:</p><ul><li><p>Fine-grained categories. Distinguishing between 37 pet breeds requires the model to attend to subtle visual differences in fur pattern, ear shape, and body proportion, exactly the kind of long-range spatial reasoning that self-attention handles well. </p></li><li><p>Sufficient visual complexity. The images contain natural backgrounds, varying poses, and different lighting conditions, giving the model a realistic transfer learning challenge. </p></li><li><p>Manageable size. With roughly 3,680 training images and 3,669 test images, the dataset is small enough to fine-tune on a single GPU in reasonable time, yet large enough to produce meaningful results.</p></li></ul><h3><strong>Installing dependencies and setting constants</strong></h3><p>Before we write any model code, we install the required libraries and define the hyperparameters that will stay fixed throughout the experiment:</p><p><strong>Listing 1.1 Installing dependencies and defining constants</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;e3fb1108-452f-43a7-8c43-b15b2df9d929&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">!pip install torchmetrics -q
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np

from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from transformers import ViTForImageClassification, ViTImageProcessor
from transformers import get_cosine_schedule_with_warmup

from torchmetrics.classification import MulticlassAccuracy
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from tqdm.auto import tqdm</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;9befd5a7-dc2e-4cfc-a8a3-c5103d5884c9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

NUM_CLASSES = 37
IMAGE_SIZE = 224
BATCH_SIZE = 32
EPOCHS = 50</code></pre></div><p>We fix the random seed so that results are reproducible, and we set <strong>NUM_CLASSES = 37</strong> to match the 37 pet breeds in the Oxford-IIIT dataset. The image size of 224 matches the resolution that the pretrained ViT-Base model expects.</p><h3><strong>Loading and exploring the dataset</strong></h3><p>We use <strong>torchvision.datasets.OxfordIIITPet</strong> to download and load the dataset. The dataset provides both a train-val split (used for training) and a test split (used for validation):</p><p><strong>Listing 1.2 Loading the Oxford-IIIT Pet dataset</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;3ec76dc7-c9ac-4613-84ae-aedd97cedbc0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">raw_train = datasets.OxfordIIITPet(
    root="./data",
    split="trainval",
    target_types="category",
    download=True
)

class_names = raw_train.classes
print(len(class_names))
print(class_names[:10])</code></pre></div><p> this prints </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;c150614e-8513-4b99-84a7-99467ca44431&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">37
['Abyssinian', 'Bengal', 'Birman', 'Bombay', 'British_Shorthair',
'Egyptian_Mau', 'Maine_Coon', 'Persian', 'Ragdoll', 'Russian_Blue']</code></pre></div><h3><strong>Visualizing sample images</strong></h3><p>This listing displays a small grid of pet images to highlight the dataset&#8217;s visual diversity and resolution.</p><p><strong>Listing 1.3 Visualizing sample images from the dataset</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;c0ec043d-38e9-4ede-89e8-9768f3ab41ea&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">plt.figure(figsize=(12, 6))

for i in range(8):
    img, label = raw_train[i]
    plt.subplot(2, 4, i + 1)
    plt.imshow(img)
    plt.title(class_names[label])
    plt.axis("off")

plt.show()
</code></pre></div><p>output</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H0tj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H0tj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png 424w, https://substackcdn.com/image/fetch/$s_!H0tj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png 848w, https://substackcdn.com/image/fetch/$s_!H0tj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png 1272w, https://substackcdn.com/image/fetch/$s_!H0tj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H0tj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png" width="950" height="489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:489,&quot;width&quot;:950,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H0tj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png 424w, https://substackcdn.com/image/fetch/$s_!H0tj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png 848w, https://substackcdn.com/image/fetch/$s_!H0tj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png 1272w, https://substackcdn.com/image/fetch/$s_!H0tj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F092c4ca0-1b11-47ef-affa-fa64c37fa403_950x489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From this visualization we observe that images contain rich textures, variable poses, and complex backgrounds. These properties make long-range attention over image patches particularly valuable.</p><h3><strong>Preprocessing and data loaders</strong></h3><p>Pretrained Vision Transformers are sensitive to the normalization statistics used during pretraining. We load the <strong>ViTImageProcessor</strong> to extract the correct mean and standard deviation, and then build separate transforms for training and validation:</p><p><strong>Listing 1.4 Building preprocessing pipelines with ViTImageProcessor</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;87df1ff9-59ef-424b-993a-7e15fcf76f9c&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">processor = ViTImageProcessor.from_pretrained(
"google/vit-base-patch16-224"
)
print(processor)</code></pre></div><p>Output</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;aa510ba4-a654-48e0-97ef-82da7e85fc2e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">ViTImageProcessor {
  "do_convert_rgb": null,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.5,
    0.5,
    0.5
  ],
  "image_processor_type": "ViTImageProcessor",
  "image_std": [
    0.5,
    0.5,
    0.5
  ],
  "resample": 2,
  "rescale_factor": 0.00392156862745098,
  "size": {
    "height": 224,
    "width": 224
  }
}
</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;a218680a-4df1-4abd-93a3-30d4e9c70513&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">image_mean = processor.image_mean
image_std = processor.image_std

train_transforms = transforms.Compose([
    transforms.RandomResizedCrop(IMAGE_SIZE),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize(mean=image_mean, std=image_std)
])

val_transforms = transforms.Compose([
    transforms.Resize(IMAGE_SIZE),
    transforms.CenterCrop(IMAGE_SIZE),
    transforms.ToTensor(),
    transforms.Normalize(mean=image_mean, std=image_std)
])
</code></pre></div><p>The training transforms apply random resized cropping and horizontal flipping for data augmentation, while the validation transforms use a deterministic resize and center crop so that evaluation is reproducible. Both pipelines normalize using the ImageNet statistics that the pretrained model was trained with.</p><p>After preprocessing, every image has shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(3,224,224)&quot;,&quot;id&quot;:&quot;IOEYSCPFVG&quot;}" data-component-name="LatexBlockToDOM"></div><p>which matches the ViT input specification.</p><p>We now construct the training and validation datasets and wrap them in PyTorch DataLoaders.</p><p><strong>Listing 1.5 Constructing training and validation data loaders</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;e2e91aba-2090-4d3a-8f1e-96e8d931bc37&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">train_dataset = datasets.OxfordIIITPet(
    root="./data",
    split="trainval",
    target_types="category",
    transform=train_transforms
)

val_dataset = datasets.OxfordIIITPet(
    root="./data",
    split="test",
    target_types="category",
    transform=val_transforms
)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)
</code></pre></div><h3><strong>Loading the pretrained model</strong></h3><p>We load a <em>ViT-Base</em> model pretrained on <em>ImageNet</em> using the Hugging Face transformers library. The key argument <strong>num_labels=NUM_CLASSES</strong> tells the library to replace the original 1000-class ImageNet head with a new linear head that outputs 37 logits,one per pet breed. </p><p>The <strong>ignore_mismatched_sizes=True</strong> flag suppresses the warning about the size mismatch in the classification head:</p><p><strong>Listing 1.6 Loading a pretrained ViT-Base model with a new classification head</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;1c3080c3-b5e2-47a2-b104-85e43307f174&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">model = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224",
    num_labels=NUM_CLASSES,
    ignore_mismatched_sizes=True
).to(device)
</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;9386d76e-7bc6-476f-a749-b1fc3fb187d7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">print(model)</code></pre></div><p>The output model structure would be </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;02cea0c1-5e8c-4710-a3ad-2f529ef9a343&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">ViTForImageClassification(
  (vit): ViTModel(
    (embeddings): ViTEmbeddings(
      (patch_embeddings): ViTPatchEmbeddings(
        (projection): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): ViTEncoder(
      (layer): ModuleList(
        (0-11): 12 x ViTLayer(
          (attention): ViTAttention(
            (attention): ViTSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
            )
            (output): ViTSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (intermediate): ViTIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): ViTOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (dropout): Dropout(p=0.0, inplace=False)
          )
          (layernorm_before): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (layernorm_after): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
    (layernorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  )
  (classifier): Linear(in_features=768, out_features=37, bias=True)
)</code></pre></div><h3><strong>Freezing the backbone and training only the head</strong></h3><p>A common transfer-learning strategy is to freeze all pretrained parameters and train only the newly initialized classification head. This is fast, requires little memory, and often produces strong results when the pretrained features already capture the visual concepts needed for the target task:</p><p><strong>Listing 1.7 Freezing the backbone and counting trainable parameters</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;ff75c543-d981-41ee-90be-92c9b0ce2d1a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">for param in model.parameters():
    param.requires_grad = False
for name, param in model.named_parameters():
    if "classifier" in name:
       param.requires_grad = True</code></pre></div><p>We can verify the freeze with a quick parameter count:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;21a72cea-2081-4246-880d-650c34898208&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def print_model_parameters(model):
    trainable_params = 0
    frozen_params = 0
    all_param = 0

    for _, param in model.named_parameters():
        num_params = param.numel()
        all_param += num_params

        if param.requires_grad:
            trainable_params += num_params
        else:
            frozen_params += num_params

    print(f"trainable params: {trainable_params:,}")
    print(f"frozen params:    {frozen_params:,}")
    print(f"all params:       {all_param:,}")
    print(f"trainable%:       {100 * trainable_params / all_param:.2f}%")

# Run the function
print_model_parameters(model)</code></pre></div><p>The output shows that only the classification head is trainable, roughly 28,000 parameters out of the model&#8217;s 86 million total:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;0fcb8baf-5d12-4cbd-8004-2b7c1c2325a3&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">trainable params: 28,453
frozen params: 85,798,656
all params: 85,827,109
trainable%: 0.03%</code></pre></div><h3><strong>Sanity check: pre-training inference</strong></h3><p>Before any fine-tuning, we run the model on a single validation image to establish a baseline. Since the classification head is randomly initialized, we expect the prediction to be essentially random:</p><p><strong>Listing 1.8 Sanity check: pre-training inference on one image</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;740f71fa-2135-495f-a7df-8d9a73b3aa7b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">model.eval()

image, label = val_dataset[0]
image = image.unsqueeze(0).to(device)

with torch.no_grad():
    outputs = model(pixel_values=image)
    logits = outputs.logits
    probs = torch.softmax(logits, dim=1)
    pred = probs.argmax(dim=1).item()
    confidence = probs.max(dim=1).values.item()

print(
    f"[Pre-training inference]\n"
    f"  Ground truth class : {class_names[label]}\n"
    f"  Predicted class    : {class_names[pred]}\n"
    f"  Prediction confidence : {confidence:.2f}"
)
</code></pre></div><p>Output</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;3017f71f-36d8-45ff-90a5-bebb4a1aa115&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">[Pre-training inference]
  Ground truth class : Abyssinian
  Predicted class    : Shiba Inu
  Prediction confidence : 0.05</code></pre></div><p>The model predicts an incorrect class with low confidence, confirming that the head needs training.</p><h3><strong>Setting up the optimizer, scheduler, and loss</strong></h3><p>We use AdamW with a learning rate of 3 &#215; 10&#8722;4 and a cosine schedule with linear warmup. The warmup phase helps stabilize early training when the head weights are still random:</p><p><strong>Listing 1.9 Setting up AdamW optimizer, cosine scheduler, and loss function</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;e484ebc4-38ca-46e6-9701-5b96cb3ff130&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">optimizer = optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=3e-4,
    weight_decay=1e-4
)


total_steps = len(train_loader) * EPOCHS
warmup_steps = int(0.1 * total_steps)

scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

criterion = nn.CrossEntropyLoss()
accuracy = MulticlassAccuracy(num_classes=NUM_CLASSES).to(device)</code></pre></div><blockquote><p><strong>AdamW for fine-tuning</strong> </p><p>AdamW is a variant of Adam that decouples weight decay from the gradient update. This prevents the regularization from interfering with the adaptive learning rate, leading to better generalization. It is the standard optimizer for both pretraining and fine-tuning transformers.</p></blockquote><h3><strong>The training loop</strong></h3><p>The training loop follows a standard PyTorch pattern: for each epoch, iterate over mini-batches, compute the cross-entropy loss, back-propagate, and update the classification-head weights. After each epoch, we evaluate on the validation set:</p><p><strong>Listing 1.10 The main training and validation loop</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;aa103752-472c-4867-8e2c-51234653a3bd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">train_losses, val_accuracies = [], []

for epoch in range(EPOCHS):
    model.train()
    running_loss = 0.0

    for imgs, labels in tqdm(train_loader):
        imgs, labels = imgs.to(device), labels.to(device)

        outputs = model(pixel_values=imgs)
        loss = criterion(outputs.logits, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        running_loss += loss.item()

    train_losses.append(running_loss / len(train_loader))

    model.eval()
    accuracy.reset()
    with torch.no_grad():
        for imgs, labels in val_loader:
            imgs, labels = imgs.to(device), labels.to(device)
            preds = model(pixel_values=imgs).logits.argmax(dim=1)
            accuracy.update(preds, labels)

    val_accuracies.append(accuracy.compute().item())

    print(f"Epoch {epoch+1}: "
          f"Loss={train_losses[-1]:.4f}, "
          f"Val Acc={val_accuracies[-1]:.4f}")
</code></pre></div><p>Note how we <strong>call model.train()</strong> at the start of each epoch to enable dropout and batch-norm updates, and <strong>model.eval()</strong> before validation to disable them. The <strong>scheduler.step()</strong> call happens after each optimizer step (not each epoch), which is the correct behavior for the cosine-with-warmup schedule.</p><h3><strong>Plotting training progress</strong></h3><p>After training, we plot the training loss and validation accuracy curves to assess convergence:</p><p><strong>Listing 1.11 Plotting training loss and validation accuracy curves</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;5c6ed8d8-c907-4afd-9790-0e470990c52d&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(train_losses)
plt.title("Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")

plt.subplot(1, 2, 2)
plt.plot(val_accuracies)
plt.title("Validation Accuracy")

plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.tight_layout()
plt.show()</code></pre></div><p>Output plot</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0a01!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0a01!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png 424w, https://substackcdn.com/image/fetch/$s_!0a01!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png 848w, https://substackcdn.com/image/fetch/$s_!0a01!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png 1272w, https://substackcdn.com/image/fetch/$s_!0a01!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0a01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png" width="981" height="374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:374,&quot;width&quot;:981,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0a01!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png 424w, https://substackcdn.com/image/fetch/$s_!0a01!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png 848w, https://substackcdn.com/image/fetch/$s_!0a01!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png 1272w, https://substackcdn.com/image/fetch/$s_!0a01!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F061a6403-a7ec-471c-9f9d-e3c880068057_981x374.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A healthy run will show the training loss decreasing steadily and the validation accuracy climbing over the first several epochs before leveling off. Since we are only training the classification head, convergence is typically fast.</p><h3><strong>Post-training inference</strong></h3><p>We can now verify that the fine-tuned model makes correct predictions on validation images:<br><strong>Listing 1.12 Post-training inference on a validation image</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;e5debcb4-b402-4ae9-a099-e13708a06f09&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">model.eval()

image, label = val_dataset[0]
image = image.unsqueeze(0).to(device)

with torch.no_grad():
    logits = model(pixel_values=image).logits
    pred = logits.argmax(dim=1).item()

print("After training &#8594; Pred:", class_names[pred],
      "| GT:", class_names[label])
</code></pre></div><p>Output</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;64fbe6ce-140c-42bc-8ff2-b2735a2dcb76&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">After training &#8594; Pred: Abyssinian | GT: Abyssinian</code></pre></div><h3><strong>Confusion matrix evaluation</strong></h3><p>To understand where the model excels and where it struggles, we compute a confusion matrix over the entire validation set. This reveals which breed pairs are most easily confused:</p><p><strong>Listing 1.13 Computing and displaying the confusion matrix</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;48be2dc6-764f-4a4e-adc7-e5f3dc704dac&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">all_preds, all_labels = [], []

model.eval()
with torch.no_grad():
    for imgs, labels in val_loader:
        imgs = imgs.to(device)
        # Get predictions
        preds = model(pixel_values=imgs).logits.argmax(dim=1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.numpy())

cm = confusion_matrix(all_labels, all_preds)
disp = ConfusionMatrixDisplay(cm, display_labels=class_names)


fig, ax = plt.subplots(figsize=(30, 30))

disp.plot(ax=ax, xticks_rotation=45, colorbar=True)


plt.tight_layout()

plt.show()</code></pre></div><p>Output</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l_w0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l_w0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png 424w, https://substackcdn.com/image/fetch/$s_!l_w0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png 848w, https://substackcdn.com/image/fetch/$s_!l_w0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png 1272w, https://substackcdn.com/image/fetch/$s_!l_w0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l_w0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png" width="1456" height="1578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1578,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!l_w0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png 424w, https://substackcdn.com/image/fetch/$s_!l_w0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png 848w, https://substackcdn.com/image/fetch/$s_!l_w0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png 1272w, https://substackcdn.com/image/fetch/$s_!l_w0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3445e49f-0158-4039-88f0-fa33b415dc1f_2758x2990.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The diagonal entries show correct predictions&#894; off-diagonal entries indicate confusions. Breeds with visually similar features (for example, different shorthaired cats) will typically show higher off-diagonal values. This kind of finegrained analysis is valuable for deciding whether to invest in more data, stronger augmentation, or a larger model.</p><h3><strong>Saving the fine-tuned model</strong></h3><p>Finally, we save the fine-tuned weights so that the model can be reloaded later for inference without retraining:</p><p><strong>Listing 1.14 Saving the fine-tuned model weights</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;b5c834c7-816d-4102-a1ab-0132cfc065b7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">torch.save(model.state_dict(), "vit_finetuned_final.pth")
print("Model saved successfully.")</code></pre></div><p></p><p>The <strong>.pth</strong> file contains only the model&#8217;s state_dict, a dictionary mapping each layer name to its parameter tensor. To reload the model, we would create a new <strong>ViTForImageClassification</strong> instance with the same configuration and call <strong>model.load_state_dict(torch.load("vit_finetuned_final.pth")).</strong></p><p></p><h1>Resources</h1><p><strong>Original Paper</strong></p><p><a href="https://arxiv.org/pdf/2010.11929">An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</a></p><p><strong>Dr Sreedath Panat</strong></p><p><a href="https://youtu.be/U_sdodhcBC4?si=mtZFetZbLuv7458d">Vision Transformer Paper Dissection</a></p><p><a href="https://youtu.be/ZRo74xnN2SI?si=RjEsfAevdakYpgxX">Build Vision Transformer from Scratch</a></p><h1><strong>1.8 Summary</strong></h1><ul><li><p> Vision Transformers adapt the transformer architecture to images by treating fixed-size image patches as tokens, enabling global self-attention from the first layer. Unlike convolutional networks, which build receptive fields gradually through stacked layers, ViTs can relate any two image regions directly.</p></li><li><p>Patch embedding converts a 2D image into a 1D sequence of token vectors. This can be done by flattening each patch and applying a linear projection, or equivalently by using a single convolution with kernel size and stride equal to the patch size.</p></li><li><p>A learnable class token is prepended to the sequence and accumulates information from all patches through self-attention. After the final encoder block, the class token&#8217;s context vector serves as a compact summary of the entire image and is passed to an MLP head for classification.</p></li><li><p>Learnable positional embeddings are added to each token so that the model retains spatial information about where each patch originated in the original image grid.</p></li><li><p>The encoder-only architecture processes the full patch sequence through L identical blocks of multi-head self-attention and feed-forward layers. Each block preserves the sequence length and embedding dimension, progressively refining token representations.</p></li><li><p>Vision Transformers scale well with large datasets and model sizes but are less data-efficient than CNNs on small datasets. Practical variants address the quadratic attention cost through hierarchical designs and windowed attention.</p></li><li><p>Fine-tuning a pretrained ViT on a downstream classification task follows a standard transfer-learning workflow: freeze the pretrained backbone, replace the classification head, and train only the head on the target dataset using a cosine learning rate schedule with warmup.</p></li></ul><h1>Some of More Substacks</h1><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;47d417d5-8d2a-4367-a76f-db2e773f33d1&quot;,&quot;caption&quot;:&quot;The Transformer Architecture&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;The Transformers&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-03-17T03:32:41.080Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!Igi_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/the-transformers&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:190611987,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:92,&quot;comment_count&quot;:5,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;5d313a70-cb07-48d9-8631-0941c99cf854&quot;,&quot;caption&quot;:&quot;Figure 0: Detailed Architecture of the Segment Anything Model (SAM).&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Segment Anything Model (SAM)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:136642032,&quot;name&quot;:&quot;Sreedath Panat&quot;,&quot;bio&quot;:&quot;I am the co-founder of Vizuara AI Labs and a PhD from MIT. I use this space to put down my thoughts and knowledge on AI/ML related topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa047b9-4cee-4d9d-8ed7-4a63c5f919b4_974x1220.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-20T09:19:46.533Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ea6440e-c81a-4e4e-b357-db44820234f5_1920x1278.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/segment-anything-model-sam&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:184705881,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:12,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;722ceca4-ae3c-46b5-bea7-4ddbb4a20d7b&quot;,&quot;caption&quot;:&quot;Table Of Content&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Detection Transformer (DETR): An introduction&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:136642032,&quot;name&quot;:&quot;Sreedath Panat&quot;,&quot;bio&quot;:&quot;I am the co-founder of Vizuara AI Labs and a PhD from MIT. I use this space to put down my thoughts and knowledge on AI/ML related topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa047b9-4cee-4d9d-8ed7-4a63c5f919b4_974x1220.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-15T08:40:59.104Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!M0HN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/detection-transformer-detr-an-introduction&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:183945695,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;9c5fb056-98e2-43b4-aa5d-b36cb4c755c5&quot;,&quot;caption&quot;:&quot;Table of Content&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;An beginners introduction to Swin transformer&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:136642032,&quot;name&quot;:&quot;Sreedath Panat&quot;,&quot;bio&quot;:&quot;I am the co-founder of Vizuara AI Labs and a PhD from MIT. I use this space to put down my thoughts and knowledge on AI/ML related topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa047b9-4cee-4d9d-8ed7-4a63c5f919b4_974x1220.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-13T09:20:10.516Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!X8hk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/an-beginners-introduction-to-swin&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:183324523,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:7,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>I&#8217;m also building Audio Deep Learning projects and Exploring and Finetuning different tts,sst models,  sharing and discussing them on LinkedIn and Twitter. If you&#8217;re someone curious about these topics, I&#8217;d love to connect with you all!</p><p><strong>Mayank Pratap Singh</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/mayankpratapsingh022/">www.linkedin.com/in/mayankpratapsingh022</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/Mayank_022">x.com/Mayank_022</a>.</p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Transformers]]></title><description><![CDATA[A complete architectural breakdown of Transformers, paired with a step-by-step guide to coding BERT from the ground up.]]></description><link>https://www.vizuaranewsletter.com/p/the-transformers</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/the-transformers</guid><dc:creator><![CDATA[Mayank Pratap Singh]]></dc:creator><pubDate>Tue, 17 Mar 2026 03:32:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Igi_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Igi_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Igi_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif 424w, https://substackcdn.com/image/fetch/$s_!Igi_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif 848w, https://substackcdn.com/image/fetch/$s_!Igi_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif 1272w, https://substackcdn.com/image/fetch/$s_!Igi_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Igi_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:179353,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/190611987?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Igi_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif 424w, https://substackcdn.com/image/fetch/$s_!Igi_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif 848w, https://substackcdn.com/image/fetch/$s_!Igi_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif 1272w, https://substackcdn.com/image/fetch/$s_!Igi_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcae88335-48f0-4ab3-baef-790df9e6f2ed_1920x1080.gif 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Transformer Architecture </h2><ul><li><p>1.1 Introduction to Large Language Models</p></li><li><p>1.2 Anatomy of the transformer block</p></li><li><p>1.3 Tokenization</p></li><li><p>1.4 Byte Pair Encoding</p></li><li><p>1.5 Word Embedding</p></li><li><p>1.6 Transformer Block</p></li><li><p>1.7 The Need for Attention Mechanism</p></li><li><p>1.8 Self Attention Mechanism</p></li><li><p>1.9 Understanding the Input Embedding Matrix</p></li><li><p>1.10 From Embeddings to Queries, Keys &amp; Values</p></li><li><p>1.11 A Quick Note on Matrix Multiplication</p></li><li><p>1.12 Why Scale Attention Scores?</p></li><li><p>1.13 Causal &amp; Masked Attention</p></li><li><p>1.14 Causal Attention with Dropouts</p></li><li><p>1.15 Summary of Self-Attention</p></li><li><p>1.16 Intuition of Multi-Head Attention</p></li><li><p>1.17 Layer Normalization</p></li><li><p>1.18 FeedForward Network</p></li><li><p>1.19 Shortcut connections</p></li><li><p>1.20 Why Transformers Scale Better Than RNNs and CNNs</p></li><li><p>1.21 Pretraining, Fine Tuning, and Transfer Learning in Transformers</p></li><li><p>1.22 Limitations and Challenges of Transformers</p></li><li><p>1.23 Hands On Coding a Miniature Transformer for Sequence Classification</p></li><li><p>1.24 Summary</p></li></ul><p><strong>You can fine the code notebook here</strong></p><p><a href="https://github.com/VizuaraAI/Transformers-for-vision-BOOK">https://github.com/VizuaraAI/Transformers-for-vision-BOOK</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>1.1 Introduction to Large Language Models</h1><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rQ5f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rQ5f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png 424w, https://substackcdn.com/image/fetch/$s_!rQ5f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png 848w, https://substackcdn.com/image/fetch/$s_!rQ5f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png 1272w, https://substackcdn.com/image/fetch/$s_!rQ5f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rQ5f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png" width="1086" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1086,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7669,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rQ5f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png 424w, https://substackcdn.com/image/fetch/$s_!rQ5f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png 848w, https://substackcdn.com/image/fetch/$s_!rQ5f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png 1272w, https://substackcdn.com/image/fetch/$s_!rQ5f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d25b86-c09e-46fb-b2bb-b7fadfac0c3e_1086x246.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.1 </strong>A Large Language Model takes a sequence of words as input and predicts the most likely next word, generating text one token at a time.</em></p><p>Large Language Models are neural networks trained on vast text datasets to perform a fundamental task: predicting the next word in a sequence. This simple objective drives the sophisticated capabilities we see in systems like GPT and ChatGPT.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kOd8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kOd8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png 424w, https://substackcdn.com/image/fetch/$s_!kOd8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png 848w, https://substackcdn.com/image/fetch/$s_!kOd8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png 1272w, https://substackcdn.com/image/fetch/$s_!kOd8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kOd8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png" width="1410" height="984" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:984,&quot;width&quot;:1410,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:129934,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kOd8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png 424w, https://substackcdn.com/image/fetch/$s_!kOd8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png 848w, https://substackcdn.com/image/fetch/$s_!kOd8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png 1272w, https://substackcdn.com/image/fetch/$s_!kOd8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ef8c075-dead-43d6-a915-9a243dda6be7_1410x984.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.2 </strong>Autoregressive text generation. The model predicts the next word, appends it to the input, and repeats the process to produce entire paragraphs.</em></p><p>When you interact with an LLM, it generates responses one word at a time. Given a prompt like &#8220;The cat sat on the,&#8221; the model predicts the next word, perhaps &#8220;mat.&#8221; This word is added to the sequence, becoming &#8220;The cat sat on the mat,&#8221; which then serves as input for predicting the following word. Through this iterative process, LLMs produce entire paragraphs and complex responses.</p><div class="pullquote"><p>LLMs function as probabilistic engines, calculating word likelihoods based on patterns learned during training.</p><p></p></div><p> The transformer architecture enables these models to consider both immediate context and long range dependencies throughout the input sequence, maintaining coherence across extended text generation.</p><p>Despite the apparent simplicity of next word prediction, this mechanism gives rise to remarkable language understanding and generation capabilities. Understanding how transformers accomplish this task is essential to grasping how modern language models work.</p><h3>Predicting the Next Word with OpenAI&#8217;s LLM</h3><p>Let&#8217;s see a simple example to see how an LLM predicts the next word given a partial sentence:</p><p>You can refer to the full source code notebook for this exercise on Colab.</p><p><a href="https://github.com/VizuaraAI/Transformers-for-vision-BOOK/blob/main/Ch02-Transformers/Predicting_the_Next_Word_with_OpenAI's_LLM.ipynb">Predicting_the_next_word_notebook</a></p><p>Using the given code, we can predict the <strong>next word</strong> in a sentence based on probabilities assigned by a <strong>Large Language Model (LLM). </strong>Let&#8217;s say our Input sentence is </p><pre><code>&#8220;After years of hard work, your effort will take you&#8221;</code></pre><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pl2R!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pl2R!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png 424w, https://substackcdn.com/image/fetch/$s_!Pl2R!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png 848w, https://substackcdn.com/image/fetch/$s_!Pl2R!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png 1272w, https://substackcdn.com/image/fetch/$s_!Pl2R!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pl2R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png" width="1194" height="225" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:225,&quot;width&quot;:1194,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13321,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pl2R!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png 424w, https://substackcdn.com/image/fetch/$s_!Pl2R!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png 848w, https://substackcdn.com/image/fetch/$s_!Pl2R!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png 1272w, https://substackcdn.com/image/fetch/$s_!Pl2R!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ef611d8-481b-4120-8dfc-d8ca6bb267fc_1194x225.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.3 </strong>Input sentence fed to the LLM for next-word prediction.</em></p><p>if you will observe the <strong>top</strong> <strong>10 predicted next words</strong> along with their probabilities (refer the notebook)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QOgv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QOgv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png 424w, https://substackcdn.com/image/fetch/$s_!QOgv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png 848w, https://substackcdn.com/image/fetch/$s_!QOgv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png 1272w, https://substackcdn.com/image/fetch/$s_!QOgv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QOgv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png" width="450" height="519" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:519,&quot;width&quot;:450,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27099,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QOgv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png 424w, https://substackcdn.com/image/fetch/$s_!QOgv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png 848w, https://substackcdn.com/image/fetch/$s_!QOgv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png 1272w, https://substackcdn.com/image/fetch/$s_!QOgv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa180a073-691a-4a3c-a1b1-df38ae1a9514_450x519.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 2.4 </strong>Top 10 predicted next words and their probabilities. The token &#8220;to&#8221; dominates at 90.7%, reflecting the most natural continuation.</em></p><p>The probabilistic nature of Large Language Models becomes clear when examining how they rank potential next words. The first token, such as &#8220;to,&#8221; might have the highest probability at 90.7 percent because it represents the most natural continuation based on the given context. As we look at alternative word choices, the probabilities gradually decrease, with each subsequent option representing a less common but still valid completion.</p><p>This distribution reveals the fundamental mechanism of Large Language Models: they function as probabilistic engines, predicting the most likely next token based on learned patterns. Rather than selecting a single correct answer, LLMs evaluate every possible next word and assign likelihood scores based on the vast patterns learned during training. This probabilistic approach enables models to generate diverse, contextually appropriate text while maintaining flexibility in their outputs.</p><h3>Why is There &#8220;Large&#8221; in LLMs?</h3><p>The term &#8220;Large&#8221; in Large Language Models reflects a fundamental principle: size directly impacts performance. <a href="https://arxiv.org/pdf/2001.08361">Scaling laws</a> show that model capabilities improve predictably with more parameters, enabling complex tasks like reasoning and code generation that smaller models cannot perform. Most critically, emergent properties such as arithmetic reasoning and multilingual understanding appear only when models cross certain size thresholds. This relationship between scale and capability explains why billions of parameters are essential for achieving sophisticated language understanding.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LBdv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LBdv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png 424w, https://substackcdn.com/image/fetch/$s_!LBdv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png 848w, https://substackcdn.com/image/fetch/$s_!LBdv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png 1272w, https://substackcdn.com/image/fetch/$s_!LBdv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LBdv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png" width="1251" height="669" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6538586e-2173-419e-a626-603a5cb4add0_1251x669.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:669,&quot;width&quot;:1251,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117535,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LBdv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png 424w, https://substackcdn.com/image/fetch/$s_!LBdv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png 848w, https://substackcdn.com/image/fetch/$s_!LBdv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png 1272w, https://substackcdn.com/image/fetch/$s_!LBdv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6538586e-2173-419e-a626-603a5cb4add0_1251x669.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.5 </strong>Scaling laws demonstrate a predictable relationship between model size and performance across a range of benchmarks</em></p><p>LLMs have <strong>billions to trillions of parameters.</strong> The first major paper to explore scaling laws was the <strong><a href="https://arxiv.org/pdf/2005.14165">GPT-3 paper</a></strong>  (<em>Language Models are Few-Shot Learners</em>). The research demonstrated that <strong>as we increase the model size</strong>, from <strong>1.3B parameters to 13B to 175B</strong>, the model&#8217;s performance <strong>dramatically improves.</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XCDN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XCDN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png 424w, https://substackcdn.com/image/fetch/$s_!XCDN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png 848w, https://substackcdn.com/image/fetch/$s_!XCDN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png 1272w, https://substackcdn.com/image/fetch/$s_!XCDN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XCDN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png" width="1266" height="738" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:738,&quot;width&quot;:1266,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:109622,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XCDN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png 424w, https://substackcdn.com/image/fetch/$s_!XCDN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png 848w, https://substackcdn.com/image/fetch/$s_!XCDN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png 1272w, https://substackcdn.com/image/fetch/$s_!XCDN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7a90c35-172e-456a-a2eb-b2c48df03a09_1266x738.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.6 </strong>Exponential growth in the size of language models from the 1950s to today. Orange dots represent language models, some of which have already crossed one trillion parameters.</em></p><p>Over the years, we have seen an <strong>exponential increase</strong> in the size of LLMs, from the 1950s to today. In the above  graph, the <strong>orange dots</strong> represent language models, showing how their size has increased drastically over time. Some models have already <strong>crossed 1 trillion parameters!</strong></p><h3>Why do we care about the size of LLMs?</h3><p>The size of Large Language Models matters primarily because of emergent properties: abilities that are absent in smaller models but spontaneously appear when models reach certain scales. These emergent capabilities fundamentally distinguish large models from their smaller counterparts. As LLMs grow beyond specific parameter thresholds, they suddenly acquire skills like solving complex arithmetic equations, translating between languages with nuanced understanding, and unscrambling letters into meaningful words. These abilities do not gradually improve with size but rather emerge abruptly at particular scales, making model size not just a technical detail but a critical factor in determining what tasks an LLM can perform.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lcrc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lcrc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png 424w, https://substackcdn.com/image/fetch/$s_!lcrc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png 848w, https://substackcdn.com/image/fetch/$s_!lcrc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png 1272w, https://substackcdn.com/image/fetch/$s_!lcrc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lcrc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png" width="1296" height="729" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:729,&quot;width&quot;:1296,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161254,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lcrc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png 424w, https://substackcdn.com/image/fetch/$s_!lcrc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png 848w, https://substackcdn.com/image/fetch/$s_!lcrc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png 1272w, https://substackcdn.com/image/fetch/$s_!lcrc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ce4bbd8-55d7-4c6b-8e58-795c36ea19dd_1296x729.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.7 </strong> Emergent abilities of large language models. Performance on certain tasks remains near zero until the model reaches a critical size, after which accuracy jumps sharply.</em></p><p>In the figure above , the <strong>X-axis represents model size</strong> (or computational power), and we can observe a <strong>pickup point, </strong>a stage where models <strong>suddenly start performing significantly better</strong> at these tasks. <a href="https://arxiv.org/pdf/2206.07682">Emergent Abilities of Large Language Models</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ztO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ztO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png 424w, https://substackcdn.com/image/fetch/$s_!4ztO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png 848w, https://substackcdn.com/image/fetch/$s_!4ztO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png 1272w, https://substackcdn.com/image/fetch/$s_!4ztO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ztO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png" width="1191" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1191,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35595,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4ztO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png 424w, https://substackcdn.com/image/fetch/$s_!4ztO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png 848w, https://substackcdn.com/image/fetch/$s_!4ztO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png 1272w, https://substackcdn.com/image/fetch/$s_!4ztO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4ee2044-608e-4392-b839-8fbc250e4e43_1191x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.8 </strong>At larger scales, LLMs move beyond simple word prediction to excel at specialized tasks such as multilingual translation, text summarization, and grammar correction.</em></p><p>At larger scales, LLMs transcend simple word prediction to excel at specialized tasks like multilingual translation, text summarization, and grammar correction. This evolution from basic prediction to complex language understanding drives the race to build increasingly larger models. The direct correlation between parameter count and performance across diverse NLP tasks makes scale a critical competitive advantage.</p><h1>1.2 Anatomy of the transformer block</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SNlH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SNlH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png 424w, https://substackcdn.com/image/fetch/$s_!SNlH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png 848w, https://substackcdn.com/image/fetch/$s_!SNlH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!SNlH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SNlH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png" width="933" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:933,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222203,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SNlH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png 424w, https://substackcdn.com/image/fetch/$s_!SNlH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png 848w, https://substackcdn.com/image/fetch/$s_!SNlH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!SNlH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14e2ffb-c10d-48ec-987c-b9fe4248c1e7_933x1092.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.9 </strong>The original transformer architecture from the &#8220;Attention Is All You Need&#8221; paper, consisting of an encoder stack on the left and a decoder stack on the right, connected through cross-attention.</em></p><p>The transformer architecture, introduced in the groundbreaking 2017 paper <a href="https://arxiv.org/pdf/1706.03762">&#8220;Attention Is All You Need,&#8221;</a> revolutionized artificial intelligence and natural language processing. This paper, now with over 200,000 citations, proposed the concept of self attention, fundamentally changing how we implement NLP systems. The transformer architecture consists of two main components: encoders and decoders. Encoder architectures power models like BERT, while decoder architectures form the basis of GPT and ChatGPT.</p><p>At the heart of modern LLMs lies this transformer architecture, which replaced traditional models like LSTMs and GRUs with self attention mechanisms. This innovation brought crucial advantages: the ability to capture long range dependencies in text, parallel processing that enables faster training, and unprecedented scalability that allows building increasingly powerful models. Understanding how the decoder portion works essentially reveals how GPT models function, as they are decoder only architectures.</p><p>The transformer block itself contains several key components working in sequence. Input text is first tokenized and converted to embeddings, which are then combined with positional encodings. These flow through layers of multi head attention, normalization, and feed forward networks, with dropout applied for regularization. The output layer finally produces logits for next token prediction. While the complete architecture diagram may appear complex with its numerous modules and connections, each component serves a specific purpose in transforming input text into meaningful predictions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BWeA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BWeA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png 424w, https://substackcdn.com/image/fetch/$s_!BWeA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png 848w, https://substackcdn.com/image/fetch/$s_!BWeA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png 1272w, https://substackcdn.com/image/fetch/$s_!BWeA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BWeA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png" width="1170" height="849" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:849,&quot;width&quot;:1170,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38806,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BWeA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png 424w, https://substackcdn.com/image/fetch/$s_!BWeA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png 848w, https://substackcdn.com/image/fetch/$s_!BWeA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png 1272w, https://substackcdn.com/image/fetch/$s_!BWeA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27cb3b6d-7c69-482f-b5eb-1f87e1a76494_1170x849.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.10 </strong>A simplified decoder-only transformer showing the main components: token and positional embeddings, transformer blocks with multi-head attention, feed-forward networks, layer normalization, and dropout, followed by the output layer.</em></p><p>The decoder only architecture, which powers models like GPT, can be understood by examining a simplified version of the transformer&#8217;s decoder component. While the complete architecture may appear complex with numerous interconnected modules, we can break it down into three manageable parts for clarity. This modular approach allows us to examine each component systematically rather than attempting to grasp the entire system at once. By focusing on these three core sections sequentially, we can build a comprehensive understanding of how the decoder transforms input text into predictions.</p><p>The three parts of an LLM&#8217;s architecture are Input, Processing, and Output.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dA9N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dA9N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png 424w, https://substackcdn.com/image/fetch/$s_!dA9N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png 848w, https://substackcdn.com/image/fetch/$s_!dA9N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png 1272w, https://substackcdn.com/image/fetch/$s_!dA9N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dA9N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png" width="1119" height="900" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:900,&quot;width&quot;:1119,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44586,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dA9N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png 424w, https://substackcdn.com/image/fetch/$s_!dA9N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png 848w, https://substackcdn.com/image/fetch/$s_!dA9N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png 1272w, https://substackcdn.com/image/fetch/$s_!dA9N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc765a97b-410b-42db-8bf8-3c0e11fe5587_1119x900.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.11 </strong>The three stages of an LLM: the Input stage (tokenization and embeddings), the Processing stage (transformer blocks), and the Output stage (linear layer and softmax for next-token prediction).</em></p><p>So everything begins with the <strong>input stage</strong>, where several key transformations take place before it enters the <strong>processing unit</strong>, commonly known as the <strong>Transformer block</strong>.</p><p>First, the raw text undergoes <strong>tokenization</strong>, a process where the sentence is broken down into smaller units called <strong>tokens</strong>, these could be words, subwords, or characters, depending on the tokenization method used. This step ensures that the model can handle language efficiently,</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mlkV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mlkV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png 424w, https://substackcdn.com/image/fetch/$s_!mlkV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png 848w, https://substackcdn.com/image/fetch/$s_!mlkV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png 1272w, https://substackcdn.com/image/fetch/$s_!mlkV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mlkV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png" width="1272" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1272,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22449,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mlkV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png 424w, https://substackcdn.com/image/fetch/$s_!mlkV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png 848w, https://substackcdn.com/image/fetch/$s_!mlkV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png 1272w, https://substackcdn.com/image/fetch/$s_!mlkV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf3bf47e-a3bb-4250-9d59-0718154cdccd_1272x264.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.12 </strong>The input pipeline: raw text is tokenized into subword units, each token receives a numerical embedding, and positional embeddings are added to encode sequence order.</em></p><p>Next, each token is converted into a numerical representation through <strong>token embeddings</strong>. These embeddings assign a unique vector to each token, capturing semantic meaning and relationships between words. However, since token embeddings alone do not preserve the sequence order, we introduce <strong>positional embeddings</strong>. These embeddings encode the position of each token within the sentence, allowing the model to understand the <strong>order and structure</strong> of the input.</p><p>With tokenization, token embeddings, and positional embeddings in place, the input is now fully prepared for the <strong>Transformer block</strong>, where deep learning mechanisms, such as <strong>multi-head attention and feed-forward neural networks</strong>, process the text to generate meaningful predictions.</p><h2>1.3 Tokenization</h2><p>Before text enters a transformer model, it undergoes tokenization, a process that converts raw text into tokens which are then assigned unique IDs. There are three main tokenization approaches, each with distinct characteristics. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I-93!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I-93!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png 424w, https://substackcdn.com/image/fetch/$s_!I-93!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png 848w, https://substackcdn.com/image/fetch/$s_!I-93!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png 1272w, https://substackcdn.com/image/fetch/$s_!I-93!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I-93!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png" width="1456" height="174" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:174,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17787,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!I-93!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png 424w, https://substackcdn.com/image/fetch/$s_!I-93!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png 848w, https://substackcdn.com/image/fetch/$s_!I-93!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png 1272w, https://substackcdn.com/image/fetch/$s_!I-93!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7f83f76-66c2-4c71-a80c-5bc1539234cc_1530x183.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.13 </strong>Three tokenization strategies applied to the word &#8220;tokenization&#8221;: word-based (one token per word), character-based (one token per character), and subword-based (meaningful subword units).</em></p><p>Word based tokenization treats each complete word as a separate token, creating a dictionary of all words in the vocabulary. While intuitive, this approach struggles with vocabulary size and cannot handle new or misspelled words effectively. Character based tokenization breaks text down to individual characters, making each character a token. This creates a very small vocabulary but produces extremely long sequences that are computationally expensive to process.</p><p>Subword based tokenization, the preferred method for modern LLMs, breaks words into meaningful subword units. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xlNJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xlNJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png 424w, https://substackcdn.com/image/fetch/$s_!xlNJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png 848w, https://substackcdn.com/image/fetch/$s_!xlNJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png 1272w, https://substackcdn.com/image/fetch/$s_!xlNJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xlNJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png" width="459" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:459,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14039,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xlNJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png 424w, https://substackcdn.com/image/fetch/$s_!xlNJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png 848w, https://substackcdn.com/image/fetch/$s_!xlNJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png 1272w, https://substackcdn.com/image/fetch/$s_!xlNJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb290ec8d-c044-4329-aaee-57bd7d9d00c1_459x264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.14 </strong>Subword tokenization example: the word &#8220;playground&#8221; splits into &#8220;play&#8221; and &#8220;ground,&#8221; each a reusable meaningful unit.</em></p><p>A subword is a smaller meaningful unit that can be reused across different words. For example, &#8220;playground&#8221; might split into &#8220;play&#8221; and &#8220;ground,&#8221; while &#8220;unhappiness&#8221; could become &#8220;un&#8221; and &#8220;happiness.&#8221; This approach allows models to understand new words by recognizing familiar components. The word &#8220;neural&#8221; might tokenize as &#8220;ne&#8221; and &#8220;ural,&#8221; enabling the model to handle variations and new combinations it has never seen before.</p><p>The advantage of subword tokenization becomes clear when dealing with related words that share common roots. Instead of treating each variation as a completely new token, the model can leverage shared subword patterns. This reduces vocabulary size while maintaining the ability to represent any text, making it the optimal choice for Large Language Models. Tools like the <a href="https://tiktokenizer.vercel.app/">TikTokenizer</a> demonstrate how original text gets broken down into these subword tokens, revealing the building blocks that LLMs use to understand and generate language.</p><h4>1.3.1 Problems with Tokenization Methods</h4><p><strong>Word Based Tokenization Limitations</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vw13!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vw13!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png 424w, https://substackcdn.com/image/fetch/$s_!vw13!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png 848w, https://substackcdn.com/image/fetch/$s_!vw13!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png 1272w, https://substackcdn.com/image/fetch/$s_!vw13!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vw13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png" width="1155" height="639" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:639,&quot;width&quot;:1155,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29329,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vw13!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png 424w, https://substackcdn.com/image/fetch/$s_!vw13!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png 848w, https://substackcdn.com/image/fetch/$s_!vw13!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png 1272w, https://substackcdn.com/image/fetch/$s_!vw13!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6bed41-e0cd-4bb9-86e0-6e59a95df8ff_1155x639.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.15 </strong>Limitations of word-based tokenization: related words like &#8220;learn,&#8221; &#8220;learn-ing,&#8221; &#8220;learned,&#8221; and &#8220;learnt&#8221; are treated as entirely separate tokens, and out-of-vocabulary words cannot be processed.</em></p><p>Word based tokenization treats each word as an independent unit, creating fundamental challenges for language models. The most significant issue is the failure to recognize relationships between related words. Words sharing common roots like &#8220;learn,&#8221; &#8220;learning,&#8221; &#8220;learned,&#8221; and &#8220;learnt&#8221; are treated as entirely separate tokens, forcing the model to learn each variation independently without understanding their connection.</p><p>The vocabulary explosion presents another critical problem. English alone requires over 200,000 word tokens, with filler words like &#8220;this,&#8221; &#8220;is,&#8221; and &#8220;a&#8221; consuming valuable vocabulary space despite contributing minimal semantic value. Most critically, the out of vocabulary problem renders models helpless when encountering unseen words. Simple spelling mistakes transform &#8220;running&#8221; into the unrecognizable &#8220;runing,&#8221; while new terms or proper nouns become impossible to process, leaving the model unable to make educated guesses about meaning</p><p><strong>Character Based Tokenization Drawbacks</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_K34!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_K34!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png 424w, https://substackcdn.com/image/fetch/$s_!_K34!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png 848w, https://substackcdn.com/image/fetch/$s_!_K34!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png 1272w, https://substackcdn.com/image/fetch/$s_!_K34!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_K34!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png" width="549" height="219" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:219,&quot;width&quot;:549,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16296,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_K34!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png 424w, https://substackcdn.com/image/fetch/$s_!_K34!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png 848w, https://substackcdn.com/image/fetch/$s_!_K34!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png 1272w, https://substackcdn.com/image/fetch/$s_!_K34!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffbedeb0d-7a49-47d9-a970-34169a660d45_549x219.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.16 </strong>Character-based tokenization reduces the vocabulary to 256 ASCII characters but dramatically increases sequence length.</em></p><p>Character tokenization solves vocabulary size by using only 256 ASCII characters, but creates severe new problems. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7jkO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7jkO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png 424w, https://substackcdn.com/image/fetch/$s_!7jkO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png 848w, https://substackcdn.com/image/fetch/$s_!7jkO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png 1272w, https://substackcdn.com/image/fetch/$s_!7jkO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7jkO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png" width="1290" height="642" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:642,&quot;width&quot;:1290,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36069,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7jkO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png 424w, https://substackcdn.com/image/fetch/$s_!7jkO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png 848w, https://substackcdn.com/image/fetch/$s_!7jkO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png 1272w, https://substackcdn.com/image/fetch/$s_!7jkO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd95ab27-dc06-4249-8b38-4800ba0b15f2_1290x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.17 </strong>Sequence length explosion with character tokenization: &#8220;Hello, world!&#8221; grows from a few word tokens to thirteen character tokens, and individual letters carry no semantic meaning.</em></p><p>Sequence length explodes dramatically: &#8220;Hello, world!&#8221; grows from two (or six) word tokens to thirteen character tokens. This expansion makes processing computationally expensive and quickly exhausts context windows in longer texts.</p><p>More fundamentally, character tokenization destroys semantic understanding. Individual letters carry no meaning, forcing models to reconstruct word boundaries and meanings from scratch. The model cannot recognize that &#8220;lowest&#8221; and &#8220;highest&#8221; share the meaningful suffix &#8220;est&#8221; indicating superlatives. When presented with &#8220;Hello,world!&#8221; as individual characters, the model sees meaningless symbols rather than a greeting, losing the essence of language structure entirely.</p><p><strong>The Subword Tokenization Solution</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hCuh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hCuh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png 424w, https://substackcdn.com/image/fetch/$s_!hCuh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png 848w, https://substackcdn.com/image/fetch/$s_!hCuh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png 1272w, https://substackcdn.com/image/fetch/$s_!hCuh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hCuh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png" width="531" height="234" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea116c52-9a03-463c-9c38-007670046d9e_531x234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:234,&quot;width&quot;:531,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:9463,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hCuh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png 424w, https://substackcdn.com/image/fetch/$s_!hCuh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png 848w, https://substackcdn.com/image/fetch/$s_!hCuh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png 1272w, https://substackcdn.com/image/fetch/$s_!hCuh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea116c52-9a03-463c-9c38-007670046d9e_531x234.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.18 </strong>Subword tokenization splits &#8220;modernization&#8221; into &#8220;modern&#8221; and &#8220;ization,&#8221; both reusable components found across many English words.</em></p><p>Subword tokenization provides the optimal balance, breaking words into meaningful components. &#8220;Modernization&#8221; becomes &#8220;modern&#8221; and &#8220;ization,&#8221; both reusable parts appearing across many words. This approach maintains reasonable vocabulary size while preserving meaning, handles new words through familiar components, and keeps token counts manageable. The model can now understand misspellings and new terms by recognizing known subword patterns.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VTMU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VTMU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png 424w, https://substackcdn.com/image/fetch/$s_!VTMU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png 848w, https://substackcdn.com/image/fetch/$s_!VTMU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png 1272w, https://substackcdn.com/image/fetch/$s_!VTMU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VTMU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png" width="1149" height="552" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:552,&quot;width&quot;:1149,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VTMU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png 424w, https://substackcdn.com/image/fetch/$s_!VTMU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png 848w, https://substackcdn.com/image/fetch/$s_!VTMU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png 1272w, https://substackcdn.com/image/fetch/$s_!VTMU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F00c70d8d-df18-45eb-937c-3f601cc9d126_1149x552.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.19 </strong>The tokenization challenge: how should &#8220;learning&#8221; be split? Byte PairEncoding provides a systematic, data-driven answer.</em></p><p>The challenge remains: how should &#8220;learning&#8221; tokenize? As one token, as &#8220;learn&#8221; plus &#8220;ing,&#8221; or broken further? Byte Pair Encoding provides the systematic answer, using frequency analysis to determine optimal splits that balance vocabulary efficiency with semantic preservation.</p><h2>1.4 Byte Pair Encoding</h2><p>Byte Pair Encoding transforms the challenge of tokenization into a systematic process. Originally developed as a text compression algorithm in the 1990s, BPE now serves as the foundation for tokenization in models like GPT. The algorithm iteratively merges the most frequent character pairs, building a vocabulary from the bottom up.</p><p><strong>For LLMs, BPE builds vocabularies systematically. Consider this corpus with word frequencies:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D0Cy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D0Cy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png 424w, https://substackcdn.com/image/fetch/$s_!D0Cy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png 848w, https://substackcdn.com/image/fetch/$s_!D0Cy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png 1272w, https://substackcdn.com/image/fetch/$s_!D0Cy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D0Cy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png" width="1431" height="366" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:366,&quot;width&quot;:1431,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24562,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D0Cy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png 424w, https://substackcdn.com/image/fetch/$s_!D0Cy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png 848w, https://substackcdn.com/image/fetch/$s_!D0Cy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png 1272w, https://substackcdn.com/image/fetch/$s_!D0Cy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8ed98b9-629f-4e2a-a6b2-eb93e6178b85_1431x366.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.20 </strong>A small corpus with word frequencies used to illustrate the BPE algorithm: &#8220;old&#8221; appears 7 times, &#8220;older&#8221; 3 times, &#8220;finest&#8221; 9 times, and &#8220;lowest&#8221; 4 times.</em></p><p><strong>Step 1: Add End-of-Word Markers</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YLzn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YLzn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png 424w, https://substackcdn.com/image/fetch/$s_!YLzn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png 848w, https://substackcdn.com/image/fetch/$s_!YLzn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png 1272w, https://substackcdn.com/image/fetch/$s_!YLzn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YLzn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png" width="1395" height="384" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:384,&quot;width&quot;:1395,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43145,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YLzn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png 424w, https://substackcdn.com/image/fetch/$s_!YLzn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png 848w, https://substackcdn.com/image/fetch/$s_!YLzn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png 1272w, https://substackcdn.com/image/fetch/$s_!YLzn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feba016d7-e2e2-4f5d-bf3f-3b163fe0cf73_1395x384.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.21 </strong>End-of-word markers (&lt;/w&gt;) appended to each word to distinguish word boundaries and preserve morphological information.</em></p><p>The first step in BPE adds <strong>end-of-word markers (&lt;/w&gt;)</strong> to distinguish word boundaries. Words transform as: old becomes <strong>old&lt;/w&gt;</strong>, older becomes <strong>older&lt;/w&gt;</strong>, finest becomes <strong>finest&lt;/w&gt;</strong>, and lowest becomes <strong>lowest&lt;/w&gt;</strong>. This <strong>boundary marker</strong> is crucial because the same character sequence carries different meanings based on position. The sequence &#8220;est&#8221; functions as a <strong>suffix</strong> in &#8220;lowest&lt;/w&gt;&#8221; (indicating superlative) but as a <strong>prefix</strong> in &#8220;esteem&#8221; (with completely different meaning). Without these markers, the tokenizer cannot distinguish between identical character sequences that serve different linguistic roles, losing critical information about <strong>word structure and morphology</strong>.</p><p><strong>Step 2: Split into Characters</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J2gh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J2gh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png 424w, https://substackcdn.com/image/fetch/$s_!J2gh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png 848w, https://substackcdn.com/image/fetch/$s_!J2gh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png 1272w, https://substackcdn.com/image/fetch/$s_!J2gh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J2gh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png" width="1053" height="231" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:231,&quot;width&quot;:1053,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22886,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J2gh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png 424w, https://substackcdn.com/image/fetch/$s_!J2gh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png 848w, https://substackcdn.com/image/fetch/$s_!J2gh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png 1272w, https://substackcdn.com/image/fetch/$s_!J2gh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5bfe92a5-ce38-40ff-9c4a-0e539b565be7_1053x231.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.22 </strong>Each word is decomposed into individual characters, providing atomic units for iterative merging.</em></p><p>After adding end-of-word markers, each word is <strong>decomposed into individual characters</strong>, treating each as a separate token. The word <strong>old&lt;/w&gt;</strong> becomes the sequence [o, l, d, &lt;/w&gt;], while <strong>older&lt;/w&gt;</strong> splits into [o, l, d, e, r, &lt;/w&gt;]. Similarly, <strong>finest&lt;/w&gt;</strong> breaks down to [f, i, n, e, s, t, &lt;/w&gt;] and <strong>lowest&lt;/w&gt;</strong> to [l, o, w, e, s, t, &lt;/w&gt;]. This <strong>character-level decomposition</strong> serves as the starting point for BPE, providing the <strong>atomic units</strong> from which larger, more meaningful tokens will be built through iterative merging based on frequency patterns in the data.</p><p><strong>Step 3: Count Character Pairs &amp; Merge</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7WLZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7WLZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png 424w, https://substackcdn.com/image/fetch/$s_!7WLZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png 848w, https://substackcdn.com/image/fetch/$s_!7WLZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png 1272w, https://substackcdn.com/image/fetch/$s_!7WLZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7WLZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png" width="1456" height="433" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:433,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63668,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7WLZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png 424w, https://substackcdn.com/image/fetch/$s_!7WLZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png 848w, https://substackcdn.com/image/fetch/$s_!7WLZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png 1272w, https://substackcdn.com/image/fetch/$s_!7WLZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8e6e3f6-aa81-484d-9562-0403a2c4a516_1584x471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.23 </strong>Counting adjacent character pairs across the corpus, weighted by word frequency. The pair &#8220;es&#8221; appears 13 times and is selected for the first merge.</em></p><p>The algorithm now <strong>counts all adjacent character pairs</strong> across the corpus, weighted by word frequency. The pair &#8220;es&#8221; appears <strong>13 times</strong> (9 from &#8220;finest&#8221; plus 4 from &#8220;lowest&#8221;), as does &#8220;st&#8221; with the same distribution. The pairs &#8220;ol&#8221; and &#8220;ld&#8221; each appear <strong>10 times</strong> (7 from &#8220;old&#8221; plus 3 from &#8220;older&#8221;), while &#8220;ne&#8221; and &#8220;in&#8221; from &#8220;finest&#8221; contribute <strong>9 occurrences each</strong>.</p><p>With &#8220;es&#8221; as the <strong>most frequent pair</strong>, the algorithm performs its <strong>first merge</strong>, creating a new token &#8220;es&#8221; and updating the representations: finest&lt;/w&gt; becomes [f, i, n, <strong>es</strong>, t, &lt;/w&gt;] and lowest&lt;/w&gt; becomes [l, o, w, <strong>es</strong>, t, &lt;/w&gt;]. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RodN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RodN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png 424w, https://substackcdn.com/image/fetch/$s_!RodN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png 848w, https://substackcdn.com/image/fetch/$s_!RodN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png 1272w, https://substackcdn.com/image/fetch/$s_!RodN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RodN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png" width="1456" height="353" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:353,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58354,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RodN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png 424w, https://substackcdn.com/image/fetch/$s_!RodN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png 848w, https://substackcdn.com/image/fetch/$s_!RodN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png 1272w, https://substackcdn.com/image/fetch/$s_!RodN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb0bfcee5-e085-4ce2-a16d-b556d30a5280_1572x381.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.24 </strong>After the first merge (&#8220;es&#8221;), pairs are recounted and &#8220;est&#8221; emerges as the next most frequent pair, triggering a second merge.</em></p><p>After recounting pairs with this new token, &#8220;est&#8221; emerges as highly frequent, triggering a <strong>second merge</strong>. The token &#8220;est&#8221; replaces the &#8220;es&#8221; and &#8220;t&#8221; sequences, transforming finest&lt;/w&gt; into [f, i, n, <strong>est</strong>, &lt;/w&gt;] and lowest&lt;/w&gt; into [l, o, w, <strong>est</strong>, &lt;/w&gt;]. Through these <strong>iterative merges</strong>, BPE progressively builds larger, more meaningful tokens from the most frequent patterns in the data, creating an efficient vocabulary that captures common linguistic structures.</p><p><strong>Step 4: Building the Complete Vocabulary</strong></p><p>The merging process continues iteratively, identifying increasingly complex patterns. <strong>Common prefixes</strong> like &#8220;old&#8221; become single tokens when they appear frequently across multiple words. <strong>Suffixes with end markers</strong> like &#8220;est&lt;/w&gt;&#8221; are preserved as units to maintain their grammatical function. <strong>Frequent character sequences</strong> like &#8220;low&#8221; merge into single tokens regardless of their position.</p><p>After multiple iterations, the final vocabulary becomes a <strong>hierarchical collection</strong> of tokens at different granularities. It contains <strong>individual characters</strong> [o, l, d, e, r, f, i, n, w, s, t] for handling rare sequences, <strong>common subwords</strong> [es, est, old, low, fin] that appear across multiple words, and <strong>complete frequent words</strong> [old&lt;/w&gt;, finest&lt;/w&gt;] that occur often enough to warrant their own tokens. This multi-level vocabulary enables <strong>efficient encoding</strong> of common patterns while maintaining the flexibility to tokenize any possible input.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!04vJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!04vJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png 424w, https://substackcdn.com/image/fetch/$s_!04vJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png 848w, https://substackcdn.com/image/fetch/$s_!04vJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png 1272w, https://substackcdn.com/image/fetch/$s_!04vJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!04vJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png" width="1200" height="615" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:615,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:51408,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!04vJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png 424w, https://substackcdn.com/image/fetch/$s_!04vJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png 848w, https://substackcdn.com/image/fetch/$s_!04vJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png 1272w, https://substackcdn.com/image/fetch/$s_!04vJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc538cc8f-8181-4aaf-ac59-07988e26b952_1200x615.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.25 </strong>GPT-2&#8217;s final vocabulary of 50,257 tokens, built from 50,000 BPE merges. Each token is mapped to a unique numerical ID used by the model internally.</em></p><p>GPT-2 performs <strong>50,000 merges</strong> to build its vocabulary, creating a rich token set that balances compression with expressiveness. Each token in this vocabulary is assigned a <strong>unique token ID</strong>, a numerical identifier that the model uses internally. For example, in GPT-2&#8217;s vocabulary, common words like &#8220;Building&#8221; map to ID 25954, while special tokens like &#8220;&lt;/endoftext&gt;&#8221; receive IDs like 50256, creating a complete <strong>dictionary of 50,257 token-ID pairs</strong> that serves as the bridge between text and numerical processing.</p><p>When the model encounters an unfamiliar word, it <strong>gracefully degrades</strong> to smaller subwords or individual characters, ensuring robust handling of misspellings, neologisms, or foreign terms. This <strong>fallback mechanism</strong> makes BPE remarkably resilient, capable of processing any text while maintaining efficiency for common patterns.</p><p>With our text now converted into meaningful tokens through BPE and mapped to numerical IDs, the next challenge is transforming these discrete symbols into continuous numerical representations that neural networks can process, leading us to the crucial concept of <strong>embeddings</strong>.</p><h2>1.5 Word Embedding</h2><p>After tokenization transforms text into discrete symbols and assigns them numerical IDs, we face a fundamental challenge: these IDs are merely labels that convey no semantic information. The token ID 25954 for &#8220;Building&#8221; tells the model nothing about buildings, construction, or architecture. To enable neural networks to process language meaningfully, we need to convert these discrete tokens into <strong>continuous numerical representations</strong> that capture semantic relationships. This is where <strong>word embeddings</strong> become essential.</p><p><strong>The Limitations of Simple Encoding</strong></p><p>Early approaches to numerical representation revealed critical limitations. <strong>One-hot encoding</strong> represents each token as a vector of zeros with a single one at the token&#8217;s position. For a vocabulary of 50,000 tokens, &#8220;cat&#8221; might be encoded as 50,000 zeros except for a single one at position 3. While this eliminates arbitrary ordering, it creates <strong>sparse, high-dimensional vectors</strong> where every word is equally distant from every other word. The vectors for &#8220;cat&#8221; and &#8220;dog&#8221; are as orthogonal as those for &#8220;cat&#8221; and &#8220;quantum&#8221;, providing no semantic signal. Similarly, <strong>bag-of-words</strong> models count word occurrences but lose all sequential information, treating &#8220;dog bites man&#8221; and &#8220;man bites dog&#8221; identically despite their opposite meanings.</p><p><strong>Learning Meaning Through Context</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3_j-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3_j-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png 424w, https://substackcdn.com/image/fetch/$s_!3_j-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png 848w, https://substackcdn.com/image/fetch/$s_!3_j-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png 1272w, https://substackcdn.com/image/fetch/$s_!3_j-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3_j-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png" width="666" height="501" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:501,&quot;width&quot;:666,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26903,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3_j-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png 424w, https://substackcdn.com/image/fetch/$s_!3_j-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png 848w, https://substackcdn.com/image/fetch/$s_!3_j-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png 1272w, https://substackcdn.com/image/fetch/$s_!3_j-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3dd0160d-a948-49a9-a777-81e1d4e39e49_666x501.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.26 </strong>high-dimensional embeddings transform semantic meaning into geometric coordinates. In this 768-dimensional space, linguistic relationships are defined by proximity, grouping concepts like animals and fruits into distinct neighborhoods.</em></p><p>The breakthrough came from the <strong>distributional hypothesis</strong>: words appearing in similar contexts tend to have similar meanings. If &#8220;coffee&#8221; frequently appears near &#8220;morning,&#8221; &#8220;cup,&#8221; and &#8220;brew,&#8221; while &#8220;tea&#8221; appears near similar words, a model can learn that coffee and tea are related concepts. <strong>Word2Vec</strong> revolutionized this approach by training neural networks to predict words from context (CBOW) or context from words (Skip-gram). Through millions of training examples, the network&#8217;s hidden layer learns to position similar words near each other in vector space. After training, &#8220;king&#8221; naturally clusters near &#8220;queen&#8221; and &#8220;prince,&#8221; while &#8220;banana&#8221; groups with &#8220;apple&#8221; and &#8220;fruit.&#8221; Most remarkably, these embeddings capture <strong>analogical relationships</strong> geometrically: the vector arithmetic &#8220;king - man + woman&#8221; yields a vector nearly identical to &#8220;queen,&#8221; demonstrating that the model has learned abstract concepts like gender and royalty as directions in space.</p><p><strong>Embeddings in Large Language Models</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8vEa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8vEa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png 424w, https://substackcdn.com/image/fetch/$s_!8vEa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png 848w, https://substackcdn.com/image/fetch/$s_!8vEa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png 1272w, https://substackcdn.com/image/fetch/$s_!8vEa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8vEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png" width="537" height="363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:363,&quot;width&quot;:537,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15732,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8vEa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png 424w, https://substackcdn.com/image/fetch/$s_!8vEa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png 848w, https://substackcdn.com/image/fetch/$s_!8vEa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png 1272w, https://substackcdn.com/image/fetch/$s_!8vEa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9d9853a7-0260-4dd4-9c1e-d135ab75264e_537x363.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.27:</strong> Dense embedding vectors transform token IDs into high-dimensional representations where specific dimensions encode learned semantic features. This layer functions as a learned lookup table that maps tokens to unique vectors, enabling the model to capture nuanced semantic attributes like parts of speech or categorical relationships.</em></p><p>Modern LLMs transform token IDs into <strong>dense embedding vectors</strong>, typically ranging from 768 to 4096 dimensions, where each dimension encodes aspects of meaning learned during training. Unlike Word2Vec&#8217;s static embeddings where each word has one fixed representation, transformer models employ <strong>contextual embeddings</strong> that dynamically adjust based on surrounding tokens. The word &#8220;bank&#8221; receives different vector representations when appearing in &#8220;river bank&#8221; versus &#8220;investment bank,&#8221; enabling the model to disambiguate meaning through context. These embeddings are learned end-to-end during training, with the model discovering optimal representations that maximize its ability to predict the next token. The embedding layer becomes a <strong>learned lookup table</strong> that maps each of the 50,000+ token IDs to a unique vector in high-dimensional space, where semantic similarity translates to geometric proximity.</p><p>The power of LLM embeddings lies in their ability to encode multiple layers of linguistic information simultaneously. Each vector captures <strong>semantic meaning</strong> (cat near dog), <strong>syntactic roles</strong> (verbs clustering separately from nouns), <strong>conceptual relationships</strong> (similar terms grouping together), and even <strong>abstract patterns</strong> like sentiment or formality. Through billions of training examples, the model learns to position tokens in this space such that vector operations correspond to meaningful transformations.  This geometric structure enables transformers to perform complex reasoning by manipulating these vectors through attention mechanisms and feed-forward networks, turning language understanding into mathematical computation.</p><p>However, embeddings alone cannot capture the sequential nature of language, where word order fundamentally changes meaning. This limitation leads us to <strong>positional embeddings</strong>, which encode each token&#8217;s location in the sequence, enabling transformers to understand that &#8220;dog bites man&#8221; differs crucially from &#8220;man bites dog.&#8221;</p><h2>Positional Embedding</h2><p><strong>The Need for Positional Information</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ln5k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ln5k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png 424w, https://substackcdn.com/image/fetch/$s_!Ln5k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png 848w, https://substackcdn.com/image/fetch/$s_!Ln5k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png 1272w, https://substackcdn.com/image/fetch/$s_!Ln5k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ln5k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png" width="1456" height="497" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:497,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52289,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ln5k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png 424w, https://substackcdn.com/image/fetch/$s_!Ln5k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png 848w, https://substackcdn.com/image/fetch/$s_!Ln5k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png 1272w, https://substackcdn.com/image/fetch/$s_!Ln5k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb87adbe9-89ae-4701-859b-1ae1a681b7d1_1713x585.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>figure 1.28:</strong> Positional embeddings introduce sequential context into the Transformer architecture by adding unique position-aware vectors to token embeddings. This summation allows the model to distinguish identical tokens appearing at different sequence locations, enabling the architecture to capture syntactic and referential relationships despite its inherently parallel, set-based processing nature.</em></p><p>In natural language, word order fundamentally shapes meaning. Consider the sentences &#8220;The dog chased the cat&#8221; versus &#8220;The cat chased the dog.&#8221; While both sentences contain identical words, their meanings differ entirely based on word positioning. Traditional sequential models like RNNs inherently capture this ordering through their recurrent nature. However, the Transformer architecture processes all tokens simultaneously through self-attention, treating input as an unordered set. Without explicit positional information, a Transformer would produce identical representations for &#8220;dog&#8221; regardless of its position in the sentence, making it impossible to distinguish between different occurrences or understand sequential relationships.</p><p>This limitation becomes particularly problematic when dealing with pronouns and references. In &#8220;The dog chased the ball but it could not catch it,&#8221; the two instances of &#8220;it&#8221; refer to different entities based solely on their positions relative to other words. To address this fundamental limitation, Transformers incorporate positional embeddings that encode sequence order information directly into the model&#8217;s representations.</p><h3><strong>Integer Positional Encoding: The Simplest Approach</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HHUV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HHUV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png 424w, https://substackcdn.com/image/fetch/$s_!HHUV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png 848w, https://substackcdn.com/image/fetch/$s_!HHUV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png 1272w, https://substackcdn.com/image/fetch/$s_!HHUV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HHUV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png" width="690" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:690,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24282,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HHUV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png 424w, https://substackcdn.com/image/fetch/$s_!HHUV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png 848w, https://substackcdn.com/image/fetch/$s_!HHUV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png 1272w, https://substackcdn.com/image/fetch/$s_!HHUV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8e966e3-34bc-4a6b-bb67-3dd12b2eb2fd_690x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>figure 1.29:</strong> The additive method of injecting positional data by combining token embeddings with integer-based position vectors, while noting the drawback that large integer values can interfere with and potentially confuse the model regarding the original word's semantic meaning.</p><p>The most straightforward solution involves assigning each position a unique integer value. In this scheme, if a token appears at position 300 in the sequence, we create a positional embedding vector where every dimension contains the value 300. This vector, matching the token embedding dimensions, gets added element-wise to the token embedding.</p><p>For a concrete example with an 8-dimensional embedding space, the token &#8220;dog&#8221; at position 300 would receive a positional embedding of [300, 300, 300, 300, 300, 300, 300, 300]. The final input representation becomes the sum of the token embedding and this positional embedding.</p><p>However, this approach suffers from a critical flaw: scale mismatch. Token embeddings typically contain small values clustered around zero, carefully learned to capture semantic nuances. Position values, especially for longer sequences, can grow arbitrarily large. When position 500 adds [500, 500, ...] to delicate token embeddings with values like [0.23, -0.15, 0.08, ...], the positional signal completely overwhelms the semantic information. The model loses the ability to distinguish between different words, focusing instead on their positions.</p><h3><strong>Binary Positional Encoding: Constraining the Range</strong></h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gwb8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gwb8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png 424w, https://substackcdn.com/image/fetch/$s_!gwb8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png 848w, https://substackcdn.com/image/fetch/$s_!gwb8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png 1272w, https://substackcdn.com/image/fetch/$s_!gwb8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gwb8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png" width="630" height="594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:594,&quot;width&quot;:630,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23898,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gwb8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png 424w, https://substackcdn.com/image/fetch/$s_!gwb8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png 848w, https://substackcdn.com/image/fetch/$s_!gwb8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png 1272w, https://substackcdn.com/image/fetch/$s_!gwb8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c62418b-3bbd-4695-8be8-70ab0e87e522_630x594.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>figure 1.30:</strong> This technique represents positions using binary bit strings to keep values between 0 and 1, but it creates sudden jumps in the embedding space that complicate the training process for the model.</em></p><p>To address the magnitude problem inherent in integer encoding, binary positional encoding represents positions using their binary representation, naturally constraining all values between 0 and 1. This approach transforms each position number into its binary form and uses those bits directly as the positional embedding vector.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3CDH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3CDH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png 424w, https://substackcdn.com/image/fetch/$s_!3CDH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png 848w, https://substackcdn.com/image/fetch/$s_!3CDH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png 1272w, https://substackcdn.com/image/fetch/$s_!3CDH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3CDH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png" width="858" height="471" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:471,&quot;width&quot;:858,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52764,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3CDH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png 424w, https://substackcdn.com/image/fetch/$s_!3CDH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png 848w, https://substackcdn.com/image/fetch/$s_!3CDH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png 1272w, https://substackcdn.com/image/fetch/$s_!3CDH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54369dc5-d301-4650-8ba6-9ca2659b7a85_858x471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.31:</strong> This  demonstrates how binary representations change across consecutive integer positions. It highlights the rapid bit flipping in the least significant bit position which creates frequency oscillations that make optimization more difficult for the model.</em></p><p>Consider the visualization showing positions 64 through 75 with their 8-bit binary representations. Position 64, which equals 01000000 in binary, becomes the embedding vector [0, 1, 0, 0, 0, 0, 0, 0]. Here, each bit position corresponds to a dimension in the embedding space, with i=8 representing the most significant bit (MSB) and i=1 representing the least significant bit (LSB).</p><p>Looking at the pattern across consecutive positions reveals a fascinating structure. Position 64 starts with [0, 1, 0, 0, 0, 0, 0, 0]. Position 65 becomes [0, 1, 0, 0, 0, 0, 0, 1], position 66 transforms to [0, 1, 0, 0, 0, 0, 1, 0], and position 67 yields [0, 1, 0, 0, 0, 0, 1, 1]. The rightmost bit (i=1) flips with every single position increment, creating a rapid alternation between 0 and 1.</p><p>The second bit from the right (i=2) follows a different rhythm, maintaining its value for two positions before flipping. It stays 0 for positions 64-65, switches to 1 for positions 66-67, returns to 0 for positions 68-69, and so forth. The third bit (i=3) changes every four positions, remaining stable from 64-67, then flipping for 68-71.</p><p>This creates a hierarchical encoding scheme where each bit position operates at a different frequency. The LSB oscillates most rapidly, capturing fine-grained positional differences between adjacent tokens. Moving leftward through the bits, oscillation frequencies decrease exponentially. The fourth bit changes every 8 positions, the fifth every 16 positions, the sixth every 32 positions, and the seventh every 64 positions. The MSB (i=8) remains constant for 128 consecutive positions before flipping.</p><p>In the visualization, this pattern becomes immediately apparent. The rightmost column shows constant flickering between  (0) and (1) for every position. The i=2 column displays pairs of same cells. The i=3 column shows groups of four, and this doubling pattern continues across all dimensions. The leftmost column (i=8) remains uniformly across the entire visible range, as positions 64-75 all share the same MSB value of 0.</p><p>This encoding elegantly solves the scale problem that plagued integer encoding. Instead of values potentially reaching into the thousands, every dimension now contains either 0 or 1. When added to token embeddings clustered around zero, these binary values preserve the semantic information while injecting positional signals at a comparable scale.</p><p>The hierarchical structure provides the model with positional information at multiple granularities simultaneously. Lower-indexed dimensions encode local sequential relationships, helping the model understand which tokens appear near each other. Higher-indexed dimensions capture global positional context, indicating whether tokens appear in the first half versus second half of the sequence, or in early versus late quarters.</p><p>However, binary encoding introduces a critical limitation: discontinuity. The hard transitions between 0 and 1 create step functions rather than smooth gradients. When the model needs to learn relationships between positions 67 ([0, 1, 0, 0, 0, 0, 1, 1]) and 68 ([0, 1, 0, 0, 0, 1, 0, 0]), multiple dimensions flip simultaneously. These abrupt changes complicate gradient-based optimization, as the loss landscape contains sharp edges and discontinuous regions.</p><p>During backpropagation, these discrete jumps prevent smooth gradient flow. Small parameter updates cannot gradually transition the model&#8217;s understanding between binary states. The optimizer must navigate around these discontinuities, potentially getting stuck in suboptimal configurations or requiring careful learning rate scheduling to handle the non-smooth optimization landscape.</p><p>Despite these challenges, binary encoding demonstrates the key insight that positional information can be encoded through patterns of oscillation at different frequencies. This conceptual breakthrough, showing that different dimensions can operate at different temporal scales, directly inspired the development of sinusoidal positional encoding, which maintains these beneficial oscillatory patterns while ensuring continuous, differentiable representations throughout the embedding space.</p><h4><strong>Sinusoidal Positional Encoding: Continuous Representations</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oceS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oceS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png 424w, https://substackcdn.com/image/fetch/$s_!oceS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png 848w, https://substackcdn.com/image/fetch/$s_!oceS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png 1272w, https://substackcdn.com/image/fetch/$s_!oceS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oceS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png" width="1443" height="627" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:627,&quot;width&quot;:1443,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47580,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oceS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png 424w, https://substackcdn.com/image/fetch/$s_!oceS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png 848w, https://substackcdn.com/image/fetch/$s_!oceS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png 1272w, https://substackcdn.com/image/fetch/$s_!oceS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa821d6d5-84c3-420b-bfb0-78ef1bd2eb14_1443x627.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.32:</strong> Sinusoidal PE applies trigonometric sine and cosine functions to generate continuous and bounded positional vectors, which enables the model to learn sequential relationships while avoiding the optimization challenges and discontinuities inherent in integer and binary positional encodings.</em></p><p>The breakthrough in positional encoding came with the sinusoidal approach, introduced in the seminal &#8220;Attention Is All You Need&#8221; paper. This method preserves the oscillatory patterns discovered in binary encoding while ensuring smooth, continuous values bounded between -1 and 1, eliminating the discontinuity problems that hindered optimization.</p><p><strong>The Mathematical Foundation</strong></p><p>The sinusoidal formulation employs alternating sine and cosine functions across dimensions:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{For even-indexed dimensions } (i = 0, 2, 4, \\ldots),\\ \\text{ the positional encoding is defined as:}\n&quot;,&quot;id&quot;:&quot;RJXNCTBIAD&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;PE(pos, 2i) = \\sin \\left( \\frac{pos}{10000^{\\frac{2i}{d_{model}}}} \\right)\n&quot;,&quot;id&quot;:&quot;YPOXSXGYVL&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{For odd-indexed dimensions } (i = 1, 3, 5, \\ldots), \\text{ the positional encoding is defined as:}\n&quot;,&quot;id&quot;:&quot;PFNCSNMRZL&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;PE(pos, 2i+1) = \\cos \\left( \\frac{pos}{10000^{\\frac{2i}{d_{model}}}} \\right)\n&quot;,&quot;id&quot;:&quot;XQXEGNYKGU&quot;}" data-component-name="LatexBlockToDOM"></div><p>Where pos denotes the token&#8217;s position in the sequence, <strong>i</strong> represents the dimension index, and <strong>d_model</strong> indicates the total embedding dimensionality. The constant <strong>10000</strong> serves as the base for creating a geometric progression of wavelengths across different dimensions.</p><p><strong>Frequency Spectrum Analysis</strong></p><p>Taking GPT-2&#8217;s architecture as an example, with d_model = 768 and maximum context length = 1024, we can observe how different dimensions encode positional information at varying frequencies. For any given position, we compute 768 values using the alternating sine-cosine formulas.</p><p>At the lowest dimension (i=0), the formula simplifies to sin(pos/1) = sin(pos), creating rapid oscillations. The adjacent dimension uses cos(pos/1) = cos(pos). As the dimension index increases, the denominator 10000^(2i/768) grows exponentially, progressively slowing the oscillation frequency.<br></p><p><strong>Sinusoidal Patterns Across Different Dimensions</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qmq6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qmq6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png 424w, https://substackcdn.com/image/fetch/$s_!qmq6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png 848w, https://substackcdn.com/image/fetch/$s_!qmq6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png 1272w, https://substackcdn.com/image/fetch/$s_!qmq6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qmq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png" width="1456" height="381" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:381,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:114418,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qmq6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png 424w, https://substackcdn.com/image/fetch/$s_!qmq6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png 848w, https://substackcdn.com/image/fetch/$s_!qmq6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png 1272w, https://substackcdn.com/image/fetch/$s_!qmq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c4d4b47-41c6-4f0b-a19f-4abd9bdcf7c1_1489x390.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.33: </strong>The visualization reveals how positional encodings behave across four different dimension indices:</em></p><p>At <strong>i=1</strong>, both sine and cosine components oscillate extremely rapidly, appearing as dense vertical lines that alternate between approximately -1 and 1. This high-frequency pattern changes with nearly every position, capturing fine-grained local relationships between adjacent tokens.</p><p>At <strong>i=50</strong>, the oscillation frequency decreases noticeably. The sine and cosine waves create regular patterns with periods spanning roughly 20-30 positions. These medium-frequency components encode relationships at the phrase or sentence level.</p><p>At <strong>i=150</strong>, the waves become smooth and gradual, with clear sinusoidal curves visible. The sine (green) and cosine (blue) components maintain their 90-degree phase offset, completing only 2-3 full cycles across the entire 1024-position range. These dimensions capture broader structural information about whether tokens appear in early, middle, or late portions of the sequence.</p><p>At <strong>i=250</strong>, the oscillation becomes extremely slow, with the functions barely completing a single cycle across the full context. The cosine component remains nearly constant around 1, while the sine component stays close to 0, providing stable anchoring for global position context.</p><p>Sinusoidal encoding creates a hierarchical representation where each position receives a unique 768-dimensional fingerprint. Lower dimensions oscillate rapidly between positions, capturing local token relationships and word order, while higher dimensions change gradually, encoding broader context like paragraph boundaries and document structure. This combination of multiple sine-cosine pairs at different frequencies generates a unique signature for every position. Unlike binary encoding&#8217;s abrupt 0-to-1 transitions, sinusoidal encoding provides smooth, continuous functions that enable stable gradient flow during backpropagation, dramatically improving training efficiency. The bounded range between -1 and 1 keeps positional signals at a scale comparable to token embeddings, preventing positional information from overwhelming semantic content while allowing the optimizer to make incremental refinements.</p><p>The sinusoidal approach offers significant practical advantages: it requires no learned parameters, reducing model complexity and training overhead, and its mathematical formulation naturally extends to arbitrary sequence lengths, potentially enabling generalization beyond training context sizes. In practice, positional encodings are precomputed for the maximum sequence length and stored as a lookup table. During processing, these encodings are retrieved and added element-wise to token embeddings, preserving semantic information while injecting positional signals. This simple yet elegant solution simultaneously addresses multiple challenges: maintaining bounded values, ensuring smooth optimization, providing unique position identification, and encoding multiscale temporal information. These properties have established sinusoidal positional encoding as a cornerstone of the Transformer architecture, inspiring numerous variations while remaining widely used in its original form across modern language models.</p><h1>1.6 Transformer Block</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vzlj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vzlj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png 424w, https://substackcdn.com/image/fetch/$s_!vzlj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png 848w, https://substackcdn.com/image/fetch/$s_!vzlj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png 1272w, https://substackcdn.com/image/fetch/$s_!vzlj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vzlj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png" width="1245" height="876" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:876,&quot;width&quot;:1245,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45494,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vzlj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png 424w, https://substackcdn.com/image/fetch/$s_!vzlj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png 848w, https://substackcdn.com/image/fetch/$s_!vzlj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png 1272w, https://substackcdn.com/image/fetch/$s_!vzlj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae2fc2b-b5ed-4fb3-98d0-3eb22c68dbab_1245x876.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.34 </strong>Inside the transformer block: multi-head attention, feed-forward network, layer normalization, and dropout layers work in sequence to process token representations</em></p><p>Having converted our raw text into meaningful numerical representations through tokenization and embeddings, we now enter the heart of the language model: the Transformer Block. This is where the real magic happens. The block contains several components working in sequence, including layer normalization, dropout layers, and feed forward networks. However, before we dive into these supporting elements, we need to understand the star of the show: the attention mechanism. The multi head attention layer is what gives transformers their remarkable ability to understand context and relationships between words, no matter how far apart they appear in a sentence. Once we grasp how attention works and explore the feed forward network that follows, we can then circle back to understand how the other components like dropout and layer normalization help stabilize and improve the overall system. For now, let&#8217;s focus on what makes transformers truly powerful: their attention mechanism.</p><h1>1.7 The Need for Attention Mechanism</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ckXX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ckXX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png 424w, https://substackcdn.com/image/fetch/$s_!ckXX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png 848w, https://substackcdn.com/image/fetch/$s_!ckXX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png 1272w, https://substackcdn.com/image/fetch/$s_!ckXX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ckXX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png" width="675" height="267" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3126410-19cd-43b8-8698-3e72cda4797a_675x267.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:267,&quot;width&quot;:675,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:11075,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ckXX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png 424w, https://substackcdn.com/image/fetch/$s_!ckXX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png 848w, https://substackcdn.com/image/fetch/$s_!ckXX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png 1272w, https://substackcdn.com/image/fetch/$s_!ckXX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3126410-19cd-43b8-8698-3e72cda4797a_675x267.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><em><strong>Figure 1.35 </strong>Timeline of sequence modeling methods from early RNNs to LSTMs, attention with RNNs, transformers, and GPT models.</em></p><p>Feedforward neural networks see every input as independent. For a sentence such as &#8220;The cat sat on the mat&#8221; the model processes each word separately and has no built in notion of order or context. This is not enough for language, where meaning depends on how words are arranged.</p><p>Recurrent neural networks introduce a hidden state that is passed along the sequence. The encoder reads tokens one by one, updates its hidden state at each step, and hands the final state to a decoder. The decoder must use this single vector as a summary of the entire input sentence. As sequences get longer, early information is squeezed into this fixed size state and gradually fades. This is the context bottleneck.</p><p>LSTMs improve the situation with a cell state and gates that control what to store and what to forget. They maintain information over longer spans than basic RNNs, but they still process tokens step by step and still rely on compressed hidden states. Long sentences can still overwhelm this bottleneck.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!70c8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!70c8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png 424w, https://substackcdn.com/image/fetch/$s_!70c8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png 848w, https://substackcdn.com/image/fetch/$s_!70c8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png 1272w, https://substackcdn.com/image/fetch/$s_!70c8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!70c8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png" width="666" height="393" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:393,&quot;width&quot;:666,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15357,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!70c8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png 424w, https://substackcdn.com/image/fetch/$s_!70c8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png 848w, https://substackcdn.com/image/fetch/$s_!70c8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png 1272w, https://substackcdn.com/image/fetch/$s_!70c8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2715aee9-a689-47a5-bc2b-139f210052fb_666x393.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><em><strong>Figure 1.36:</strong> Encoder-decoder RNN for the sentence &#8220;I will eat&#8221; showing encoder hidden states h1, h2, h3 and decoder states that must rely on a single summary vector.</em></p><p>To see the bottleneck more concretely, consider an encoder decoder model that translates the English sentence &#8220;I will eat into French&#8221;. The encoder produces hidden states h1, h2, h3 for the three input tokens and a final state that is passed to the decoder. Without attention, the decoder can only use this final state when generating the first French word. It has no direct way to reach back to h1 or h2.</p><p><strong>Attention</strong></p><p>Attention removes the hard bottleneck by giving the decoder direct access to all encoder states. At each decoding step the model compares the current decoder state with every encoder state and produces attention scores. After a softmax these scores become attention weights that sum to one.</p><p>The decoder then forms a context vector as a weighted sum of the encoder states. If the first input word is most relevant for the current output, its weight may be close to one while the others are close to zero. At the next step the weights are recomputed and the model can shift its focus to a different part of the sentence.</p><p>In the translation example, when the decoder produces the first French word it might focus almost entirely on h1. When it moves on to the second French word it can focus more on h2, and so on. Instead of depending on a single final state, the decoder now has a flexible view over the entire input sequence at every step.</p><p><strong>Bahdanau attention</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h_cL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h_cL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png 424w, https://substackcdn.com/image/fetch/$s_!h_cL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png 848w, https://substackcdn.com/image/fetch/$s_!h_cL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png 1272w, https://substackcdn.com/image/fetch/$s_!h_cL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h_cL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png" width="663" height="465" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:465,&quot;width&quot;:663,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13870,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h_cL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png 424w, https://substackcdn.com/image/fetch/$s_!h_cL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png 848w, https://substackcdn.com/image/fetch/$s_!h_cL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png 1272w, https://substackcdn.com/image/fetch/$s_!h_cL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68ce7bad-59ec-47ef-b9d8-c16e5c95ee94_663x465.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><em><strong>Figure 1.37:</strong> Bahdanau attention architecture: the encoder produces a sequence of states and the decoder combines its own state with a context vector formed as a weighted sum of all encoder states</em></p><p>Bahdanau attention was the first widely adopted implementation of this idea. The encoder is still a recurrent network that produces a sequence of hidden states. The decoder is also recurrent, but before predicting each target token it computes alignment scores between its current state and every encoder state. These scores become attention weights, and their weighted sum is the context vector used for prediction.</p><p>The attention weights can be visualized as a matrix whose rows correspond to target words and columns correspond to source words. Each cell shows how strongly the model attends to a particular source word when generating a particular target word. This view reveals attention as a soft alignment between the two sentences.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UlIz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UlIz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png 424w, https://substackcdn.com/image/fetch/$s_!UlIz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png 848w, https://substackcdn.com/image/fetch/$s_!UlIz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png 1272w, https://substackcdn.com/image/fetch/$s_!UlIz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UlIz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png" width="1008" height="585" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:585,&quot;width&quot;:1008,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34269,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UlIz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png 424w, https://substackcdn.com/image/fetch/$s_!UlIz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png 848w, https://substackcdn.com/image/fetch/$s_!UlIz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png 1272w, https://substackcdn.com/image/fetch/$s_!UlIz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F215a9571-23c9-481c-9e8c-74ed79e6b64b_1008x585.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><em><strong>Figure 1.38</strong>: Attention heatmaps for a French-to-English sentence pair. The left grid shows overall alignments&#894; the right grid highlights how the model focuses on &#8220;European Economic Area&#8221; when generating &#8220;zone economique europeenne,&#8221; capturing word reordering.</em></p><p>These heatmaps show that many words align along a near diagonal, indicating similar order in both languages. Off diagonal patterns reveal reordered phrases. For example, the French adjective corresponding to European appears last in the phrase, but its attention weights point back to the first English word. This ability to align by meaning rather than position is what allows attention based models to handle flexible word order and long range dependencies.</p><p>Finally, it is helpful to remember where this attention block lives inside the full transformer model from earlier chapters. The transformer encoder and decoder both contain stacked attention and feedforward sublayers that operate on token and positional embeddings.</p><p>We are now ready to see why attention became the central idea in modern language models. Starting from simple recurrent networks and LSTMs, we saw how the context bottleneck makes it hard to remember all the details of a long sentence. Bahdanau attention solved this by letting the decoder look back at every encoder state and learn soft alignments between source and target words, which we visualized through attention weights and heatmaps. So far, attention has connected two different sequences, such as English and French sentences. In the next section we will study self attention in detail and see how letting every token attend to every other token becomes the core operation of the transformer.</p><h2>1.8 Self Attention Mechanism </h2><h4>What does Self Attention actually means ?</h4><p>Now that we understand the mechanics of attention, let&#8217;s clarify what makes <em>self</em>-attention special, the key concept behind modern language models like Transformers.</p><h4>Two Types of Attention</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Qjk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Qjk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png 424w, https://substackcdn.com/image/fetch/$s_!0Qjk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png 848w, https://substackcdn.com/image/fetch/$s_!0Qjk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png 1272w, https://substackcdn.com/image/fetch/$s_!0Qjk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Qjk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png" width="694" height="310" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:310,&quot;width&quot;:694,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:12701,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0Qjk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png 424w, https://substackcdn.com/image/fetch/$s_!0Qjk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png 848w, https://substackcdn.com/image/fetch/$s_!0Qjk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png 1272w, https://substackcdn.com/image/fetch/$s_!0Qjk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefcfd51b-f8d5-435b-b451-880a7c63be24_694x310.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.39:</strong> Two types of attention: cross-attention connects words across different sequences (e.g., translation), while self-attention connects words within the same sequence.</em></p><p>To understand self-attention, we first need to see where attention was used before. There are two fundamental ways attention can work:</p><p><strong>Between Sequences:</strong> Attention connects words across different sequences, think of translating from one language to another.</p><p><strong>Within a Sequence:</strong> Attention connects words within the same sequence to capture relationships and context.</p><h4>Attention in Translation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XF0j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XF0j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png 424w, https://substackcdn.com/image/fetch/$s_!XF0j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png 848w, https://substackcdn.com/image/fetch/$s_!XF0j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png 1272w, https://substackcdn.com/image/fetch/$s_!XF0j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XF0j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png" width="364" height="272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:272,&quot;width&quot;:364,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XF0j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png 424w, https://substackcdn.com/image/fetch/$s_!XF0j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png 848w, https://substackcdn.com/image/fetch/$s_!XF0j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png 1272w, https://substackcdn.com/image/fetch/$s_!XF0j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F708737bd-f9e4-467a-a8b0-8b19885b817b_364x272.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.40:</strong> Cross-attention in translation: the English phrase &#8220;The next day is bright&#8221; is aligned to its French counterpart, with attention determining which source words correspond to which target words.</em></p><p>In traditional translation tasks, attention operates between two sequences. Imagine translating the English phrase &#8220;The next day is bright&#8221; into French. The word order might change. &#8220;Day&#8221; might align with &#8220;jour,&#8221; but its position in the French sentence could be different. Attention helps the model figure out these cross-language alignments, which English word corresponds to which French word. This works beautifully for translation. But what happens when we&#8217;re not translating at all?</p><h4>Enter Self-Attention</h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RWTH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RWTH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png 424w, https://substackcdn.com/image/fetch/$s_!RWTH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png 848w, https://substackcdn.com/image/fetch/$s_!RWTH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png 1272w, https://substackcdn.com/image/fetch/$s_!RWTH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RWTH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png" width="296" height="82" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:82,&quot;width&quot;:296,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4468,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RWTH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png 424w, https://substackcdn.com/image/fetch/$s_!RWTH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png 848w, https://substackcdn.com/image/fetch/$s_!RWTH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png 1272w, https://substackcdn.com/image/fetch/$s_!RWTH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b52462a-9a4a-44f1-8be9-303368e1927e_296x82.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.41:</strong> Self-attention: every word in a single sentence attends to all other words in that same sentence to build contextual understanding.</em></p><p>Consider a different task: predicting the next word in a sentence. Or understanding what a pronoun refers to. Or simply trying to grasp the meaning of a sentence. Here, we don&#8217;t have two separate sequences. We have just one, the sentence itself. This is where self-attention comes in.</p><p>Self-attention means that every word in a sentence attends to all other words <em>in that same sentence</em>. Instead of looking across two different sequences (like English and French), the model examines how words relate to each other within a single sequence. The word &#8220;day&#8221; attends to &#8220;next,&#8221; to &#8220;bright,&#8221; to &#8220;the&#8221;, to everything in its own sentence. It&#8217;s attention turned inward. The sequence attending to itself. That&#8217;s why we call it <em>self</em>-attention. We cannot encode these complex relationships directly in the attention mechanism using just the raw input embeddings. The connections between words depend on context, grammar, meaning, and a dozen other subtle factors that shift from sentence to sentence.</p><p>So what do we do when faced with complexity we can&#8217;t hard-code? We let the model learn it. We leave it to weight matrices that can be trained. Before we dive into the mechanics, let&#8217;s be clear about our goal. We start with input embeddings, numerical representations of words. But here&#8217;s what we want to end up with: <strong>context vectors</strong>. </p><h4>What&#8217;s the difference? </h4><p>An input embedding represents a word in isolation. The embedding for &#8220;bank&#8221; is always the same, whether you&#8217;re talking about a financial institution or the side of a river. But a context vector represents a word <em>as it appears in a specific sentence</em>, infused with information from the words around it. </p><p>Think about</p><div class="pullquote"><p> &#8220;The dog chased the ball but <strong>it</strong> could not catch <strong>it.</strong>&#8221; </p></div><p>The input embedding for the second &#8220;it&#8221; doesn&#8217;t know what &#8220;it&#8221; refers to, it&#8217;s just a generic representation. But the context vector we&#8217;re building will carry information from &#8220;ball,&#8221; from &#8220;catch,&#8221; from the entire sentence. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SjLz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SjLz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png 424w, https://substackcdn.com/image/fetch/$s_!SjLz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png 848w, https://substackcdn.com/image/fetch/$s_!SjLz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png 1272w, https://substackcdn.com/image/fetch/$s_!SjLz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SjLz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png" width="448" height="348" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:348,&quot;width&quot;:448,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17465,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SjLz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png 424w, https://substackcdn.com/image/fetch/$s_!SjLz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png 848w, https://substackcdn.com/image/fetch/$s_!SjLz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png 1272w, https://substackcdn.com/image/fetch/$s_!SjLz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb20610a-e7fe-45dc-b4bd-ef7e95dbee3c_448x348.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.42:</strong> From input embedding to context vector: the static representation of a word is enriched with information from all surrounding words through self-attention.</em></p><p>It will <em>understand</em> that this particular &#8220;it&#8221; refers to the ball. So our entire journey with self-attention, the queries, the keys, the attention scores we&#8217;re about to explore, all of it serves one purpose: transforming static input embeddings into dynamic context vectors that understand meaning in context.</p><h1>1.9 Understanding the Input Embedding Matrix</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZMRe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZMRe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png 424w, https://substackcdn.com/image/fetch/$s_!ZMRe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png 848w, https://substackcdn.com/image/fetch/$s_!ZMRe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png 1272w, https://substackcdn.com/image/fetch/$s_!ZMRe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZMRe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png" width="478" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:478,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14991,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZMRe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png 424w, https://substackcdn.com/image/fetch/$s_!ZMRe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png 848w, https://substackcdn.com/image/fetch/$s_!ZMRe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png 1272w, https://substackcdn.com/image/fetch/$s_!ZMRe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52530a2c-9ce0-4c5e-9b79-8995e7b37da3_478x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.43:</strong> For each word, the word embedding and positional embedding are summed to produce the input embedding vector.</em></p><p>As we&#8217;ve already seen, for each word in our sentence, we have an embedding vector combined with positional information, that is, the word embedding plus the positional embedding that tells us where the word sits in the sequence. The sum of these two gives us our <strong>input embedding vector</strong> for each word.</p><p>When we stack all these input embedding vectors together for an entire sentence, we get what&#8217;s called the <strong>input embedding matrix</strong>.</p><p>Let&#8217;s say we&#8217;re working with the sentence </p><div class="pullquote"><p>&#8220;The next day is bright&#8221;</p></div><p>that&#8217;s five words. Our input embedding matrix would have dimensions (5, 8).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bwux!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bwux!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png 424w, https://substackcdn.com/image/fetch/$s_!Bwux!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png 848w, https://substackcdn.com/image/fetch/$s_!Bwux!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png 1272w, https://substackcdn.com/image/fetch/$s_!Bwux!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bwux!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png" width="504" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:504,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14108,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bwux!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png 424w, https://substackcdn.com/image/fetch/$s_!Bwux!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png 848w, https://substackcdn.com/image/fetch/$s_!Bwux!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png 1272w, https://substackcdn.com/image/fetch/$s_!Bwux!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1dffb3e-d016-4c40-9ecb-8805f3e05d19_504x300.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.44:</strong> The input embedding matrix for the sentence &#8220;The next day is bright&#8221; has shape (5, 8): five rows (one per word) and eight columns (the embedding dimension).</em></p><p>What do these numbers mean?</p><p>The <strong>5 rows</strong> come from having 5 words. Simple enough. Each word gets its own row in the matrix. If we had ten words, we&#8217;d have ten rows. The number of rows always matches the number of words in our sequence. </p><p>The <strong>8 columns</strong> represent the dimensionality we&#8217;ve chosen for our embeddings. Each word is represented as an 8-dimensional vector, eight numbers that capture its meaning. This dimension is something we decide when building our model. It&#8217;s a design choice.</p><p>In GPT-2, for instance, the embedding dimension varies: 768 for GPT-2 Small, all the way up to 1,600 for GPT-2 XL. Larger dimensions can capture more nuanced information, but they also require more computation.</p><h4>The Problem We&#8217;re Solving</h4><p>So here we are with our input embedding matrix. Each word has its 8-dimensional vector. But here&#8217;s what&#8217;s missing: these vectors exist in isolation. They don&#8217;t know about each other.</p><p>Look at the word &#8220;day&#8221; in our sentence &#8220;The next day is bright.&#8221; Its input embedding vector is just a generic representation of the word &#8220;day.&#8221; It doesn&#8217;t know it should pay attention to &#8220;bright.&#8221; It doesn&#8217;t know that &#8220;next&#8221; right before it gives it temporal context. It has no idea how much importance it should give to &#8220;the&#8221; or &#8220;is&#8221; or any other word in the sentence.</p><p>This is exactly why we need to transform input embeddings into context vectors. We need to integrate information from all the other words. We need each word&#8217;s representation to reflect not just what it is, but what it means in this particular sentence, surrounded by these particular neighbors. That&#8217;s the journey we&#8217;re about to take.</p><div><hr></div><p>Before we can perform any attention calculations, we must first define our input sequence and its corresponding embedding matrix. We will use the PyTorch library to create a tensor that holds this information for our example sentence: &#8220;The next day is bright&#8221;. Each word is represented by an 8-dimensional vector</p><h4><strong>Listing 1.1 Defining the input embedding matrix</strong></h4><pre><code>import torch

words = [&#8217;The&#8217;, &#8216;next&#8217;, &#8216;day&#8217;, &#8216;is&#8217;, &#8216;bright&#8217;]

inputs = torch.tensor([
    [0.32, 0.21, 0.43, 0.21, 0.86, 0.67, 0.98, 0.23], # The
    [0.43, 0.56, 0.43, 0.56, 0.69, 0.21, 0.56, 0.21], # next
    [0.56, 0.21, 0.43, 0.21, 0.54, 0.12, 0.89, 0.98], # day
    [0.87, 0.34, 0.18, 0.32, 0.75, 0.12, 0.54, 0.92], # is
    [0.76, 0.21, 0.85, 0.34, 0.98, 0.23, 0.68, 0.34]  # bright
], dtype=torch.float32)



print(&#8221;Input Embedding Matrix:&#8221;)
print(inputs)
print(&#8221;\nMatrix Shape:&#8221;)
print(inputs.shape)</code></pre><p><strong>Running the previous code prints the following output </strong></p><pre><code>Input Embedding Matrix:
tensor([
[0.3200, 0.2100, 0.4300, 0.2100, 0.8600, 0.6700, 0.9800, 0.2300],  [0.4300, 0.5600, 0.4300, 0.5600, 0.6900, 0.2100, 0.5600, 0.2100],
[0.5600, 0.2100, 0.4300, 0.2100, 0.5400, 0.1200, 0.8900, 0.9800],
[0.8700, 0.3400, 0.1800, 0.3200, 0.7500, 0.1200, 0.5400, 0.9200],
[0.7600, 0.2100, 0.8500, 0.3400, 0.9800, 0.2300, 0.6800, 0.3400]
])

Matrix Shape:
torch.Size([5, 8])</code></pre><p>The output shows our <strong>input</strong> object is a  <strong>tensor</strong> with a shape of <strong>torch.Size([5,8])</strong>. This confirms we have a matrix with 5 rows, one for each of our tokens, and 8 columns, representing the 8-dimensional embedding vector for each token. This matrix is the starting point for the self-attention mechanism, but as noted, these vectors exist in isolation and lack any contextual information from their neighbors.</p><h2>1.10 From Embeddings to Queries, Keys &amp; Values</h2><p>Here&#8217;s where we meet the heart of self-attention: three trainable weight matrices called Queries, Keys, and Values. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_dbx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_dbx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png 424w, https://substackcdn.com/image/fetch/$s_!_dbx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png 848w, https://substackcdn.com/image/fetch/$s_!_dbx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png 1272w, https://substackcdn.com/image/fetch/$s_!_dbx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_dbx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png" width="906" height="262" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:262,&quot;width&quot;:906,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8580,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!_dbx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png 424w, https://substackcdn.com/image/fetch/$s_!_dbx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png 848w, https://substackcdn.com/image/fetch/$s_!_dbx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png 1272w, https://substackcdn.com/image/fetch/$s_!_dbx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F583ec1f6-bc6e-4405-8473-d2f182cc64a8_906x262.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.45:</strong> The input embedding matrix is multiplied by three separate weight matrices Wq, Wk, and Wv to produce Query, Key, and Value matrices</em></p><p>You might wonder, why three? Why not just use the input embeddings directly?</p><p>The answer lies in a fundamental principle of neural networks: they&#8217;re universal function approximators. They can learn complex patterns if we give them the right structure. So instead of trying to hand-code how words should relate to each other, we do something smarter. </p><p>We initialize three weight matrices with random values at the start. Then we let the training process figure it out. During training, these matrices learn how to transform embeddings in ways that capture meaningful relationships. The Query matrix learns to create vectors that &#8220;ask questions.&#8221; The Key matrix learns to create vectors that &#8220;answer&#8221; whether they&#8217;re relevant. And the Value matrix? It learns what information should actually be passed along once we know which words matter. We&#8217;re not telling the model how attention should work, we&#8217;re giving it the tools to learn it on its own.</p><p>Let&#8217;s understand this by taking one example </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ekUt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ekUt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png 424w, https://substackcdn.com/image/fetch/$s_!ekUt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png 848w, https://substackcdn.com/image/fetch/$s_!ekUt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png 1272w, https://substackcdn.com/image/fetch/$s_!ekUt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ekUt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png" width="296" height="82" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:82,&quot;width&quot;:296,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:4286,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ekUt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png 424w, https://substackcdn.com/image/fetch/$s_!ekUt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png 848w, https://substackcdn.com/image/fetch/$s_!ekUt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png 1272w, https://substackcdn.com/image/fetch/$s_!ekUt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8c24effe-f42e-4607-9d92-a264102e0e9e_296x82.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Figure 1.46:</strong><em> The sentence &#8220;The next day is bright&#8221; with the word &#8220;next&#8221; highlighted as the current focus of the attention mechanism.</em></p><p>When we focus on a specific word say, &#8220;next&#8221;.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wjsX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wjsX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png 424w, https://substackcdn.com/image/fetch/$s_!wjsX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png 848w, https://substackcdn.com/image/fetch/$s_!wjsX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png 1272w, https://substackcdn.com/image/fetch/$s_!wjsX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wjsX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png" width="984" height="170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:170,&quot;width&quot;:984,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:11472,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wjsX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png 424w, https://substackcdn.com/image/fetch/$s_!wjsX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png 848w, https://substackcdn.com/image/fetch/$s_!wjsX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png 1272w, https://substackcdn.com/image/fetch/$s_!wjsX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8bad022a-ad58-4c05-9ccc-cbf736b93e1c_984x170.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 147:</strong> The word &#8220;next&#8221; acts as the Query, asking how much attention it should pay to each other word in the sentence.</em></p><p>we need to decide how much attention it should pay to all the other words in the sentence. This is where our terminology becomes important. The word we are focusing on (<strong>&#8220;next&#8221;</strong>) is called the <strong>Query (Q)</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vkgL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vkgL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png 424w, https://substackcdn.com/image/fetch/$s_!vkgL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png 848w, https://substackcdn.com/image/fetch/$s_!vkgL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png 1272w, https://substackcdn.com/image/fetch/$s_!vkgL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vkgL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png" width="984" height="170" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:170,&quot;width&quot;:984,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:12378,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vkgL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png 424w, https://substackcdn.com/image/fetch/$s_!vkgL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png 848w, https://substackcdn.com/image/fetch/$s_!vkgL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png 1272w, https://substackcdn.com/image/fetch/$s_!vkgL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55fca8b9-24f6-455d-abfd-9ee99269053b_984x170.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 148:</strong> The other words, &#8220;the,&#8221; &#8220;day,&#8221; &#8220;is,&#8221; &#8220;bright&#8221;, serve as Keys that the query evaluates for relevance.</em></p><p><strong>The other words in the sentence, &#8220;the,&#8221; &#8220;day,&#8221; &#8220;is,&#8221; &#8220;bright,&#8221; are called Keys (K).</strong> These are the words that the query will evaluate. They&#8217;re potential sources of information.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CUo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CUo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png 424w, https://substackcdn.com/image/fetch/$s_!CUo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png 848w, https://substackcdn.com/image/fetch/$s_!CUo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png 1272w, https://substackcdn.com/image/fetch/$s_!CUo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CUo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png" width="984" height="426" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6510b275-f5fa-454f-a577-c710430c3833_984x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:426,&quot;width&quot;:984,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22101,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CUo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png 424w, https://substackcdn.com/image/fetch/$s_!CUo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png 848w, https://substackcdn.com/image/fetch/$s_!CUo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png 1272w, https://substackcdn.com/image/fetch/$s_!CUo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6510b275-f5fa-454f-a577-c710430c3833_984x426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 149:</strong> Attention scores &#945; between the query word &#8220;next&#8221; and all keys. Each score quantifies how strongly &#8220;next&#8221; should attend to each other word.</em></p><p>Now comes the crucial part: the <strong>attention score (&#945;)</strong>. This score determines how much importance &#8220;next&#8221; should give to each of these other words. Should &#8220;next&#8221; pay more attention to &#8220;day&#8221; (the word right after it) or to &#8220;bright&#8221; (further away)? The attention scores tell us exactly this.</p><p>So &#8220;next&#8221; uses these attention scores to focus on other words in the sentence, weighing some as more important, others as less so. This is how a word builds its understanding of context.</p><p>For example, the attention score <strong>&#945;&#8322;&#8321;</strong> means:</p><ul><li><p><strong>&#8220;Next&#8221; (X&#8322;) is attending to &#8220;The&#8221; (X&#8321;).</strong></p></li><li><p>The first <strong>2</strong> represents &#8220;next&#8221; (position 2 in the sentence).</p></li><li><p>The second <strong>1</strong> represents &#8220;the&#8221; (position 1 in the sentence).</p></li></ul><p>The <strong>goal of self-attention</strong> is to take these attention scores (&#945; values) and use them to <strong>modify the original input embeddings</strong>, creating <strong>context vectors</strong> that contain <strong>richer</strong> information.</p><ul><li><p><strong>Input Embedding (X&#8322; - &#8220;next&#8221;)</strong>: Just represents the word itself.</p></li><li><p><strong>Context Vector (C&#8322; - &#8220;next&#8221;)</strong>: Now contains <strong>information from all relevant words</strong> around it, based on attention scores.</p></li></ul><p>Instead of just knowing &#8220;next&#8221; as an isolated word, the <strong>context vector of &#8220;next&#8221;</strong> now understands:</p><ul><li><p>How much &#8220;next&#8221; relates to &#8220;day&#8221; (&#945;&#8322;&#8323;)</p></li><li><p>How much &#8220;next&#8221; relates to &#8220;the&#8221; (&#945;&#8322;&#8321;)</p></li><li><p>How much &#8220;next&#8221; relates to &#8220;is&#8221; (&#945;&#8322;&#8324;)</p></li></ul><p>This transformation from <strong>input embeddings to context vectors</strong> is what makes <strong>self-attention so powerful</strong>, it helps the model understand relationships <strong>between words, not just individual tokens</strong>.</p><blockquote><p>Context Vector is an enriched embedding vector. It combines information from all other input elements</p></blockquote><h4>The Dimensions of Query, Key, and Value Matrices</h4><p>Now let&#8217;s talk about the actual shape and size of these weight matrices. Understanding their dimensions is crucial to grasping how self-attention works mathematically.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!66Tp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!66Tp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png 424w, https://substackcdn.com/image/fetch/$s_!66Tp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png 848w, https://substackcdn.com/image/fetch/$s_!66Tp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png 1272w, https://substackcdn.com/image/fetch/$s_!66Tp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!66Tp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png" width="876" height="526" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:526,&quot;width&quot;:876,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29911,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!66Tp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png 424w, https://substackcdn.com/image/fetch/$s_!66Tp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png 848w, https://substackcdn.com/image/fetch/$s_!66Tp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png 1272w, https://substackcdn.com/image/fetch/$s_!66Tp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F840a6e66-e128-4dc9-a08d-316d67d62285_876x526.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.50:</strong> Dimensions of the weight matrices: Wq, Wk, and Wv each have shape (d_in, d_out), where din matches the embedding dimension and d_out is a design choice.</em></p><p>If we look at the dimensions of the Query, Key, and Value matrices (Wq, Wk, and Wv), we&#8217;ll notice something interesting.</p><p>The <strong>number of rows</strong> in each of these matrices equals the <strong>number of columns</strong> in our input embedding matrix. Remember, our input embedding matrix has dimensions <strong>(5, 8)</strong>, where <strong>8 is our embedding dimension</strong>. So our weight matrices will have 8 rows.</p><p>The <strong>number of columns</strong> in these weight matrices, however, can be anything we choose. This is a design decision.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OPNH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OPNH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png 424w, https://substackcdn.com/image/fetch/$s_!OPNH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png 848w, https://substackcdn.com/image/fetch/$s_!OPNH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png 1272w, https://substackcdn.com/image/fetch/$s_!OPNH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OPNH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png" width="310" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:310,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10692,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OPNH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png 424w, https://substackcdn.com/image/fetch/$s_!OPNH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png 848w, https://substackcdn.com/image/fetch/$s_!OPNH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png 1272w, https://substackcdn.com/image/fetch/$s_!OPNH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb38c4dfe-6963-4389-b024-8e60b6f77098_310x264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.51:</strong> The terminology d_in and d_out: din = 8 is the input embedding dimension, and d_out is the chosen output dimension for queries, keys, and values.</em></p><p>When coding language models like GPT-2 or GPT-3, we use specific terminology for these dimensions:</p><p><strong>d_in (Input Dimension):</strong> The dimension of our input embeddings. In our example, this is 8.</p><p><strong>d_out (Output Dimension):</strong> The dimension we want for our query, key, and value vectors. This is the number of columns in our weight matrices.</p><p>Here&#8217;s an important point: you can choose any value for d_out. In practice, it&#8217;s often set equal to d_in for simplicity. So if our input dimension is 8, we might set the output dimension to 8 as well. But we don&#8217;t have to. In our example, we&#8217;re using d_out = 4. Why? To demonstrate that the output dimension is flexible. You have the freedom to choose what works best for your model.</p><h4><strong>Listing 1.2: Extracting a Token Embedding and Setting Dimensions</strong></h4><pre><code>x_2 = inputs[1]          # embedding for &#8220;next&#8221;
d_in = inputs.shape[1]   # input dimension
d_out = 4                # dimension for Q, K, V in this toy example

print(x_2)
print(d_in)
print(d_out)
</code></pre><p>Here you select the second row of the input matrix, which is the 8 dimensional embedding for the word &#8220;next&#8221;. The variable <code>d_in</code> confirms that the embedding dimension is 8, matching the theory. The variable <code>d_out</code> is set to 2, which means each query key and value vector will live in a 2 dimensional space in the following examples. In real models <code>d_out</code> is much larger, but using 2 keeps the printed tensors readable.</p><p><strong>Output</strong></p><pre><code>tensor([0.4300, 0.5600, 0.4300, 0.5600, 0.6900, 0.2100, 0.5600, 0.2100])
8
4
</code></pre><p></p><h4>How These Matrices Learn</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hFQu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hFQu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png 424w, https://substackcdn.com/image/fetch/$s_!hFQu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png 848w, https://substackcdn.com/image/fetch/$s_!hFQu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png 1272w, https://substackcdn.com/image/fetch/$s_!hFQu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hFQu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png" width="754" height="354" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:354,&quot;width&quot;:754,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26877,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hFQu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png 424w, https://substackcdn.com/image/fetch/$s_!hFQu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png 848w, https://substackcdn.com/image/fetch/$s_!hFQu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png 1272w, https://substackcdn.com/image/fetch/$s_!hFQu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F47fb5b05-b296-4eb3-87f2-11c00a829b16_754x354.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><em><strong>Figure 1.52:</strong> Weight matrices are initialized with random values and updated during training through backpropagation, learning to produce meaningful query, key, and value representations.</em></p><p>At the beginning, all the values in these weight matrices are initialized randomly. They start with no knowledge of language or attention patterns. But here&#8217;s where the magic of training comes in.</p><p>As we train the model using backpropagation, these random values gradually update themselves. The matrices learn which transformations help the model understand language better.</p><div class="pullquote"><p> They learn how to create query vectors that ask the right questions, key vectors that identify relevant information, and value vectors that carry the right content.</p></div><h2>1.11 A Quick Note on Matrix Multiplication</h2><p>Before we dive into multiplying our embedding matrices, let&#8217;s make sure we&#8217;re all on the same page about how matrix multiplication actually works. <strong>If you already know this, feel free to skip ahead</strong>. But if matrices feel a bit fuzzy, stick with me for a moment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k-gS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k-gS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png 424w, https://substackcdn.com/image/fetch/$s_!k-gS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png 848w, https://substackcdn.com/image/fetch/$s_!k-gS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png 1272w, https://substackcdn.com/image/fetch/$s_!k-gS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k-gS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png" width="702" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:702,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13849,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k-gS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png 424w, https://substackcdn.com/image/fetch/$s_!k-gS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png 848w, https://substackcdn.com/image/fetch/$s_!k-gS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png 1272w, https://substackcdn.com/image/fetch/$s_!k-gS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F050688e4-1c8e-4af6-83ca-9deeeab868cf_702x302.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.53:</strong> Matrix multiplication: Matrix A of shape (3, 2) multiplied by Matrix B of shape (2, 3) produces a result of shape (3, 3). The inner dimensions must match.</em></p><p>We have <strong>Matrix A</strong> with dimensions <strong>(3, 2)</strong> and <strong>Matrix B</strong> with dimensions <strong>(2, 3)</strong>. Notice something important: the number of columns in Matrix A (which is 2) matches the number of rows in Matrix B (also 2). This isn&#8217;t a coincidence. For matrix multiplication to work, these inner dimensions must match.</p><p>When we multiply them, we get a result with dimensions (3, 3). The outer dimensions survive: 3 rows from Matrix A and 3 columns from Matrix B.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UorR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UorR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png 424w, https://substackcdn.com/image/fetch/$s_!UorR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png 848w, https://substackcdn.com/image/fetch/$s_!UorR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png 1272w, https://substackcdn.com/image/fetch/$s_!UorR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UorR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png" width="1096" height="302" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:302,&quot;width&quot;:1096,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13701,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UorR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png 424w, https://substackcdn.com/image/fetch/$s_!UorR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png 848w, https://substackcdn.com/image/fetch/$s_!UorR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png 1272w, https://substackcdn.com/image/fetch/$s_!UorR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5f22e2ac-08f8-4c8e-938d-6b4f5d886c12_1096x302.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.54:</strong> Element-wise computation in matrix multiplication: each output entry is the dot product of a row from the first matrix and a column from the second matrix.</em></p><p>To compute each element in the result, you take a row from the first matrix and pair it with a column from the second matrix. Multiply corresponding elements together, then sum them up.</p><p><strong>Example:</strong> To find the element at position (1, 1):</p><ul><li><p>Take row 1 from Matrix A: [1, 2]</p></li><li><p>Take column 1 from Matrix B: [7, 10]</p></li><li><p>Calculate: (1 &#215; 7) + (2 &#215; 10) = 7 + 20 = <strong>27</strong></p></li></ul><p><strong>Another example:</strong> For position (2, 1):</p><ul><li><p>Take row 2 from Matrix A: [3, 4]</p></li><li><p>Take column 1 from Matrix B: [7, 10]</p></li><li><p>Calculate: (3 &#215; 7) + (4 &#215; 10) = 21 + 40 = <strong>61</strong></p></li></ul><p>You repeat this pattern for every position. Row meets column, multiply and sum. That&#8217;s the entire process.</p><div><hr></div><h3>[Step 1] Creating Query, Key, and Value Vectors</h3><p>The first step in converting our input embedding matrix into context embeddings is straightforward: matrix multiplication. Let&#8217;s walk through this process carefully, starting with how we create query vectors.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XJzM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XJzM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png 424w, https://substackcdn.com/image/fetch/$s_!XJzM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png 848w, https://substackcdn.com/image/fetch/$s_!XJzM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png 1272w, https://substackcdn.com/image/fetch/$s_!XJzM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XJzM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png" width="902" height="392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:902,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30032,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XJzM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png 424w, https://substackcdn.com/image/fetch/$s_!XJzM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png 848w, https://substackcdn.com/image/fetch/$s_!XJzM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png 1272w, https://substackcdn.com/image/fetch/$s_!XJzM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F84bb6337-2753-4dd1-9e2d-f2a1450e5e60_902x392.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.55 </strong>The input embedding matrix (5,8) is multiplied by the query weight matrix W_q (8,4) to produce the query matrix (5,4).</em></p><p>We take our input embedding matrix and multiply it by the query weight matrix (W_q). This transformation gives us our query vectors.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZGgK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZGgK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png 424w, https://substackcdn.com/image/fetch/$s_!ZGgK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png 848w, https://substackcdn.com/image/fetch/$s_!ZGgK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png 1272w, https://substackcdn.com/image/fetch/$s_!ZGgK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZGgK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png" width="1456" height="516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59565,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZGgK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png 424w, https://substackcdn.com/image/fetch/$s_!ZGgK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png 848w, https://substackcdn.com/image/fetch/$s_!ZGgK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png 1272w, https://substackcdn.com/image/fetch/$s_!ZGgK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcf39ee72-1f82-4a7d-85c4-78e68d0a2c9b_1659x588.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.56:</strong> Detailed view of the matrix multiplication: each 8-dimensional word embedding is projected through the weight matrix to produce a 4-dimensional query vector</em></p><p>Each row of the input matrix represents one word with its 8-dimensional embedding.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FNBe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FNBe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png 424w, https://substackcdn.com/image/fetch/$s_!FNBe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png 848w, https://substackcdn.com/image/fetch/$s_!FNBe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!FNBe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FNBe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png" width="1456" height="1253" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1253,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:83035,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FNBe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png 424w, https://substackcdn.com/image/fetch/$s_!FNBe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png 848w, https://substackcdn.com/image/fetch/$s_!FNBe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png 1272w, https://substackcdn.com/image/fetch/$s_!FNBe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef2a6b4b-8ac3-4f9d-9cb0-b7c553eeb872_1659x1428.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.57:</strong> Step-by-step computation showing how one row of the input matrix multiplied by W_q produces one row of the query matrix.</em></p><p>When we multiply this row by the weight matrix, we get a new row in the output, a <strong>4-dimensional query vector</strong> for that word. This happens for all five words simultaneously.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DF1M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DF1M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png 424w, https://substackcdn.com/image/fetch/$s_!DF1M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png 848w, https://substackcdn.com/image/fetch/$s_!DF1M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png 1272w, https://substackcdn.com/image/fetch/$s_!DF1M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DF1M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png" width="1106" height="392" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:392,&quot;width&quot;:1106,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39827,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DF1M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png 424w, https://substackcdn.com/image/fetch/$s_!DF1M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png 848w, https://substackcdn.com/image/fetch/$s_!DF1M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png 1272w, https://substackcdn.com/image/fetch/$s_!DF1M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F917157c6-ea62-46bd-88ec-cca327ec9f21_1106x392.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.58:</strong> The resulting query matrix: five words now have 4-dimensional query vectors, transformed from the original 8-dimensional embeddings.</em></p><p>The result? A query matrix where each of our five words now has its own query vector, transformed from 8 dimensions down to 4.</p><h4>The Complete Picture</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mlje!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mlje!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png 424w, https://substackcdn.com/image/fetch/$s_!mlje!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png 848w, https://substackcdn.com/image/fetch/$s_!mlje!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png 1272w, https://substackcdn.com/image/fetch/$s_!mlje!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mlje!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png" width="1022" height="1190" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1190,&quot;width&quot;:1022,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:76642,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mlje!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png 424w, https://substackcdn.com/image/fetch/$s_!mlje!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png 848w, https://substackcdn.com/image/fetch/$s_!mlje!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png 1272w, https://substackcdn.com/image/fetch/$s_!mlje!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8d6908dd-57c4-48de-9565-9fddcd392325_1022x1190.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.59:</strong></em> <em>All three projections in parallel: the input embedding matrix is multiplied by W_q, W_k, and W_v simultaneously to produce Query, Key, and Value matrices, each of shape (5, 4)</em></p><p>Creating the query vectors is just the beginning. The same transformation process happens two more times, each with its own weight matrix, all operating in parallel.</p><p><strong>For the key vectors,</strong> we multiply our input embedding matrix by the key weight matrix (W_K). Same dimensions, same process. Each word gets its own key vector.</p><p><strong>For the value vectors,</strong> we multiply the input embedding matrix by the value weight matrix (W_V). Each word now has a value vector too. </p><p>Here&#8217;s something important to recognize: we&#8217;ve moved from an 8-dimensional space to a 4-dimensional space. More significantly, we&#8217;ve moved into a different kind of space altogether. We&#8217;re no longer dealing with input embeddings, those static representations of words. We&#8217;re now working with query, key, and value vectors. Each lives in its own transformed space, optimized for a specific purpose in the attention mechanism.</p><p>This might seem like an odd detour. Why transform our embeddings at all? Why not just work with them directly?</p><p>This trick, transforming data into different spaces is fundamental to deep learning, and it&#8217;s powerful for a simple reason: sometimes the patterns we need aren&#8217;t visible in the original data. Think about it this way. In computer vision, early systems used hand-crafted features like edges and corners. Then convolutional neural networks came along and learned to discover their own features automatically, finding patterns humans never thought to look for. That&#8217;s what&#8217;s happening here. We&#8217;re not stuck with the fixed relationships in our input embeddings. Instead, we let the model learn, through training, what transformations actually help it understand language.</p><p>Think of it like passing our input through three different lenses simultaneously. Each lens, each weight matrix transforms the same input embeddings in a different way, extracting different aspects of meaning. When all three transformations are complete, we have three new matrices sitting side by side, all sharing the same dimensions of (5, 4).  These three matrices are now ready for the next step in the attention mechanism. The queries and keys will interact to figure out who should pay attention to whom. But that&#8217;s a story for the next section.</p><h4><strong>Listing 1.3 Initializing Query, Key, and Value Weight Matrices</strong></h4><pre><code>torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)

print(&#8221;W_query:&#8221;)
print(W_query)
print(&#8221;\nW_key:&#8221;)
print(W_key)
print(&#8221;\nW_value:&#8221;)
print(W_value)
</code></pre><p><strong>Output</strong></p><pre><code>W_query:
Parameter containing:
tensor([[ 0.2961,  0.5166, -0.0973,  0.2340],
        [ 0.2517,  0.6886,  0.0451, -0.4128],
        [ 0.0740,  0.8665,  0.3210,  0.0185],
        [ 0.1366,  0.1025, -0.2314,  0.5642],
        [ 0.1841,  0.7264, -0.1035,  0.3399],
        [ 0.3153,  0.6871,  0.2478, -0.1520],
        [ 0.0756,  0.1966,  0.5142,  0.0813],
        [ 0.3164,  0.4017, -0.0879,  0.2904]])

W_key:
Parameter containing:
tensor([[ 0.1186,  0.8274,  0.1040, -0.3055],
        [ 0.3821,  0.6605, -0.2103,  0.1428],
        [ 0.8536,  0.5932, -0.1449,  0.3170],
        [ 0.6367,  0.9826,  0.2553, -0.0872],
        [ 0.2745,  0.6584,  0.0342,  0.5051],
        [ 0.2775,  0.8573, -0.2984,  0.1907],
        [ 0.8993,  0.0390,  0.1206,  0.2843],
        [ 0.9268,  0.7388, -0.0721,  0.3419]])

W_value:
Parameter containing:
tensor([[ 0.7179,  0.7058, -0.1630,  0.3310],
        [ 0.9156,  0.4340,  0.0982, -0.2753],
        [ 0.0772,  0.3565,  0.2056,  0.1468],
        [ 0.1479,  0.5331, -0.0925,  0.2391],
        [ 0.4066,  0.2318,  0.0194,  0.1844],
        [ 0.4545,  0.9737, -0.3086, -0.0417],
        [ 0.4606,  0.5159,  0.1274,  0.0219],
        [ 0.4220,  0.5786, -0.0853,  0.3640]])
</code></pre><p>This code creates the three trainable weight matrices that turn input embeddings into query, key and value vectors.<br>The variable <strong>d_in</strong> is the size of each input embedding, here 8. The variable <strong>d_out</strong> is the size we want for the query, key and value vectors, here 4.</p><p>The tensors <strong>W_query</strong>, <strong>W_key</strong> and <strong>W_value</strong> are wrapped in <strong>torch.nn.Parameter</strong>, which tells PyTorch that these tensors are learnable weights. During training, gradient descent will update these matrices so that they learn useful transformations.<br>The shapes printed at the end confirm that each weight matrix has shape <code>8, 4</code>, matching the description in the theory where the number of rows equals <strong>d_in</strong> and the number of columns equals <strong>d_out</strong>.</p><h4><strong>Listing 1.4: Computing Query, Key, and Value Vectors</strong></h4><pre><code>queries = inputs @ W_query   # shape: (5, 4)
keys    = inputs @ W_key     # shape: (5, 4)
values  = inputs @ W_value   # shape: (5, 4)

print(&#8221;queries.shape:&#8221;, queries.shape)
print(&#8221;keys.shape   :&#8221;, keys.shape)
print(&#8221;values.shape :&#8221;, values.shape)

print(&#8221;\nqueries:&#8221;)
print(queries)
print(&#8221;\nkeys:&#8221;)
print(keys)
print(&#8221;\nvalues:&#8221;)
print(values)</code></pre><p><strong>Output</strong></p><pre><code>queries.shape: torch.Size([5, 4])
keys.shape   : torch.Size([5, 4])
values.shape : torch.Size([5, 4])

queries:
tensor([[ 0.8840,  2.0469,  0.3419,  0.5755],
        [ 0.8723,  2.0443,  0.3241,  0.3917],
        [ 0.9738,  1.9925,  0.3154,  0.7673],
        [ 1.1051,  2.1175,  0.2772,  0.9102],
        [ 0.9692,  2.5180,  0.2741,  0.8069]])

keys:
tensor([[ 2.2381,  2.7132,  0.0507,  0.8830],
        [ 2.1564,  2.7927, -0.0064,  0.8684],
        [ 2.6705,  2.7739,  0.0891,  1.0862],
        [ 2.4087,  3.1074,  0.0683,  1.0651],
        [ 2.6735,  3.1745,  0.0538,  1.2434]])

values:
tensor([[ 2.1213,  2.5079, -0.2830,  0.7693],
        [ 2.0476,  2.1981, -0.2301,  0.4927],
        [ 2.1971,  2.4733, -0.2806,  0.9121],
        [ 2.4207,  2.5415, -0.3020,  0.9874],
        [ 2.2625,  2.5902, -0.2714,  0.9625]])
</code></pre><p>Here we apply the three weight matrices to the full input matrix. Each row of  <strong>inputs</strong> is an embedding for one word.<br>The matrix multiplication <strong>inputs</strong><code> @</code> <strong>W_query</strong> takes every word embedding and projects it into query space. The result is a query matrix with shape <code>5, 4</code>, one query vector of length four for each of the five words. The same happens for keys and values.</p><p>This mirrors the explanation in the text that we now have three new matrices, each of size <code>number of tokens, d_out</code>. We are no longer working with raw input embeddings but with transformed representations tailored for searching, being searched and being blended.</p><h3>[Step 2] Computing Attention Scores</h3><p>Now that we have our query, key, and value vectors, we&#8217;re ready for the heart of the attention mechanism: figuring out which words should pay attention to which other words. Remember, each word has a query vector and a key vector . When we compute the dot product between a query and a key, we get a number that represents how well they align. A high dot product means strong alignment, which translates to high attention. A low dot product means weak alignment, which means less attention.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gldU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gldU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png 424w, https://substackcdn.com/image/fetch/$s_!gldU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png 848w, https://substackcdn.com/image/fetch/$s_!gldU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png 1272w, https://substackcdn.com/image/fetch/$s_!gldU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gldU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png" width="538" height="334" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:334,&quot;width&quot;:538,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:20541,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gldU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png 424w, https://substackcdn.com/image/fetch/$s_!gldU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png 848w, https://substackcdn.com/image/fetch/$s_!gldU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png 1272w, https://substackcdn.com/image/fetch/$s_!gldU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2c9fa9fd-3c3a-4ac1-bc5d-882d08202aba_538x334.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.60:</strong> The dot product between a query vector and a key vector produces a scalar attention score indicating how well they align.</em></p><p>Here&#8217;s where we hit a small technical hurdle. We want to compute all these dot products at once using matrix multiplication. Our Query matrix has dimensions (5, 4) and our Keys matrix also has dimensions (5, 4). If we try to multiply them directly, Query &#215; Keys, we run into a problem. For matrix multiplication to work, the number of columns in the first matrix must equal the number of rows in the second matrix. But Query has 4 columns and Keys has 5 rows. They don&#8217;t match. The multiplication simply won&#8217;t work. The solution of this problem is to transpose the Keys matrix.</p><div><hr></div><h4>A Quick Note on Matrix Transpose</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RvNM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RvNM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png 424w, https://substackcdn.com/image/fetch/$s_!RvNM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png 848w, https://substackcdn.com/image/fetch/$s_!RvNM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png 1272w, https://substackcdn.com/image/fetch/$s_!RvNM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RvNM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png" width="576" height="297" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:297,&quot;width&quot;:576,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:9247,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RvNM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png 424w, https://substackcdn.com/image/fetch/$s_!RvNM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png 848w, https://substackcdn.com/image/fetch/$s_!RvNM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png 1272w, https://substackcdn.com/image/fetch/$s_!RvNM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F22fe456c-7021-4035-a62f-ed157c5657d6_576x297.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.61:</strong> Matrix transpose: rows become columns and columns become rows, converting a (3, 2) matrix into a (2, 3) matrix.</em></p><p>If you&#8217;re already comfortable with matrix transpose, feel free to <strong>skip ahead to the next section.</strong> But if transpose feels unfamiliar or you want a quick refresher, stay with me for just a moment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zBlY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zBlY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png 424w, https://substackcdn.com/image/fetch/$s_!zBlY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png 848w, https://substackcdn.com/image/fetch/$s_!zBlY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png 1272w, https://substackcdn.com/image/fetch/$s_!zBlY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zBlY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png" width="1456" height="401" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:401,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33658,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zBlY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png 424w, https://substackcdn.com/image/fetch/$s_!zBlY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png 848w, https://substackcdn.com/image/fetch/$s_!zBlY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png 1272w, https://substackcdn.com/image/fetch/$s_!zBlY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ba794ba-11d8-4708-893a-939db1dc9e47_1614x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.62:</strong> Transposing the Keys matrix from (5, 4) to (4, 5) so that it can be multiplied with the (5, 4) Query matrix.</em></p><p>When we transpose a matrix, we flip it along its diagonal. Rows become columns, and columns become rows. If you have a matrix with dimensions (3, 2), its transpose will have dimensions (2, 3). The first row of the original matrix becomes the first column of the transposed matrix. The second row becomes the second column, and so on. It&#8217;s like rotating the entire matrix 90 degrees and reflecting it. This simple operation is incredibly useful because it lets us align dimensions for matrix multiplication when they wouldn&#8217;t otherwise match.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RLn5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RLn5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png 424w, https://substackcdn.com/image/fetch/$s_!RLn5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png 848w, https://substackcdn.com/image/fetch/$s_!RLn5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png 1272w, https://substackcdn.com/image/fetch/$s_!RLn5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RLn5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png" width="626" height="314" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:314,&quot;width&quot;:626,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19848,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RLn5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png 424w, https://substackcdn.com/image/fetch/$s_!RLn5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png 848w, https://substackcdn.com/image/fetch/$s_!RLn5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png 1272w, https://substackcdn.com/image/fetch/$s_!RLn5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a84325b-e1b4-4f95-8134-c8f601791fbf_626x314.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.63:</strong> The Keys matrix transposed: each word&#8217;s key vector becomes a column, converting the (5, 4) matrix into (4, 5).</em></p><p>When we transpose the Keys matrix, each row becomes a column. Notice how the first row for <strong>&#8220;The&#8221;</strong> <strong>[1.4, 1.0, 1.8, 2.2]</strong> in the original Keys matrix becomes the first column <strong>[1.4, 1.0, 1.8, 2.2]</strong> reading downward in the transposed version. The same happens for every word, <strong>&#8221;next&#8221;</strong> becomes the second column, <strong>&#8220;day&#8221;</strong> becomes the third column, and so on, transforming our (5, 4) matrix into a (4, 5) matrix ready for multiplication.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WiSM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WiSM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png 424w, https://substackcdn.com/image/fetch/$s_!WiSM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png 848w, https://substackcdn.com/image/fetch/$s_!WiSM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png 1272w, https://substackcdn.com/image/fetch/$s_!WiSM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WiSM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png" width="1245" height="429" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:429,&quot;width&quot;:1245,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43854,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WiSM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png 424w, https://substackcdn.com/image/fetch/$s_!WiSM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png 848w, https://substackcdn.com/image/fetch/$s_!WiSM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png 1272w, https://substackcdn.com/image/fetch/$s_!WiSM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac8f9f81-7d94-4033-b50e-a5096a6a6153_1245x429.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.64:</strong> Multiplying Query (5, 4) by K_T (4, 5) produces the (5, 5) attention scores matrix, capturing every possible word-to-word relationship.</em></p><p>The result of the dot product of Query and Keys vectors is a (5, 5) attention scores matrix. This matrix captures every possible relationship between words. </p><h4>Interpreting the Attention Scores Matrix</h4><p>Now that we have our attention scores matrix, let&#8217;s understand what it actually tells us. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tuQG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tuQG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png 424w, https://substackcdn.com/image/fetch/$s_!tuQG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png 848w, https://substackcdn.com/image/fetch/$s_!tuQG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png 1272w, https://substackcdn.com/image/fetch/$s_!tuQG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tuQG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png" width="561" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ff134868-35bb-4722-8400-f068644d682e_561x567.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:561,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24520,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tuQG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png 424w, https://substackcdn.com/image/fetch/$s_!tuQG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png 848w, https://substackcdn.com/image/fetch/$s_!tuQG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png 1272w, https://substackcdn.com/image/fetch/$s_!tuQG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff134868-35bb-4722-8400-f068644d682e_561x567.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.65:</strong> The (5, 5) attention scores matrix: rows represent queries and columns represent keys. Entry (i, j) shows how much word i attends to word j.</em></p><p>Each number in this (5, 5) matrix represents how much one word should attend to another word. The key to reading this matrix is simple: </p><blockquote><p><strong>rows</strong> represent <strong>queries</strong>, and <strong>columns</strong> represent <strong>keys</strong>.</p></blockquote><p>Let&#8217;s look at some concrete examples.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OFAG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OFAG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png 424w, https://substackcdn.com/image/fetch/$s_!OFAG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png 848w, https://substackcdn.com/image/fetch/$s_!OFAG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png 1272w, https://substackcdn.com/image/fetch/$s_!OFAG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OFAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png" width="1347" height="519" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:519,&quot;width&quot;:1347,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44236,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OFAG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png 424w, https://substackcdn.com/image/fetch/$s_!OFAG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png 848w, https://substackcdn.com/image/fetch/$s_!OFAG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png 1272w, https://substackcdn.com/image/fetch/$s_!OFAG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe10235d9-2b97-497f-a1a8-f6177fb09c8f_1347x519.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.66:</strong> Reading the attention matrix: the entry at row 2, column 1 gives the attention score from &#8220;next&#8221; (query) to &#8220;The&#8221; (key).</em></p><p>Finding attention between <strong>&#8220;next&#8221;</strong> and <strong>&#8220;The&#8221;:</strong></p><p>The word &#8220;next&#8221; is in position 2, so we look at row 2. The word &#8220;The&#8221; is in position 1, so we look at column 1. The value at position (2, 1) tells us how much &#8220;next&#8221; attends to &#8220;The.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dNhs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dNhs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png 424w, https://substackcdn.com/image/fetch/$s_!dNhs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png 848w, https://substackcdn.com/image/fetch/$s_!dNhs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png 1272w, https://substackcdn.com/image/fetch/$s_!dNhs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dNhs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png" width="1347" height="519" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:519,&quot;width&quot;:1347,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44888,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dNhs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png 424w, https://substackcdn.com/image/fetch/$s_!dNhs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png 848w, https://substackcdn.com/image/fetch/$s_!dNhs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png 1272w, https://substackcdn.com/image/fetch/$s_!dNhs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff6dff7f0-ba2b-4107-9674-e4da9d4eabd8_1347x519.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.67:</strong> The entry at row 2, column 2 shows the self-attention score: how much &#8220;next&#8221; attends to itself.</em></p><p>Finding attention between <strong>&#8220;next&#8221;</strong> and itself:</p><p>Same word, but the pattern holds. Row 2 for &#8220;next&#8221; as the query, column 2 for &#8220;next&#8221; as the key. The value at position (2, 2) shows how much &#8220;next&#8221; attends to itself.</p><p>Here&#8217;s where it gets interesting. Each row tells a complete story about one word&#8217;s attention pattern.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ELKJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ELKJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png 424w, https://substackcdn.com/image/fetch/$s_!ELKJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png 848w, https://substackcdn.com/image/fetch/$s_!ELKJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png 1272w, https://substackcdn.com/image/fetch/$s_!ELKJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ELKJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png" width="486" height="516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:486,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22543,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ELKJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png 424w, https://substackcdn.com/image/fetch/$s_!ELKJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png 848w, https://substackcdn.com/image/fetch/$s_!ELKJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png 1272w, https://substackcdn.com/image/fetch/$s_!ELKJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b55187d-82aa-4e76-b8d3-96c362e4899f_486x516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.68:</strong> Each row tells a complete story about one word&#8217;s attention pattern across all other words in the sentence.</em></p><p>Take the second row, for example. This entire row represents the attention scores between &#8220;next&#8221; (the query) and all other words (the keys). As you move across the columns, you see how much &#8220;next&#8221; should attend to &#8220;The,&#8221; then to &#8220;next&#8221; itself, then to &#8220;day,&#8221; then to &#8220;is,&#8221; and finally to &#8220;bright.&#8221;</p><p>The first row does the same for &#8220;The.&#8221; The third row does it for &#8220;day.&#8221; Every row follows this pattern.</p><h4>Problem with Attention Scores Matrix</h4><p>We have our attention scores matrix, and it captures the relationships between words. But there&#8217;s a fundamental issue we need to address. Look at the second row, which represents how much &#8220;next&#8221; should attend to all other words. The values might be something like 1.3, 0.9, 1.9, 1.9, and 1.2. These numbers tell us relative importance, but they&#8217;re hard to interpret. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rz16!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rz16!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png 424w, https://substackcdn.com/image/fetch/$s_!rz16!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png 848w, https://substackcdn.com/image/fetch/$s_!rz16!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png 1272w, https://substackcdn.com/image/fetch/$s_!rz16!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rz16!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png" width="453" height="453" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:453,&quot;width&quot;:453,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:17235,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rz16!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png 424w, https://substackcdn.com/image/fetch/$s_!rz16!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png 848w, https://substackcdn.com/image/fetch/$s_!rz16!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png 1272w, https://substackcdn.com/image/fetch/$s_!rz16!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F52ca8b24-be7b-4ec7-bf37-56d171047874_453x453.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.69:</strong> Raw attention scores do not sum to one and cannot be interpreted as probabilities. We need normalization to create a proper attention distribution.</em></p><p>What we really want is to make clear, intuitive statements like &#8220;next should give 30% of its attention to &#8216;day,&#8217; 25% to &#8216;is,&#8217; 20% to itself, 15% to &#8216;The,&#8217; and 10% to &#8216;bright.&#8217;&#8221; We want percentages that sum to 100%, or in mathematical terms, probabilities that sum to 1. </p><p>This would let us understand attention as a distribution, like slices of a pie chart, where we can immediately see which words matter most. Right now, our raw scores don&#8217;t have this property (11.0 + 10.2 + 10.1 + 10.4 + 11.2), you get 52.9, not 1.  The values in each row don&#8217;t sum to one, and some might even be negative. We can&#8217;t interpret them as percentages or probabilities, and this creates two problems.</p><p>First, there&#8217;s the interpretability issue. We can&#8217;t make clear statements about attention distribution. We can&#8217;t say &#8220;next pays 22% attention to &#8216;bright&#8217;&#8221; when the numbers don&#8217;t represent percentages. Second, there&#8217;s a training stability issue. When training large language models, it&#8217;s better if the numbers stay in a controlled range, ideally between 0 and 1. This makes the training process much more stable. The gradients behave better, and the model learns more reliably.</p><p>That&#8217;s the problem we need to solve, and the solution is to convert attention scores into attention weights. Attention weights have two key properties: they sum to 1 for each row, and each individual weight lies between 0 and 1. This transformation is called normalization.</p><div><hr></div><h4>A Quick Note on Simple Normalization and Softmax</h4><p><strong>Simple Normalization</strong></p><p>The simplest approach to normalization is straightforward. Take each value in a row and divide it by the sum of all values in that row.</p><p><em>Formula</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\n\\text{Normalized value} = \\frac{x_i}{x_1 + x_2 + x_3 + \\dots + x_n}&quot;,&quot;id&quot;:&quot;VHLAJKUCXA&quot;}" data-component-name="LatexBlockToDOM"></div><p>Example</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-WmL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-WmL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png 424w, https://substackcdn.com/image/fetch/$s_!-WmL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png 848w, https://substackcdn.com/image/fetch/$s_!-WmL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png 1272w, https://substackcdn.com/image/fetch/$s_!-WmL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-WmL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png" width="1456" height="707" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:707,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:77204,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-WmL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png 424w, https://substackcdn.com/image/fetch/$s_!-WmL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png 848w, https://substackcdn.com/image/fetch/$s_!-WmL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png 1272w, https://substackcdn.com/image/fetch/$s_!-WmL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35ef0079-534b-4a38-95f1-c98df98c19bc_1785x867.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.70:</strong> Simple normalization divides each value by the row sum, preserving proportions exactly. Softmax exponentiates first, dramatically amplifying differences so the largest value dominates.</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Consider attention scores} = [1, 2, 3, 6]&quot;,&quot;id&quot;:&quot;YPDSDQDKFK&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Sum} = 1 + 2 + 3 + 6 = 12&quot;,&quot;id&quot;:&quot;KZREZVAPKS&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n&amp;\\text{Simple normalization gives us:} \\\\\n&amp;\\quad \\text{- } 1/12 \\approx 0.083 \\text{ (8.3%)} \\\\\n&amp;\\quad \\text{- } 2/12 \\approx 0.167 \\text{ (16.7%)} \\\\\n&amp;\\quad \\text{- } 3/12 \\approx 0.250 \\text{ (25.0%)} \\\\\n&amp;\\quad \\text{- } 6/12 \\approx 0.500 \\text{ (50.0%)}\n\\end{align*}&quot;,&quot;id&quot;:&quot;LTNDNGNNUP&quot;}" data-component-name="LatexBlockToDOM"></div><p>These values sum to 1, which is good. The differences are proportional. The value 6 is twice as large as 3, and after normalization, 0.5 is twice as large as 0.25. The proportions are preserved exactly.</p><p><strong>Softmax Normalization</strong></p><p>Softmax takes a different approach. Instead of dividing by the sum directly, it first exponentiates each value, then normalizes by the sum of exponentials.</p><p><em>Formula</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Softmax}(x_i) = \\frac{e^{x_i}}{e^{x_1} + e^{x_2} + e^{x_3} + \\dots + e^{x_n}}&quot;,&quot;id&quot;:&quot;BSLVJRANLY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Example</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Kj0Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png 424w, https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png 848w, https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png" width="1456" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102217,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png 424w, https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png 848w, https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!Kj0Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30cf53c7-7675-4de2-8d68-a7207ff26afc_1785x1140.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.71:</strong> Softmax normalization: the largest value (6) receives 93% of the weight while smaller values are heavily suppressed, creating a sharp and decisive attention distribution.</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Using the same attention scores } [1, 2, 3, 6]&quot;,&quot;id&quot;:&quot;KPAONTEIRV&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n&amp;\\text{Step 1: Exponentiate each value} \\\\\n&amp;\\text{- } e^1 \\approx 2.72 \\\\\n&amp;\\text{- } e^2 \\approx 7.39 \\\\\n&amp;\\text{- } e^3 \\approx 20.09 \\\\\n&amp;\\text{- } e^6 \\approx 403.43\n\\end{align*}&quot;,&quot;id&quot;:&quot;SIRMHOAXSU&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n&amp;\\text{Step 2: Sum the exponentials} \\\\\n&amp;\\text{Sum} \\approx 2.72 +  7.39 + 20.09 + 403.43 = 433.63\n\\end{align*}&quot;,&quot;id&quot;:&quot;GEUTSGFSJY&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n&amp;\\text{Step 3: Normalize} \\\\\n&amp;\\text{- } 2.72/433.63 \\approx 0.006 \\text{ (0.6%)} \\\\\n&amp;\\text{- } 7.39/433.63 \\approx 0.017 \\text{ (1.7%)} \\\\\n&amp;\\text{- } 20.09/433.63 \\approx 0.046 \\text{ (4.6%)} \\\\\n&amp;\\text{- } 403.43/433.63 \\approx 0.930 \\text{ (93.0%)}\n\\end{align*}&quot;,&quot;id&quot;:&quot;BNXVMPPDGQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Notice something dramatic. The largest value (6) now dominates completely with 93% of the attention, while the smaller values are heavily suppressed(see the bar graph in the figure). This is the key difference between simple normalization and softmax.</p><h4>Why Softmax Works Better</h4><p><strong>Amplification of Differences</strong></p><p>Softmax has a crucial property: it amplifies differences. Larger values get disproportionately larger weights, and smaller values get disproportionately smaller weights. This makes the resulting distribution sharper and more decisive. </p><p>In our example with simple normalization, the value 6 was six times larger than 1, and after normalization, its weight (50%) was also six times larger than the weight for 1 (8.3%). The proportions stayed exactly the same. </p><p>But with softmax, the value 6 gets 93% of the weight, while 1 gets only 0.6%. That&#8217;s a ratio of over 150 times! The difference got amplified dramatically. This amplification is exactly what we want in attention mechanisms. When one word should clearly attend to another, softmax makes that relationship strong and clear. The model can make bold, decisive choices about where to focus attention.</p><p><strong>Handling Negative Values</strong></p><p>Softmax has another important advantage: it handles negative numbers gracefully.</p><p><strong>Simple Normalization</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Consider attention scores } [1, 2, -3, 5]&quot;,&quot;id&quot;:&quot;ZCWSRSTGEZ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\\text{Simple Normalization:} \\\\\n\\text{Sum} = 1 + 2 + (-3) + 5 = 5 \\\\\n\\frac{1}{5} = 0.20 \\ (20\\%) \\\\\n\\frac{2}{5} = 0.40 \\ (40\\%) \\\\\n\\frac{-3}{5} = -0.60 \\ (-60\\%) \\\\\n\\frac{5}{5} = 1.00 \\ (100\\%)\n\\end{align*}\n&quot;,&quot;id&quot;:&quot;ZNUXDKIVHG&quot;}" data-component-name="LatexBlockToDOM"></div><p>We have a problem. A negative probability (-60%) doesn&#8217;t make sense. Probabilities must be between 0 and 1.</p><p><strong>Softmax</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\\textbf{Softmax:} \\\\\ne^{1} &amp;\\approx 2.72 \\\\\ne^{2} &amp;\\approx 7.39 \\\\\ne^{-3} &amp;\\approx 0.050 \\\\\ne^{5} &amp;\\approx 148.41\n\\end{align*}\n&quot;,&quot;id&quot;:&quot;LSERSLBFKJ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Sum} \\approx 158.57\n&quot;,&quot;id&quot;:&quot;WSLGYVKIFV&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\\textbf{Normalized:} \\\\\n\\frac{2.72}{158.57} &amp;\\approx 0.017\\ (1.7\\%) \\\\\n\\frac{7.39}{158.57} &amp;\\approx 0.047\\ (4.7\\%) \\\\\n\\frac{0.050}{158.57} &amp;\\approx 0.0003\\ (0.03\\%) \\\\\n\\frac{148.41}{158.57} &amp;\\approx 0.936\\ (93.6\\%)\n\\end{align*}\n&quot;,&quot;id&quot;:&quot;KERATHFKDX&quot;}" data-component-name="LatexBlockToDOM"></div><p>All values are positive! The negative score (-3) simply gets suppressed to nearly zero (0.03%), while the largest value (5) dominates. Softmax automatically ensures all outputs are valid probabilities regardless of input values.</p><h4><strong>Listing 1.5: Computing Raw Attention Scores</strong></h4><pre><code># attention scores for all five tokens
attn_scores = queries @ keys.T     # shape (5, 5)

print(&#8221;Attention scores matrix:&#8221;)
print(attn_scores)

# attention scores only for the word &#8220;next&#8221;
idx = 1                            # index 1 is &#8220;next&#8221;
query_next = queries[idx]          # shape (4,)

keys_T = keys.T                    # shape (4, 5)
attn_scores_next = query_next @ keys_T

print(&#8221;\nAttention scores for &#8216;next&#8217;:&#8221;)
print(attn_scores_next)
</code></pre><p><strong>Output</strong></p><pre><code>Attention scores matrix:
tensor([[10.3564, 10.4507, 10.9581, 11.3706, 11.7368],
        [10.0395, 10.1182, 10.6117, 11.0026, 11.3353],
        [10.2672, 10.3379, 10.8698, 11.2815, 11.6372],
        [10.5471, 10.6268, 11.1764, 11.5984, 11.9785],
        [11.0007, 11.1326, 11.7058, 12.2208, 12.6405]])

Attention scores for &#8216;next&#8217;:
tensor([10.0395, 10.1182, 10.6117, 11.0026, 11.3353])</code></pre><p>The matrix <strong>attn_scores</strong> contains all raw attention scores before scaling or softmax. Each row corresponds to one query token. Each column corresponds to one key token. Entry <strong>(i, j)</strong> is the dot product between the query vector for token <strong>i</strong> and the key vector for token <strong>j.</strong></p><p>Computing the full matrix in one step is just the matrix form of what we did for a single token earlier. In the second part of the code we select the query for <strong>&#8220;next&#8221;</strong> and multiply it with the transposed key matrix. The resulting vector <strong>attn_scores_next</strong> is simply row one of the full score matrix and shows how strongly <strong>&#8220;next&#8221;</strong> matches the key for each word in the sentence, including itself.</p><div><hr></div><h3>[Step 3] Converting Attention Scores to Attention Weight</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wC4X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wC4X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png 424w, https://substackcdn.com/image/fetch/$s_!wC4X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png 848w, https://substackcdn.com/image/fetch/$s_!wC4X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png 1272w, https://substackcdn.com/image/fetch/$s_!wC4X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wC4X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png" width="1218" height="651" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:651,&quot;width&quot;:1218,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:41826,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wC4X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png 424w, https://substackcdn.com/image/fetch/$s_!wC4X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png 848w, https://substackcdn.com/image/fetch/$s_!wC4X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png 1272w, https://substackcdn.com/image/fetch/$s_!wC4X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F35cc6c38-d007-4e35-a8d7-f27737778818_1218x651.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.72:</strong> Applying softmax row by row converts raw attention scores into normalized attention weights that sum to one, creating an interpretable probability distribution.</em></p><p>Now let&#8217;s apply softmax to convert our attention scores into attention weights. We&#8217;ll work through this for one row to see exactly how it works.</p><p>Looking at our attention scores matrix, let&#8217;s take the second row for &#8220;next.&#8221; The values are:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{bmatrix}\n11.0 &amp; 10.2 &amp; 10.1 &amp; 10.4 &amp; 11.2\n\\end{bmatrix}\n\n&quot;,&quot;id&quot;:&quot;IYMPPMOJMN&quot;}" data-component-name="LatexBlockToDOM"></div><p>These represent how much &#8220;next&#8221; should attend to &#8220;The,&#8221; &#8220;next,&#8221; &#8220;day,&#8221; &#8220;is,&#8221; and &#8220;bright&#8221; respectively.</p><p><strong>Step 1: Exponentiate each score</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\ne^{11.0} &amp;\\approx 59,\\!874 \\\\\ne^{10.2} &amp;\\approx 26,\\!903 \\\\\ne^{10.1} &amp;\\approx 24,\\!343 \\\\\ne^{10.4} &amp;\\approx 32,\\!960 \\\\\ne^{11.2} &amp;\\approx 73,\\!130 \\\\\n\\end{align*}\n&quot;,&quot;id&quot;:&quot;DWPRUEJTKQ&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Step 2: Calculate the sum</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\text{Sum} = 59{,}874 + 26{,}903 + 24{,}343 + 32{,}960 + 73{,}130 = 217{,}210\n\n&quot;,&quot;id&quot;:&quot;NMSPWJWFXH&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Step 3: Divide each exponential by the sum</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\\textbf{Attention to The}:     &amp;\\quad \\frac{59,\\!874}{217,\\!210} \\approx 0.276~(27.6\\%) \\\\\n\\textbf{Attention to next}:    &amp;\\quad \\frac{26,\\!903}{217,\\!210} \\approx 0.124~(12.4\\%) \\\\\n\\textbf{Attention to day}:     &amp;\\quad \\frac{24,\\!343}{217,\\!210} \\approx 0.112~(11.2\\%) \\\\\n\\textbf{Attention to is}:      &amp;\\quad \\frac{32,\\!960}{217,\\!210} \\approx 0.152~(15.2\\%) \\\\\n\\textbf{Attention to bright}:  &amp;\\quad \\frac{73,\\!130}{217,\\!210} \\approx 0.336~(33.6\\%) \\\\\n\\end{align*}\n&quot;,&quot;id&quot;:&quot;VKAJIOWHEN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now we can make clear, interpretable statements: &#8220;next&#8221; pays 33.6% of its attention to &#8220;bright,&#8221; 27.6% to &#8220;The,&#8221; 15.2% to &#8220;is,&#8221; 12.4% to itself, and 11.2% to &#8220;day.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ko99!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ko99!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png 424w, https://substackcdn.com/image/fetch/$s_!Ko99!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png 848w, https://substackcdn.com/image/fetch/$s_!Ko99!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png 1272w, https://substackcdn.com/image/fetch/$s_!Ko99!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ko99!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png" width="1182" height="504" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:504,&quot;width&quot;:1182,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39123,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ko99!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png 424w, https://substackcdn.com/image/fetch/$s_!Ko99!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png 848w, https://substackcdn.com/image/fetch/$s_!Ko99!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png 1272w, https://substackcdn.com/image/fetch/$s_!Ko99!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6665c82f-0d35-4a45-990a-7f10af2799cb_1182x504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.73:</strong> The complete attention weights matrix after softmax: every value is between 0 and 1, and every row sums to 1.</em></p><p>We apply this same softmax operation to every row in our attention scores matrix. Each row gets its own independent softmax transformation, converting raw scores into normalized attention weights that sum to 1. The result is our attention weights matrix, where every value is between 0 and 1, every row sums to 1, and we can finally interpret the numbers as meaningful probabilities.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HGG5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HGG5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png 424w, https://substackcdn.com/image/fetch/$s_!HGG5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png 848w, https://substackcdn.com/image/fetch/$s_!HGG5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png 1272w, https://substackcdn.com/image/fetch/$s_!HGG5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HGG5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png" width="1245" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1245,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187102,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HGG5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png 424w, https://substackcdn.com/image/fetch/$s_!HGG5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png 848w, https://substackcdn.com/image/fetch/$s_!HGG5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png 1272w, https://substackcdn.com/image/fetch/$s_!HGG5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a939138-81a6-45da-b0a4-392c2aeee6f3_1245x582.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"></figcaption></figure></div><p><em><strong>Figure 1.74:</strong> The scaled dot-product attention formula from the original transformer paper.</em></p><p>Before we move forward, there&#8217;s something important to clarify about the attention formula we&#8217;ve been building. What we just covered was the softmax operation, which converts attention scores into attention weights. But in practice, there are two additional operations that happen before we apply softmax: scaling by the square root of the key dimension, and adding a mask.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CB5Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CB5Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png 424w, https://substackcdn.com/image/fetch/$s_!CB5Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png 848w, https://substackcdn.com/image/fetch/$s_!CB5Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png 1272w, https://substackcdn.com/image/fetch/$s_!CB5Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CB5Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png" width="1456" height="398" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:398,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53566,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CB5Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png 424w, https://substackcdn.com/image/fetch/$s_!CB5Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png 848w, https://substackcdn.com/image/fetch/$s_!CB5Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png 1272w, https://substackcdn.com/image/fetch/$s_!CB5Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d4f409-81c6-43fe-ad78-8e5c9cdb628a_1557x426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.75:</strong> The complete attention pipeline: compute QK^T , scale by &#8730; d_k, optionally apply a mask, then apply softmax to obtain attention weights</em></p><p>You might wonder why we&#8217;re mentioning this now, after already explaining softmax. The reason is pedagogical. Understanding softmax first makes it much easier to grasp why these additional steps matter. If we had introduced all three operations at once, the picture would have been muddier. By learning them in this order, you&#8217;ll see not just what these operations do, but why they&#8217;re necessary.</p><p>If this sounds a bit abstract right now, don&#8217;t worry. The next section will clarify everything. We&#8217;ll walk through both scaling and masking step by step, and by the end, you&#8217;ll understand exactly how they fit into the complete attention mechanism. </p><h2>1.12 Why Scale Attention Scores?</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ohqR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ohqR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png 424w, https://substackcdn.com/image/fetch/$s_!ohqR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png 848w, https://substackcdn.com/image/fetch/$s_!ohqR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png 1272w, https://substackcdn.com/image/fetch/$s_!ohqR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ohqR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png" width="1449" height="501" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:501,&quot;width&quot;:1449,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33496,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ohqR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png 424w, https://substackcdn.com/image/fetch/$s_!ohqR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png 848w, https://substackcdn.com/image/fetch/$s_!ohqR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png 1272w, https://substackcdn.com/image/fetch/$s_!ohqR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F452d9c4d-2576-4765-a8e6-a4fe56034212_1449x501.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.76:</strong> The scaling factor normalizes the variance of the dot product, preventing softmax from producing extremely sharp distributions</em></p><p>In the Transformer model, the attention mechanism calculates scores using the formula:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Attention}(Q, K, V) = \\text{softmax} \\left( \\frac{Q K^T}{\\sqrt{d_k}} \\right) V&quot;,&quot;id&quot;:&quot;OETVMRXGJF&quot;}" data-component-name="LatexBlockToDOM"></div><p>A critical component of this formula is the scaling factor,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\sqrt{d_k}. &quot;,&quot;id&quot;:&quot;QGMIORUNNE&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{where } d_k \\text{ is the dimension of the key and query vectors.}&quot;,&quot;id&quot;:&quot;GLCLTIKLNA&quot;}" data-component-name="LatexBlockToDOM"></div><p>This scaling is not arbitrary; it is essential for stabilizing the training process.</p><h4>The Problem with Unscaled Scores</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jSaN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jSaN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png 424w, https://substackcdn.com/image/fetch/$s_!jSaN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png 848w, https://substackcdn.com/image/fetch/$s_!jSaN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png 1272w, https://substackcdn.com/image/fetch/$s_!jSaN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jSaN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png" width="885" height="408" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:408,&quot;width&quot;:885,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:13507,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jSaN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png 424w, https://substackcdn.com/image/fetch/$s_!jSaN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png 848w, https://substackcdn.com/image/fetch/$s_!jSaN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png 1272w, https://substackcdn.com/image/fetch/$s_!jSaN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4083918-5d3c-4f92-baf6-f4a368c4a364_885x408.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.77 :</strong> As the key dimension d_k increases, the variance of the dot product grows, causing large scores that push softmax into saturation where gradients vanish.</em></p><p>The attention scores are calculated from the dot product of a query vector (Q) and a key vector (K^T). A dot product is the sum of element wise products:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\nq_1k_1 + q_2k_2 + \\dots + q_{d_k}k_{d_k}\n\n&quot;,&quot;id&quot;:&quot;AIZZEESKED&quot;}" data-component-name="LatexBlockToDOM"></div><p>The problem is that as the dimension (d_k) increases, the variance of this dot product also increases. A larger dimension means more terms are being added together, which can lead to the final scores being very large in magnitude. </p><p>These large scores are then passed into the <strong>softmax</strong> function. The softmax function is sensitive to large inputs. If one score is significantly larger than the others, softmax will assign it a probability very close to 1.0, while all other scores will be assigned a probability very close to 0.0. This is known as <strong>saturation</strong>.</p><p>When this happens, the attention becomes &#8220;hard&#8221; and &#8220;spiky,&#8221; focusing on only one position. This makes it difficult for the model to learn, as the gradients during backpropagation become extremely small, effectively vanishing and halting the training process for that attention head.</p><h4>The Statistical Solution</h4><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{The choice of } \\sqrt{d_k} \\text{ is a precise statistical correction.}\n\n&quot;,&quot;id&quot;:&quot;QWBTQSBPAG&quot;}" data-component-name="LatexBlockToDOM"></div><p>If we assume the components of (Q) and (K) are independent random variables with a mean of 0 and a variance of 1, then their dot product (Q K^T) will have a mean of 0 but a variance of d_k. To normalize this, we want to scale the dot product so that its variance remains 1, regardless of the dimension d_k.</p><p>The standard deviation of the dot product is the square root of its variance, which is </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sqrt{d_k}&quot;,&quot;id&quot;:&quot;LSBFGPYOQV&quot;}" data-component-name="LatexBlockToDOM"></div><p>By dividing the dot product (Q K^T) by its standard deviation (, we ensure the input to the softmax function has a stable variance of 1.</p><h4>Let&#8217;s compute the scaled attention matrix</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SFX8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SFX8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png 424w, https://substackcdn.com/image/fetch/$s_!SFX8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png 848w, https://substackcdn.com/image/fetch/$s_!SFX8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png 1272w, https://substackcdn.com/image/fetch/$s_!SFX8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SFX8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png" width="768" height="426" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/afd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:426,&quot;width&quot;:768,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26787,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SFX8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png 424w, https://substackcdn.com/image/fetch/$s_!SFX8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png 848w, https://substackcdn.com/image/fetch/$s_!SFX8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png 1272w, https://substackcdn.com/image/fetch/$s_!SFX8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fafd7cd53-720e-41f6-a83f-4f9fae79a7e1_768x426.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.78:</strong> Computing the scaled attention scores, key dimension is 4, so &#8730; d_k = 2, and every raw score is divided by 2 before softmax.</em></p><p>&#8220;Keys Vectors&#8221; a matrix with the shape <strong>(5, 4)</strong>.</p><p>This means there are 5 key vectors, and each vector has a dimension of <strong>4</strong>.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Therefore, d_k = 4\n&quot;,&quot;id&quot;:&quot;SNRLBFFGHZ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\text{The scaling factor is } \\sqrt{d_k} = \\sqrt{4} = \\mathbf{2}\n\n&quot;,&quot;id&quot;:&quot;WPSOZPIJSH&quot;}" data-component-name="LatexBlockToDOM"></div><p>To get the scaled scores, we must divide every number in  &#8220;Attention Scores&#8221; matrix by our new scaling factor, <strong>2</strong>.</p><p>This is the  &#8220;Attention Scores&#8221;</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{bmatrix}\n10.7 &amp; 10.0 &amp; 9.9 &amp; 10.3 &amp; 11.0 \\\\\n11.0 &amp; 10.2 &amp; 10.1 &amp; 10.4 &amp; 11.2 \\\\\n12.6 &amp; 11.7 &amp; 11.5 &amp; 12.0 &amp; 12.8 \\\\\n13.2 &amp; 12.2 &amp; 12.0 &amp; 12.3 &amp; 13.4 \\\\\n12.5 &amp; 11.5 &amp; 11.4 &amp; 11.5 &amp; 12.7\n\\end{bmatrix}\n\n&quot;,&quot;id&quot;:&quot;VHMNZWKRSB&quot;}" data-component-name="LatexBlockToDOM"></div><p>Calculation (Dividing by 2):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{bmatrix}\n10.7 \\div 2 &amp; 10.0 \\div 2 &amp; 9.9 \\div 2 &amp; 10.3 \\div 2 &amp; 11.0 \\div 2 \\\\\n11.0 \\div 2 &amp; 10.2 \\div 2 &amp; 10.1 \\div 2 &amp; 10.4 \\div 2 &amp; 11.2 \\div 2 \\\\\n\\vdots &amp; \\vdots &amp; \\vdots &amp; \\vdots &amp; \\vdots\n\\end{bmatrix}\n&quot;,&quot;id&quot;:&quot;GJTNUBVVCF&quot;}" data-component-name="LatexBlockToDOM"></div><p>Final &#8220;Scaled Attention Scores&#8221; (based on d_k=4):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{bmatrix}\n5.35 &amp; 5.00 &amp; 4.95 &amp; 5.15 &amp; 5.50 \\\\\n5.50 &amp; 5.10 &amp; 5.05 &amp; 5.20 &amp; 5.60 \\\\\n6.30 &amp; 5.85 &amp; 5.75 &amp; 6.00 &amp; 6.40 \\\\\n6.60 &amp; 6.10 &amp; 6.00 &amp; 6.15 &amp; 6.70 \\\\\n6.25 &amp; 5.75 &amp; 5.70 &amp; 5.75 &amp; 6.35\n\\end{bmatrix}\n&quot;,&quot;id&quot;:&quot;OCTVOTLFQA&quot;}" data-component-name="LatexBlockToDOM"></div><h4><strong>Listing 1.6: Scaling Scores and Computing Attention Weights</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;61f365f9-9ca1-43ac-be42-f9e241c297c7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">d_k = keys.shape[-1]   # key dimension, 4

# scale scores and convert to attention weights for all tokens

scaled_scores = attn_scores / d_k**0.5
attn_weights = torch.softmax(scaled_scores, dim=-1)

print(&#8221;Attention weights matrix:&#8221;)
print(attn_weights)
print(&#8221;\nRow sums:&#8221;, attn_weights.sum(dim=-1))

# same thing, but shown explicitly for the word &#8220;next&#8221;

scaled_scores_next = attn_scores_next / d_k**0.5
attn_weights_next = torch.softmax(scaled_scores_next, dim=-1)

print(&#8221;\nScaled scores for &#8216;next&#8217;:&#8221;)
print(scaled_scores_next)
print(&#8221;Attention weights for &#8216;next&#8217;:&#8221;)
print(attn_weights_next)
print(&#8221;Sum of weights for &#8216;next&#8217;:&#8221;, attn_weights_next.sum())</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;aa09ab95-880d-4c4a-b5d7-8793748be1e3&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Attention weights matrix:

tensor([[0.1331, 0.1401, 0.1781, 0.2413, 0.3074],
        [0.1375, 0.1430, 0.1779, 0.2361, 0.3056],
        [0.1342, 0.1394, 0.1797, 0.2406, 0.3060],
        [0.1307, 0.1352, 0.1776, 0.2426, 0.3139],
        [0.1180, 0.1271, 0.1660, 0.2553, 0.3336]])

Row sums: tensor([1., 1., 1., 1., 1.])

Scaled scores for &#8216;next&#8217;:

tensor([5.0198, 5.0591, 5.3059, 5.5013, 5.6677])
Attention weights for &#8216;next&#8217;:

tensor([0.1375, 0.1430, 0.1779, 0.2361, 0.3056])

Sum of weights for &#8216;next&#8217;: tensor(1.)</code></pre></div><p>Raw dot products can grow large when the key dimension increases and they are difficult to interpret. The first step therefore scales the scores by dividing by the square root of the key dimension <strong>d_k</strong>. This keeps the variance of the scores roughly constant and prevents softmax from producing extremely sharp distributions.</p><p>The call to <strong>torch.softmax</strong> then turns each row of scaled scores into a proper probability distribution. All entries are between zero and one and each row sums to one, as confirmed by the printed <strong>row sums</strong>. The attention weight at position <strong>(i, j)</strong> now expresses the fraction of token <strong>i</strong>&#8217;s attention that is assigned to token <strong>j.</strong></p><p>For example, the vector <strong>attn_weights_next</strong> shows how the word <strong>&#8220;next&#8221;</strong> distributes its attention across the five tokens. In the example above it puts about thirty percent of its weight on the last word, with the remaining seventy percent spread over the earlier words.</p><h3>[Step 4] From Attention Weights to Context Vectors</h3><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9f2c!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9f2c!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png 424w, https://substackcdn.com/image/fetch/$s_!9f2c!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png 848w, https://substackcdn.com/image/fetch/$s_!9f2c!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png 1272w, https://substackcdn.com/image/fetch/$s_!9f2c!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9f2c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png" width="978" height="228" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:228,&quot;width&quot;:978,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7237,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9f2c!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png 424w, https://substackcdn.com/image/fetch/$s_!9f2c!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png 848w, https://substackcdn.com/image/fetch/$s_!9f2c!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png 1272w, https://substackcdn.com/image/fetch/$s_!9f2c!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F338cb0ea-ce60-4769-9075-e3e3ccaeea68_978x228.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.79:</strong> The final step of self-attention: multiply the attention weights matrix by the Value matrix to produce context vectors.</em></p><blockquote><p><strong>Brief Note:</strong> The calculations described in this section illustrate the core mechanism of scaled dot-product attention. For simplicity, this example does not apply causal or look-ahead masking, which would be essential in a decoder-based model (like a GPT) to prevent a token from &#8220;seeing&#8221; future tokens.</p></blockquote><p>In the self-attention mechanism, the final step is to compute the <strong>context vector</strong> for each token. A common misconception is that attention is applied to the original input embeddings. Instead, attention is used to create a weighted sum of a new, <em>transformed</em> representation of the input. This new representation is called the <strong>Value (V) matrix</strong>.</p><h4>The Role of the Value (V) Matrix</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LROB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LROB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png 424w, https://substackcdn.com/image/fetch/$s_!LROB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png 848w, https://substackcdn.com/image/fetch/$s_!LROB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png 1272w, https://substackcdn.com/image/fetch/$s_!LROB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LROB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png" width="960" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:960,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33207,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LROB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png 424w, https://substackcdn.com/image/fetch/$s_!LROB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png 848w, https://substackcdn.com/image/fetch/$s_!LROB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png 1272w, https://substackcdn.com/image/fetch/$s_!LROB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F44b98b9b-f1f4-4d77-829a-9985218708e0_960x378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.80:</strong> The Value matrix is created by multiplying the input embeddings by a separate weight matrix W_V . It provides the representations that are blended according to attention weights.</em></p><p>Just as we created the Query (Q) and Key (K) matrices by multiplying our input embeddings (X) with trainable weight matrices (W_q and W_k), we create the Value (V) matrix by multiplying the input embeddings with its own trainable weight matrix, W_v.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;V = X \\times W_v&quot;,&quot;id&quot;:&quot;GATVDXKTBL&quot;}" data-component-name="LatexBlockToDOM"></div><p>This transformation is crucial. It allows the model to learn a representation of the input tokens that is <em>specifically optimized for constructing the final contextualized output</em>. While the Key matrix is designed for &#8220;being searched&#8221; and the Query matrix is for &#8220;searching,&#8221; the Value matrix is designed to &#8220;be blended.&#8221;</p><p>Instead of directly blending the input vectors, we blend these new Value vectors. This gives the model more flexibility and expressive power.</p><h4>Calculating the Context Matrix</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Z70_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Z70_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png 424w, https://substackcdn.com/image/fetch/$s_!Z70_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png 848w, https://substackcdn.com/image/fetch/$s_!Z70_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png 1272w, https://substackcdn.com/image/fetch/$s_!Z70_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Z70_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png" width="1215" height="501" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:501,&quot;width&quot;:1215,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48394,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Z70_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png 424w, https://substackcdn.com/image/fetch/$s_!Z70_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png 848w, https://substackcdn.com/image/fetch/$s_!Z70_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png 1272w, https://substackcdn.com/image/fetch/$s_!Z70_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5ace98f-6087-4486-8186-e240d7580e0a_1215x501.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.81:</strong> The context matrix is computed as a single matrix multiplication: Attention Weights (5, 5) times Values (5, 4) yields the (5, 4) context matrix containing one context vector per token.</em></p><p>The calculation of the final context matrix is a single matrix multiplication:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Context = AttentionWeights \\times V&quot;,&quot;id&quot;:&quot;ALLDYYDYSJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let&#8217;s break this down using the dimensions from our examples.</p><p>1. <strong>Attention Weights (A):</strong> This is the (5, 5) matrix of normalized scores we calculated previously. Each row (i) of this matrix represents the &#8220;attention&#8221; that token (i) pays to every other token (including itself).</p><p>2. <strong>Value (V) Matrix:</strong> This is the (5, 4) matrix of transformed input vectors. Each row corresponds to a token, but it now exists in the &#8220;value space&#8221; with a dimension of 4. The multiplication is therefore:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\text{Context } (5, 4) = \\text{AttentionWeights } (5, 5) \\times V (5, 4)\n&quot;,&quot;id&quot;:&quot;UABPLZGPOA&quot;}" data-component-name="LatexBlockToDOM"></div><p>The resulting (5, 4) Context matrix contains our five new context vectors, one for each input token. Each of these new vectors has a dimension of 4, matching the dimensionality of our value space.</p><h4>What is a Context Vector?</h4><p>Each row in the final Context matrix is the new, &#8220;context-aware&#8221; vector for its corresponding token. This new vector is a <strong>weighted sum</strong> of <em>all</em> the Value vectors in the sequence.  Let&#8217;s illustrate by calculating the context vector for the third token, <strong>&#8220;day&#8221;</strong> (row 3).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KMCy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KMCy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png 424w, https://substackcdn.com/image/fetch/$s_!KMCy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png 848w, https://substackcdn.com/image/fetch/$s_!KMCy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png 1272w, https://substackcdn.com/image/fetch/$s_!KMCy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KMCy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png" width="1203" height="501" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:501,&quot;width&quot;:1203,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49631,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KMCy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png 424w, https://substackcdn.com/image/fetch/$s_!KMCy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png 848w, https://substackcdn.com/image/fetch/$s_!KMCy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png 1272w, https://substackcdn.com/image/fetch/$s_!KMCy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa82457ed-ad0e-466c-b24e-78d5d3090d74_1203x501.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.82:</strong> Computing the context vector for &#8220;day&#8221;: its attention weights are multiplied with the corresponding Value vectors and summed to produce a new representation that blends information from all tokens.</em></p><p>1. Get the Weights: We take row 3 from the Attention Weights matrix. These are the weights from <strong>&#8220;day&#8221;</strong> to all other tokens:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{bmatrix}\n0.28 &amp; 0.12 &amp; 0.09 &amp; 0.16 &amp; 0.35\n\\end{bmatrix}\n&quot;,&quot;id&quot;:&quot;ZUWOJAOJSQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>2. <strong>Get the Values:</strong> We use the <em>entire</em> (5, 4) <strong>Value</strong> matrix.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\nV = \n\\begin{bmatrix}\n1.5 &amp; 1.2 &amp; 1.6 &amp; 2.4 \\\\\n1.4 &amp; 1.2 &amp; 1.7 &amp; 2.1 \\\\\n1.4 &amp; 1.5 &amp; 1.5 &amp; 2.1 \\\\\n1.6 &amp; 1.6 &amp; 1.8 &amp; 2.0 \\\\\n1.7 &amp; 1.7 &amp; 1.6 &amp; 2.4\n\\end{bmatrix}\n\n&quot;,&quot;id&quot;:&quot;JPYWZXFEAP&quot;}" data-component-name="LatexBlockToDOM"></div><p>3. <strong>Perform the Weighted Sum:</strong> The new context vector for &#8220;day&#8221; is calculated by multiplying its attention weights by the corresponding Value vectors and summing the results:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{aligned}\n\\text{Context Vector for \&quot;day\&quot;} ={} &amp; (0.28 \\times V_{\\text{The}}) + (0.12 \\times V_{\\text{next}}) + (0.09 \\times V_{\\text{day}}) \\\\\n&amp; + (0.16 \\times V_{\\text{is}}) + (0.35 \\times V_{\\text{bright}}) \\quad (1, 4)\n\\end{aligned}\n&quot;,&quot;id&quot;:&quot;NYIZGLDJVK&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{bmatrix}\n1.57 &amp; 1.47 &amp; 1.64 &amp; 2.27\n\\end{bmatrix}\n&quot;,&quot;id&quot;:&quot;XKPDFLXPCC&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gyXB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gyXB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png 424w, https://substackcdn.com/image/fetch/$s_!gyXB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png 848w, https://substackcdn.com/image/fetch/$s_!gyXB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png 1272w, https://substackcdn.com/image/fetch/$s_!gyXB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gyXB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png" width="1209" height="687" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:687,&quot;width&quot;:1209,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gyXB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png 424w, https://substackcdn.com/image/fetch/$s_!gyXB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png 848w, https://substackcdn.com/image/fetch/$s_!gyXB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png 1272w, https://substackcdn.com/image/fetch/$s_!gyXB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd31cbb52-0b6d-4108-819b-a9b33b22d537_1209x687.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.83:</strong> Detailed calculation showing how the first element of the context vector for &#8220;day&#8221; is computed as a weighted sum of the first elements of all Value vectors.</em></p><p>Since matrix calculations can sometimes feel overwhelming due to the number of values involved, You can see how how the value in the first column of the context vector for the token &#8216;day&#8217; is calculated.</p><p>This new vector is a blend, or a weighted average, of all the tokens&#8217; &#8220;value&#8221; representations. The blend is dictated by the attention scores. In this case, the new meaning of &#8220;day&#8221; is most heavily influenced by the value of &#8220;bright&#8221; (35%), followed by its own original value (9%), and the values of the other tokens.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xdm3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xdm3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png 424w, https://substackcdn.com/image/fetch/$s_!xdm3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png 848w, https://substackcdn.com/image/fetch/$s_!xdm3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png 1272w, https://substackcdn.com/image/fetch/$s_!xdm3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xdm3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png" width="1398" height="396" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/77672a56-3385-4170-bc23-20412046f5b6_1398x396.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:396,&quot;width&quot;:1398,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:41183,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xdm3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png 424w, https://substackcdn.com/image/fetch/$s_!xdm3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png 848w, https://substackcdn.com/image/fetch/$s_!xdm3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png 1272w, https://substackcdn.com/image/fetch/$s_!xdm3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F77672a56-3385-4170-bc23-20412046f5b6_1398x396.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure</strong> <strong>1.84 :</strong> The complete transformation: input embeddings (static, context-free) are converted into context vectors (dynamic, context-aware) through the self-attention mechanism.</em></p><p>We began with an input embedding matrix, where each token&#8217;s vector represented its meaning in isolation, unaware of its surroundings. The self-attention mechanism transforms these static inputs by projecting them into three new spaces: Query, Key, and Value. By comparing the Query and Key matrices, the model generates a matrix of attention weights. This weight matrix acts as a precise &#8220;blending recipe,&#8221; quantifying the exact relevance and relationship of every token to every other token in the sequence. The result is the context vector matrix, where each token&#8217;s original vector is replaced by a new, context-aware representation. This fundamental transformation from isolated, static meaning to a rich, contextualized representation is the central power of the self-attention mechanism.</p><h4><strong>Listing 1.7: Computing Context Vectors from Attention Weights</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;f9319d3d-3667-4bfd-b99b-7495f2a2484a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># compute context vectors for all tokens
context = attn_weights @ values      # shape (5, 4)

print(&#8221;Context vectors:&#8221;)
print(context)
print(&#8221;context.shape:&#8221;, context.shape)

# context vector for the word &#8220;next&#8221; only
context_next = attn_weights_next @ values
print(&#8221;\nContext vector for &#8216;next&#8217;:&#8221;)
print(context_next)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;c51f862d-6924-4783-9ef6-8d194ec6d286&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Context vectors:
tensor([[ 2.2118,  2.4971, -0.2800,  0.8843],
        [ 2.2038,  2.4754, -0.2794,  0.8741],
        [ 2.2062,  2.4855, -0.2796,  0.8797],
        [ 2.2100,  2.4933, -0.2802,  0.8884],
        [ 2.2229,  2.5209, -0.2808,  0.9064]])
context.shape: torch.Size([5, 4])

Context vector for &#8216;next&#8217;:
tensor([ 2.2038,  2.4754, -0.2794,  0.8741])
</code></pre></div><p>The final step in self attention is to combine the value vectors using the attention weights as coefficients. Every context vector is a weighted sum of all value vectors, where the weights come from the corresponding row in the attention matrix.</p><p>The matrix product</p><p><strong>context = attn_weights @ values </strong></p><p>implements this operation for all five tokens at once. Since <strong>attn_weights</strong> has shape <strong>5, 5</strong> and <strong>values</strong> has shape <strong>5, 4</strong>, the result has shape <strong>5, 4</strong>. Each row in <strong>context</strong> is the new context aware representation of one token in the sentence.</p><p>The row <strong>context_next</strong> shows the updated representation for <strong>&#8220;next&#8221;</strong>. It lives in the same 4 dimensional space as the value vectors, but it now encodes information aggregated from all tokens in the sentence according to the learned attention pattern. This is exactly the transformation the theory section describes when it talks about going from static input embeddings to dynamic context vectors.</p><h4><strong>Listing 1.8: Packaging Self-Attention into a PyTorch Module</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;86644a0e-81c3-4090-8442-23859e968d7e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import torch.nn as nn

class SelfAttention(nn.Module):

    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.randn(d_in, d_out))
        self.W_key   = nn.Parameter(torch.randn(d_in, d_out))
        self.W_value = nn.Parameter(torch.randn(d_in, d_out))

    def forward(self, x):
        queries = x @ self.W_query    # (seq_len, d_out)
        keys    = x @ self.W_key      # (seq_len, d_out)
        values  = x @ self.W_value    # (seq_len, d_out)

        attn_scores  = queries @ keys.T
        d_k = keys.shape[-1]
        attn_weights = torch.softmax(
            attn_scores / d_k**0.5, dim=-1
        )

        context = attn_weights @ values
        return context

torch.manual_seed(123)
sa = SelfAttention(d_in=8, d_out=4)
out = sa(inputs)
print(out.shape)
</code></pre></div><p>output</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;5a95ba62-29af-4bc4-bd20-621a7cb20f0a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">torch.Size([5, 4])</code></pre></div><p>This <strong>class</strong> collects the individual steps of scaled dot product self attention into a reusable component. The constructor creates three trainable parameter matrices, each of shape <strong>d_in, d_out</strong>. When the module is part of a model, these parameters will be updated by the <strong>optimiser</strong> during training.</p><p>The <strong>forward</strong> method implements the same pipeline we derived by hand. It projects the input embeddings into queries, keys and values, computes all attention scores with a single matrix product, scales and normalises them with softmax, and finally uses the resulting weights to blend the value vectors into context vectors.</p><p>The last two lines create an instance of the layer with <strong>d_in</strong> equal to <strong>8</strong> and <strong>d_out</strong> equal to <strong>4</strong>, apply it to the input sentence and print the shape of the output. The result <strong>5, 4</strong> confirms that for a sequence of five tokens the layer returns five context vectors, each living in the 4 dimensional space of the attention head. This is exactly the representation that will be passed on to the feed forward network in the transformer block.</p><h2>1.13 Causal &amp; Masked Attention</h2><p>In the preceding section, we examined how self-attention transforms input embeddings into context-aware vectors. However, that explanation omitted a vital component for generative models: <strong>causal attention</strong>. This mechanism is fundamental, as it ensures that the model respects the sequential order of text and does not &#8220;cheat&#8221; by looking at future tokens.</p><p>Now that you have a solid understanding of the full self-attention pipeline, it is the perfect time to introduce this concept. This masking step is applied directly to the attention scores, just before the softmax function, to block any information from subsequent positions in the sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vwwM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vwwM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png 424w, https://substackcdn.com/image/fetch/$s_!vwwM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png 848w, https://substackcdn.com/image/fetch/$s_!vwwM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png 1272w, https://substackcdn.com/image/fetch/$s_!vwwM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vwwM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png" width="498" height="288" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:288,&quot;width&quot;:498,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16937,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vwwM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png 424w, https://substackcdn.com/image/fetch/$s_!vwwM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png 848w, https://substackcdn.com/image/fetch/$s_!vwwM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png 1272w, https://substackcdn.com/image/fetch/$s_!vwwM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf983454-2dee-4cd0-97c4-4fdb048e2c78_498x288.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.85:</strong> Causal masking ensures that when processing token i, the model can only attend to tokens at positions 0 through i, preventing information leakage from future tokens.</em></p><p>Large language models like ChatGPT generate text by predicting one token at a time. Each predicted token is appended to the input, creating a growing context window used to predict the next token. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r0Va!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r0Va!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png 424w, https://substackcdn.com/image/fetch/$s_!r0Va!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png 848w, https://substackcdn.com/image/fetch/$s_!r0Va!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png 1272w, https://substackcdn.com/image/fetch/$s_!r0Va!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r0Va!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png" width="921" height="369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:921,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55779,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r0Va!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png 424w, https://substackcdn.com/image/fetch/$s_!r0Va!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png 848w, https://substackcdn.com/image/fetch/$s_!r0Va!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png 1272w, https://substackcdn.com/image/fetch/$s_!r0Va!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F12bc6d99-d215-4826-92bd-bd97af3f9a6b_921x369.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.86:</strong> Sequential text generation: the model predicts one token at a time, appending each prediction to the input before generating the next.</em></p><p>This sequential process imposes a fundamental constraint: when computing the context vector for any token, only that token and preceding tokens should have influence. Future tokens must not contribute, as they have not yet been generated.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KLup!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KLup!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png 424w, https://substackcdn.com/image/fetch/$s_!KLup!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png 848w, https://substackcdn.com/image/fetch/$s_!KLup!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png 1272w, https://substackcdn.com/image/fetch/$s_!KLup!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KLup!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png" width="1026" height="504" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:504,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27847,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KLup!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png 424w, https://substackcdn.com/image/fetch/$s_!KLup!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png 848w, https://substackcdn.com/image/fetch/$s_!KLup!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png 1272w, https://substackcdn.com/image/fetch/$s_!KLup!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb8dd0410-ba90-4236-a286-6fab2f5720e8_1026x504.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.87</strong> The causal constraint: when processing &#8220;science,&#8221; the model can only access &#8220;Computer&#8221; and itself. &#8220;is,&#8221; &#8220;the,&#8221; and subsequent tokens are masked out.</em></p><p>Consider the sequence <strong>&#8220;Computer&#8221;</strong> , <strong>&#8220;science&#8221;</strong>, <strong>&#8220;is&#8221;</strong> , <strong>&#8220;the&#8221;</strong>, <strong>&#8220;study...&#8221;</strong> When processing <strong>&#8220;science&#8221;</strong> , the model should only access itself and <strong>&#8220;Computer&#8221;</strong> . It must not see <strong>&#8220;is&#8221;</strong>, <strong>&#8220;the&#8221;</strong>, or any subsequent tokens. Similarly, <strong>&#8220;Computer&#8221;</strong> should only attend to itself, while <strong>&#8220;is&#8221;</strong>  can attend to <strong>&#8220;Computer&#8221;</strong>, <strong>&#8220;science&#8221;</strong> , and itself, but not to the tokens that follow it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IU1L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IU1L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png 424w, https://substackcdn.com/image/fetch/$s_!IU1L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png 848w, https://substackcdn.com/image/fetch/$s_!IU1L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png 1272w, https://substackcdn.com/image/fetch/$s_!IU1L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IU1L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png" width="1170" height="747" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:747,&quot;width&quot;:1170,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53353,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IU1L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png 424w, https://substackcdn.com/image/fetch/$s_!IU1L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png 848w, https://substackcdn.com/image/fetch/$s_!IU1L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png 1272w, https://substackcdn.com/image/fetch/$s_!IU1L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49a1e4aa-ac3c-44ad-9f35-ef25ed58cc85_1170x747.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure</strong> <strong>1.88:</strong> A lower-triangular attention matrix enforces the causal constraint. All entries above the diagonal are zero, ensuring that no token attends to future positions.</em></p><p>To enforce this constraint, we use masked attention. For each token acting as a query, we mask all keys corresponding to future positions by setting their attention scores to zero. The token <strong>&#8220;Computer&#8221;</strong> has a non-zero attention score only with itself. The token <strong>&#8220;science&#8221;</strong> has attention scores with <strong>&#8220;Computer&#8221;</strong> and itself, but zero scores with all future tokens like <strong>&#8220;is&#8221;</strong> and <strong>&#8220;the&#8221;</strong> . This masking creates a lower triangular attention matrix where all entries above the diagonal are zero.</p><p>After masking, we normalize the remaining attention weights in each row to sum to one. For <strong>&#8220;Computer&#8221;</strong> , the single remaining weight is set to one. For <strong>&#8220;science&#8221;</strong>, the two remaining weights are normalized so their sum equals one. This normalization is achieved by summing the non-masked weights in each row and dividing each weight by this sum, creating a proper probability distribution over the tokens each query can attend to. This mechanism, called causal attention or masked self attention, enables language models to generate coherent text while respecting the sequential nature of prediction.</p><h3>Implementing Causal Attention through Zero Masking</h3><p>To implement causal attention, we can begin with the raw attention scores. Consider the 5x5 attention score matrix for the sequence &#8220;computer science is the study&#8221;</p><h4>Step 1: Initial Attention Scores</h4><p>This is the raw, unnormalized matrix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GQ3u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GQ3u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png 424w, https://substackcdn.com/image/fetch/$s_!GQ3u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png 848w, https://substackcdn.com/image/fetch/$s_!GQ3u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png 1272w, https://substackcdn.com/image/fetch/$s_!GQ3u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GQ3u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png" width="444" height="406" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:406,&quot;width&quot;:444,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19214,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GQ3u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png 424w, https://substackcdn.com/image/fetch/$s_!GQ3u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png 848w, https://substackcdn.com/image/fetch/$s_!GQ3u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png 1272w, https://substackcdn.com/image/fetch/$s_!GQ3u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4e8f8a9b-eb24-416f-b44f-74f6a287468f_444x406.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.89:</strong> The raw 5&#215;5 attention score matrix before any masking is applied.</em></p><p>In PyTorch, we can construct a lower triangular mask using the <strong>torch.tril()</strong></p><pre><code>torch.tril(MatrixA)</code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O4Ok!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O4Ok!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png 424w, https://substackcdn.com/image/fetch/$s_!O4Ok!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png 848w, https://substackcdn.com/image/fetch/$s_!O4Ok!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png 1272w, https://substackcdn.com/image/fetch/$s_!O4Ok!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O4Ok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png" width="1170" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:546,&quot;width&quot;:1170,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40629,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O4Ok!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png 424w, https://substackcdn.com/image/fetch/$s_!O4Ok!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png 848w, https://substackcdn.com/image/fetch/$s_!O4Ok!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png 1272w, https://substackcdn.com/image/fetch/$s_!O4Ok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a9f6e7c-c5e1-4cbc-a349-96413c101795_1170x546.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.90:</strong> The torch.tril() function creates a lower-triangular mask: ones on and below the diagonal, zeros above.</em></p><p>This function creates a matrix where elements on and below the diagonal are ones, while elements above the diagonal are zeros. The size of this mask matrix must match the dimensions of our attention score matrix, which is determined by the context length. </p><p>The context length is simply the number of tokens currently in the sequence. For the example sequence &#8220;computer,&#8221; &#8220;science,&#8221; &#8220;is,&#8221; &#8220;the,&#8221; &#8220;study,&#8221; the context length is five. </p><h4><strong>Listing 1.9: Understanding Triangular Masks with torch.tril</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;4d6a9bf5-904e-4658-ae40-d650b4a23be7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import torch

# A simple 3x3 example matrix

A = torch.tensor([
    [1., 2., 3.],
    [4., 5., 6.],
    [7., 8., 9.],
])

print(&#8221;A:&#8221;)
print(A)

# Lower-triangular version of A

A_tril = torch.tril(A)

print(&#8221;\ntorch.tril(A):&#8221;)
print(A_tril)

# A pure mask built from ones

mask_ones = torch.tril(torch.ones_like(A))

print(&#8221;\nLower-triangular mask from ones:&#8221;)
print(mask_ones)

# Using the mask to zero out the upper triangle

A_masked = A * mask_ones

print(&#8221;\nA * mask_ones:&#8221;)
print(A_masked)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;efd9a4c0-0b33-4486-957a-e7eb8dfd669f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">A:
tensor([[1., 2., 3.],
        [4., 5., 6.],
        [7., 8., 9.]])

torch.tril(A):
tensor([[1., 0., 0.],
        [4., 5., 0.],
        [7., 8., 9.]])

Lower-triangular mask from ones:
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]).tril()
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

A * mask_ones:
tensor([[1., 0., 0.],
        [4., 5., 0.],
        [7., 8., 9.]])</code></pre></div><p>The function <strong>torch.tril</strong> returns the lower triangular part of a matrix: everything on and below the main diagonal is kept, everything above it is set to zero.</p><p><strong>torch.tril(A)</strong> takes the original data in <strong>A</strong> and zeroes the entries above the diagonal.</p><p><strong>torch.tril(torch.ones_like(A))</strong> builds a mask: ones on and below the diagonal, zeros above.</p><p>Multiplying <strong>A</strong> by this mask with <strong>A * mask_ones</strong> keeps the lower triangle and zeroes out the upper triangle.</p><p>Causal attention uses exactly this idea. Instead of masking a 3&#215;3 matrix, we mask a <strong>seq_len</strong> &#215; <strong>seq_len</strong> attention score matrix so that token <strong>i</strong> can only see tokens <strong>0..i</strong> and not future tokens.</p><p><strong>Building a causal mask for a 5-token sequence</strong></p><p>Assume we have a sequence of five tokens, such as:</p><p>[&#8221;computer&#8221;, &#8220;science&#8221;, &#8220;is&#8221;, &#8220;the&#8221;, &#8220;study&#8221;]</p><p>The attention scores <strong>attn_scores</strong> are a  <strong>5 x 5</strong> matrix.</p><h4><strong>Listing 1.10: Building a Causal Mask for a 5-Token Sequence</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;4d945bb8-61f5-40d9-af7b-95d71ebe11ca&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">seq_len = 5

# Build a 5x5 causal mask

causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool))

print(&#8221;Causal mask (True = allowed, False = masked):&#8221;)
print(causal_mask)
</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;1e72cd37-5c76-4613-a8cc-141fa31dbbce&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Causal mask (True = allowed, False = masked):
tensor([[ True, False, False, False, False],
        [ True,  True, False, False, False],
        [ True,  True,  True, False, False],
        [ True,  True,  True,  True, False],
        [ True,  True,  True,  True,  True]])
</code></pre></div><p>This mask allows each token to only attend to itself and all previous tokens, blocking attention to future tokens to prevent information leakage. </p><h4>Step 2: Create and Apply the Mask</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Niiy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Niiy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png 424w, https://substackcdn.com/image/fetch/$s_!Niiy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png 848w, https://substackcdn.com/image/fetch/$s_!Niiy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png 1272w, https://substackcdn.com/image/fetch/$s_!Niiy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Niiy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png" width="1456" height="448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:448,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:30212,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Niiy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png 424w, https://substackcdn.com/image/fetch/$s_!Niiy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png 848w, https://substackcdn.com/image/fetch/$s_!Niiy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png 1272w, https://substackcdn.com/image/fetch/$s_!Niiy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06ffae20-aba1-4b29-b77f-738e796d6b65_1560x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.91:</strong> Applying the lower-triangular mask to the attention scores through element-wise multiplication zeros out all future-token entries.</em></p><p>The mask is created as a 5x5 lower triangular matrix of ones.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\nM = \\begin{bmatrix}\n1 &amp; 0 &amp; 0 &amp; 0 &amp; 0 \\\\\n1 &amp; 1 &amp; 0 &amp; 0 &amp; 0 \\\\\n1 &amp; 1 &amp; 1 &amp; 0 &amp; 0 \\\\\n1 &amp; 1 &amp; 1 &amp; 1 &amp; 0 \\\\\n1 &amp; 1 &amp; 1 &amp; 1 &amp; 1\n\\end{bmatrix}\n&quot;,&quot;id&quot;:&quot;NVONLDIRDS&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p> We then apply this mask to our attention scores through element-wise multiplication. </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;Masked Scores A'_{\\text{scores}} = A_{\\text{scores}} \\odot M&quot;,&quot;id&quot;:&quot;ABKDACZDSR&quot;}" data-component-name="LatexBlockToDOM"></div><p>This operation sets all elements in the upper triangle to zero while preserving the lower triangular elements.</p><p>After multiplication, the attention scores that corresponded to future tokens are now zero, while scores for current and past tokens remain unchanged.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!22Rj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!22Rj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png 424w, https://substackcdn.com/image/fetch/$s_!22Rj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png 848w, https://substackcdn.com/image/fetch/$s_!22Rj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png 1272w, https://substackcdn.com/image/fetch/$s_!22Rj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!22Rj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png" width="879" height="531" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6928b18b-85d8-4e68-a714-309409d3262d_879x531.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:879,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:31152,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!22Rj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png 424w, https://substackcdn.com/image/fetch/$s_!22Rj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png 848w, https://substackcdn.com/image/fetch/$s_!22Rj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png 1272w, https://substackcdn.com/image/fetch/$s_!22Rj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6928b18b-85d8-4e68-a714-309409d3262d_879x531.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.92:</strong> After masking, the rows are not yet normalized. Each row must be divided by its sum to form a valid probability distribution.</em></p><p>However, this masking alone is insufficient. The rows are not normalized and do not sum to one, which violates the requirement that attention weights form a probability distribution. We must normalize each row by dividing every element by the row sum, ensuring that the non-zero weights in each row sum to one.</p><h4>Step 3: Row-wise Normalization</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cqHM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cqHM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png 424w, https://substackcdn.com/image/fetch/$s_!cqHM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png 848w, https://substackcdn.com/image/fetch/$s_!cqHM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png 1272w, https://substackcdn.com/image/fetch/$s_!cqHM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cqHM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png" width="1456" height="416" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:416,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:42953,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cqHM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png 424w, https://substackcdn.com/image/fetch/$s_!cqHM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png 848w, https://substackcdn.com/image/fetch/$s_!cqHM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png 1272w, https://substackcdn.com/image/fetch/$s_!cqHM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f3b97bc-1273-4d50-9383-648b2db08c36_1512x432.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure</strong> <strong>1.93:</strong> Row-wise normalization of the masked attention scores: dividing each entry by its row sum produces attention weights that sum to one.</em></p><p>First, we find the sum of each row in the masked matrix:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{align*}\n\\text{Row 1 Sum:} &amp; \\quad 0.20 \\\\\n\\text{Row 2 Sum:} &amp; \\quad 0.23 + 0.27 = 0.50 \\\\\n\\text{Row 3 Sum:} &amp; \\quad 0.22 + 0.25 + 0.18 = 0.65 \\\\\n\\text{Row 4 Sum:} &amp; \\quad 0.22 + 0.24 + 0.19 + 0.15 = 0.80 \\\\\n\\text{Row 5 Sum:} &amp; \\quad 0.22 + 0.25 + 0.18 + 0.15 + 0.18 = 0.98 \\\\\n\\end{align*}\n\n&quot;,&quot;id&quot;:&quot;AGUCTRGTBG&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Now, we divide each element by its row sum to get the final weights.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{bmatrix}\n1.00 &amp; 0.00 &amp; 0.00 &amp; 0.00 &amp; 0.00 &amp; 1.00 \\\\\n0.46 &amp; 0.54 &amp; 0.00 &amp; 0.00 &amp; 0.00 &amp; 1.00 \\\\\n0.34 &amp; 0.38 &amp; 0.28 &amp; 0.00 &amp; 0.00 &amp; 1.00 \\\\\n0.28 &amp; 0.30 &amp; 0.24 &amp; 0.19 &amp; 0.00 &amp; 1.00 \\\\\n0.22 &amp; 0.26 &amp; 0.18 &amp; 0.15 &amp; 0.18 &amp; 1.00 \\\\\n\\end{bmatrix}\n\n&quot;,&quot;id&quot;:&quot;VGOEAZROZC&quot;}" data-component-name="LatexBlockToDOM"></div><p>This normalization step completes the implementation of this (flawed) version of causal attention.</p><h3>The Problem: Data Leakage in Attention Computation</h3><p>At first glance, the masked self-attention approach appears to solve the problem of preventing queries from attending to future tokens. We mask the upper triangular portion of the attention matrix and normalize the remaining weights. However, a closer examination reveals a critical flaw in this approach.</p><p>To understand the problem, we must revisit how attention weights are computed. The process begins with the construction of query, key, and value matrices. We compute the dot product between the query matrix and the transpose of the key matrix, producing attention scores that indicate how much each token attends to every other token. These scores are then scaled by dividing each element by the square root of the key dimensionality, yielding the scaled dot product. The crucial step occurs when we convert the scaled dot product into attention weights by applying the softmax function row-wise.</p><p>Here lies the problem. When we apply softmax to a row in the scaled dot product matrix, the denominator considers all elements in that row, including those corresponding to future tokens. Consider the first row, which corresponds to the token &#8220;computer.&#8221; When computing the softmax, the summation in the denominator includes the scaled dot product values for all tokens, including &#8220;science,&#8221; &#8220;is,&#8221; &#8220;the,&#8221; and &#8220;study.&#8221; Similarly, for the second row corresponding to &#8220;science,&#8221; the softmax denominator includes values from future tokens &#8220;is,&#8221; &#8220;the,&#8221; and &#8220;study.&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NvYh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NvYh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png 424w, https://substackcdn.com/image/fetch/$s_!NvYh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png 848w, https://substackcdn.com/image/fetch/$s_!NvYh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png 1272w, https://substackcdn.com/image/fetch/$s_!NvYh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NvYh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png" width="1239" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:1239,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:41767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NvYh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png 424w, https://substackcdn.com/image/fetch/$s_!NvYh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png 848w, https://substackcdn.com/image/fetch/$s_!NvYh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png 1272w, https://substackcdn.com/image/fetch/$s_!NvYh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa360d81a-343d-46aa-8924-abfa4beb580c_1239x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.94:</strong> Data leakage: when softmax is computed before masking, the denominator already includes contributions from future tokens, subtly influencing the attention weights.</em></p><p>By the time we obtain the attention weights, each element has already been influenced by future tokens through the softmax normalization. Masking the attention weights <em>after</em> this computation does not eliminate this influence. The information from future tokens has already leaked into the computation through the softmax denominator.</p><p>This phenomenon is termed data leakage. We intended masked self-attention to prevent queries from accessing information about future tokens, but this prevention fails because the leakage occurs during the softmax computation itself. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mnkC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mnkC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png 424w, https://substackcdn.com/image/fetch/$s_!mnkC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png 848w, https://substackcdn.com/image/fetch/$s_!mnkC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png 1272w, https://substackcdn.com/image/fetch/$s_!mnkC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mnkC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png" width="1104" height="561" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:561,&quot;width&quot;:1104,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23086,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mnkC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png 424w, https://substackcdn.com/image/fetch/$s_!mnkC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png 848w, https://substackcdn.com/image/fetch/$s_!mnkC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png 1272w, https://substackcdn.com/image/fetch/$s_!mnkC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6471bb6a-140c-4593-b6b7-2bf741196e8c_1104x561.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.95:</strong> To prevent data leakage, masking must be applied before softmax so that future tokens are excluded from the softmax denominator entirely.</em></p><p>To properly implement causal attention, we must intervene <em>before</em> applying softmax. The softmax denominator for each row should only consider elements up to and including the current token position. Future keys must be excluded from this summation entirely. The masking operation must occur at the scaled dot product stage, before the softmax function is applied, to truly prevent data leakage.</p><h4>The Solution: Masking with Negative Infinity</h4><p>The solution to the data leakage problem involves a clever technique that applies masking before the softmax operation. Instead of zeroing out attention weights after computing softmax, we assign negative infinity values to the positions we want to mask in the scaled dot product matrix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7Qhg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7Qhg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png 424w, https://substackcdn.com/image/fetch/$s_!7Qhg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png 848w, https://substackcdn.com/image/fetch/$s_!7Qhg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png 1272w, https://substackcdn.com/image/fetch/$s_!7Qhg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7Qhg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png" width="996" height="444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29610,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7Qhg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png 424w, https://substackcdn.com/image/fetch/$s_!7Qhg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png 848w, https://substackcdn.com/image/fetch/$s_!7Qhg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png 1272w, https://substackcdn.com/image/fetch/$s_!7Qhg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66d7e713-d01b-4d19-983a-ecce51b2fae3_996x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure</strong> <strong>1.96:</strong> Negative infinity masking: upper-triangular entries are set to &#8722;&#8734; before softmax, which maps them to exactly zero probability while correctly normalizing over visible tokens.</p><p>The process works as follows. After computing the attention scores through the dot product between the query and key matrices, and before applying softmax, we set all upper triangular elements to negative infinity. These negative infinity values persist even after scaling by the square root of the key dimensionality, since dividing negative infinity by any finite number still yields negative infinity.</p><p>To understand why this approach works, consider how the softmax function behaves with negative infinity values. </p><p><em>Formula</em></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\text{Softmax}(x_i) = \\frac{e^{x_i}}{\\sum_{j} e^{x_j}}\n&quot;,&quot;id&quot;:&quot;HFLLKLSJAZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Suppose we have a row containing the values 2, 3, and 5.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\begin{aligned}\nx_1 &amp;= 2, \\quad x_2 = 3, \\quad x_3 = 5 \\\\\n\\text{Softmax}(2) &amp;= \\frac{e^{2}}{e^{2} + e^{3} + e^{5}} \\\\\n\\text{Softmax}(3) &amp;= \\frac{e^{3}}{e^{2} + e^{3} + e^{5}} \\\\\n\\text{Softmax}(5) &amp;= \\frac{e^{5}}{e^{2} + e^{3} + e^{5}}\n\\end{aligned}\n\n&quot;,&quot;id&quot;:&quot;VJCWLIGICL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Then the values would be like below</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\nSoftmax(2) = 0.0420 \\\\\nSoftmax(3) = 0.1142 \\\\\nSoftmax(5) = 0.8438\n\\end{align*}\n&quot;,&quot;id&quot;:&quot;EDVRYZWCRW&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now consider what happens when we want to mask the last two elements. We replace them with negative infinity, giving us the sequence 2, negative infinity, negative infinity.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\nx = [2, -\\infty, -\\infty]\n&quot;,&quot;id&quot;:&quot;PSQIDIVACF&quot;}" data-component-name="LatexBlockToDOM"></div><p>When we apply softmax,  the first element is like this </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Softmax}(x_1) = \\frac{e^{2}}{e^{2} + e^{-\\infty} + e^{-\\infty}}\n&quot;,&quot;id&quot;:&quot;CCATEDRBGV&quot;}" data-component-name="LatexBlockToDOM"></div><p>The key insight is that</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\ne^{-\\infty} = \\frac{1}{e^{\\infty}} \\to 0\n&quot;,&quot;id&quot;:&quot;QEGRQFXXCM&quot;}" data-component-name="LatexBlockToDOM"></div><p>Therefore, the first element simplifies to 1</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Softmax}(x_1) = \\frac{e^{2}}{e^{2} + 0 + 0} = 1\n&quot;,&quot;id&quot;:&quot;BGXILKCLHC&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>What happens to the masked elements?</strong></p><p>Consider the second position containing negative infinity, it becomes zero. The third position similarly becomes zero as well.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\text{Softmax}(x_2) = \\frac{e^{-\\infty}}{e^{2}} = 0, \\quad\n\\text{Softmax}(x_3) = \\frac{e^{-\\infty}}{e^{2}} = 0&quot;,&quot;id&quot;:&quot;OZUHLOEEDD&quot;}" data-component-name="LatexBlockToDOM"></div><p>After softmax, our sequence transforms into 1, 0, 0.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Rightarrow \\text{Softmax}([2, -\\infty, -\\infty]) = [1, 0, 0]\n\n&quot;,&quot;id&quot;:&quot;QGMWYQQCSG&quot;}" data-component-name="LatexBlockToDOM"></div><h4>Let&#8217;s take one more example,</h4><p>Now we are going to mask only the third element</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x = [2, 3, -\\infty]\n&quot;,&quot;id&quot;:&quot;FLVYJGGPGI&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Softmax}(x_1) = \\frac{e^{2}}{e^{2} + e^{3} + e^{-\\infty}} = \\frac{e^{2}}{e^{2} + e^{3}}\n&quot;,&quot;id&quot;:&quot;QYRAWJPQLD&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Softmax}(x_2) = \\frac{e^{3}}{e^{2} + e^{3} + e^{-\\infty}} = \\frac{e^{3}}{e^{2} + e^{3}}\n&quot;,&quot;id&quot;:&quot;RTNXRCLOAQ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Softmax}(x_3) = \\frac{e^{-\\infty}}{e^{2} + e^{3} + e^{-\\infty}} = 0\n&quot;,&quot;id&quot;:&quot;FKOWQRCMZQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>The non-masked softmax values, 0.2689 and 0.7311, add up to one, which confirms they form a proper probability distribution over the unmasked elements</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Softmax}([2,\\,3,\\,-\\infty]) = [0.2689,\\ 0.7311,\\ 0]\n&quot;,&quot;id&quot;:&quot;RYTTRXQFKN&quot;}" data-component-name="LatexBlockToDOM"></div><p>The elegance of this method lies in its automatic normalization property. By setting masked positions to negative infinity before softmax, the resulting attention weights naturally satisfy two requirements. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f0VW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f0VW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png 424w, https://substackcdn.com/image/fetch/$s_!f0VW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png 848w, https://substackcdn.com/image/fetch/$s_!f0VW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png 1272w, https://substackcdn.com/image/fetch/$s_!f0VW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f0VW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png" width="1456" height="379" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:379,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45197,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f0VW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png 424w, https://substackcdn.com/image/fetch/$s_!f0VW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png 848w, https://substackcdn.com/image/fetch/$s_!f0VW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png 1272w, https://substackcdn.com/image/fetch/$s_!f0VW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83700855-a1ee-462c-b48e-700ed0a40c33_1704x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.97:</strong> The complete causal masking pipeline: (1) compute scaled dot products, (2) set upper-triangular entries to &#8722;&#8734;, (3) apply softmax. Masked positions become exactly zero with no data leakage.</em></p><p>First, all masked positions become exactly zero after softmax. Second, the remaining non-masked weights in each row automatically sum to one, as the softmax function guarantees normalization over all finite input values. This eliminates the need for any additional normalization step after masking, solving the data leakage problem while maintaining the mathematical properties required for attention weights.</p><p>Now we apply the causal mask to the scaled attention scores.</p><h4><strong>Listing 1.11: Applying the Causal Mask to Scaled Attention Scores</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;fd90af92-11e1-407b-83bc-174acf469673&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import math

scaled_scores = attn_scores / math.sqrt(d_k)
print(&#8221;Scaled scores (unmasked):&#8221;)
print(scaled_scores)

# Build the boolean causal mask again

seq_len = attn_scores.size(0)
causal_mask = torch.tril(torch.ones(seq_len, seq_len, dtype=torch.bool))

# Apply the mask: set disallowed positions to -inf

masked_scaled_scores = scaled_scores.masked_fill(~causal_mask, float(&#8221;-inf&#8221;))

print(&#8221;\nScaled scores with causal mask:&#8221;)
print(masked_scaled_scores)
</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;7ff816d3-9f1a-4e45-b19b-e6b55d4ad252&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Scaled scores (unmasked):
tensor([[0.5000, 1.0000, 1.5000, 2.0000, 2.5000],
        [0.7500, 1.2500, 1.7500, 2.2500, 2.7500],
        [1.0000, 1.5000, 2.0000, 2.5000, 3.0000],
        [1.2500, 1.7500, 2.2500, 2.7500, 3.2500],
        [1.5000, 2.0000, 2.5000, 3.0000, 3.5000]])

Scaled scores with causal mask:
tensor([[0.5000,   -inf,   -inf,   -inf,   -inf],
        [0.7500, 1.2500,   -inf,   -inf,   -inf],
        [1.0000, 1.5000, 2.0000,   -inf,   -inf],
        [1.2500, 1.7500, 2.2500, 2.7500,   -inf],
        [1.5000, 2.0000, 2.5000, 3.0000, 3.5000]])</code></pre></div><p>We first apply the usual scaling factor <strong>1/sqrt(d_k)</strong> to the scores.</p><p>Then </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;2d85a795-5909-4cfa-963c-624f4476d476&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">masked_scaled_scores = scaled_scores.masked_fill(~causal_mask, float(&#8221;-inf&#8221;))</code></pre></div><p>does two things:</p><ul><li><p><strong>~causal_mask</strong> inverts the boolean mask. Positions that were <strong>False</strong> (future tokens) become <strong>True</strong>.</p></li><li><p><strong>masked_fill</strong> writes <strong>-inf</strong> into those positions.</p></li></ul><p>All allowed positions (on and below the diagonal) keep their original scaled scores. Disallowed positions become negative infinity. This guarantees that, when we apply softmax next, future tokens will contribute <strong>zero</strong> probability.</p><p>Now let&#8217;s apply softmax to make causal attention weights from masked scores</p><h4><strong>Listing 1.12: Computing Causal Attention Weights with Softmax</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;7e1b4023-4b89-4e16-b350-5e855003cee1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">attn_weights_causal = torch.softmax(masked_scaled_scores, dim=-1)

print(&#8221;Causal attention weights:&#8221;)
print(attn_weights_causal)
print(&#8221;\nRow sums:&#8221;, attn_weights_causal.sum(dim=-1))</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;db0a2311-ad07-40ea-b70d-b1548961a2cd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Causal attention weights:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3777, 0.6223, 0.0000, 0.0000, 0.0000],
        [0.1863, 0.3072, 0.5065, 0.0000, 0.0000],
        [0.1015, 0.1674, 0.2760, 0.4551, 0.0000],
        [0.0580, 0.0956, 0.1577, 0.2599, 0.4288]])

Row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000])</code></pre></div><p>Softmax is now applied to rows that contain finite scores only on and below the diagonal, while masked positions are set to &#8722;inf so their exponentials become zero. As a result, the first token can only attend to itself, giving a distribution like [1, 0, 0, 0, 0]; the second token attends only to the first two positions, and those two weights sum to one; the last token can attend to all five positions, so its row is a full probability distribution over the sequence. In every case, all entries above the diagonal are exactly zero, so no token ever attends to future tokens, and each row still sums to one, so every row is a valid attention distribution. Because this masking is applied before softmax, there is no data leakage from future positions, which gives us true causal attention.</p><h2>1.14 Causal Attention with Dropouts</h2><div><hr></div><h4>Concept of Dropout</h4><p>Before exploring how dropout is applied in causal attention, we first review the concept of dropout and its purpose in neural networks. Dropout is a regularization technique designed to prevent overfitting and ensure that all neurons in a network contribute meaningfully to the learning process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1131!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1131!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png 424w, https://substackcdn.com/image/fetch/$s_!1131!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png 848w, https://substackcdn.com/image/fetch/$s_!1131!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png 1272w, https://substackcdn.com/image/fetch/$s_!1131!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1131!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png" width="786" height="276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:276,&quot;width&quot;:786,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:11554,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1131!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png 424w, https://substackcdn.com/image/fetch/$s_!1131!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png 848w, https://substackcdn.com/image/fetch/$s_!1131!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png 1272w, https://substackcdn.com/image/fetch/$s_!1131!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F702b2c16-8212-4e2e-b5fc-0d476a1112ee_786x276.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.98:</strong> Dropout randomly deactivates neurons during training, forcing lazy neurons to participate and preventing the network from relying on a few dominant connections.</em></p><p>Consider a neural network layer where certain neurons dominate the computation while others contribute minimally. For example, in a layer with five neurons, one neuron might have very large weights while the other two have small weights. This dominant neuron effectively controls the output of the layer, while the other neurons become what we call lazy neurons. These lazy neurons do not significantly influence the forward pass and consequently do not learn useful representations during training. The network essentially overfits by relying too heavily on a subset of neurons.</p><p>Dropout addresses this problem by randomly deactivating neurons during training. In each forward pass through the network, neurons are switched off with a certain probability, typically 0.5. This means that statistically, half of the neurons will be deactivated in any given training iteration. The selection is probabilistic and automatic, not manual.</p><p>When a previously dominant neuron is switched off, the lazy neurons must participate in the forward propagation. During backpropagation, the weights of these previously inactive neurons must now be adjusted to minimize the loss. Without dropout, if the forward propagation relies entirely on one or two dominant neurons that already produce low loss, the weights of the lazy neurons would never be updated. By forcing different subsets of neurons to be active across training iterations, dropout ensures that all neurons learn to extract useful features from the data.</p><p>This technique prevents the network from becoming overly dependent on specific neurons and encourages the development of more robust, distributed representations across the entire network.</p><div><hr></div><h4><strong>Why Dropout Matters in Attention Mechanisms</strong></h4><p>When we build attention mechanisms for language models, we sometimes encounter a problem where certain words become overly dependent on each other. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PsfA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PsfA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png 424w, https://substackcdn.com/image/fetch/$s_!PsfA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png 848w, https://substackcdn.com/image/fetch/$s_!PsfA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png 1272w, https://substackcdn.com/image/fetch/$s_!PsfA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PsfA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png" width="1095" height="258" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:258,&quot;width&quot;:1095,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21340,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PsfA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png 424w, https://substackcdn.com/image/fetch/$s_!PsfA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png 848w, https://substackcdn.com/image/fetch/$s_!PsfA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png 1272w, https://substackcdn.com/image/fetch/$s_!PsfA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F41606981-5d62-4c62-8ee4-f5e0d2a59d09_1095x258.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure</strong> <strong>1.99:</strong> Without dropout, the word &#8220;study&#8221; may pay excessive attention to specific earlier words, memorizing patterns rather than learning general language rules.</em></p><p>Consider the sentence &#8220;computer science is the study.&#8221; If the word &#8220;study&#8221; pays excessive attention to specific earlier words, the model might memorize these particular patterns rather than learning general language rules. This excessive dependency between tokens can hurt the model&#8217;s ability to generalize to new sentences.</p><p>Dropout offers an elegant solution to this problem. By randomly removing some attention connections during training, we force the model to learn more robust patterns that don&#8217;t rely on any single strong connection.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gLSm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gLSm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png 424w, https://substackcdn.com/image/fetch/$s_!gLSm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png 848w, https://substackcdn.com/image/fetch/$s_!gLSm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png 1272w, https://substackcdn.com/image/fetch/$s_!gLSm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gLSm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png" width="1185" height="537" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:537,&quot;width&quot;:1185,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14229,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gLSm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png 424w, https://substackcdn.com/image/fetch/$s_!gLSm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png 848w, https://substackcdn.com/image/fetch/$s_!gLSm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png 1272w, https://substackcdn.com/image/fetch/$s_!gLSm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9824b2e-377b-4bab-8979-6a8ec904f0aa_1185x537.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.100:</strong> Causal attention with dropout applied: some valid attention connections (shown in red) are randomly zeroed out during training while the causal mask is preserved.</em></p><p>Let&#8217;s walk through where dropout fits in. Consider our example with 5 input tokens: &#8220;computer&#8221;, &#8220;science&#8221;, &#8220;is&#8221;, &#8220;the&#8221;, &#8220;study&#8221;. Each token gets represented as a vector, and as you have seen in the self attention section, we have Attention weights.</p><p>For unidirectional attention (also called causal attention), we mask the upper triangle of this matrix. This ensures that each token can only attend to previous tokens and itself, not future ones. Looking at the visualization, &#8220;computer&#8221; can only see itself , &#8220;science&#8221; can see &#8220;computer&#8221; and itself, &#8220;is&#8221; can see the first three words, and so on. The gray areas represent these masked positions where attention is blocked.</p><p>Here&#8217;s where dropout comes in. After obtaining the attention weights, we randomly set some of them to zero with probability p. In the right image, we see unidirectional attention with dropout applied. The red boxes highlight positions where dropout has been applied, effectively zeroing out those attention connections.</p><p>Notice how dropout respects the causal structure. It only affects the valid attention weights (the lower triangular part), never touching the already masked upper triangle. Some attention weights that were previously active  are now dropped out (shown within red boxes), forcing the model to rely on different connection patterns.</p><h4>The Scaling Factor Explained</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6EOV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6EOV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png 424w, https://substackcdn.com/image/fetch/$s_!6EOV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png 848w, https://substackcdn.com/image/fetch/$s_!6EOV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png 1272w, https://substackcdn.com/image/fetch/$s_!6EOV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6EOV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png" width="1287" height="687" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:687,&quot;width&quot;:1287,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48795,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6EOV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png 424w, https://substackcdn.com/image/fetch/$s_!6EOV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png 848w, https://substackcdn.com/image/fetch/$s_!6EOV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png 1272w, https://substackcdn.com/image/fetch/$s_!6EOV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F023f2489-f774-4035-84d8-ca08fd46e9b1_1287x687.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 1.101: Dropout scaling: when connections are dropped with probability p, the remaining weights are scaled by 1/( 1&#8722;p) to maintain consistent expected output magnitude between training and inference.</em></p><p><strong>When we apply dropout, there&#8217;s a crucial detail:</strong> </p><p>we need to scale up the remaining weights. This scaling maintains consistent behavior between training and inference. Suppose in row 4, we originally have attention weights distributed across the first four positions. If dropout with probability 0.5 removes half these connections, the remaining weights need to be doubled to maintain the same expected output magnitude. For example, referring to the matrix, if the fourth row originally had attention weights [0.25, 0.26, 0.25, 0.24] for its four non-zero positions, and dropout with probability 0.5 zeros out two of them, such as the second and fourth values, the remaining two might become [0.50, 0, 0.50, 0] after scaling by 2. This ensures the total signal strength remains consistent.</p><p>After applying dropout and scaling, each row of the attention matrix still represents a valid probability distribution over the tokens that can be attended to, just with fewer active connections.</p><p>This technique has become a standard component in transformer architectures, contributing to their remarkable success in natural language processing tasks. The simple act of randomly removing connections, combined with proper scaling, creates a powerful regularization effect that helps these models achieve better performance on unseen data.</p><p>Here let&#8217;s  add <strong>dropout</strong> to the causal attention weights and then compute the context vectors.<br>We keep the same <strong>attn_weights_causal</strong> as above and assume we already have a values matrix of shape <code>(5, 4)</code> from the self-attention section.</p><h4><strong>Listing 1.13: Applying Dropout to Causal Attention Weights</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;73972c8e-aba3-4bb1-b5c6-f5718e273096&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">dropout = torch.nn.Dropout(p=0.5)

torch.manual_seed(0)  # to get a stable example mask
attn_weights_causal_drop = dropout(attn_weights_causal)

print(&#8221;Causal attention weights before dropout:&#8221;)
print(attn_weights_causal)

print(&#8221;\nCausal attention weights after dropout (training mode):&#8221;)
print(attn_weights_causal_drop)
print(&#8221;Row sums after dropout:&#8221;, attn_weights_causal_drop.sum(dim=-1))

# Use the dropped weights to compute context vectors
context_causal = attn_weights_causal_drop @ values

print(&#8221;\nCausal context vectors with dropout:&#8221;)
print(context_causal)
print(&#8221;context_causal.shape:&#8221;, context_causal.shape)
</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;3148ad71-00b0-4816-871a-e0141f9ff829&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Causal attention weights before dropout:
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3777, 0.6223, 0.0000, 0.0000, 0.0000],
        [0.1863, 0.3072, 0.5065, 0.0000, 0.0000],
        [0.1015, 0.1674, 0.2760, 0.4551, 0.0000],
        [0.0580, 0.0956, 0.1577, 0.2599, 0.4288]])

Causal attention weights after dropout (training mode):
tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.7554, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3726, 0.6144, 0.0000, 0.0000, 0.0000],
        [0.2030, 0.3348, 0.0000, 0.9102, 0.0000],
        [0.0000, 0.1912, 0.3154, 0.5198, 0.0000]])

Row sums after dropout:
tensor([2.0000, 0.7554, 0.9870, 1.4480, 1.0264])

Causal context vectors with dropout:
tensor([[ 4.2400,  5.0200, -0.5600,  1.5400],
        [ 1.6020,  1.7000, -0.1600,  0.3700],
        [ 2.2640,  2.4060, -0.2700,  0.7000],
        [ 2.4240,  2.6460, -0.2800,  0.9700],
        [ 2.2670,  2.5410, -0.2800,  0.9100]])
context_causal.shape: torch.Size([5, 4])
</code></pre></div><p>Dropout randomly sets some attention weights to zero during training. Each weight is kept with <strong>probability 1 - p</strong> and dropped with probability <strong>p</strong>, and the remaining weights are <strong>scaled by 1 / 1 - p</strong> so that the expected total attention remains unchanged. Importantly, dropout never unblocks future positions, so all entries above the diagonal stay at zero and the causal structure is preserved. It only thins out the valid lower triangular connections. After this regularisation step, the dropped attention weights are still used in the usual way to compute the context vectors by multiplying them with the value matrix.</p><p>Finally, the context vectors are computed as</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;e745a40e-8d15-42e7-8c8c-e54bc3a9feb8&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">context_causal = attn_weights_causal_drop @ values
</code></pre></div><p>Each row of <strong>context_causal</strong> is the context vector for a token under <strong>causal attention with dropout</strong>. These vectors have shape <strong>(5,4)</strong>, matching the number of tokens and the attention head dimension, and are what you feed into the feed-forward network inside a transformer block during training of generative models.</p><h2>1.15 Summary of Self-Attention</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Umfy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Umfy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png 424w, https://substackcdn.com/image/fetch/$s_!Umfy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png 848w, https://substackcdn.com/image/fetch/$s_!Umfy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png 1272w, https://substackcdn.com/image/fetch/$s_!Umfy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Umfy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png" width="1456" height="564" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:564,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35974,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Umfy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png 424w, https://substackcdn.com/image/fetch/$s_!Umfy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png 848w, https://substackcdn.com/image/fetch/$s_!Umfy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png 1272w, https://substackcdn.com/image/fetch/$s_!Umfy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2e51b450-96c5-4592-a76f-6721a2e4a4ac_1713x663.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.102:</strong> Complete self-attention pipeline: Input embeddings are projected into Q, K, V&#894; attention scores are computed via QK^T , scaled, optionally masked, passed through softmax (with optional dropout), and multiplied by V to produce context vectors.</em></p><p>Before jumping into what multi-head attention is, let&#8217;s summarize the self-attention mechanism that we covered in the previous section. So the Process begins when the input tokens are converted into the <strong>Input Embedding</strong> matrix. This matrix is linearly projected into three distinct matrices: <strong>Query (Q)</strong>, <strong>Key (K)</strong>, and <strong>Value (V)</strong>. This is achieved by multiplying the Input Embedding matrix by three separate, trainable weight matrices. Once Q, K, and V are generated, the input embedding matrix is no longer needed.</p><p>The core attention calculation starts by computing the <strong>Attention Scores</strong>. This is done, as shown in the first &#8216;MatMul&#8217; step, by taking the dot product of the Q matrix with the transpose of the K matrix. These raw scores are then normalized in the &#8216;Scale&#8217; step, where they are divided by the square root of the Key&#8217;s dimension. This scaling is crucial for stabilizing the gradients during training.</p><p>Following scaling, an &#8216;Optional Mask&#8217; can be applied. This step is essential for implementing <strong>causal attention</strong>, where it masks out all scores corresponding to future tokens, ensuring a token can only attend to itself and previous tokens. Next, the <strong>SoftMax</strong> function is applied across each row of the scaled (and possibly masked) scores. This converts the scores into positive values that sum to one, effectively turning them into the final <strong>attention weights</strong>. An &#8216;Optional Dropout&#8217; layer can be applied here to prevent overfitting.</p><p>In the final step, these attention weights are multiplied by the Value (V) matrix, as shown in the second &#8216;MatMul&#8217; operation. This produces the <strong>Context Vector</strong> (Z). Each row of this Z matrix is a new, contextually enriched vector for the corresponding input token, as it now contains a weighted combination of information from all other tokens it was allowed to attend to.</p><h3>Limitations of Self-Attention Mechanisms</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rEvD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rEvD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png 424w, https://substackcdn.com/image/fetch/$s_!rEvD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png 848w, https://substackcdn.com/image/fetch/$s_!rEvD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png 1272w, https://substackcdn.com/image/fetch/$s_!rEvD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rEvD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png" width="717" height="483" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:483,&quot;width&quot;:717,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:278267,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rEvD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png 424w, https://substackcdn.com/image/fetch/$s_!rEvD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png 848w, https://substackcdn.com/image/fetch/$s_!rEvD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png 1272w, https://substackcdn.com/image/fetch/$s_!rEvD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a14a76d-319b-449c-a182-a5e8cd3a667e_717x483.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.103:</strong> Linguistic ambiguity: &#8220;The artist painted the portrait of a woman with a brush&#8221; has two valid interpretations, the brush is the tool for painting, or the woman in the portrait holds a brush.</em></p><p>A significant problem with a single self-attention mechanism is its limited ability to effectively handle linguistic ambiguity. This challenge can be illustrated with the sentence: </p><div class="pullquote"><p>&#8220;The artist painted the portrait of a woman with a brush.&#8221; </p></div><p>This statement has two distinct and valid interpretations. </p><p>The first interpretation is that the artist used a brush as a tool to perform the action of painting. In this context, the phrase &#8220;with a brush&#8221; modifies the verb &#8220;painted&#8221;. </p><p>The second interpretation is that the subject of the painting is a woman who is holding a brush. Here, &#8220;with a brush&#8221; modifies the &#8220;woman&#8221; or &#8220;portrait&#8221;. </p><p>A single self-attention layer may struggle to capture both of these potential relationships simultaneously. It might incorrectly average these dependencies or fixate on only one, resulting in a contextual vector that fails to represent the full nuance of the sentence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1QEM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1QEM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png 424w, https://substackcdn.com/image/fetch/$s_!1QEM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png 848w, https://substackcdn.com/image/fetch/$s_!1QEM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png 1272w, https://substackcdn.com/image/fetch/$s_!1QEM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1QEM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png" width="1119" height="525" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:525,&quot;width&quot;:1119,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21743,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1QEM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png 424w, https://substackcdn.com/image/fetch/$s_!1QEM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png 848w, https://substackcdn.com/image/fetch/$s_!1QEM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png 1272w, https://substackcdn.com/image/fetch/$s_!1QEM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08e36698-65e2-4b58-9904-30d0cd1a84a8_1119x525.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.104:</strong> A single attention head can only produce one attention matrix, either capturing the tool interpretation or the subject interpretation, but not both simultaneously.</em></p><p>The first interpretation, where the artist uses the brush, would generate an attention score matrix where the word &#8220;artist&#8221; has a high attention score with &#8220;brush&#8221;. </p><p>The second interpretation, where the woman in the in the portrait is holding a brush, would generate a completely different matrix where &#8220;woman&#8221; and &#8220;portrait&#8221; have high attention scores with &#8220;brush&#8221;. </p><p>A single self-attention layer can only produce one of these attention matrices. It will either settle on one perspective or create an unhelpful average of the two. This results in a context vector that fails to represent the full richness and potential ambiguity of the input, limiting the model&#8217;s ability to understand the multiple angles or meanings present in complex language.</p><p>This limitation demonstrates the need for a more robust mechanism, which multi-head attention addresses.</p><h2>1.16 Intuition of Multi-Head Attention</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pGSP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pGSP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png 424w, https://substackcdn.com/image/fetch/$s_!pGSP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png 848w, https://substackcdn.com/image/fetch/$s_!pGSP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png 1272w, https://substackcdn.com/image/fetch/$s_!pGSP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pGSP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png" width="1456" height="615" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:615,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19340,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pGSP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png 424w, https://substackcdn.com/image/fetch/$s_!pGSP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png 848w, https://substackcdn.com/image/fetch/$s_!pGSP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png 1272w, https://substackcdn.com/image/fetch/$s_!pGSP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fef3185ce-bb23-4291-a775-c5319fd80ea3_1521x642.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.105:</strong> Multi-head attention: the same input is fed into multiple independent self-attention heads in parallel, each learning a different set of relationships. Their outputs are concatenated into a single enriched context matrix.</em></p><p>Since a single self-attention mechanism is limited to capturing only one perspective from an input sequence, the solution is to use multiple self-attention mechanisms in parallel. This architecture is known as multi-head attention. The core idea is that the same input embedding matrix is fed into several independent self-attention &#8220;heads&#8221;. Each head produces its own distinct context vector matrix, effectively learning a different set of relationships or focusing on a different aspect of the input, such as one head capturing verb-centric relationships while another captures a different semantic nuance. These individual context vector matrices, each representing a unique perspective, are then combined or merged. This process results in a single, final context vector matrix that is much richer, as it amalgamates the multiple perspectives captured by all the individual heads, leading to a more comprehensive representation of the input.</p><p>To implement multi-head attention, we must adapt the query, key, and value matrix operations to support multiple attention mechanisms operating in parallel. Having established the limitations of a single self-attention layer, the goal is to see how we can practically implement a system with multiple heads, for instance, a two-head attention mechanism. The procedure will demonstrate how to generate two independent sets of attention scores and two corresponding context vector matrices. This parallel processing is the core of multi-head attention, as it allows the model to produce multiple, distinct representations, with each head capturing a different perspective or set of relationships from the input sequence.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P2PR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P2PR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png 424w, https://substackcdn.com/image/fetch/$s_!P2PR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png 848w, https://substackcdn.com/image/fetch/$s_!P2PR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png 1272w, https://substackcdn.com/image/fetch/$s_!P2PR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P2PR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png" width="1215" height="1509" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1509,&quot;width&quot;:1215,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:39062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P2PR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png 424w, https://substackcdn.com/image/fetch/$s_!P2PR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png 848w, https://substackcdn.com/image/fetch/$s_!P2PR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png 1272w, https://substackcdn.com/image/fetch/$s_!P2PR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26494253-5ef2-4ef9-9c03-939eeb79dbf6_1215x1509.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.106:</strong> From single-head to two-head attention: the input embedding matrix is processed by separate weight matrices for each head, producing independent sets of Q, K, and V vectors.</em></p><p>The process begins with the input embedding matrix. Using the example sentence &#8220;The next day is bright&#8221;, which has 5 tokens, we start with an input embedding matrix. As illustrated, each token is represented by an embedding of eight dimensions. This configuration results in an input embedding matrix with dimensions of 5 by 8. The goal of a two head attention mechanism is to transform this single input matrix into two distinct context vector matrices, with each one capturing a different perspective.</p><p>To establish a baseline, recall the procedure for a single attention head. In this case, the 5 by 8 input embedding matrix is multiplied by three separate trainable weight matrices. As shown in the diagram, these are the Query Weight Matrix (W_q) with dimensions 8 by 4, the Keys Weight Matrix (W_k) with dimensions 8 by 4, and the Values Weight Matrix (W_v) with dimensions 8 by 4. This matrix multiplication operation produces a Query Vectors matrix (5 by 4), a Keys Vectors matrix (5 by 4), and a Values Vectors matrix (5 by 4). This single set of query, key, and value matrices is what the multi head attention mechanism will expand upon.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4a-y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4a-y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png 424w, https://substackcdn.com/image/fetch/$s_!4a-y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png 848w, https://substackcdn.com/image/fetch/$s_!4a-y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png 1272w, https://substackcdn.com/image/fetch/$s_!4a-y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4a-y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png" width="1456" height="709" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:709,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4a-y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png 424w, https://substackcdn.com/image/fetch/$s_!4a-y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png 848w, https://substackcdn.com/image/fetch/$s_!4a-y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png 1272w, https://substackcdn.com/image/fetch/$s_!4a-y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec120595-d137-447a-adff-c0d3e5cb4e98_1515x738.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.107:</strong> The head dimension: with d_out = 4 split across 2 heads, each head operates on a reduced dimension of 2. Weight matrices W_k1 and W_k2 each have shape (8, 2).</em></p><p>To transition from a single head to a multi head mechanism, such as one with two heads, the first step is to adapt the trainable weight matrices. Instead of one single query weight matrix (W_q), we now initialize two separate matrices, W_q1 and W_q2, one for each head. This same division is applied to the key and value matrices, creating W_k1, W_k2, W_v1, and W_v2}.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IXwV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IXwV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png 424w, https://substackcdn.com/image/fetch/$s_!IXwV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png 848w, https://substackcdn.com/image/fetch/$s_!IXwV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png 1272w, https://substackcdn.com/image/fetch/$s_!IXwV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IXwV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png" width="873" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:873,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28896,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IXwV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png 424w, https://substackcdn.com/image/fetch/$s_!IXwV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png 848w, https://substackcdn.com/image/fetch/$s_!IXwV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png 1272w, https://substackcdn.com/image/fetch/$s_!IXwV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce4222d2-c11f-4e22-8fe5-4357b3004108_873x609.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.108:</strong> Each head operates on a reduced subspace. Head 1 produces Q1, K1, V1 of shape (5, 2)&#894; Head 2 produces Q2, K2, V2 of shape (5, 2).</em></p><p>The dimensions of these new matrices are determined by the &#8220;head dimension&#8221;. This value is calculated by dividing the original total output dimension (d_out) by the number of heads. For example, if the original d_out was 4, for a two head system, the head dimension is 4 divided by 2, which equals 2. This means that while the original weight matrices might have been 8 by 4, each new head specific matrix (like W_k1 and W_k2) will have dimensions of 8 by 2. The main idea of this step is to create multiple, smaller copies of the trainable W_q, W_k, and W_v matrices. As a direct consequence, this will naturally produce multiple sets of query vectors, key vectors, and value vectors when multiplied with the input embeddings.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qmdB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qmdB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png 424w, https://substackcdn.com/image/fetch/$s_!qmdB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png 848w, https://substackcdn.com/image/fetch/$s_!qmdB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png 1272w, https://substackcdn.com/image/fetch/$s_!qmdB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qmdB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png" width="1456" height="1587" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1587,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:177034,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qmdB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png 424w, https://substackcdn.com/image/fetch/$s_!qmdB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png 848w, https://substackcdn.com/image/fetch/$s_!qmdB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png 1272w, https://substackcdn.com/image/fetch/$s_!qmdB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7b7a3c3-0341-49c7-9797-1bc00a225753_1698x1851.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.109:</strong> Parallel attention computation: the input embeddings are projected through head-specific weight matrices to produce independent Q, K, V matrices for each head</em></p><p>Multi-head attention extends the basic attention mechanism by creating parallel attention computations that can capture different types of relationships simultaneously. Starting with the input embedding matrix of dimensions 5&#215;8 for our five tokens, the process splits attention into multiple heads. For a two-head configuration with output dimension of 4, each head operates on a reduced dimension of 2. The input embeddings are multiplied with separate weight matrices to produce Query, Key, and Value matrices for each head. Head 1 generates Q1, K1, and V1 matrices, while Head 2 produces Q2, K2, and V2 matrices, all with dimensions 5&#215;2 to match the five tokens and head dimension of 2. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kT_J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kT_J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png 424w, https://substackcdn.com/image/fetch/$s_!kT_J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png 848w, https://substackcdn.com/image/fetch/$s_!kT_J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png 1272w, https://substackcdn.com/image/fetch/$s_!kT_J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kT_J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png" width="1456" height="1321" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1321,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138479,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kT_J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png 424w, https://substackcdn.com/image/fetch/$s_!kT_J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png 848w, https://substackcdn.com/image/fetch/$s_!kT_J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png 1272w, https://substackcdn.com/image/fetch/$s_!kT_J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b54c5fc-b3df-4b98-865e-7c2daa0e5dd2_1617x1467.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure</strong> <em><strong>1.110: </strong></em>Each head independently computes its (5, 5) attention score matrix by multiplying Q with K^T , maintaining the ability to capture all pairwise token relationships.</p><p>Each head then independently computes its attention scores by multiplying its Query matrix with the transposed Key matrix.</p><p>The Query matrix for each head has shape (5,2) and the Key matrix has shape (2,5), where 5 represents the number of tokens in the sequence. When we multiply these matrices (Q times K transpose), we obtain a (5,5) attention score matrix for each head. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pbUt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pbUt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png 424w, https://substackcdn.com/image/fetch/$s_!pbUt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png 848w, https://substackcdn.com/image/fetch/$s_!pbUt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png 1272w, https://substackcdn.com/image/fetch/$s_!pbUt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pbUt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png" width="1456" height="611" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:611,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pbUt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png 424w, https://substackcdn.com/image/fetch/$s_!pbUt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png 848w, https://substackcdn.com/image/fetch/$s_!pbUt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png 1272w, https://substackcdn.com/image/fetch/$s_!pbUt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7f57bf0-4c65-4288-8577-2dbb2630ddba_1581x663.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure</strong> <strong>1.111:</strong> Two separate (5, 5) attention score matrices from two heads: each can capture different types of relationships in the data.</em></p><p><strong>This is the crucial insight:</strong> although each head works with half the original dimension, the resulting attention score matrix maintains the full (5,5) shape, which represents relationships between all token pairs. This means that with 2 heads, we generate two separate (5,5) attention score matrices rather than one. Each matrix can capture different types of relationships in the data. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cADS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cADS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png 424w, https://substackcdn.com/image/fetch/$s_!cADS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png 848w, https://substackcdn.com/image/fetch/$s_!cADS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png 1272w, https://substackcdn.com/image/fetch/$s_!cADS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cADS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png" width="1456" height="877" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bde028dd-14b0-45ff-b051-922b47d493af_1578x951.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:877,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84161,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cADS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png 424w, https://substackcdn.com/image/fetch/$s_!cADS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png 848w, https://substackcdn.com/image/fetch/$s_!cADS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png 1272w, https://substackcdn.com/image/fetch/$s_!cADS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbde028dd-14b0-45ff-b051-922b47d493af_1578x951.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure</strong> <strong>1.112:</strong> Each head independently applies scaling and softmax to produce its own attention weight matrix, then computes context vectors by multiplying with its Value matrix.</em></p><p>These attention scores then undergo the standard scaling and softmax normalization within each head independently. After computing the context vectors by multiplying the attention weights with the Value matrices, the outputs from all heads are concatenated back together to restore the original output dimension. This architecture allows the model to simultaneously learn and represent multiple perspectives of token relationships without increasing the computational cost compared to having a single attention mechanism with the full dimension.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GRx6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GRx6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png 424w, https://substackcdn.com/image/fetch/$s_!GRx6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png 848w, https://substackcdn.com/image/fetch/$s_!GRx6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png 1272w, https://substackcdn.com/image/fetch/$s_!GRx6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GRx6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png" width="963" height="975" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:975,&quot;width&quot;:963,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69959,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GRx6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png 424w, https://substackcdn.com/image/fetch/$s_!GRx6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png 848w, https://substackcdn.com/image/fetch/$s_!GRx6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png 1272w, https://substackcdn.com/image/fetch/$s_!GRx6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0819c40f-b61b-43f7-9d24-3877946872b0_963x975.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure</strong> <strong>1.113:</strong> Context vectors from both heads: Head 1 produces a (5, 2) context matrix and Head 2 produces another (5, 2) context matrix, each capturing a different perspective.</em></p><p>Once we have the attention weight matrices for both heads, each of dimension, we proceed to compute the context vectors. This is done by multiplying each head&#8217;s attention weights with its corresponding Value matrix.</p><p>For Head 1, we multiply the (5,5) attention weight matrix with the (5,2) Value matrix V1, producing a (5,2) context matrix. Similarly, for Head 2, we multiply its (5,5) attention weight matrix with the (5,2) Value matrix V2, yielding another (5,2) context matrix. Each context matrix represents how the tokens should be represented based on the attention patterns learned by that particular head.</p><p>The final step in multi-head attention involves concatenating the context matrices from all heads to produce a unified output representation. Each head generates its own context matrix by multiplying its attention weights with its corresponding Value matrix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V8KX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V8KX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png 424w, https://substackcdn.com/image/fetch/$s_!V8KX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png 848w, https://substackcdn.com/image/fetch/$s_!V8KX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png 1272w, https://substackcdn.com/image/fetch/$s_!V8KX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V8KX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png" width="1101" height="465" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:465,&quot;width&quot;:1101,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:27627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V8KX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png 424w, https://substackcdn.com/image/fetch/$s_!V8KX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png 848w, https://substackcdn.com/image/fetch/$s_!V8KX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png 1272w, https://substackcdn.com/image/fetch/$s_!V8KX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5da5bdd-2e23-4745-a610-5a766284f3fe_1101x465.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure</strong> <strong>1.114:</strong>  Concatenation: the (5, 2) outputs from Head 1 and Head 2 are concatenated along the feature dimension to recover the original output dimension, forming a (5, 4) final context matrix.</em></p><p>In our example with two heads processing a sequence of 5 tokens, Head 1 produces a context matrix representing the first perspective on token relationships, while Head 2 independently generates another  context matrix capturing a second, distinct perspective.  To combine these complementary perspectives, the context matrices are concatenated along the feature dimension. Specifically, the (5,2) matrix from Head 1 is placed alongside the (5,2) matrix from Head 2, forming a single (5,4) output matrix. This concatenation operation merges the outputs horizontally, stacking the feature vectors from each head side by side for every token position. The resulting (5,4) final context matrix maintains the sequence length of 5 tokens while recovering the original output dimension of 4, which equals the head dimension multiplied by the number of heads. This concatenated representation now contains the enriched information from both attention heads, allowing each token&#8217;s final representation to simultaneously encode multiple types of relationships and patterns discovered by the different heads. The concatenated output serves as the complete output of the multi-head attention mechanism and is typically passed through a final linear projection layer before proceeding to subsequent layers in the transformer architecture.</p><h3>The Dimensional Trade-Off in Multi-Head Attention</h3><p>While multi-head attention offers significant advantages in capturing diverse perspectives, it does involve a fundamental trade-off in its design. When the output dimension is split across multiple heads, each head operates with a reduced dimensionality compared to single-head attention. In our example with an output dimension of 4 split into 2 heads, each head works with only 2 dimensions rather than the full 4 dimensions that would be available in single-head attention. This reduction in per-head dimensionality means that each individual head has a smaller representational capacity and fewer parameters to capture nuanced patterns within its specific perspective. With fewer dimensions to work with, each head may be limited in the complexity and detail of the relationships it can encode. However, this apparent limitation is offset by the increased number of perspectives that can be learned in parallel. The architecture essentially implements a divide-and-conquer strategy: instead of attempting to capture all types of token relationships within a single high-dimensional space, the model distributes this learning task across multiple specialized heads, each focusing on different aspects of the input. While one head might capture syntactic dependencies with its 2 dimensions, another head simultaneously learns semantic relationships with its own 2 dimensions. This parallelization allows the model to explore a broader range of attention patterns across the same computational budget. The concatenation of outputs from all heads ultimately reconstructs the full output dimension, ensuring that the combined representation benefits from multiple complementary perspectives. Thus, while each head operates with reduced capacity, the overall multi-head architecture achieves greater expressiveness through diversification, making this trade-off worthwhile for most applications.</p><h4><strong>Listing 1.14: Creating the Input Tensor for Multi-Head Attention</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;8fdd7c6e-9fd8-48df-a167-3d47c625b0b0&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import torch
torch.manual_seed(123)
torch.set_printoptions(precision=3, suppress=True)

# b, num_tokens, d_in = (1, 3, 6)
x = torch.tensor([[
    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0],   # &#8220;The&#8221;
    [6.0, 5.0, 4.0, 3.0, 2.0, 1.0],   # &#8220;kid&#8221;
    [1.0, 1.0, 1.0, 1.0, 1.0, 1.0],   # &#8220;smiles&#8221;
]])

print(&#8221;x.shape:&#8221;, x.shape)
</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;3ab7c72e-b230-4416-89a1-c8dfb4181a35&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">x.shape: torch.Size([1, 3, 6])</code></pre></div><p>The tensor <strong>x</strong> holds one mini batch of token embeddings. The shape <strong>1, 3, 6 </strong>reads as batch size one, three tokens per sequence, and six features per token. The three rows correspond to &#8220;The&#8221;, &#8220;kid&#8221;, and &#8220;smiles&#8221;, and each row has six embedding values.</p><h4><strong>Listing 1.15: Projecting Input to Query, Key, and Value</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;f0b97c43-0c9e-4ff2-8430-c91a819e02f7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">b, num_tokens, d_in = x.shape

d_out = 6          # final output dimension we want per token
num_heads = 2
head_dim = d_out // num_heads   # 6 // 2 = 3

W_q = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
W_k = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
W_v = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)

q = x @ W_q    # (1, 3, 6)
k = x @ W_k    # (1, 3, 6)
v = x @ W_v    # (1, 3, 6)

print(&#8221;q.shape:&#8221;, q.shape)
print(&#8221;k.shape:&#8221;, k.shape)
print(&#8221;v.shape:&#8221;, v.shape)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;580a81a5-23d2-4ead-8d96-6d224cefd6f2&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">q.shape: torch.Size([1, 3, 6])
k.shape: torch.Size([1, 3, 6])
v.shape: torch.Size([1, 3, 6])</code></pre></div><p>Instead of having separate weight matrices for each head, we follow the weight splitting idea. We keep a single large query, key, and value matrix of shape <strong>6, 6</strong>  Multiplying the <strong>1, 3, 6</strong> input by a <strong>6, 6</strong> weight gives new <strong>1, 3, 6</strong>  tensors for <strong>q, k, and v</strong>. At this point there is no explicit notion of heads in the tensors; all six output features per token are packed into the last dimension.</p><h4><strong>Listing 1.16: Splitting Projections into Multiple Heads</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;0ec826de-a8a7-4f36-bb9d-95d02f1ec3ed&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># reshape from (b, num_tokens, d_out) to (b, num_tokens, num_heads, head_dim)
q = q.view(b, num_tokens, num_heads, head_dim)
k = k.view(b, num_tokens, num_heads, head_dim)
v = v.view(b, num_tokens, num_heads, head_dim)

print(&#8221;q after view:&#8221;, q.shape)
print(&#8221;k after view:&#8221;, k.shape)
print(&#8221;v after view:&#8221;, v.shape)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;8843b46d-5266-421f-87d2-c615bc6f61e6&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">q after view: torch.Size([1, 3, 2, 3])
k after view: torch.Size([1, 3, 2, 3])
v after view: torch.Size([1, 3, 2, 3])</code></pre></div><p>The six features that came out of each projection are now interpreted as two heads with three features each. The <strong>view</strong> operation does not change any values; it only changes how we index them. The new shape <strong>1, 3, 2, 3</strong> can be read as batch size one, three tokens, two heads, three features per head. For a given token position, the last two dimensions now contain the representation for head one and head two.</p><h4><strong>Listing 1.17: Reordering Dimensions to Group by Head</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;68949543-7b60-464c-9c6a-ee2db3c4fe9a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># move the head dimension in front of the token dimension
# from (b, num_tokens, num_heads, head_dim)
# to   (b, num_heads, num_tokens, head_dim)
q = q.transpose(1, 2)
k = k.transpose(1, 2)
v = v.transpose(1, 2)

print(&#8221;q after transpose:&#8221;, q.shape)
print(&#8221;k after transpose:&#8221;, k.shape)
print(&#8221;v after transpose:&#8221;, v.shape)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;9b7905a8-27fb-4d2a-bcb0-43949b4c1bde&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">q after transpose: torch.Size([1, 2, 3, 3])
k after transpose: torch.Size([1, 2, 3, 3])
v after transpose: torch.Size([1, 2, 3, 3])</code></pre></div><p>After splitting into heads, it is convenient to group all tokens that belong to the same head together. The transpose call swaps the token and head axes. The new shape <strong>1, 2, 3, 3</strong> reads as batch size one, two heads, three tokens, three features per head. If you isolate <strong>q[0, 0]</strong> you see the three query vectors for head one, while <strong>q[0, 1]</strong> contains the three query vectors for head two. The same interpretation holds for <strong>k</strong> and <strong>v</strong>. This layout allows a single tensor operation to compute attention for all heads in parallel.</p><h4><strong>Listing 1.18: Computing Per-Head Attention Scores and Context Vectors</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;65ac809a-9c5b-4cb0-be2b-12fb1e36b261&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import math

# scaled dot product attention for all heads at once
scores = q @ k.transpose(-1, -2)        # (b, num_heads, num_tokens, num_tokens)
print(&#8221;scores.shape:&#8221;, scores.shape)

scale = math.sqrt(head_dim)
weights = torch.softmax(scores / scale, dim=-1)
print(&#8221;weights.shape:&#8221;, weights.shape)
print(&#8221;weights[0, 0]:&#8221;)
print(weights[0, 0])

# context vectors inside each head
context = weights @ v                   # (b, num_heads, num_tokens, head_dim)
print(&#8221;context per head shape:&#8221;, context.shape)
print(&#8221;context[0, 0]:&#8221;)
print(context[0, 0])</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;a76feb73-9c82-42f8-b1ca-2c471ecbb570&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">scores.shape: torch.Size([1, 2, 3, 3])
weights.shape: torch.Size([1, 2, 3, 3])
weights[0, 0]:
tensor([[0.59 , 0.24 , 0.17 ],
        [0.29 , 0.45 , 0.26 ],
        [0.22 , 0.31 , 0.47 ]])
context per head shape: torch.Size([1, 2, 3, 3])
context[0, 0]:
tensor([[ 0.564, -3.817,  2.064],
        [ 3.116,  9.936, 14.649],
        [-2.124, -4.104,  2.056]])</code></pre></div><p>The tensor <strong>scores</strong> has shape <strong>1, 2, 3, 3</strong>. For each batch and head, it contains a full three by three attention score matrix over the three tokens. Dividing by <strong>sqrt(head_dim)</strong> and applying softmax along the last axis converts these scores into attention weights, again separately for each head. The shape of <strong>weights</strong> matches that of <strong>scores</strong>.</p><p>Multiplying the weights by <strong>v</strong> produces the context tensor of shape <strong>1, 2, 3, 3</strong>. For each head you now have three context vectors, one per token, each with three features. In the printed slice <strong>context[0,0]</strong> you can see the three vectors produced by the first head for &#8220;The&#8221;, &#8220;kid&#8221;, and &#8220;smiles&#8221;, which match the output structure.</p><h4><strong>Listing 1.19: Merging Heads into the Final Context Matrix</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;f41ab7aa-46e7-4e43-a4a7-8a994c1c62fb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python"># move tokens back in front of heads and merge head and feature dimensions
context = context.transpose(1, 2).contiguous()  # (b, num_tokens, num_heads, head_dim)
context = context.view(b, num_tokens, num_heads * head_dim)

print(&#8221;final context.shape:&#8221;, context.shape)
print(&#8221;final context:&#8221;)
print(context)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;f10e889d-57a0-42d4-a3c4-e0eb76857b95&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">final context.shape: torch.Size([1, 3, 6])
final context:
tensor([[[ 0.564, -3.817,  2.064,  3.116,  9.936, 14.649],
         [-2.124, -4.104,  2.056,  3.098,  9.814, 14.479],
         [-2.120, -4.099,  2.053,  3.113,  9.915, 14.620]]])</code></pre></div><p>To return to a single context matrix per token, we undo the earlier reordering and then collapse the head dimension and the per head feature dimension back into one. The transpose moves us from <strong>1, 2, 3, 3</strong> to <strong>1, 3, 2, 3</strong> grouping all heads for each token together. The final <strong>view</strong> then interprets the last two dimensions <strong>2, 3</strong> as a single dimension of size six.</p><p>The result is a <strong>1, 3, 6</strong> tensor. Each row is now a six dimensional context vector for one token, built by concatenating the three features from head one and the three features from head two. Compared to single head attention, nothing about the scoring or weighting changed. The difference is that we used reshaping and transposing to let two separate attention heads operate in parallel on smaller subspaces, then merged their outputs to recover the original six dimensional representation per token.</p><h3>Concluding Multi Head Attention</h3><p>Multi head attention extends single head attention by running several independent attention mechanisms in parallel, each with its own learned query, key, and value projections on a reduced head dimension. For an input embedding matrix, each head produces its own attention scores over all token pairs and then its own context matrix, so different heads can specialize on different types of relationships in the sequence. These context matrices are then concatenated along the feature dimension to recover the original output size, so each token representation combines multiple complementary views of the same input. While each head has fewer dimensions and therefore lower capacity than a single large attention module, the collection of diverse heads makes the overall representation more expressive, and a final linear projection can further mix and refine these combined features.</p><h2>1.17 Layer Normalization</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XVeS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XVeS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png 424w, https://substackcdn.com/image/fetch/$s_!XVeS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png 848w, https://substackcdn.com/image/fetch/$s_!XVeS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png 1272w, https://substackcdn.com/image/fetch/$s_!XVeS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XVeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png" width="408" height="735" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7114008-fcc3-41e7-baa7-1760269de241_408x735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18137,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XVeS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png 424w, https://substackcdn.com/image/fetch/$s_!XVeS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png 848w, https://substackcdn.com/image/fetch/$s_!XVeS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png 1272w, https://substackcdn.com/image/fetch/$s_!XVeS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7114008-fcc3-41e7-baa7-1760269de241_408x735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.115: </strong>Layer normalization appears multiple times in the transformer block: before multi-head attention, before the feed-forward network, and often before the output layer.</em></p><p>In the transformer block, layer normalization appears several times. It is applied before the multi head attention sublayer, again before the feed forward network, and often once more outside the block before the final output layer. Because it is used so frequently, it is convenient to implement it as its own reusable module when we code the model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MdOy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MdOy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png 424w, https://substackcdn.com/image/fetch/$s_!MdOy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png 848w, https://substackcdn.com/image/fetch/$s_!MdOy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png 1272w, https://substackcdn.com/image/fetch/$s_!MdOy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MdOy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png" width="909" height="933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:933,&quot;width&quot;:909,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107072,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MdOy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png 424w, https://substackcdn.com/image/fetch/$s_!MdOy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png 848w, https://substackcdn.com/image/fetch/$s_!MdOy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png 1272w, https://substackcdn.com/image/fetch/$s_!MdOy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c9e3c68-3b73-46b3-90f6-eecf71732f59_909x933.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.116: </strong> The gradient flow problem in deep networks: without normalization, gradients can explode (very large activations) or vanish (very small activations), making training unstable.</em></p><p>To understand why layer normalization is so important, it helps to step back and look at a standard deep neural network with an input layer, several hidden layers, and an output layer. During the forward pass, activations flow from left to right, and during backpropagation the gradients flow in the opposite direction, from the output layer back through each hidden layer to the input.</p><p>Each layer has parameters and therefore receives gradients of the loss with respect to those parameters. The gradients at a given layer depend strongly on the outputs of that layer. If the layer outputs are very large in magnitude, the corresponding gradients tend to become very large as we chain them backward through the network. By the time they reach the earlier layers, they can explode to extremely large values. This is the exploding gradient problem and it leads to unstable updates and divergence during training.</p><p>The opposite can also happen. If the outputs of a layer are very small, the gradients that depend on them can quickly shrink as they propagate backward through many layers. Early layers then receive gradients that are almost zero and their parameters barely change. This is the vanishing gradient problem and it makes learning extremely slow or stops it altogether.</p><p>In both cases, very large or very small activations in intermediate layers create gradient magnitudes that are either too large or too small. Training then becomes unstable and inefficient. One way to stabilize the gradients is therefore to control the magnitude of the layer outputs themselves. This is exactly what normalization layers are designed to do.</p><p><strong>Internal covariate shift</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2xLN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2xLN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png 424w, https://substackcdn.com/image/fetch/$s_!2xLN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png 848w, https://substackcdn.com/image/fetch/$s_!2xLN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png 1272w, https://substackcdn.com/image/fetch/$s_!2xLN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2xLN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png" width="1191" height="609" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:609,&quot;width&quot;:1191,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33772,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2xLN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png 424w, https://substackcdn.com/image/fetch/$s_!2xLN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png 848w, https://substackcdn.com/image/fetch/$s_!2xLN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png 1272w, https://substackcdn.com/image/fetch/$s_!2xLN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F288908b6-a1d7-4f91-9649-f18f421fcbd6_1191x609.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.117:</strong> Internal covariate shift: as earlier layers update their weights during training, the distribution of activations fed into later layers keeps changing, making learning a moving target.</em></p><p>There is a second issue in deep networks known as internal covariate shift. During training, as earlier layers update their weights, the distribution of activations that they feed into later layers keeps changing. Imagine looking at the inputs to a particular hidden layer at the beginning of training. They may roughly follow one distribution. After a few training iterations, as weights update, the same layer may now see inputs with a different mean or variance, or a skewed shape. The layer is trying to learn a good mapping, but the distribution of its inputs keeps drifting, so the layer is constantly adapting to a moving target. This slows down convergence and makes optimization harder.</p><p>If we could keep the mean and variance of the inputs to each layer more stable across training iterations, learning would become easier. Normalization does this by rescaling activations so that their distribution is more consistent over time. This reduces internal covariate shift and helps the model converge faster.</p><p><strong>The core idea of layer normalization</strong></p><p>Layer normalization is a simple procedure applied to the outputs of a layer. Consider a single training example and focus on the vector of activations produced by some layer for that example. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b8dM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b8dM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png 424w, https://substackcdn.com/image/fetch/$s_!b8dM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png 848w, https://substackcdn.com/image/fetch/$s_!b8dM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png 1272w, https://substackcdn.com/image/fetch/$s_!b8dM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b8dM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png" width="960" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:960,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55073,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b8dM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png 424w, https://substackcdn.com/image/fetch/$s_!b8dM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png 848w, https://substackcdn.com/image/fetch/$s_!b8dM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png 1272w, https://substackcdn.com/image/fetch/$s_!b8dM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a8d13c-cbb4-406e-b56b-7b23af0511df_960x657.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.118:</strong>  Layer normalization in action: six activations with mean 0.6 and variance 0.07 are centered to mean 0 and rescaled to variance 1, producing standardized activations.</em></p><p>Imagine a single layer that produces six outputs for one training example</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{aligned}\nx_1 &amp;= 0.78, \\\\\nx_2 &amp;= 1.05, \\\\\nx_3 &amp;= 0.44, \\\\\nx_4 &amp;= 0.73, \\\\\nx_5 &amp;= 0.65, \\\\\nx_6 &amp;= 0.28\n\\end{aligned}\n\n&quot;,&quot;id&quot;:&quot;JTFQPLWDCC&quot;}" data-component-name="LatexBlockToDOM"></div><p>These are the values shown in the middle row of the figure. Layer normalization first computes the mean of these activations</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{mean} = \\frac{x_1 + x_2 + x_3 + x_4 + x_5 + x_6}{6} \\approx 0.6\n\n&quot;,&quot;id&quot;:&quot;KCIOUDAKNM&quot;}" data-component-name="LatexBlockToDOM"></div><p>Next it computes the variance</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\\text{var} = \\frac{1}{6} \\bigg[ &amp; (x_1 - \\text{mean})^2 + (x_2 - \\text{mean})^2 + (x_3 - \\text{mean})^2 \\\\\n&amp; + (x_4 - \\text{mean})^2 + (x_5 - \\text{mean})^2 + (x_6 - \\text{mean})^2 \\bigg] \\approx 0.07\n\\end{align*}\n&quot;,&quot;id&quot;:&quot;AEWUBEZHGX&quot;}" data-component-name="LatexBlockToDOM"></div><p>When each activation <em>xi</em> is normalized, it is first shifted so its mean becomes zero, and then it is rescaled so its variance becomes one. This process is known as centering and rescaling. The formula</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\n\\hat{x}_i = \\frac{x_i - \\text{mean}}{\\sqrt{\\text{var}}}\n&quot;,&quot;id&quot;:&quot;VMGPNXQXXY&quot;}" data-component-name="LatexBlockToDOM"></div><p>means that for each output value, you subtract the average of all outputs (centering), and then divide by their standard deviation (rescaling). This transformation ensures that the set of normalized activations has an average of zero and a unit variance, making them easier for the next layer of a neural network to process.</p><p>If you perform this computation for all six values, you obtain approximately</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;0.44, \\quad 1.48, \\quad -0.85, \\quad 0.28, \\quad -0.01, \\quad -1.33\n&quot;,&quot;id&quot;:&quot;RTNNEGUJYE&quot;}" data-component-name="LatexBlockToDOM"></div><p>These are the normalized outputs shown in the top row of the figure. By construction, the mean of these normalized activations is zero and their variance is one, which is why the left side of the illustration reports mean equal to 0.0 and variance equal to 1.00. This example demonstrates how layer normalization transforms a set of layer outputs with mean 0.6 and variance 0.07 into a standardized set of activations that are better behaved numerically, making gradient based training more stable.</p><p>In practice, layer normalization is usually followed by a learned scale and shift. After computing the normalized activations x hat i, the layer produces new activations</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;y_i = \\gamma \\hat{x}_i + \\beta\n\n&quot;,&quot;id&quot;:&quot;JXNDJUASKL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here gamma and beta are trainable parameters with the same size as the activation vector. They let the model undo or modify the standardization whenever that helps performance. In other words, the network can learn whatever output distribution it prefers, while still enjoying the stabilizing effect of normalization during training.</p><p>The way mean and variance are computed is what distinguishes layer normalization from batch based methods. For layer normalization we normalize across the features of a single example, not across different examples in a mini batch. Each token representation or layer output vector is treated independently and normalized across its dimensions. This makes the procedure independent of batch size and very convenient for transformer models that see variable length sequences and perform autoregressive decoding one token at a time.</p><p>Inside a transformer block, layer normalization is applied to the token representations before they enter the multi head attention module. This keeps the scale of the inputs to attention under control and stabilizes its gradients. After attention and its residual shortcut, layer normalization is applied again before the feed forward network so that this network also sees well behaved activations. Many architectures further normalize the final outputs of each block, and often the outputs of the last block before the language modeling head that predicts the next token.</p><h2>1.18 FeedForward Network</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bx2T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bx2T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png 424w, https://substackcdn.com/image/fetch/$s_!Bx2T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png 848w, https://substackcdn.com/image/fetch/$s_!Bx2T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png 1272w, https://substackcdn.com/image/fetch/$s_!Bx2T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bx2T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png" width="408" height="735" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/300dd920-eb07-464b-aef6-2a8bff075467_408x735.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:735,&quot;width&quot;:408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:18246,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bx2T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png 424w, https://substackcdn.com/image/fetch/$s_!Bx2T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png 848w, https://substackcdn.com/image/fetch/$s_!Bx2T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png 1272w, https://substackcdn.com/image/fetch/$s_!Bx2T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F300dd920-eb07-464b-aef6-2a8bff075467_408x735.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.119:</strong> The feed-forward network sits between dropout and layer normalization inside each transformer block, processing each token representation independently</em></p><p>The second major component inside a transformer block, after the multi head attention, is the feed forward network. In the block diagram this appears as the Feed forward NN box, sitting between dropout and layer normalization. Conceptually it is an ordinary two layer neural network with an activation function in the middle, applied independently to every token representation. The key is that the same small network, with the same weights, is reused for all tokens and all examples in the batch.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cy6Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png 424w, https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png 848w, https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png 1272w, https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png" width="1131" height="594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:594,&quot;width&quot;:1131,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46086,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png 424w, https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png 848w, https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png 1272w, https://substackcdn.com/image/fetch/$s_!Cy6Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff72ea384-3507-4c36-a22a-5c27b942f9fb_1131x594.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><br><em><strong>Figure 1.120:</strong>  The feed-forward network processes each token vector independently using shared weights. For an input tensor of shape (batch, tokens, 768), each 768- dimensional vector is processed in parallel.</em></p><p>2, 3, 768. The first entry is the batch size, the second is the number of tokens in the context window, and the third is the embedding dimension for each token. The important point is that the feed forward network is applied to every token vector of length 768 independently but it uses the same weights for all tokens and all examples in the batch.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!V_ZW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!V_ZW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png 424w, https://substackcdn.com/image/fetch/$s_!V_ZW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png 848w, https://substackcdn.com/image/fetch/$s_!V_ZW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png 1272w, https://substackcdn.com/image/fetch/$s_!V_ZW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!V_ZW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png" width="1456" height="920" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:920,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:399059,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!V_ZW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png 424w, https://substackcdn.com/image/fetch/$s_!V_ZW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png 848w, https://substackcdn.com/image/fetch/$s_!V_ZW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png 1272w, https://substackcdn.com/image/fetch/$s_!V_ZW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6dff2498-7f95-448d-b828-fa75a5cfe079_1642x1038.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.121: </strong></em>Internal structure of the feed-forward network: a linear expansion from 768 to 3072 dimensions, followed by GELU activation, then linear contraction back to 768 dimensions.</p><p>The internal structure of this network is shown in the detailed diagrams. It consists of two linear layers with an activation function in between. The first linear layer performs an expansion. It takes each 768 dimensional input vector and projects it into a much larger space with 4 &#215; 768 = 3072 hidden units. In matrix form this is a multiplication by a 768 by 3072 weight matrix plus a bias term. Intuitively, this expansion gives the model more capacity to construct rich intermediate features from each token representation before compressing them again. Because every input dimension interacts with every hidden unit, this single layer already introduces dense mixing among the 768 input features.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bGXQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bGXQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png 424w, https://substackcdn.com/image/fetch/$s_!bGXQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png 848w, https://substackcdn.com/image/fetch/$s_!bGXQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png 1272w, https://substackcdn.com/image/fetch/$s_!bGXQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bGXQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png" width="1449" height="789" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:789,&quot;width&quot;:1449,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bGXQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png 424w, https://substackcdn.com/image/fetch/$s_!bGXQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png 848w, https://substackcdn.com/image/fetch/$s_!bGXQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png 1272w, https://substackcdn.com/image/fetch/$s_!bGXQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff4435de6-c71c-4252-ad33-6bdde6f02e9f_1449x789.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.122:</strong></em> Comparison of ReLU and GELU activation functions. GELU is smooth everywhere, including at zero, and preserves small negative activations rather than collapsing them to exactly zero</p><p>After the expansion, the output is passed through a nonlinearity. Early transformer implementations used the ReLU activation function. ReLU simply returns x when x is positive and returns zero for negative x. Graphically, this is a straight line through the origin for positive inputs and a flat line at zero for negative inputs. ReLU is easy to implement and works well in many convolutional and fully connected networks, but it has two important drawbacks in this setting. First, every negative input is collapsed to exactly zero, so any information stored in the magnitude of negative activations is lost. If many units become negative, a large part of the network can effectively stop learning, a problem often described informally as dead neurons. Second, the ReLU curve has a sharp corner at zero and is not differentiable there. In practice we can still compute subgradients and train, but the function is not smooth.</p><p>Because of these issues, transformer language models have largely switched to the GELU activation. The GELU curve, shown next to ReLU in the figures, is a smooth S shaped function. For large positive inputs it behaves similarly to ReLU and returns values close to the identity. For large negative inputs it sends activations toward zero, so very negative units are still turned off. The important difference appears around zero. Instead of cutting everything below zero to exactly zero, GELU tapers smoothly and maps small negative inputs to small negative outputs. This has two consequences. First, the function is differentiable everywhere, including at zero, which makes optimization smoother. Second, the network does not discard all information carried by small negative activations. In the region near zero the model can still use their sign and magnitude to encode subtle distinctions. Together with layer normalization, which keeps activations in a moderate range and prevents very large negative or positive values, this leads to more stable training and slightly better performance in practice.</p><p>The output tensor of the feed forward network has exactly the same shape as its input, namely batch size, context length, embedding dimension. This design choice is deliberate. Because the main hidden dimension remains constant, we can add a residual connection around the feed forward sublayer and stack an arbitrary number of transformer blocks on top of each other without reshaping tensors. It becomes straightforward to plug in more blocks, remove blocks, or reuse the same block structure in very deep models, since every block expects and returns vectors of size 768 in this example.</p><p>It is also helpful to connect this back to the question of how many tokens the network sees at once. While the feed forward network operates on each token vector independently, each of those vectors already encodes information about the entire context window thanks to the preceding multi head attention. In our example with context length 3, the tensor has shape 2, 3, 768, and each of the three vectors for a given sequence summarizes that position in the context of the other positions. The feed forward network then applies a rich non linear transformation to each of these contextual vectors. When generating text autoregressively, the model still predicts one next token at a time, but this prediction is based on the full context representation, which has been refined by both attention and the feed forward expansion, activation and contraction.</p><p>In summary, the position wise feed forward network in a transformer block is a powerful per token multilayer perceptron. It expands the embedding dimension to a higher dimensional space, applies a smooth and information preserving GELU nonlinearity, and contracts the representation back to the original size. This structure provides most of the depth and nonlinearity in the model, while attention handles interaction across positions. Together they give transformers the capacity to learn complex patterns in sequences of tokens.</p><h2>1.19 Shortcut connections</h2><p>Layer normalization helps stabilize the scale of activations, but on its own it is not enough to reliably train very deep transformer stacks. In the previous section we saw that each block already contains a powerful feed forward network that expands the embedding dimension, applies a GELU nonlinearity, and then contracts it again. If we simply stacked many of these attention plus feed forward blocks, gradients flowing backward through all of those nonlinear layers would quickly become very small. To keep such deep transformers trainable, we rely on another key idea shortcut connections, also called residual connections.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zqu9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zqu9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png 424w, https://substackcdn.com/image/fetch/$s_!Zqu9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png 848w, https://substackcdn.com/image/fetch/$s_!Zqu9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png 1272w, https://substackcdn.com/image/fetch/$s_!Zqu9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zqu9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png" width="1456" height="614" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:614,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61681,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zqu9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png 424w, https://substackcdn.com/image/fetch/$s_!Zqu9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png 848w, https://substackcdn.com/image/fetch/$s_!Zqu9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png 1272w, https://substackcdn.com/image/fetch/$s_!Zqu9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7d1217c6-45c7-4d46-b4d8-90c15da48dd0_1587x669.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.123:</strong> Effect of shortcut connections on gradients. Left: without shortcuts, gradients vanish (0.00003, 0.00001). Right: with shortcuts, gradients remain large (0.45, 0.52), enabling effective learning in earlier layers.</em></p><p>A shortcut connection simply adds the input of a block to its output, giving the signal an extra path through the model that bypasses one or more nonlinear layers. This extra path turns out to be very effective at keeping gradients from disappearing during backpropagation.</p><p>You can see the effect in the two layer illustration. On the left we have a small network that takes an input vector such as [1.0, 0.0, 0.0, minus 1.0], applies a linear layer and GELU activation twice, and then propagates gradients from the output back to the earlier layers. Without shortcut connections, the gradient at layer 2 might be around 0.00003 and at layer 1 around 0.00001. These tiny values are an example of the vanishing gradient problem the early layers barely receive any learning signal.</p><p>Now compare this to the version on the right, which adds residual connections around each linear plus GELU block. The same input is fed forward, but now the input of layer 1 is added to its output, and the output of layer 1 is added to the output of layer 2. With these shortcut paths in place, the gradients during backpropagation are much larger for the same network depth values like 0.45 for layer 1 and 0.52 for layer 2 in the illustration. Bringing the inputs forward through shortcut links preserves stronger gradients in the earlier layers, which makes learning much more effective.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G1lq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G1lq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png 424w, https://substackcdn.com/image/fetch/$s_!G1lq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png 848w, https://substackcdn.com/image/fetch/$s_!G1lq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png 1272w, https://substackcdn.com/image/fetch/$s_!G1lq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G1lq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png" width="1173" height="531" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:1173,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:114568,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!G1lq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png 424w, https://substackcdn.com/image/fetch/$s_!G1lq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png 848w, https://substackcdn.com/image/fetch/$s_!G1lq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png 1272w, https://substackcdn.com/image/fetch/$s_!G1lq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322e3622-8aa4-4ee0-80fd-56bce225211a_1173x531.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.124:</strong> </em>Loss landscape comparison: without skip connections the surface is jagged with many sharp peaks (left), while with skip connections the landscape becomes smooth with broad valleys (right).</p><p>There is also an optimization perspective that connects to the loss landscape illustration. If you visualize the loss of a deep network without shortcut connections as a function of its parameters, the surface often looks jagged, with many sharp peaks and narrow valleys. This makes gradient based optimization difficult and can trap training in poor local minima. When you add skip connections, the same network tends to exhibit a much smoother loss surface with broader valleys and fewer sharp spikes. A smoother landscape leads to more predictable gradients and makes it easier for simple optimizers like Adam to find good solutions.</p><p>Transformers exploit shortcut connections throughout the architecture. Within each transformer block, the input token representations are passed into a sublayer such as multi head attention or the feed forward network, and the sublayer output is added back to the original input. During backpropagation, gradients can then flow both through the sublayer and directly along the identity shortcut. This combination of residual paths and layer normalization is what allows transformers to stack many attention and feed forward blocks while still training reliably on large datasets.</p><h2>1.20 Why Transformers Scale Better Than RNNs and CNNs</h2><p>Transformers are fundamentally designed for scalability, both in terms of model size and training efficiency. Unlike recurrent neural networks, which process tokens sequentially and therefore suffer from limited parallelism, transformers operate on entire sequences at once. Self attention allows every token to directly interact with every other token in a single layer, removing the need to propagate information step by step through time. This parallel structure maps naturally to modern hardware such as GPUs and TPUs, enabling efficient utilization of large compute budgets. Compared to convolutional neural networks, which rely on fixed receptive fields and require deep stacks to capture long range dependencies, transformers model global context explicitly from the start. As models grow larger, this ability to combine global context with parallel computation leads to predictable improvements in performance, making transformers well suited for large scale training regimes.</p><p>Another key factor behind transformer scalability is architectural uniformity. The same transformer block can be stacked repeatedly with minimal modification, allowing depth and width to be increased systematically. Residual connections and normalization stabilize training even when hundreds of layers are used, while attention weights adapt dynamically to different inputs rather than being hard coded as in convolutions. This combination results in smooth scaling behavior where increasing parameters, data, and compute leads to consistent gains. In contrast, RNNs often struggle with vanishing gradients at scale, and CNNs require task specific architectural tuning. Transformers therefore provide a general purpose backbone that benefits directly from scale without extensive redesign.</p><h2>1.21 Pretraining, Fine Tuning, and Transfer Learning in Transformers</h2><p>Pretraining is the process that gives transformers their general purpose capabilities. In this stage, a model is trained on large amounts of unlabeled data using a self supervised objective such as next token prediction. The goal is not to solve a specific task, but to learn broad statistical structure in language or other modalities. During pretraining, the transformer learns representations that capture syntax, semantics, and long range dependencies. These representations are distributed across layers and attention heads, forming a reusable foundation that can support many downstream tasks. Because the objective is simple and data is abundant, pretraining scales effectively with model size and dataset size.</p><p>Fine tuning adapts a pretrained transformer to a specific task or domain. Instead of training from scratch, the pretrained weights are used as initialization, and training continues on a smaller labeled dataset. This process reshapes the learned representations toward task relevant patterns while preserving general knowledge acquired during pretraining. Transfer learning emerges naturally from this setup, since the same pretrained model can be reused across many tasks such as classification, generation, or question answering. In practice, this dramatically reduces data requirements and training time compared to building separate models for each task. It also enables rapid experimentation, since changes in objectives or datasets do not require redesigning the entire architecture.</p><h2>1.22 Limitations and Challenges of Transformers</h2><p>Despite their success, transformers are not without limitations. The most significant challenge lies in the quadratic cost of self attention with respect to sequence length. As input sequences grow longer, memory usage and computation increase rapidly, placing practical limits on context size. While various approximations and sparse attention mechanisms exist, they often introduce trade offs between efficiency and modeling fidelity. This makes long context modeling an active area of research rather than a solved problem.</p><p>Transformers also require substantial data and compute to reach their full potential. Large models trained on small or noisy datasets can overfit or learn spurious correlations, leading to unreliable behavior. In addition, pretrained transformers inherit biases present in their training data, which can surface during downstream use. From an engineering perspective, training and deploying large transformer models introduces challenges related to cost, latency, and energy consumption. These constraints mean that while transformers scale well in theory, practical deployments must balance model size with efficiency, reliability, and responsible use.</p><h2>1.23 Hands On Coding a Miniature Transformer for Sequence Classification</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4v7f!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4v7f!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png 424w, https://substackcdn.com/image/fetch/$s_!4v7f!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png 848w, https://substackcdn.com/image/fetch/$s_!4v7f!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png 1272w, https://substackcdn.com/image/fetch/$s_!4v7f!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4v7f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png" width="909" height="657" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:657,&quot;width&quot;:909,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:23954,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4v7f!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png 424w, https://substackcdn.com/image/fetch/$s_!4v7f!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png 848w, https://substackcdn.com/image/fetch/$s_!4v7f!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png 1272w, https://substackcdn.com/image/fetch/$s_!4v7f!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc5d2e220-3a50-4b51-91b8-cb924aa0f0a6_909x657.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.245:</strong> Sequence classification with BERT, where the input sentence is encoded, summarized by the classification token, and mapped by a classifier to a sentiment label.</em></p><p><strong>The Notebook code is available here</strong></p><p><a href="https://github.com/VizuaraAI/Transformers-for-vision-BOOK">https://github.com/VizuaraAI/Transformers-for-vision-BOOK</a></p><p>So far, we have discussed the transformer architecture and its core components at a conceptual level. To make these ideas concrete, we now move from theory to practice by implementing a small transformer model from scratch. The goal of this section is not to recreate a full scale BERT model, but to clearly understand how its fundamental design translates into working code.</p><p>In this hands on walkthrough, we build a miniature transformer for sequence classification using the IMDB movie review dataset. This dataset consists of textual reviews labeled with positive or negative sentiment, making it a practical and intuitive example for understanding how transformers process and classify entire sequences of text. Sequence classification highlights one of the key strengths of transformer encoders: their ability to capture bidirectional context across a complete input.</p><p>We will construct the model step by step, beginning with data loading and tokenization, then implementing embeddings, self attention, and transformer blocks, and finally adding a simple classification head. Each component is introduced explicitly so the flow of information through the model remains transparent. By the end of this section, you will have a working transformer classifier trained on the IMDB dataset and a clear understanding of how BERT style sequence classification models are built from scratch.</p><h4><strong>Listing 1.20: Installing the Required Dependencies</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;9e4a2066-ce1e-4a06-9ab8-1141474d2b21&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">!pip install torch datasets tiktoken tqdm scikit-learn</code></pre></div><h4><strong>Listing 1.21: Importing All Required Python Modules</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;b21fd2f0-8738-4d00-bba0-0eb4470f63df&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import math
import os
import torch
import torch.nn as nn
import torch.nn.functional as F

from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
from tqdm import tqdm
import tiktoken

from sklearn.metrics import classification_report, confusion_matrix</code></pre></div><h4><strong>Listing 1.22: Selecting the Compute Device</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;283bf45b-d5d9-430b-ba1b-dd54ffb2837a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;4dbfa48d-87a3-4f88-a075-a5b6c93d098f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">device(type='cuda')</code></pre></div><p>In <strong>Listing 1.20</strong>, we begin by installing the core dependencies required to implement a transformer based sequence classification model. PyTorch provides the foundational tensor operations and neural network abstractions that will be used to define embeddings, attention mechanisms, and training loops. The datasets library allows us to easily load the IMDB movie review dataset, while tiktoken supplies a modern subword tokenizer suitable for transformer models. The tqdm library is included to visualize training progress, and scikit learn provides standard evaluation utilities that will later help us interpret classification performance.</p><p>With the environment prepared, <strong>Listing 1.21</strong> imports all required Python modules. In addition to standard libraries such as math and os, we import PyTorch&#8217;s neural network components, including layers, activation functions, and data loading utilities. The Dataset and DataLoader classes define how text samples are structured and batched during training. The load_dataset function simplifies dataset retrieval, and tqdm enables progress tracking during training iterations. Finally, evaluation tools such as classification reports and confusion matrices are imported to support quantitative analysis of model predictions after training.</p><p>Finally, in <strong>Listing 1.22</strong>, we select the compute device on which the model will run. The code checks whether a CUDA enabled GPU is available and assigns it as the execution device when possible, otherwise defaulting to the CPU. This conditional setup allows the same implementation to scale from local experimentation to accelerated training environments without modification. With the dependencies installed, modules imported, and the compute device configured, we have established a solid foundation for building and training a transformer model from scratch in the sections that follow.</p><div><hr></div><p>Before building the transformer model, we first need a dataset that clearly illustrates the sequence classification task. In this section, we use the IMDB movie review dataset, a widely used benchmark for sentiment analysis. The dataset contains 50,000 movie reviews split evenly into training and test sets. Each review is labeled with one of two classes: positive sentiment or negative sentiment. The text samples vary in length and style, ranging from short opinions to long, detailed critiques, which makes the dataset well suited for evaluating a model&#8217;s ability to understand full sequences of natural language. A typical sample consists of a review such as</p><div class="pullquote"><p> &#8220;The movie was slow, but the performances were outstanding,&#8221; paired with a binary label indicating its sentiment.</p></div><h4><strong>Listing 1.23: Loading the IMDb Dataset</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;08b237ab-c03a-4c32-93c5-86331306ee81&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">dataset = load_dataset("imdb")

train_texts = dataset["train"]["text"]
train_labels = dataset["train"]["label"]

test_texts = dataset["test"]["text"]
test_labels = dataset["test"]["label"]

len(train_texts), len(test_texts)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;4a35d76c-7f78-48ee-9cc7-f07421fbd81a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">(25000, 25000)</code></pre></div><p>In <strong>Listing 1.23</strong>, we load the IMDB dataset using the datasets library, which automatically downloads and prepares the data in a standardized format. The dataset is split into training and test partitions, each containing 25,000 samples. From each split, we extract the raw review texts and their corresponding sentiment labels. The labels are encoded as integers, where 0 represents negative sentiment and 1 represents positive sentiment. At this stage, the data remains in raw text form, which allows us to apply custom tokenization and preprocessing steps in later sections.</p><h4><strong>Listing 1.24: Initializing a Byte Pair Encoding Tokenizer</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;4d89a49f-9a69-450a-9e12-b0da64a11226&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">tokenizer = tiktoken.get_encoding("gpt2")
base_vocab_size = tokenizer.n_vocab
base_vocab_size</code></pre></div><p><strong>Output</strong></p><pre><code>50257</code></pre><p>In <strong>Listing 1.24</strong>, we initialize a Byte Pair Encoding tokenizer using the GPT 2 vocabulary. As discussed earlier in the tokenization section. The GPT 2 tokenizer comes with a fixed base vocabulary size of 50,257 tokens, which includes common words, subwords, punctuation, and special byte level encodings. We reuse this tokenizer to avoid designing a vocabulary from scratch and to ensure efficient coverage of the diverse language found in movie reviews. Here, we record the base vocabulary size because we will extend it in the next step.</p><h3>Preparing Text Inputs and Batches for Transformers</h3><p>Before introducing special tokens and moving into the BERT specific input construction, we first need to understand how raw text is transformed into batches that a transformer can process. Transformers do not operate directly on free form text. Instead, text must pass through a sequence of structured transformations that determine what the model sees as input and what it is trained to predict.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GCPJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GCPJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png 424w, https://substackcdn.com/image/fetch/$s_!GCPJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png 848w, https://substackcdn.com/image/fetch/$s_!GCPJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png 1272w, https://substackcdn.com/image/fetch/$s_!GCPJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GCPJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png" width="1332" height="924" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:924,&quot;width&quot;:1332,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89650,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GCPJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png 424w, https://substackcdn.com/image/fetch/$s_!GCPJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png 848w, https://substackcdn.com/image/fetch/$s_!GCPJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png 1272w, https://substackcdn.com/image/fetch/$s_!GCPJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd6896e5-e781-471b-a503-ff9083cfeb05_1332x924.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.246:</strong>  Preparing text for a transformer: raw text is tokenized, split into context windows, and arranged into input&#8211;output batch pairs where each target is the next token.</em></p><p>We begin with a continuous piece of text, which is first broken into tokens. In the example shown in Figure, each word in the paragraph is mapped to a corresponding token id. At this stage, the text is still treated as one long sequence. Because transformers have a fixed context size, the model cannot process the entire sequence at once. Instead, a context window is chosen, which defines how many consecutive tokens the model can attend to in a single forward pass.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DYYx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DYYx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png 424w, https://substackcdn.com/image/fetch/$s_!DYYx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png 848w, https://substackcdn.com/image/fetch/$s_!DYYx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png 1272w, https://substackcdn.com/image/fetch/$s_!DYYx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DYYx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png" width="1456" height="431" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:431,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58163,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DYYx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png 424w, https://substackcdn.com/image/fetch/$s_!DYYx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png 848w, https://substackcdn.com/image/fetch/$s_!DYYx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png 1272w, https://substackcdn.com/image/fetch/$s_!DYYx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F06a6e926-f278-46d2-9466-d4d0ee5b1917_1590x471.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.247: </strong>Sliding window mechanism: the token sequence is sliced into overlapping segments, each becoming one training example where the output is the input shifted by one position</em></p><p>Using this fixed context window, the token sequence is sliced into overlapping segments. Each segment becomes one training example. As illustrated in Figure, the input batch contains sequences of tokens within the context window, while the output batch contains the same sequences shifted by one position. This shift is the learning signal. The model is trained to predict the next token at every position, which is why the input and output batches appear nearly identical except for alignment.</p><p>If you will see the  second figure, each row in the input batch corresponds to a short phrase extracted from the original sentence, and each row in the output batch represents the immediate continuation of that phrase. This sliding window mechanism allows a single sentence to generate many training examples. When these examples are stacked together, they form a batch that can be processed efficiently in parallel by the transformer.</p><p>The figure highlights an important property of next token prediction. The model does not predict only the final word of a sentence. Instead, it learns to predict the next token at every position, given the context seen so far. This is why context grows incrementally and why causal masking is required in autoregressive models. At each step, the model is only allowed to attend to previous tokens within the context window.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xqze!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xqze!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png 424w, https://substackcdn.com/image/fetch/$s_!Xqze!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png 848w, https://substackcdn.com/image/fetch/$s_!Xqze!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png 1272w, https://substackcdn.com/image/fetch/$s_!Xqze!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xqze!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png" width="684" height="285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:285,&quot;width&quot;:684,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22585,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xqze!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png 424w, https://substackcdn.com/image/fetch/$s_!Xqze!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png 848w, https://substackcdn.com/image/fetch/$s_!Xqze!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png 1272w, https://substackcdn.com/image/fetch/$s_!Xqze!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F611bf97a-b91b-43bb-a791-b6b459a826a1_684x285.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.248: </strong>For BERT-style classification, multiple independent sentences are tokenized separately, producing sequences of different lengths.</em></p><p>The next transition shown in this figure moves away from next token prediction and toward sequence level processing. Here, we start with multiple independent sentences rather than one long document. Each sentence is tokenized separately, producing sequences of different lengths. At this stage, these sequences cannot yet be processed together because transformers require all inputs in a batch to share the same length.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D8it!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D8it!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png 424w, https://substackcdn.com/image/fetch/$s_!D8it!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png 848w, https://substackcdn.com/image/fetch/$s_!D8it!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png 1272w, https://substackcdn.com/image/fetch/$s_!D8it!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D8it!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png" width="936" height="516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46326,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D8it!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png 424w, https://substackcdn.com/image/fetch/$s_!D8it!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png 848w, https://substackcdn.com/image/fetch/$s_!D8it!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png 1272w, https://substackcdn.com/image/fetch/$s_!D8it!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b04cb4c-ba31-49e1-ae26-2807ca4bb7b2_936x516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.249: </strong>BERT input formatting: a classification token is added at the beginning, a separator token at the end, and padding tokens fill shorter sequences to a uniform length.</em></p><p>See in this figure how BERT resolves this issue through structured input formatting. Each sentence is augmented with a classification token at the beginning and a separator token at the end. Shorter sequences are padded so that all examples reach the same length. Padding tokens do not represent real text and are later ignored by attention masks, but they are essential for forming a rectangular batch tensor.</p><div><hr></div><h4><strong>Listing 1.25: Extending the Tokenizer with BERT Special Tokens</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;bfd67b59-122c-4d16-b603-e5c322e987da&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">PAD_ID = base_vocab_size
CLS_ID = base_vocab_size + 1
SEP_ID = base_vocab_size + 2

VOCAB_SIZE = base_vocab_size + 3
VOCAB_SIZE

</code></pre></div><p><strong>Output</strong></p><pre><code>50260</code></pre><p>The classification token introduced here plays a special role. During self attention, it attends to all other tokens in the sequence, allowing it to accumulate information from the entire sentence. By the final transformer layer, its hidden state acts as a compact summary of the sequence. This is the representation used for sequence classification tasks such as sentiment analysis.</p><p>This progression, from sliding context windows for next token prediction to padded, sentence level batches for BERT style processing, illustrates a critical shift in how transformers consume text. Autoregressive models learn by predicting future tokens, while BERT learns by encoding entire sequences at once. With this conceptual foundation established, we are now ready to formally introduce special tokens in the code and explain their precise role in the BERT implementation.</p><p>By appending these tokens after the original GPT 2 vocabulary, we preserve all existing token mappings while expanding the vocabulary size to 50,260. This setup allows the transformer model to handle variable length inputs and perform sequence level classification in a manner consistent with BERT style architectures.</p><h4><strong>Listing 1.26: Encoding Text into Fixed-Length BERT Input Sequences</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;a302fb3b-1066-47de-9398-f91293b01274&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">MAX_LEN = 256

def encode(text):
    token_ids = tokenizer.encode(text)
    token_ids = token_ids[:MAX_LEN - 2]

    token_ids = [CLS_ID] + token_ids + [SEP_ID]

    if len(token_ids) &lt; MAX_LEN:
        token_ids += [PAD_ID] * (MAX_LEN - len(token_ids))

    return token_ids</code></pre></div><p>This function converts raw text into a fixed length input sequence that the BERT model can process. The text is first tokenized into subword token ids using the tokenizer, and the sequence is truncated to leave space for the special tokens. A classification token is then added to the beginning of the sequence and a separator token to the end, establishing clear boundaries for the model. If the resulting sequence is shorter than the maximum length, padding tokens are appended until the desired length is reached. The output is a uniform length token sequence, which ensures that all inputs can be stacked into batches and processed efficiently by the transformer.</p><p>Note on Sequence Length: We limit the sequence length to <em>256 tokens</em> to maintain training efficiency, as the self-attention mechanism&#8217;s computational cost grows quadratically</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{O($N^2$) with length.}\n&quot;,&quot;id&quot;:&quot;GNNIMFHBEL&quot;}" data-component-name="LatexBlockToDOM"></div><p>While this speeds up processing, it uses &#8220;head-only&#8221; truncation which risks discarding important sentiment cues often found at the end of reviews; a &#8220;head+tail&#8221; strategy (keeping the first and last chunks) is often a more effective alternative for longer documents.</p><p>However, this is just a demonstration. If you want to increase accuracy, you can keep the entire text, but be aware that this will significantly increase training hours and computational costs.</p><h4><strong>Listing 1.27: Creating Attention Masks for Padding Tokens</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;7c33177e-ca82-4dcf-8e04-2c1e4454062b&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def create_attention_mask(input_ids):
    return (input_ids != PAD_ID).long()</code></pre></div><p>This function constructs an attention mask that distinguishes real tokens from padding tokens. Each position containing a padding token is marked with zero, while all other positions are marked with one. During self attention, this mask ensures that padded positions are ignored so they do not influence the model&#8217;s representations. Attention masks are essential when batching variable length sequences, as they allow the transformer to operate on padded inputs without learning from artificial padding.</p><h4><strong>Listing 1.28: Defining the IMDb Dataset Class</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;93ec098e-c534-44ce-8e74-48c670d55f8d&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class IMDBDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        ids = torch.tensor(encode(self.texts[idx]))
        mask = create_attention_mask(ids)
        label = torch.tensor(self.labels[idx])

        return ids, mask, label</code></pre></div><p>This dataset class wraps the raw text and labels into a format compatible with PyTorch training loops. For each example, the text is encoded into a fixed length token sequence, an attention mask is generated, and the corresponding label is returned. By centralizing encoding and masking inside the dataset, the data pipeline remains clean and consistent, ensuring that every batch fed into the model follows the same preprocessing logic.</p><h4><strong>Listing 1.29: Creating DataLoaders for Training and Evaluation</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;1800da99-8b30-454f-998c-f7ba796f1382&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">train_ds = IMDBDataset(train_texts, train_labels)
test_ds = IMDBDataset(test_texts, test_labels)

train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=16)</code></pre></div><p>Here, the dataset objects are passed into DataLoader instances, which handle batching and shuffling automatically. The training data is shuffled to prevent the model from learning order based artifacts, while the evaluation data is kept deterministic. DataLoaders enable efficient iteration over the dataset and ensure that inputs, masks, and labels are delivered to the model in properly structured batches.</p><h4><strong>Listing 1.30: Implementing the BERT Embedding Layer</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;33829fca-3b6c-4a21-a6e5-b0fea636f9bb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class BERTEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_dim, max_len, dropout=0.1):
        super().__init__()
        self.token = nn.Embedding(vocab_size, embed_dim, padding_idx=PAD_ID)
        self.position = nn.Embedding(max_len, embed_dim)
        self.segment = nn.Embedding(2, embed_dim)
        self.norm = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T = x.size()
        pos = torch.arange(T).unsqueeze(0).to(x.device)
        seg = torch.zeros_like(x)

        embeddings = (
            self.token(x) +
            self.position(pos) +
            self.segment(seg)
        )

        embeddings = self.norm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings</code></pre></div><p>This module implements the embedding layer used in BERT style models. Token embeddings encode word identity, positional embeddings capture word order, and segment embeddings provide sentence level context, even though a single segment is used here. These embeddings are summed to form the input representation for the transformer encoder. Layer normalization and dropout are applied to stabilize training and improve generalization. This embedding layer serves as the entry point where raw token ids are transformed into dense vectors suitable for self attention.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M02g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M02g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png 424w, https://substackcdn.com/image/fetch/$s_!M02g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png 848w, https://substackcdn.com/image/fetch/$s_!M02g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png 1272w, https://substackcdn.com/image/fetch/$s_!M02g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M02g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png" width="840" height="381" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:381,&quot;width&quot;:840,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26109,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/175710993?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M02g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png 424w, https://substackcdn.com/image/fetch/$s_!M02g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png 848w, https://substackcdn.com/image/fetch/$s_!M02g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png 1272w, https://substackcdn.com/image/fetch/$s_!M02g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98bd6529-6a83-4ce2-9602-fc94d3f87953_840x381.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.250:</strong> BERT input representation for single-sentence IMDb classification, showing token embeddings (CLS + tokens + SEP), uniform segment A embeddings across all tokens (used consistently despite single sentence to maintain pre-training format for sentence A/B distinction), and positional embeddings summed together.</em><br></p><h4><strong>Listing 1.31: Implementing Multi-Head Self-Attention</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;48dc5aa6-a939-4bf3-97ab-ee7b8e491858&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class MultiHeadSelfAttention(nn.Module):
    def __init__(self, dim, heads, dropout=0.1):
        super().__init__()
        assert dim % heads == 0

        self.heads = heads
        self.d = dim // heads

        self.qkv = nn.Linear(dim, dim * 3)
        self.out = nn.Linear(dim, dim)

        self.attn_dropout = nn.Dropout(dropout)
        self.out_dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        B, T, C = x.shape

        q, k, v = self.qkv(x).chunk(3, dim=-1)

        q = q.view(B, T, self.heads, self.d).transpose(1, 2)
        k = k.view(B, T, self.heads, self.d).transpose(1, 2)
        v = v.view(B, T, self.heads, self.d).transpose(1, 2)

        scores = (q @ k.transpose(-2, -1)) / math.sqrt(self.d)

        mask = mask.unsqueeze(1).unsqueeze(2)
        scores = scores.masked_fill(mask == 0, -1e9)

        attn = F.softmax(scores, dim=-1)
        attn = self.attn_dropout(attn)

        out = attn @ v
        out = out.transpose(1, 2).reshape(B, T, C)

        out = self.out(out)
        out = self.out_dropout(out)
        return out</code></pre></div><p>This module implements the core self attention mechanism used inside BERT. The input embeddings are first projected into queries, keys, and values using a single linear layer and then split across multiple attention heads. Each head operates on a smaller subspace of the embedding dimension, allowing the model to attend to different relationships in parallel. Scaled dot product attention is applied within each head, and the attention mask is used to prevent padded tokens from contributing to the computation. The outputs of all heads are then concatenated, projected back to the original embedding dimension, and passed through dropout for regularization.</p><p>The key architectural difference between this BERT style attention and the attention used in GPT lies in masking. In BERT, self attention is fully bidirectional, meaning every token is allowed to attend to every other token in the sequence. The only masking applied here is to ignore padding tokens. In contrast, GPT uses causal masking to prevent tokens from attending to future positions, enforcing an autoregressive left to right structure. Aside from this masking behavior, the mathematical formulation of multi head self attention remains the same across both architectures.</p><h4><strong>Listing 1.32: Implementing the Feed-Forward Network</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;d5756e56-df00-463b-9f03-95470891bbfb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class FeedForward(nn.Module):
    def __init__(self, dim, hidden, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden, dim),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)</code></pre></div><p>This module defines the position wise feed forward network used inside each transformer encoder layer. After self attention mixes information across tokens, the feed forward network independently transforms each token representation using the same set of parameters. It consists of two linear projections with a GELU activation in between, which introduces nonlinearity and allows the model to learn more expressive feature transformations. Dropout is applied after each linear layer to reduce overfitting and improve generalization. Although simple in structure, this feed forward network plays a critical role by refining token representations at every layer and complementing the relational modeling performed by self attention.</p><h4><strong>Listing 1.33: Defining a Transformer Encoder Block</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;f801e9df-f27e-4bf2-b1f5-2bd968124753&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class TransformerBlock(nn.Module):
    def __init__(self, dim, heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attn = MultiHeadSelfAttention(dim, heads, dropout)
        self.ff = FeedForward(dim, ff_dim, dropout)
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)

    def forward(self, x, mask):
        x = x + self.attn(self.norm1(x), mask)
        x = x + self.ff(self.norm2(x))
        return x</code></pre></div><p>This module brings together the two fundamental components of the transformer encoder into a single reusable block. Each block first applies multi head self attention to allow tokens to exchange information across the sequence, and then applies a position wise feed forward network to refine each token representation independently. Layer normalization is applied before each sublayer to stabilize training, while residual connections add the sublayer outputs back to the original input. This design preserves gradient flow in deep networks and enables stacking many encoder blocks without degradation. By repeatedly applying this block, the model progressively builds richer and more contextual representations of the input sequence, which is the core mechanism behind BERT style encoders.</p><h4><strong>Listing 1.34: Constructing the BERT Encoder Stack</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;301057f5-7a84-45c8-a297-803ba7764094&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class BERTEncoder(nn.Module):
    def __init__(self, vocab_size, dim, max_len, layers, heads, ff_dim):
        super().__init__()
        self.embed = BERTEmbedding(vocab_size, dim, max_len)
        self.layers = nn.ModuleList([
            TransformerBlock(dim, heads, ff_dim)
            for _ in range(layers)
        ])

    def forward(self, x, mask):
        x = self.embed(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x</code></pre></div><p>After defining the embedding layer and the transformer encoder block, we can now assemble the complete BERT encoder. This step connects all the components introduced so far into a single coherent module. The encoder begins by converting input token ids into dense vector representations using the BERT embedding layer, which injects token identity, positional information, and segment context. These embeddings serve as the initial representation of the input sequence.</p><p>Once the embeddings are formed, they are passed through a stack of transformer encoder blocks. Each block applies multi head self attention to mix information across tokens, followed by a feed forward network to refine each token representation. By stacking multiple such blocks, the model repeatedly contextualizes the sequence, allowing higher layers to build on patterns discovered in earlier ones. The output of the encoder is a sequence of deeply contextualized token embeddings, where each token representation reflects information from the entire input. This encoder stack forms the core of the BERT architecture and provides the representations that will later be used for sequence level classification.</p><h4><strong>Listing 1.35: Adding a Sequence Classification Head</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;c04bf7dd-7201-4a8e-bbc3-dd10f19f8258&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">class BERTForClassification(nn.Module):
    def __init__(self, vocab_size, dim, max_len, layers, heads, ff_dim):
        super().__init__()
        self.bert = BERTEncoder(
            vocab_size, dim, max_len, layers, heads, ff_dim
        )
        self.classifier = nn.Sequential(nn.Dropout(0.1),nn.Linear(dim, 2))

    def forward(self, x, mask):
        out = self.bert(x, mask)
        cls = out[:, 0]
        return self.classifier(cls)</code></pre></div><p>With the BERT encoder stack in place, the final step is to adapt it for a concrete downstream task. In this module, we attach a lightweight sequence classification head on top of the encoder. The encoder itself remains unchanged and continues to produce contextualized embeddings for every token in the input sequence. What is new here is how we convert those token level representations into a single prediction.</p><p>During the forward pass, the output of the BERT encoder is a tensor containing one embedding per token. We explicitly select the representation of the first token in the sequence, which corresponds to the classification token introduced earlier. As discussed previously, this token has attended to all other tokens through self attention and therefore acts as a compact summary of the entire sequence. The classification head applies dropout for regularization and then uses a linear layer to map this summary representation to class logits. This design cleanly separates general language encoding from task specific prediction, allowing the same encoder to be reused for different classification tasks with minimal modification.</p><h4><strong>Listing 1.36: Initializing the Model</strong></h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;3fac6efa-3b72-4178-928c-5ceb9f85fad1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">model = BERTForClassification(
    vocab_size=VOCAB_SIZE,
    dim=256,
    max_len=MAX_LEN,
    layers=4,
    heads=6,
    ff_dim=1024
).to(device)

print(model)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;ff1c7001-950c-4c3a-8d7d-42b6e9522a83&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">BERTForClassification(
  (bert): BERTEncoder(
    (embed): BERTEmbedding(
      (token): Embedding(50260, 256, padding_idx=50257)
      (position): Embedding(256, 256)
      (segment): Embedding(2, 256)
      (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layers): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attn): MultiHeadSelfAttention(
          (qkv): Linear(in_features=256, out_features=768, bias=True)
          (out): Linear(in_features=256, out_features=256, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (out_dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): FeedForward(
          (net): Sequential(
            (0): Linear(in_features=256, out_features=1024, bias=True)
            (1): GELU(approximate='none')
            (2): Dropout(p=0.1, inplace=False)
            (3): Linear(in_features=1024, out_features=256, bias=True)
            (4): Dropout(p=0.1, inplace=False)
          )
        )
        (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
      )
    )
  )
  (classifier): Sequential(
    (0): Dropout(p=0.1, inplace=False)
    (1): Linear(in_features=256, out_features=2, bias=True)
  )
)</code></pre></div><p>Finally, we are ready to define the complete model by bringing together all the components built so far. At this stage, the embedding layer, transformer encoder stack, and sequence classification head are no longer independent pieces but parts of a single, end to end architecture.</p><p>The model is initialized with an embedding dimension of 256, which determines the size of the vector representation used throughout the network. Four transformer encoder layers are stacked to progressively refine contextual information, while six attention heads in each layer allow the model to capture multiple relationships in parallel. Inside each layer, the feed forward network expands the representation to 1024 dimensions before projecting it back, preserving the standard transformer design pattern. The vocabulary size includes both the base tokenizer and the added BERT specific special tokens, and the maximum sequence length defines the longest input the model can process.</p><p>Once instantiated, the model is moved to the selected compute device, completing the setup phase. With the architecture now fully defined, the BERT model is ready to be trained on the IMDB dataset, marking the transition from model construction to optimization.</p><h4>Listing 1.37: Defining the Loss Function and Optimizer</h4><p>With the model architecture fully defined and instantiated, we now set up the two components needed for training: the loss function and the optimizer.</p><p>For the loss function, we use `CrossEntropyLoss`, which is the standard choice for classification tasks. Cross-entropy loss measures the difference between the model&#8217;s predicted probability distribution over the two classes (positive and negative sentiment) and the true label. Internally, PyTorch&#8217;s `CrossEntropyLoss` applies the softmax function to the raw logits produced by the model and then computes the negative log likelihood, so we do not need to apply softmax ourselves before passing the logits to the loss function.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;c1cd4507-825e-44ac-acc4-0de950d7f749&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)</code></pre></div><blockquote><p><strong>AdamW</strong></p><p><em>Adam optimizers</em> are a popular choice for training deep neural networks. However, in our training loop, we opt for the <em>AdamW optimizer</em>. AdamW is a variant of Adam that improves the weight decay approach, which aims to minimize model complexity and prevent overfitting by penalizing larger weights. This adjustment allows AdamW to achieve more effective regularization and better generalization; thus, AdamW is frequently used in the training of transformer models.</p></blockquote><h4>Listing 1.38: Training the Model</h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;398ad020-7e67-49a0-93d4-f6f7c2f10ea1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">EPOCHS = 100

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0

    for ids, mask, labels in tqdm(train_loader):
        ids, mask, labels = ids.to(device), mask.to(device), labels.to(device)

        optimizer.zero_grad()
        logits = model(ids, mask)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch+1} | Train Loss: {total_loss:.2f}")</code></pre></div><p><strong>Output</strong> (abbreviated)</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;eda048e5-aa4a-4d06-9949-0996dfd3c543&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Epoch 1 | Train Loss: 1078.51
Epoch 2 | Train Loss: 1040.23
Epoch 3 | Train Loss: 1009.87
...
Epoch 50 | Train Loss: 512.34
...
Epoch 98 | Train Loss: 289.45
Epoch 99 | Train Loss: 287.12
Epoch 100 | Train Loss: 285.67</code></pre></div><p>It is now time to implement the training loop. The training process follows the standard PyTorch pattern: for each epoch, we iterate over all batches in the training DataLoader, compute the forward pass to obtain logits, calculate the loss, perform backpropagation to compute gradients, and update the model parameters using the optimizer.</p><p>At the beginning of each epoch, we set the model to training mode using <strong>model.train()</strong>. This ensures that layers such as <em>Dropout</em> and <em>LayerNorm</em> behave correctly during training, dropout randomly zeroes elements to prevent overfitting, and layer normalization uses batch-level statistics. At the start of each batch, we call <strong>optimizer.zero_grad()</strong> to reset the gradients accumulated from the previous iteration, since PyTorch accumulates gradients by default. The forward pass produces logits from the model, the loss is computed against the true labels, and <strong>loss.backward()</strong> calculates the gradients through backpropagation. Finally, <strong>optimizer.step()</strong> updates all model parameters using the computed gradients.</p><p>We train for a relatively large number of epochs to allow the small model to converge. The cumulative loss over all batches is printed at the end of each epoch to monitor training progress. A decreasing loss across epochs indicates that the model is successfully learning to distinguish positive from negative reviews.</p><p>As we can see from the output, the training loss decreases steadily across epochs, indicating that the model is learning meaningful representations from the training data. The loss starts high in the first epoch because the model&#8217;s weights are randomly initialized and the predictions are essentially random guesses. Over the course of training, the model adjusts its parameters to produce increasingly accurate sentiment predictions.</p><blockquote><p><strong>Note on Training Duration:</strong></p><p>Training for 100 epochs on 25,000 samples with a batch size of 16 results in approximately 156,250 parameter updates per epoch. On a modern GPU, this takes several hours. If computational resources are limited, reducing the number of epochs to 10&#8211;20 will still produce a model that performs noticeably above chance, though with lower accuracy.</p></blockquote><h4>Listing 1.39: Evaluating the Model on the Test Set</h4><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;50f6a867-679c-44d2-8b50-8f5ad0775d27&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">def evaluate(model, loader):
    model.eval()
    correct, total = 0, 0

    with torch.no_grad():
        for ids, mask, labels in loader:
            ids, mask, labels = ids.to(device), mask.to(device), labels.to(device)
            preds = model(ids, mask).argmax(dim=1)

            correct += (preds == labels).sum().item()
            total += labels.size(0)

    return correct / total</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;acfbac0c-e063-4c7b-b3da-5d12ca76ea09&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">accuracy = evaluate(model, test_loader)
print("Test Accuracy:", accuracy)</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;42ac039a-2553-467c-884c-7be456006443&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Test Accuracy: 0.8074</code></pre></div><p>After training, we evaluate the model on the held-out test set to measure its generalization performance, that is, how well it classifies reviews it has never seen during training. This is a critical step, because a model that performs well on training data but poorly on unseen data has <em>overfit</em> to the training set and has not learned generalizable patterns.</p><p>During evaluation, we set the model to evaluation mode using <strong>model.eval(),</strong> which disables dropout and ensures that layer normalization uses its learned running statistics rather than batch-level statistics. We also wrap the evaluation loop inside <strong>torch.no_grad()</strong>, which disables gradient computation. Since we are not updating the model&#8217;s parameters during evaluation, disabling gradients reduces memory usage and speeds up computation.</p><p>For each batch, the model produces logits, and we take the <strong>argmax</strong> along the class dimension to obtain the predicted label (0 for negative, 1 for positive). We then compare these predictions to the ground truth labels and accumulate the number of correct predictions to compute the overall accuracy.</p><p>The model achieves approximately 80.7% accuracy on the test set. Considering that this is a BERT model trained entirely from scratch with a reduced architecture (256-dimensional embeddings, 4 layers, and 6 attention heads) on truncated input sequences of just 256 tokens, this is a reasonable result. For reference, the original BERT-Base model (768-dimensional embeddings, 12 layers, 12 heads) pretrained on massive corpora and then fine-tuned on IMDb typically achieves around 93&#8211;95% accuracy. The gap in performance is expected, given the significant differences in model size, pretraining data, and input sequence length.</p><h4>Listing 1.40: Generating a Detailed Classification Report</h4><p>While overall accuracy gives a useful single-number summary, it can sometimes be misleading, especially on imbalanced datasets. To get a more detailed picture of the model&#8217;s performance, we generate a full classification report using scikit-learn&#8217;s <em>classification_report</em> function. This report includes <em>precision</em>, <em>recall</em>, and <em>F1-score</em> for each class.</p><p>Precision measures what fraction of the samples the model predicted as a given class actually belong to that class. Recall measures what fraction of the samples that truly belong to a given class were correctly identified by the model. The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. These per-class metrics are especially informative when the classes have different distributions or when the cost of false positives and false negatives differs.</p><p>To generate this report, we first collect all predictions and ground truth labels from the test set by running the model in evaluation mode with gradients disabled.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;0cc96cd3-dddd-4903-8144-6a59cab5ffa7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">all_preds, all_labels = [], []

model.eval()
with torch.no_grad():
    for ids, mask, labels in test_loader:
        preds = model(ids.to(device), mask.to(device)).argmax(dim=1).cpu()
        all_preds.extend(preds.numpy())
        all_labels.extend(labels.numpy())

print(classification_report(all_labels, all_preds, target_names=["Negative", "Positive"]))</code></pre></div><p>Output</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;a4df8e61-a6b8-494e-ae59-8d487db6a4c4&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">              precision    recall  f1-score   support

    Negative       0.81      0.80      0.81     12500
    Positive       0.81      0.81      0.81     12500

    accuracy                           0.81     25000
   macro avg       0.81      0.81      0.81     25000
weighted avg       0.81      0.81      0.81     25000</code></pre></div><p>The classification report confirms that the model performs consistently across both classes. The precision, recall, and F1-scores are all approximately 0.81 for both the negative and positive classes, with balanced support of 12,500 samples each. This symmetry indicates that the model does not exhibit a bias toward predicting one class over the other, which is a desirable property in a balanced binary classification task.</p><h4> Listing 1.41: Saving the Trained Model</h4><p>After training and evaluation, it is important to save the model so that it can be loaded later for inference or further fine-tuning without having to retrain from scratch. In PyTorch, the standard approach is to save the model&#8217;s <em>state_dict</em>, which is a Python dictionary that maps each layer name to its corresponding parameter tensor. Saving the <em>state_dict</em> rather than the entire model object is the recommended practice because it is more portable and less prone to issues when the code structure changes between sessions.</p><p>In addition to the model weights, we also save the tokenizer metadata, specifically the special token IDs and maximum sequence length, so that all the information needed for inference is available in one place. This ensures reproducibility: anyone who loads the model later will have the exact configuration required to tokenize new inputs in the same way they were processed during training.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;61d80bff-8151-473e-896b-9a595792408a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">SAVE_DIR = "bert_from_scratch_imdb"
os.makedirs(SAVE_DIR, exist_ok=True)

torch.save(model.state_dict(), f"{SAVE_DIR}/model.pt")

torch.save({
    "pad_id": PAD_ID,
    "cls_id": CLS_ID,
    "sep_id": SEP_ID,
    "max_len": MAX_LEN
}, f"{SAVE_DIR}/tokenizer_info.pt")

print("Model saved successfully!")</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;82d8531c-34f7-40d1-86fa-be076a2a1a05&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Model saved successfully!</code></pre></div><h4>Listing 1.42: Loading the Saved Model for Inference</h4><p>Before running inference on new text, we load the saved model weights back into the model architecture. The <em>torch.load</em> function reads the saved <em>state_dict</em> from disk, and <em>model.load_state_dict()</em> applies these weights to the model. The <em>map_location</em> argument ensures that the weights are loaded onto the correct device, which is particularly useful when a model trained on a GPU is later loaded on a CPU-only machine.</p><p>After loading, we set the model to evaluation mode with <em>model.eval().</em> This is essential because, without it, dropout layers would still randomly zero out activations, leading to inconsistent and degraded predictions at inference time.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;11b4ae48-0b90-4659-9cf6-b4f626b434c2&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">model.load_state_dict(torch.load(f"{SAVE_DIR}/model.pt", map_location=device))
model.eval()
print("Model loaded successfully!")</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;1950cc8e-ada6-41ac-937c-875fd4779f59&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">Model loaded successfully!</code></pre></div><h4>Listing 1.43: Running Inference on New Text</h4><p>With the trained model loaded, we can now use it to classify the sentiment of arbitrary text inputs. The inference pipeline mirrors the preprocessing steps used during training: the raw text is tokenized using byte pair encoding, augmented with classification and separator tokens, padded to the fixed sequence length, and converted into a tensor. An attention mask is created to indicate which positions contain real tokens versus padding. The model then produces logits for the two classes, and we take the <em>argmax</em> to obtain the predicted label.</p><p>We test the model on multiple example sentences, to verify that it has learned meaningful sentiment representations.</p><p>Some of the examples are </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;cf70da4c-e8f9-4635-bff0-702d270f4037&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">text = "Bromwell High is a brilliantly conceived, executed and acted, but sadly overlooked sitcom. The writing is razor sharp, the characters are well drawn and the jokes are genuinely funny. The animation is also excellent, with a style that suits the material perfectly. It's a shame that it didn't get a proper chance in the UK, as it deserves to be up there with the likes of The Simpsons and South Park. Highly recommended for anyone who likes clever, witty humour."

ids = torch.tensor([encode(text)]).to(device)
mask = (ids != PAD_ID).long()

with torch.no_grad():
    pred = model(ids, mask).argmax(dim=1).item()

print("Prediction:", "Positive" if pred == 1 else "Negative")</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;d8d9fa8a-499a-4817-a5d9-570adedcc48d&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Prediction: Positive</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;5eb7c85c-d92c-4926-929a-df4fbb7f4df1&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">text = "In fact I must confess, so bad was it I fast forwarded through most of the garbage... As for the title characters, they barely even have a footnote in the film."

ids = torch.tensor([encode(text)]).to(device)
mask = (ids != PAD_ID).long()

with torch.no_grad():
    pred = model(ids, mask).argmax(dim=1).item()

print("Prediction:", "Positive" if pred == 1 else "Negative")</code></pre></div><p><strong>Output</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;plaintext&quot;,&quot;nodeId&quot;:&quot;2c112747-0221-44da-abfb-bd9c36165766&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-plaintext">Prediction: Negative</code></pre></div><p>As the outputs show, the model correctly classifies all two test inputs. The strongly positive reviews are predicted as positive, and the clearly negative reviews are predicted as negative. While these examples contain relatively unambiguous sentiment cues, they demonstrate that the model has learned to associate specific linguistic patterns, such as words like &#8220;brilliantly,&#8221; &#8220;excellent,&#8221; and &#8220;highly recommended&#8221; with positive sentiment, and phrases like &#8220;waste of time,&#8221; &#8220;cringe-worthy,&#8221; and &#8220;garbage&#8221; with negative sentiment.</p><p>It is worth noting that a small model trained from scratch on truncated sequences will not handle every edge case perfectly. Reviews with mixed sentiment, heavy sarcasm, or critical information located beyond the 256-token truncation point may be misclassified. Nonetheless, the model&#8217;s ability to correctly classify straightforward examples confirms that the BERT architecture, even at a reduced scale, can learn meaningful bidirectional representations for sentiment analysis when trained on a sufficiently large labeled dataset.</p><h1>Resources</h1><p>Dr Raj has made a very detailed playlist to built an LLM from Scratch, You could refer that as well</p><p><a href="https://youtube.com/playlist?list=PLPTV0NXA_ZSgsLAr8YCgCwhPIJNNtexWu&amp;si=2vzAspB2zerjhGXa">Building LLMs from Scratch</a></p><p>Also You can refer the book by Sebastian Raschka</p><p><a href="https://www.manning.com/books/build-a-large-language-model-from-scratch">Build a Large Language Model (From Scratch)</a></p><h1>1.24 Summary </h1><ul><li><p>Large Language Models predict the next word in a sequence and use this simple objective to develop sophisticated language understanding. Model size is critical: emergent abilities such as arithmetic reasoning appear only when models cross certain parameter thresholds. </p></li><li><p>The transformer architecture replaces recurrent and convolutional approaches with self-attention, enabling parallel processing and global context from the first layer. Its core components are tokenization, embeddings, multi-head attention, feed-forward networks, layer normalization, and residual connections.</p></li><li><p>Byte Pair Encoding builds a subword vocabulary through iterative merging of 90, The Transformer Architecture the most frequent character pairs, balancing vocabulary size with the ability to represent any text. </p></li><li><p> Self-attention transforms static input embeddings into dynamic context vectors by projecting them into queries, keys, and values. Attention scores are computed via scaled dot products, normalized with softmax, and used to blend value vectors. </p></li><li><p>Causal masking prevents tokens from attending to future positions by setting upper-triangular scores to negative infinity before softmax, eliminating data leakage. </p></li><li><p>Multi-head attention runs several independent attention heads in parallel, each capturing different types of relationships, and concatenates their outputs to form a richer representation. </p></li><li><p>Layer normalization stabilizes training by centering and rescaling activations, while residual connections preserve gradient flow through deep stacks of transformer blocks. </p></li><li><p>Transformers scale better than RNNs and CNNs because of their parallel computation, architectural uniformity, and smooth scaling behavior with increasing parameters and data.</p></li><li><p>Pretraining on large unlabeled data creates general-purpose representations that can be efficiently adapted to specific tasks through fine-tuning, dramatically reducing data requirements for downstream applications.</p></li></ul><h1>Some More Substacks </h1><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:162355148,&quot;url&quot;:&quot;https://vizuara.substack.com/p/from-words-to-vectors-understanding&quot;,&quot;publication_id&quot;:3466476,&quot;publication_name&quot;:&quot;Vizuara&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!D3Nd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1a48e78-8537-4335-aee7-95b52957e861_3456x3456.png&quot;,&quot;title&quot;:&quot;From Words to Vectors: Understanding Word Embeddings in NLP&quot;,&quot;truncated_body_text&quot;:&quot;Introduction&quot;,&quot;date&quot;:&quot;2025-05-14T10:38:43.830Z&quot;,&quot;like_count&quot;:16,&quot;comment_count&quot;:3,&quot;bylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;handle&quot;:&quot;mayankpratapsingh022&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;profile_set_up_at&quot;:&quot;2025-02-24T07:56:43.260Z&quot;,&quot;reader_installed_at&quot;:&quot;2025-08-22T12:46:43.613Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:4286927,&quot;user_id&quot;:321073573,&quot;publication_id&quot;:4203235,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:true,&quot;publication&quot;:{&quot;id&quot;:4203235,&quot;name&quot;:&quot;Mayank&#8217;s Substack&quot;,&quot;subdomain&quot;:&quot;mayankpratapsingh022&quot;,&quot;custom_domain&quot;:null,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;My personal Substack&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;author_id&quot;:321073573,&quot;primary_user_id&quot;:321073573,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2025-02-24T08:04:27.535Z&quot;,&quot;email_from_name&quot;:null,&quot;copyright&quot;:&quot;Mayank Pratap Singh&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false,&quot;logo_url_wide&quot;:null}},{&quot;id&quot;:6579433,&quot;user_id&quot;:321073573,&quot;publication_id&quot;:3591997,&quot;role&quot;:&quot;contributor&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:3591997,&quot;name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;subdomain&quot;:&quot;aivizuara&quot;,&quot;custom_domain&quot;:&quot;www.vizuaranewsletter.com&quot;,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Making AI accessible for all&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;author_id&quot;:160920062,&quot;primary_user_id&quot;:160920062,&quot;theme_var_background_pop&quot;:&quot;#FF6719&quot;,&quot;created_at&quot;:&quot;2024-12-27T10:02:26.912Z&quot;,&quot;email_from_name&quot;:&quot;Team Vizuara&quot;,&quot;copyright&quot;:&quot;Vizuara AI Labs&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false,&quot;logo_url_wide&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a1a36ff-6d9a-4ee7-9494-3ae38adfe134_1920x600.png&quot;}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null,&quot;status&quot;:{&quot;bestsellerTier&quot;:null,&quot;subscriberTier&quot;:null,&quot;leaderboard&quot;:null,&quot;vip&quot;:false,&quot;badge&quot;:null,&quot;paidPublicationIds&quot;:[],&quot;subscriber&quot;:null}}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;,&quot;source&quot;:null}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://vizuara.substack.com/p/from-words-to-vectors-understanding?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!D3Nd!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1a48e78-8537-4335-aee7-95b52957e861_3456x3456.png" loading="lazy"><span class="embedded-post-publication-name">Vizuara&#8217;s Substack</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">From Words to Vectors: Understanding Word Embeddings in NLP</div></div><div class="embedded-post-body">Introduction&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">a year ago &#183; 16 likes &#183; 3 comments &#183; Mayank Pratap Singh</div></a></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;f340b53d-16fc-4640-ab35-c24e9238bf5a&quot;,&quot;caption&quot;:&quot;Figure 0: Detailed Architecture of the Segment Anything Model (SAM).&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Segment Anything Model (SAM)&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:136642032,&quot;name&quot;:&quot;Sreedath Panat&quot;,&quot;bio&quot;:&quot;I am the co-founder of Vizuara AI Labs and a PhD from MIT. I use this space to put down my thoughts and knowledge on AI/ML related topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa047b9-4cee-4d9d-8ed7-4a63c5f919b4_974x1220.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-20T09:19:46.533Z&quot;,&quot;cover_image&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ea6440e-c81a-4e4e-b357-db44820234f5_1920x1278.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/segment-anything-model-sam&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:184705881,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:9,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;601356b7-3bc0-44df-bc1b-627d6640421d&quot;,&quot;caption&quot;:&quot;Table Of Content&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;lg&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Detection Transformer (DETR): An introduction&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:321073573,&quot;name&quot;:&quot;Mayank Pratap Singh&quot;,&quot;bio&quot;:&quot;I&#8217;m Mayank, an AI and LLM enthusiast, and this is my space to breaking down complex concepts into accessible insights&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/41e709c7-9629-4748-93e5-e3014c16fb57_460x460.jpeg&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null},{&quot;id&quot;:136642032,&quot;name&quot;:&quot;Sreedath Panat&quot;,&quot;bio&quot;:&quot;I am the co-founder of Vizuara AI Labs and a PhD from MIT. I use this space to put down my thoughts and knowledge on AI/ML related topics.&quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fa047b9-4cee-4d9d-8ed7-4a63c5f919b4_974x1220.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-15T08:40:59.104Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!M0HN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/detection-transformer-detr-an-introduction&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:183945695,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:7,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>I&#8217;m also building Audio Deep Learning projects and LLM projects, sharing and discussing them on LinkedIn and Twitter. If you&#8217;re someone curious about these topics, I&#8217;d love to connect with you all!</p><p><strong>Mayank Pratap Singh</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/mayankpratapsingh022/">www.linkedin.com/in/mayankpratapsingh022</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/Mayank_022">x.com/Mayank_022</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Diffusion Policy: Teaching Robots to Act by Denoising ]]></title><description><![CDATA[How diffusion models solve the multimodal action problem in robot imitation learning]]></description><link>https://www.vizuaranewsletter.com/p/diffusion-policy-teaching-robots</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/diffusion-policy-teaching-robots</guid><dc:creator><![CDATA[Dr Rajat Dandekar]]></dc:creator><pubDate>Mon, 23 Feb 2026 11:50:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8eYv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>The Multimodal Action Problem</h2><p>Let us start with a simple example. Imagine you are training a robot to push a T-shaped block into a target zone on a table.</p><p>You collect expert demonstrations by asking several humans to perform the task. Some humans push the block from the left side, steering it rightward into the target. Others push from the right side, steering it leftward. Both strategies work perfectly.</p><p>Now, you train a standard neural network on this data. The network takes the current image of the table as input and predicts a single action: which direction to push.</p><p>Here is the problem: the network sees that for the same observation (T-block sitting in the center), the correct answer is sometimes &#8220;push left&#8221; and sometimes &#8220;push right.&#8221; To minimize its total error across all demonstrations, the network does what any reasonable regression model would do &#8212; it takes the average.</p><p>The average of &#8220;push left&#8221; and &#8220;push right&#8221; is &#8220;push straight ahead.&#8221;</p><p>And pushing straight ahead slams directly into the flat edge of the T-block. The robot fails.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oX3Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oX3Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!oX3Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!oX3Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!oX3Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oX3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Point estimates average conflicting demos and fail.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Point estimates average conflicting demos and fail." title="Point estimates average conflicting demos and fail." srcset="https://substackcdn.com/image/fetch/$s_!oX3Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!oX3Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!oX3Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!oX3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd4b9eb89-0c63-4090-9293-0ac7306c1973_1408x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is called the <strong>multimodal action problem</strong>. In robotics, it shows up constantly. A robot picking up an object can reach from the left or the right. A robot navigating around an obstacle can go clockwise or counter-clockwise. Whenever multiple valid strategies exist for the same observation, a point-estimate policy collapses them into a single useless average.</p><p>What we really want is a policy that captures the <strong>full distribution</strong> of possible actions &#8212; not just the mean. The policy should be able to sample from this distribution, and each sample should be a coherent, valid action sequence.</p><p>So, how do we build a policy that captures the full distribution of actions, not just the average?</p><div><hr></div><h2>From Image Generation to Action Generation</h2><p>This brings us to an elegant idea: What if we borrowed the same technique that generates images from noise &#8212; diffusion models &#8212; and used it to generate robot actions from noise?</p><p>Let us quickly understand how diffusion models work. The core idea has two parts.</p><p><strong>The Forward Process:</strong> Take a clean piece of data &#8212; say, an image of a cat &#8212; and gradually add Gaussian noise to it, step by step, until the image becomes pure random noise. This is a fixed process. We do not need to learn anything here. We just keep adding noise.</p><p><strong>The Reverse Process:</strong> Train a neural network to undo this corruption. Given a noisy image, the network learns to predict what the slightly-less-noisy version looks like. If we chain many such denoising steps together &#8212; starting from pure noise &#8212; we can generate a clean image that looks like it came from the training distribution.</p><p>The mathematical formulation for the forward process is straightforward. At each step k, we corrupt the clean data A&#8304; by mixing it with Gaussian noise:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A^k_t = \\sqrt{\\bar{\\alpha}_k} \\, A^0_t + \\sqrt{1 - \\bar{\\alpha}_k} \\, \\epsilon, \\quad \\epsilon \\sim \\mathcal{N}(0, I)&quot;,&quot;id&quot;:&quot;WUTFYMWNOS&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, &#8113;&#8342; is a noise schedule parameter that controls how much of the original signal remains at step k. When k is small, &#8113;&#8342; is close to 1, so most of the original data is preserved. When k is large, &#8113;&#8342; is close to 0, and we have almost pure noise.</p><p>Let us plug in some simple numbers to see how this works. Suppose our clean action is A&#8304;&#8348; = 3.0 (a joint angle in radians), the noise schedule gives &#8113;&#8342; = 0.5 at step k, and the sampled noise is &#949; = 0.8:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A^k_t = \\sqrt{0.5} \\times 3.0 + \\sqrt{1 - 0.5} \\times 0.8 = 0.707 \\times 3.0 + 0.707 \\times 0.8 = 2.12 + 0.57 = 2.69&quot;,&quot;id&quot;:&quot;LZGHZKOTJE&quot;}" data-component-name="LatexBlockToDOM"></div><p>Notice how the original action value of 3.0 has been pulled partway toward noise. If we continued this process with smaller and smaller &#8113;&#8342;, the value would eventually become indistinguishable from a random Gaussian sample. This is exactly what we want.</p><p>Now, here is the key insight from the Diffusion Policy paper by Chi et al. (2023): instead of denoising pixels to generate images, we can denoise <strong>action sequences</strong> to generate robot behaviors.</p><p>The input to our denoising process is random noise in the action space &#8212; imagine a completely random sequence of joint angles and velocities. The output, after many denoising steps, is a smooth, coherent sequence of robot actions that solves the task.</p><p>And the conditioning signal? The robot&#8217;s current observations &#8212; what the camera sees and what the joint encoders report.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8eYv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8eYv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!8eYv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!8eYv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!8eYv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8eYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;The same denoising principle applies to robot actions.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The same denoising principle applies to robot actions." title="The same denoising principle applies to robot actions." srcset="https://substackcdn.com/image/fetch/$s_!8eYv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!8eYv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!8eYv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!8eYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb1ce9bd-2570-445b-9ccf-e18c8ffff0ed_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>The Diffusion Policy Formulation</h2><p>Now let us formalize how Diffusion Policy works, step by step.</p><p>The first important design choice is how the policy interacts with time. The authors define three temporal horizons:</p><p><strong>Observation Horizon (T&#8338;):</strong> The policy receives the latest T&#8338; steps of observation data. For example, with T&#8338; = 2, the robot looks at the current camera frame and the previous one. This gives the network a sense of motion and velocity.</p><p><strong>Prediction Horizon (T&#8346;):</strong> The diffusion model predicts T&#8346; future action steps all at once. Typically T&#8346; = 16, meaning the model outputs a sequence of 16 future actions in a single forward pass.</p><p><strong>Action Horizon (T&#8336;):</strong> Of those T&#8346; predicted actions, only T&#8336; are actually executed on the robot before the policy replans. Typically T&#8336; = 8, meaning we execute half the predicted sequence, then re-observe and generate a new plan.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RDk2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RDk2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!RDk2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!RDk2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!RDk2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RDk2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Predict T&#8346; actions, execute only T&#8336;, then replan.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Predict T&#8346; actions, execute only T&#8336;, then replan." title="Predict T&#8346; actions, execute only T&#8336;, then replan." srcset="https://substackcdn.com/image/fetch/$s_!RDk2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!RDk2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!RDk2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!RDk2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ed5cee9-8bbc-4f6b-93bd-d40b68bab7f9_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now let us look at the training process. The training objective is beautifully simple: <strong>predict the noise that was added.</strong></p><p>During training, we take a clean action sequence A&#8304;&#8348; from the demonstration data, add noise at a random level k, and ask the network to predict the noise. The loss function is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L} = \\text{MSE}\\Big(\\epsilon^k, \\; \\epsilon_\\theta(O_t, \\; A^0_t + \\epsilon^k, \\; k)\\Big)&quot;,&quot;id&quot;:&quot;BZULSXJHSB&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, &#949;&#7503; is the actual noise that was added, &#949;_&#952; is our noise-prediction network, O&#8348; is the observation, and k is the noise level. The network learns to look at a noisy action sequence and predict what noise was added to it.</p><p>Let us plug in some simple numbers. Suppose our action sequence has 3 dimensions (for a 3-DOF robot), and the true noise added was &#949;&#7503; = [0.3, -0.1, 0.5]. Our model predicts &#949;&#770; = [0.28, -0.12, 0.48]. The MSE loss is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L} = \\frac{1}{3}\\left[(0.3 - 0.28)^2 + (-0.1 - (-0.12))^2 + (0.5 - 0.48)^2\\right] = \\frac{1}{3}[0.0004 + 0.0004 + 0.0004] = 0.0004&quot;,&quot;id&quot;:&quot;KCBBGLWTAM&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is a very small loss, which means our model is predicting the noise accurately. This is exactly what we want.</p><p>At inference time, we start from pure Gaussian noise and iteratively denoise:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A^{k-1}t = \\alpha \\left( A^k_t - \\gamma \\, \\epsilon\\theta(O_t, A^k_t, k) + \\mathcal{N}(0, \\sigma^2 I) \\right)&quot;,&quot;id&quot;:&quot;YDLYDARLVW&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us walk through one denoising step with concrete numbers. Suppose A&#7503;&#8348; = 1.5 (a noisy action value), our network predicts &#949;_&#952; = 0.8, and the denoising parameters are &#945; = 0.99, &#947; = 0.5, and &#963; = 0.1. The random noise sample is z = 0.05:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A^{k-1}_t = 0.99 \\times (1.5 - 0.5 \\times 0.8 + 0.1 \\times 0.05) = 0.99 \\times (1.5 - 0.4 + 0.005) = 0.99 \\times 1.105 = 1.094&quot;,&quot;id&quot;:&quot;KWJXKLVXYS&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Notice how the action value moved from 1.5 toward a less noisy value of 1.094. The network&#8217;s noise prediction of 0.8 told us that a large part of the current value was noise, so we subtracted it out. After many such steps, we arrive at a clean action sequence.</p><div><hr></div><h2>Two Architectures: CNN and Transformer</h2><p>Now the question is: what does the noise prediction network &#949;_&#952; actually look like inside?</p><p>The Diffusion Policy paper proposes two architectures.</p><h3>CNN-Based Diffusion Policy</h3><p>The first architecture uses a <strong>1D temporal convolutional network</strong>. Think of it as a series of convolution layers that slide along the time axis of the action sequence.</p><p>The clever part is how observations are injected. Rather than concatenating observation features to the input, the authors use <strong>FiLM conditioning</strong> (Feature-wise Linear Modulation). At every convolutional layer, the observation features scale and shift the hidden activations:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;h&#8217; = \\gamma(O_t) \\odot h + \\beta(O_t)&quot;,&quot;id&quot;:&quot;WDAVGNAQFJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, h is the hidden activation from the convolution layer, &#947;(O&#8348;) and &#946;(O&#8348;) are learned functions of the observation that produce per-channel scale and shift values, and &#8857; denotes element-wise multiplication.</p><p>Let us plug in some simple numbers. Suppose a convolutional layer produces a hidden value of h = 2.0 for one channel. The observation features generate &#947; = 1.5 and &#946; = -0.3 for this channel:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;h&#8217; = 1.5 \\times 2.0 + (-0.3) = 3.0 - 0.3 = 2.7&quot;,&quot;id&quot;:&quot;FSCVHUKVXA&quot;}" data-component-name="LatexBlockToDOM"></div><p>The observation has modulated the hidden feature from 2.0 to 2.7. Different observations will produce different &#947; and &#946; values, allowing the network to condition its denoising behavior on what the robot currently sees. This is exactly what we want.</p><p>For the visual encoder, the authors use a <strong>ResNet-18</strong> backbone with two modifications: they replace global average pooling with spatial softmax pooling (which preserves spatial information about where features are located), and they swap BatchNorm for GroupNorm (which is more stable when using exponential moving average of weights during training).</p><h3>Transformer-Based Diffusion Policy</h3><p>The second architecture uses <strong>Transformer decoder blocks</strong>, inspired by minGPT. The noisy action sequence is treated as a sequence of tokens, with the diffusion timestep k prepended as a special token. The observation features are provided as a separate sequence that the action tokens attend to via <strong>cross-attention</strong>.</p><p>The processing flow is: noisy action tokens pass through self-attention (with causal masking), then cross-attend to the observation tokens, then produce the noise prediction through a feed-forward layer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MQcw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MQcw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!MQcw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!MQcw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!MQcw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MQcw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/270de229-8170-4e71-b745-5e14346368e0_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;CNN uses FiLM conditioning; Transformer uses cross-attention.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="CNN uses FiLM conditioning; Transformer uses cross-attention." title="CNN uses FiLM conditioning; Transformer uses cross-attention." srcset="https://substackcdn.com/image/fetch/$s_!MQcw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!MQcw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!MQcw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!MQcw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F270de229-8170-4e71-b745-5e14346368e0_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Which one should you use? The CNN-based architecture is the default workhorse &#8212; it is faster to train, requires less hyperparameter tuning, and works well on most tasks. The Transformer-based architecture shines on tasks with high-frequency action changes, such as velocity control, where the temporal convolutions of the CNN tend to over-smooth the predictions.</p><div><hr></div><h2>Why Diffusion Policy Handles Multimodality</h2><p>Let us come back to our original question: how does Diffusion Policy solve the multimodal action problem?</p><p>The key insight comes from an energy-based interpretation. The noise prediction network &#949;_&#952; implicitly learns the gradient of an energy function over the action space:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\nabla_a \\log p(a \\mid o) = -\\nabla_a E_\\theta(a, o)&quot;,&quot;id&quot;:&quot;COSXEGRYDZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>This equation tells us that the noise prediction network is learning to point toward regions of high probability in the action space. The beautiful thing is that this works without ever needing to compute a normalizing constant &#8212; a notoriously difficult problem for energy-based models.</p><p>Let us see this with a simple 1D example. Suppose for a given observation, there are two valid actions: a = -1 (go left) and a = +1 (go right). The energy landscape might look like a surface with two valleys &#8212; one at a = -1 and one at a = +1 &#8212; with a high ridge at a = 0.</p><p>If we start from a random noise sample, say a&#7479; = 0.3, the gradient will point toward the nearest valley. After many denoising steps, the action will settle into the valley at a = +1.</p><p>If we start from a&#7479; = -0.4, the gradient points the other way, and the action settles into the valley at a = -1.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HSaS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HSaS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!HSaS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!HSaS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!HSaS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HSaS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Two valleys in the energy landscape &#8212; initial noise determines which mode is selected.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Two valleys in the energy landscape &#8212; initial noise determines which mode is selected." title="Two valleys in the energy landscape &#8212; initial noise determines which mode is selected." srcset="https://substackcdn.com/image/fetch/$s_!HSaS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!HSaS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!HSaS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!HSaS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3a220bb-fcdb-4b70-8abf-fbcaefb678dc_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the fundamental difference from a point-estimate policy. The point-estimate policy would output a = 0 &#8212; the average of the two modes &#8212; which is not a valid action. The diffusion policy starts from random noise and naturally &#8220;falls into&#8221; one of the valid modes.</p><p>And here is what makes this even more elegant: the model <strong>commits to a single mode within each rollout</strong>. Because the entire action sequence is denoised together, all 16 predicted timesteps land in the same mode. You will never see a trajectory that starts going left and then suddenly switches to going right mid-execution.</p><div><hr></div><h2>Receding Horizon Control: Smooth and Reactive</h2><p>There is one more important piece of the puzzle: how does the robot actually execute these predicted actions in the real world?</p><p>The authors use a strategy called <strong>receding horizon control</strong>. The idea is simple but powerful:</p><ol><li><p>The policy observes the current state and predicts T&#8346; = 16 future action steps</p></li><li><p>The robot executes only the first T&#8336; = 8 steps</p></li><li><p>After executing 8 steps, the robot re-observes the world and generates a fresh prediction of 16 steps</p></li><li><p>Repeat</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3F8u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3F8u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!3F8u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!3F8u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!3F8u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3F8u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Predict long, execute short, replan often.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Predict long, execute short, replan often." title="Predict long, execute short, replan often." srcset="https://substackcdn.com/image/fetch/$s_!3F8u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!3F8u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!3F8u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!3F8u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F28f29f62-89f1-4b6d-bb12-2bb4a85c9914_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Why not just predict one action at a time? Because single-step prediction leads to jerky, inconsistent motions. The robot might oscillate between modes at every timestep. By predicting a long sequence, the diffusion model ensures temporal consistency &#8212; all actions within one prediction are coherent and smooth.</p><p>Why not execute all 16 predicted steps? Because the world changes. If the robot gets bumped, or an object shifts unexpectedly, we want to react. By re-observing and replanning every 8 steps, the robot stays responsive to disturbances.</p><p>This creates a beautiful balance: <strong>long prediction horizon for consistency, short execution horizon for reactivity.</strong></p><p>The authors also found that using <strong>position control</strong> (predicting target joint positions) is much more robust than velocity control when dealing with computational latency. With position targets, the robot can still reach approximately the right pose even if the next action arrives a few milliseconds late. With velocity commands, even a small delay causes the robot to overshoot.</p><div><hr></div><h2>Practical Implementation</h2><p>Enough theory &#8212; let us look at some practical implementation now.</p><p>We will implement the two core operations: (1) adding noise to a clean action sequence during training, and (2) the denoising loop during inference.</p><p>First, let us set up the noise schedule and the forward diffusion process:</p><pre><code><code>import torch
import torch.nn as nn

# --- Noise Schedule ---
def cosine_beta_schedule(T, s=0.008):
    """Cosine noise schedule (improved over linear)."""
    steps = torch.arange(T + 1, dtype=torch.float32)
    alpha_bar = torch.cos((steps / T + s) / (1 + s) * torch.pi / 2) ** 2
    alpha_bar = alpha_bar / alpha_bar[0]
    betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
    return torch.clamp(betas, max=0.999)

# --- Forward Diffusion (Training) ---
def add_noise(clean_actions, k, alpha_bar):
    """Add noise to clean actions at diffusion step k."""
    noise = torch.randn_like(clean_actions)
    sqrt_alpha_bar = torch.sqrt(alpha_bar[k]).view(-1, 1, 1)
    sqrt_one_minus = torch.sqrt(1 - alpha_bar[k]).view(-1, 1, 1)
    noisy_actions = sqrt_alpha_bar * clean_actions + sqrt_one_minus * noise
    return noisy_actions, noise
</code></code></pre><p>Let us understand this code in detail. The <code>cosine_beta_schedule</code> function computes a noise schedule where the amount of noise added at each step follows a cosine curve &#8212; this gives a smoother transition than a linear schedule. The <code>add_noise</code> function takes a clean action sequence, a diffusion step k, and applies the forward diffusion equation we saw earlier. It returns both the noisy actions (which the network will see as input) and the true noise (which the network must learn to predict).</p><p>Now, let us implement the denoising loop used during inference:</p><pre><code><code># --- Denoising Loop (Inference) ---
@torch.no_grad()
def denoise_actions(model, obs_features, alpha_bar, T=100):
    """Generate actions by iteratively denoising from pure noise."""
    betas = 1 - alpha_bar[1:] / alpha_bar[:-1]
    alphas = 1 - betas

    # Start from pure Gaussian noise: shape (1, T_p, action_dim)
    action = torch.randn(1, 16, 6)  # 16 steps, 6-DOF robot

    for k in reversed(range(T)):
        # Predict the noise at this step
        predicted_noise = model(obs_features, action, k)

        # Remove predicted noise (simplified DDPM update)
        alpha_k = alphas[k]
        alpha_bar_k = alpha_bar[k]
        action = (1 / torch.sqrt(alpha_k)) * (
            action - (betas[k] / torch.sqrt(1 - alpha_bar_k)) * predicted_noise
        )

        # Add small noise (except at final step)
        if k &gt; 0:
            action += torch.sqrt(betas[k]) * torch.randn_like(action)

    return action  # Clean action sequence: shape (1, 16, 6)
</code></code></pre><p>This is the heart of the Diffusion Policy inference. We start from pure Gaussian noise with shape <code>(1, 16, 6)</code> &#8212; one batch, 16 timesteps, 6 degrees of freedom. Then we loop backward from step T to step 0. At each step, the model predicts the noise component, we subtract it out (scaled appropriately), and add a small amount of fresh noise for stochasticity. After all T steps, we have a clean, coherent action sequence ready to send to the robot.</p><p>Notice how compact this is. The entire inference loop is just a few lines of code. The complexity lives inside the <code>model</code> &#8212; which is either the 1D CNN or the Transformer architecture we discussed earlier.</p><div><hr></div><h2>Results: How Well Does It Work?</h2><p>Now let us see how Diffusion Policy performs across different tasks.</p><p>The authors evaluated across <strong>15 tasks from 4 different benchmarks</strong>, in both simulation and the real world. The results are remarkable.</p><p><strong>Simulation Results (RoboMimic):</strong></p><p>On standard manipulation tasks like lifting, picking, and placing, most methods do well. But the gap becomes dramatic on harder tasks:</p><ul><li><p><strong>ToolHang</strong> (hang a tool on a hook &#8212; requires precise multimodal grasping): Diffusion Policy CNN achieves <strong>93%</strong> success. Implicit Behavioral Cloning (IBC) achieves <strong>0%</strong>. Behavior Transformer (BET) achieves <strong>52%</strong>. The task requires the robot to choose between multiple valid grasp orientations &#8212; exactly the multimodal scenario where point estimates fail.</p></li><li><p><strong>Transport</strong> (bimanual manipulation &#8212; two arms must coordinate): Diffusion Policy achieves <strong>96%</strong> success, outperforming all baselines.</p></li><li><p><strong>Kitchen</strong> (multi-stage cooking tasks): Diffusion Policy achieves <strong>96-99%</strong> on the hardest metric, compared to <strong>44%</strong> for BET. This is a <strong>213% improvement</strong>.</p></li></ul><p><strong>Real-World Results:</strong></p><p>The Push-T task we discussed at the beginning? Diffusion Policy achieves <strong>95% success</strong> with a coverage IoU of 0.80. LSTM-GMM achieves only 20% success. IBC achieves 0%. The human baseline is 100% success with 0.84 IoU &#8212; Diffusion Policy is remarkably close to human performance.</p><p>On more complex real-world tasks: - <strong>Sauce pouring:</strong> 79% success with 0.74 IoU (human: 0.79 IoU) - <strong>Sauce spreading:</strong> 100% success with 0.77 coverage (human: 0.79) - <strong>Shirt folding</strong> (bimanual): 75% success from 284 demonstrations - <strong>Mug flipping</strong> (7-DOF with complex 3D rotations): 90% success</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eP7V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eP7V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png 424w, https://substackcdn.com/image/fetch/$s_!eP7V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png 848w, https://substackcdn.com/image/fetch/$s_!eP7V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png 1272w, https://substackcdn.com/image/fetch/$s_!eP7V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eP7V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png" width="1456" height="866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Diffusion Policy outperforms prior methods, especially on hard tasks.&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Diffusion Policy outperforms prior methods, especially on hard tasks." title="Diffusion Policy outperforms prior methods, especially on hard tasks." srcset="https://substackcdn.com/image/fetch/$s_!eP7V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png 424w, https://substackcdn.com/image/fetch/$s_!eP7V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png 848w, https://substackcdn.com/image/fetch/$s_!eP7V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png 1272w, https://substackcdn.com/image/fetch/$s_!eP7V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0544886f-e609-45ee-a049-86ac476bf120_1980x1178.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Across all tasks, Diffusion Policy achieves an <strong>average improvement of 46.9%</strong> over the previous state-of-the-art. Not bad, right?</p><div><hr></div><h2>Conclusion</h2><p>Let us summarize the three key ideas behind Diffusion Policy:</p><ol><li><p><strong>Represent robot policies as conditional denoising diffusion processes.</strong> Instead of predicting a single action, sample from the full action distribution by iteratively denoising random noise, conditioned on observations.</p></li><li><p><strong>Predict action sequences, not single actions.</strong> By generating an entire trajectory at once, the policy produces temporally consistent, smooth motions that commit to one mode.</p></li><li><p><strong>Use receding horizon control for execution.</strong> Predict long (T&#8346; = 16), execute short (T&#8336; = 8), replan often. This balances consistency with reactivity.</p></li></ol><p>Diffusion Policy has become a foundational method in modern robot learning. Its ability to handle multimodal demonstrations, produce smooth trajectories, and work reliably in the real world has made it the default choice for many imitation learning systems that followed.</p><p>Here is the link to the original paper: Chi et al., &#8220;Diffusion Policy: Visuomotor Policy Learning via Action Diffusion&#8221; (2023)</p><p>Project page: https://diffusion-policy.cs.columbia.edu/</p><p><strong>References:</strong></p><ul><li><p>Chi et al., &#8220;Diffusion Policy: Visuomotor Policy Learning via Action Diffusion&#8221; (2023)</p></li><li><p>Ho et al., &#8220;Denoising Diffusion Probabilistic Models&#8221; (2020)</p></li><li><p>Janner et al., &#8220;Planning with Diffusion for Flexible Behavior Synthesis&#8221; (2022)</p></li><li><p>Florence et al., &#8220;Implicit Behavioral Cloning&#8221; (2022)</p></li><li><p>Shafiullah et al., &#8220;Behavior Transformers: Cloning k Modes with One Stone&#8221; (2022)</p></li><li><p>Perez et al., &#8220;FiLM: Visual Reasoning with a General Conditioning Layer&#8221; (2018)</p></li></ul><p>That&#8217;s it!</p><p>If you like this content, please check out our bootcamps on the following topics:</p><p><strong>Modern Robot Learning</strong>: <a href="https://robotlearningbootcamp.vizuara.ai/">https://robotlearningbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>:  <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[How to run Robotics simulations in Maniskill Environment?]]></title><description><![CDATA[Getting started with Robotics simulation environments!]]></description><link>https://www.vizuaranewsletter.com/p/how-to-run-robotics-simulations-in</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/how-to-run-robotics-simulations-in</guid><dc:creator><![CDATA[Dr Rajat Dandekar]]></dc:creator><pubDate>Sun, 08 Feb 2026 06:27:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CB0n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I am documenting my experience with the Maniskill environment in this article.</p><p>I was looking for a simulation environment, where I can easily simulate robot policies. I stumbled across Maniskill, while listening to this podcast.</p><div id="youtube2-fpOCPQB2spM" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;fpOCPQB2spM&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/fpOCPQB2spM?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Here Stone Tao, explains about his journey and motivations behind building Maniskill.</p><p>First, let us understand the basics. For any simulation environment, we need 2 things:</p><h3>(1) The Physics Engine</h3><p>The Physics Engine is a simulator designed specifically to help robots understand how to interact with physical objects. </p><p>Maniskill uses a Physics Engine called SAPIEN.</p><p>While many simulators focus on robots walking or navigating (like a Roomba avoiding a wall), SAPIEN focuses on <strong>manipulation</strong> - using a robot arm to open drawers, turn faucets, or pick up cups.</p><ul><li><p><strong>What it does:</strong> It calculates the physics of the world. If a robot arm bumps a table, SAPIEN calculates how the table moves, how the objects on it shake, and how the friction feels.</p></li><li><p><strong>The &#8220;Basics&#8221; Analogy:</strong> Think of SAPIEN as a <strong>video game engine</strong> (like Unity or Unreal Engine) but built strictly for scientific robots. It simulates the gravity, friction, and &#8220;touch&#8221; that a robot needs to understand the world.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lCWn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lCWn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!lCWn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!lCWn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!lCWn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lCWn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png" width="1200" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;GitHub - haosulab/SAPIEN: SAPIEN Embodied AI Platform&quot;,&quot;title&quot;:&quot;GitHub - haosulab/SAPIEN: SAPIEN Embodied AI Platform&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GitHub - haosulab/SAPIEN: SAPIEN Embodied AI Platform" title="GitHub - haosulab/SAPIEN: SAPIEN Embodied AI Platform" srcset="https://substackcdn.com/image/fetch/$s_!lCWn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!lCWn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!lCWn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!lCWn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe588c4-c4db-4a19-a3ce-a0d4813a1180_1200x600.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>(2) High-Level Framework</h3><p><strong>ManiSkill</strong> is a collection of tasks and data built <em>on top</em> of SAPIEN.</p><p>Just having a physics engine (SAPIEN) isn&#8217;t enough; you need specific goals to train the robot. ManiSkill provides those goals. It includes difficult tasks like &#8220;pick up this pen,&#8221; &#8220;pour water from this bucket,&#8221; or &#8220;assemble this chair.&#8221;</p><ul><li><p><strong>What it does:</strong> It acts as a standardized test. It provides thousands of demonstrations (digital recordings of a task being done correctly) so researchers can train their AI, and then it scores the AI on how well it can repeat the task.</p></li><li><p><strong>The &#8220;Basics&#8221; Analogy:</strong> If SAPIEN is the <strong>gym building</strong> with all the equipment, ManiSkill is the <strong>personal trainer</strong> who gives you a specific workout plan and tracks your reps.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CB0n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CB0n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CB0n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CB0n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CB0n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CB0n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg" width="1456" height="656" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:656,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;User Guide &#8212; ManiSkill 3.0.0b22 documentation&quot;,&quot;title&quot;:&quot;User Guide &#8212; ManiSkill 3.0.0b22 documentation&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="User Guide &#8212; ManiSkill 3.0.0b22 documentation" title="User Guide &#8212; ManiSkill 3.0.0b22 documentation" srcset="https://substackcdn.com/image/fetch/$s_!CB0n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CB0n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CB0n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CB0n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0384876a-95a3-4eef-bc8d-e5ac4289532d_4800x2161.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Most importantly, ManiSkill supports <strong>GPU parallelization</strong>. Instead of running one robot simulation at a time, you can run 2,000+ simultaneously. This means you can collect millions of training steps in minutes rather than days.</p><p>Now, let us see how to get started with Maniskill!</p><div><hr></div><h3>Getting Started with Maniskill</h3><h4>Step 1: Installation</h4><p>First, you need to install the library and its dependencies. ManiSkill is built on the SAPIEN engine and integrates with PyTorch.</p><p>Run this command in your terminal or notebook:</p><pre><code><code>pip install maniskill torch</code></code></pre><p><em>Note: You may also need to set up Vulkan (the graphics API) drivers if you are running this locally.</em></p><h4>Step 2: Run a Basic Environment (CPU)</h4><p>Before we go fast, let&#8217;s go slow. We&#8217;ll run a single environment on the CPU just to see how the API works. If you&#8217;ve used OpenAI Gym or Gymnasium before, this will feel like home.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JFbu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JFbu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png 424w, https://substackcdn.com/image/fetch/$s_!JFbu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png 848w, https://substackcdn.com/image/fetch/$s_!JFbu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png 1272w, https://substackcdn.com/image/fetch/$s_!JFbu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JFbu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png" width="1280" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;GitHub - Farama-Foundation/Gymnasium: An API standard for single-agent  reinforcement learning environments, with popular reference environments  and related utilities (formerly Gym)&quot;,&quot;title&quot;:&quot;GitHub - Farama-Foundation/Gymnasium: An API standard for single-agent  reinforcement learning environments, with popular reference environments  and related utilities (formerly Gym)&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GitHub - Farama-Foundation/Gymnasium: An API standard for single-agent  reinforcement learning environments, with popular reference environments  and related utilities (formerly Gym)" title="GitHub - Farama-Foundation/Gymnasium: An API standard for single-agent  reinforcement learning environments, with popular reference environments  and related utilities (formerly Gym)" srcset="https://substackcdn.com/image/fetch/$s_!JFbu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png 424w, https://substackcdn.com/image/fetch/$s_!JFbu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png 848w, https://substackcdn.com/image/fetch/$s_!JFbu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png 1272w, https://substackcdn.com/image/fetch/$s_!JFbu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b6283bf-bf3a-4405-8313-d0299df44970_1280x640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We will load the <code>PegInsertionSide-v1</code> task, where a robot arm tries to insert a peg into a hole sideways - a classic precision task.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qq6r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qq6r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png 424w, https://substackcdn.com/image/fetch/$s_!qq6r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png 848w, https://substackcdn.com/image/fetch/$s_!qq6r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png 1272w, https://substackcdn.com/image/fetch/$s_!qq6r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qq6r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png" width="256" height="256" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:256,&quot;width&quot;:256,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Table-Top 2 Finger Gripper Tasks &#8212; ManiSkill 3.0.0b22 documentation&quot;,&quot;title&quot;:&quot;Table-Top 2 Finger Gripper Tasks &#8212; ManiSkill 3.0.0b22 documentation&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Table-Top 2 Finger Gripper Tasks &#8212; ManiSkill 3.0.0b22 documentation" title="Table-Top 2 Finger Gripper Tasks &#8212; ManiSkill 3.0.0b22 documentation" srcset="https://substackcdn.com/image/fetch/$s_!qq6r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png 424w, https://substackcdn.com/image/fetch/$s_!qq6r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png 848w, https://substackcdn.com/image/fetch/$s_!qq6r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png 1272w, https://substackcdn.com/image/fetch/$s_!qq6r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6434ce4b-91a9-41ca-a392-29b5374aebd9_256x256.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Code to run the &#8220;Peg Insertion&#8221; task:</strong></p><pre><code><code>import gymnasium as gym
import mani_skill.envs

# Create the environment with a human-viewable render mode
env = gym.make("PegInsertionSide-v1", render_mode="human")
obs, _ = env.reset()

done = False
while not done:
    # Sample a random action (just flailing around for now)
    action = env.action_space.sample()
    
    # Step the environment
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    
    # Render the scene
    env.render()

env.close()</code></code></pre><p><em>When you run this, you should see a window pop up with a robot arm moving randomly. It won&#8217;t succeed, but it proves your physics engine is alive.</em></p><p>Note the similarity between the code syntax here and the Gymnasium code syntax. We are using the same functions: gym.make, env.action.space.sample() etc.</p><p>Have a look at this article, where I explain about the Gymnasium environment separately using some practical examples: </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;c15ab137-4323-41e4-a7b5-9787041216f6&quot;,&quot;caption&quot;:&quot;In this lecture, we began our journey by first understanding classical reinforcement learning.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Hands-on RL Bootcamp Lecture 1&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:290241614,&quot;name&quot;:&quot;Vizuara AI&quot;,&quot;bio&quot;:&quot;Deep dive into AI. Deep content. No fluff. &quot;,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1a48e78-8537-4335-aee7-95b52957e861_3456x3456.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-08-01T10:25:41.889Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!zmaD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98ed7c5c-7c63-4406-92b5-d8863ae17eb2_1000x662.jpeg&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://vizuara.substack.com/p/hands-on-rl-bootcamp-lecture-1&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:169648519,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:20,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3466476,&quot;publication_name&quot;:&quot;Vizuara&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!D3Nd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1a48e78-8537-4335-aee7-95b52957e861_3456x3456.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><h4>Step 3: Unleashing the GPU (The &#8220;ManiSkill Magic&#8221;)</h4><p>Now, let&#8217;s do what ManiSkill was born to do: massive scaling.</p><p>We are going to switch to the <code>PickCube-v1</code> task and run <strong>2,048 environments</strong> at the same time.</p><pre><code><code>import gymnasium as gym
import mani_skill.envs
import torch

# Create 2048 parallel environments
# obs_mode="state" gives us physical data (coordinates) rather than images
env = gym.make("PickCube-v1", num_envs=2048, obs_mode="state")

obs, _ = env.reset()

# Generate random actions for ALL 2048 environments
# Note: ManiSkill expects a PyTorch tensor for actions on GPU
action_batch = torch.from_numpy(env.action_space.sample())

# Step all 2048 environments instantly
obs, reward, terminated, truncated, info = env.step(action_batch)

print(f"Observation shape: {obs.shape}") 
# Output: torch.Size([2048, 42])</code></code></pre><p><em>If you check your performance logs, you&#8217;ll likely see this running at over <strong>20,000 Frames Per Second (FPS)</strong> on a standard GPU (like a T4 or RTX 3060). This speed is the &#8220;unfair advantage&#8221; of modern robot learning.</em></p><h4>Step 4: Training an Agent with PPO</h4><p>Flailing randomly is fun, but let&#8217;s actually teach the robot something.</p><p>ManiSkill comes with pre-baked training scripts for <strong>Proximal Policy Optimization (PPO)</strong>, a popular reinforcement learning algorithm. We can use these to solve the <code>PushCube-v1</code> task in just a few minutes.</p><p><strong>1. Get the training script:</strong> Download the high-performance PPO implementation provided by the ManiSkill team:</p><pre><code><code>wget https://raw.githubusercontent.com/haosulab/ManiSkill/main/examples/baselines/ppo/ppo.py</code></code></pre><p><strong>2. Start Training:</strong> Run the following command in your terminal. This will spin up 1,024 parallel environments and train for 600,000 steps.</p><pre><code><code>python ppo.py --env_id="PushCube-v1" \
  --num_envs=1024 \
  --update_epochs=8 \
  --num_minibatches=32 \
  --total_timesteps=600_000</code></code></pre><p><strong>What&#8217;s happening here?</strong></p><ul><li><p><strong>num_envs=1024:</strong> We are gathering experience from 1,024 &#8220;timelines&#8221; simultaneously.</p></li><li><p><strong>total_timesteps=600_000:</strong> The agent will practice 600k interactions. Because of parallelization, this might finish in <strong>less than 2 minutes</strong>.</p></li></ul><p>Once finished, the script will save a model checkpoint and often generate a video showing the robot successfully pushing the cube to its target.</p><h4>Step 5: Learning from Experts (Imitation Learning)</h4><p>Maybe you don&#8217;t want the robot to learn by trial and error. Maybe you want to <em>show</em> it what to do. This is called <strong>Imitation Learning</strong>, and ManiSkill has a massive dataset of expert demonstrations ready to download.</p><p><strong>Download the data:</strong></p><pre><code><code># Downloads expert demos for the Peg Insertion task
python -m mani_skill.utils.download_demo "PegInsertionSide-v1" -o demos</code></code></pre><p><strong>Watch the expert:</strong> You can replay these demonstrations to visualize &#8220;perfect&#8221; behavior:</p><pre><code><code>python -m mani_skill.trajectory.replay_trajectory \
  --traj-path demos/PegInsertionSide-v1/motionplanning/trajectory.h5 \
  --save-video --allow-failure</code></code></pre><p>This generates a video file (usually in the <code>videos</code> folder) where you can see exactly how the motion planner solves the puzzle.</p><p><strong>This shows the following:</strong></p><div class="pullquote"><p>ManiSkill lowers the barrier to entry for robotics research significantly. You no longer need a supercomputer to test advanced RL algorithms; you just need a laptop and a few lines of Python.</p></div><h2>Training a Robot using ACT Policy in Simulation</h2><p>If you&#8217;ve been following AI robotics lately, you&#8217;ve likely seen the <strong>Aloha</strong> robot - those bimanual arms cooking shrimp or folding clothes. The &#8220;brain&#8221; behind that robot isn&#8217;t magic; it&#8217;s an architecture called <strong>ACT (Action Chunking with Transformers)</strong>.</p><p>But here is the problem: Physical robots are expensive. They break. They are hard to reset.</p><p>The solution? <strong>Simulation.</strong></p><p>Today, I&#8217;m going to walk you through how to train a robot arm to perform a manipulation task (picking up a cube) from scratch, entirely in simulation, using a single GPU (like on RunPod). We will be using <strong>ManiSkill</strong>, a lightning-fast simulator, and the official ACT implementation.</p><h2>The Tech Stack</h2><p>We are going to use a pipeline that mimics a real-world workflow:</p><ol><li><p><strong>The Simulator:</strong> ManiSkill (uses GPU physics to run fast).</p></li><li><p><strong>The Brain:</strong> ACT (Imitation Learning that predicts &#8220;chunks&#8221; of future actions).</p></li><li><p><strong>The Task:</strong> <code>PickCube-v1</code> (The &#8220;Hello World&#8221; of robotics).</p></li></ol><div><hr></div><h2>Step 1: The Environment Setup</h2><p>Robotics simulation is notorious for &#8220;dependency hell&#8221;&#8212;specifically getting Vulkan drivers (for rendering images) to play nice with headless cloud GPUs.</p><p>If you are running this on a cloud provider like RunPod (recommended: RTX 3090 or 4090), you need to install system-level drivers before Python libraries.</p><p>I&#8217;ve condensed the setup into a script that handles the heavy lifting. It installs Vulkan, sets up the NVIDIA ICD files, and installs the specific versions of <code>torch</code> and <code>mani_skill</code> required.</p><p><strong>What&#8217;s happening under the hood?</strong> Normally, simulators run on a CPU. ManiSkill runs physics on the GPU. This allows us to collect data and train massively faster than real time.</p><div><hr></div><h2>Step 2: Getting &#8220;Expert&#8221; Data</h2><p>ACT is a form of <strong>Imitation Learning</strong>. This means we don&#8217;t punish or reward the robot (like in Reinforcement Learning); instead, we show it examples of a human (or a script) doing the job perfectly, and say, <em>&#8220;Do it like this.&#8221;</em></p><p>In the real world, you would control the robot via teleoperation to collect data. In ManiSkill, we can download pre-recorded &#8220;perfect&#8221; trajectories.</p><pre><code><code># This downloads the expert motion planning data
python -m mani_skill.utils.download_demo "PickCube-v1"</code></code></pre><h3>The &#8220;Vision&#8221; Problem</h3><p>The downloaded demos are usually stored as <strong>State</strong> data (exact XYZ coordinates of the cube). But we want our robot to see with cameras, just like a real robot would.</p><p>We use a process called <strong>Trajectory Replay</strong>. We take the coordinate data, replay it inside the simulator, and render the camera views (RGB + Depth) frame-by-frame. This creates a new dataset containing the video feed the robot <em>would have seen</em> if it were doing the task.</p><p>In the pipeline script, this command handles the conversion:</p><pre><code><code>python -m mani_skill.trajectory.replay_trajectory \
    --traj-path ~/.maniskill/demos/PickCube-v1/motionplanning/trajectory.h5 \
    --use-env-states \
    -o rgbd \
    --save-video</code></code></pre><p><em>Note: We use </em><code>--use-env-states</code><em> to ensure the replay is physically accurate to the original demo.</em></p><h2>Step 3: Action Chunking with Transformers (ACT)</h2><p>Now for the magic. Why is ACT better than standard behavioral cloning?</p><p>Standard policies look at an image and predict <strong>one</strong> step: <em>&#8220;Move left 1mm.&#8221;</em> ACT looks at an image and predicts a <strong>chunk</strong> of steps: <em>&#8220;Move left, then down, then close gripper.&#8221;</em></p><p>This &#8220;chunking&#8221; makes the robot&#8217;s motion smoother and less jittery. It uses a <strong>VAE (Variational Autoencoder)</strong> to compress the style of movement and a <strong>Transformer</strong> to predict the sequence.</p><h3>The Training Loop</h3><p>We utilize a unified training script (<code>train_act.sh</code>) that automates the process. For the <code>PickCube-v1</code> task, we don&#8217;t need a massive compute cluster.</p><p><strong>Key Hyperparameters:</strong></p><ul><li><p><strong>Demos:</strong> 100 trajectories is usually enough for this simple task.</p></li><li><p><strong>Iterations:</strong> 30,000 gradient steps (about 15-20 minutes on an RTX 4090).</p></li><li><p><strong>Episode Length:</strong> 125 steps (if it takes longer, the robot has likely failed).</p></li></ul><p>The script launches the training:</p><pre><code><code>python train_rgbd.py \
    --env-id PickCube-v1 \
    --demo-path demos/PickCube-v1/rgbd_trajectory.h5 \
    --control-mode pd_joint_delta_pos \
    --total_iters 30000</code></code></pre><h2>Step 4: Seeing the Results</h2><p>Once training is done, looking at a loss curve isn&#8217;t enough. In robotics, you need to watch the video.</p><p>The pipeline automatically runs an <strong>Inference</strong> step. It loads the saved model weights (<code>best_eval_success_once.pt</code>) and runs the robot in a fresh environment it hasn&#8217;t seen before.</p><p><strong>What to look for:</strong></p><ol><li><p><strong>Smoothness:</strong> Does the arm jitter, or does it swoop down confidently? (ACT is known for the swoop).</p></li><li><p><strong>Recovery:</strong> If the robot misses the grasp slightly, does it adjust?</p></li><li><p><strong>Success Rate:</strong> For picking up a cube, a well-trained ACT policy should hit nearly 100% success.</p></li></ol><p>Here are some of the results we got after training!</p><p><strong>Video 1: </strong></p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;93358f3c-706f-4b9c-8ab6-c1e819fcea1a&quot;,&quot;duration&quot;:null}"></div><p><strong>Video 2:</strong></p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;be280449-eb66-489b-aedf-dd5eba94238f&quot;,&quot;duration&quot;:null}"></div><p><strong>Video 3:</strong></p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;61bac25e-6cca-4b3b-9c7f-9fda71c98318&quot;,&quot;duration&quot;:null}"></div><p></p><p><strong>Video 4:</strong></p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;f3af6de2-895d-48d8-b3f3-8d3cbb9d4279&quot;,&quot;duration&quot;:null}"></div><p><strong>Video 5:</strong></p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;aa1e92ea-7c5d-4eea-9f1f-7689c1327de3&quot;,&quot;duration&quot;:null}"></div><p></p><p>The robot is learning the pick and place successfully!</p><p>Look at the evaluation curves! They look promising :)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mnxK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mnxK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png 424w, https://substackcdn.com/image/fetch/$s_!mnxK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png 848w, https://substackcdn.com/image/fetch/$s_!mnxK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png 1272w, https://substackcdn.com/image/fetch/$s_!mnxK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mnxK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png" width="1456" height="903" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:903,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:241609,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/187054496?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!mnxK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png 424w, https://substackcdn.com/image/fetch/$s_!mnxK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png 848w, https://substackcdn.com/image/fetch/$s_!mnxK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png 1272w, https://substackcdn.com/image/fetch/$s_!mnxK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39e8e54a-c5a2-43dd-904a-b4b26f583a94_2398x1488.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can follow the Github repo here to replicate these results: </p><p>You can follow the Github repo here to replicate these results: https://github.com/VizuaraAILabs/ACT-Maniskill</p><p>We used Runpod for training the policy. We used the RTX-4090 GPU.</p><p><em>Just clone the repo and run the following commands:</em></p><p>(1) bash setup_maniskill.sh<br>(2) bash train_act.sh</p><p>That&#8217;s it!</p><p>If you like this content, please check out our bootcamps on the following topics:</p><p><strong>Modern Robot Learning</strong>: <a href="https://robotlearningbootcamp.vizuara.ai/">https://robotlearningbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Segment Anything Model (SAM)]]></title><description><![CDATA[The promptable foundation model revolutionizing zero-shot image segmentation through massive-scale training.]]></description><link>https://www.vizuaranewsletter.com/p/segment-anything-model-sam</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/segment-anything-model-sam</guid><dc:creator><![CDATA[Mayank Pratap Singh]]></dc:creator><pubDate>Tue, 20 Jan 2026 09:19:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2ea6440e-c81a-4e4e-b357-db44820234f5_1920x1278.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ybTD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ybTD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png 424w, https://substackcdn.com/image/fetch/$s_!ybTD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png 848w, https://substackcdn.com/image/fetch/$s_!ybTD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png 1272w, https://substackcdn.com/image/fetch/$s_!ybTD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ybTD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png" width="1456" height="986" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:986,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:247750,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ybTD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png 424w, https://substackcdn.com/image/fetch/$s_!ybTD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png 848w, https://substackcdn.com/image/fetch/$s_!ybTD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png 1272w, https://substackcdn.com/image/fetch/$s_!ybTD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29d7dcc3-74ad-4418-aeac-e7c4a13c6b4f_1785x1209.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 0:</strong></em> <em>Detailed Architecture of the Segment Anything Model (SAM).</em></p><h1>Table Of Content</h1><ol><li><p>Introduction to the Segment Anything Model (SAM)</p></li><li><p>High-Level Architecture of the Segment Anything Model</p></li><li><p>Prompting Mechanisms in the Segment Anything Model</p></li><li><p>How the Dataset for SAM Was Created (Segment Anything 1B)</p></li><li><p>Image Encoder in the Segment Anything Model</p></li><li><p>Masked Autoencoder Pretraining for the SAM Image Encoder</p></li><li><p>Prompt Encoder in the Segment Anything Model</p></li><li><p> Mask Decoder</p></li><li><p>Conclusion Perspective on SAM</p><p></p><p></p></li></ol><h1>1.1 Introduction to the Segment Anything Model (SAM)</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Py8U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Py8U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png 424w, https://substackcdn.com/image/fetch/$s_!Py8U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png 848w, https://substackcdn.com/image/fetch/$s_!Py8U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png 1272w, https://substackcdn.com/image/fetch/$s_!Py8U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Py8U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png" width="786" height="270" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:270,&quot;width&quot;:786,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:126055,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Py8U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png 424w, https://substackcdn.com/image/fetch/$s_!Py8U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png 848w, https://substackcdn.com/image/fetch/$s_!Py8U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png 1272w, https://substackcdn.com/image/fetch/$s_!Py8U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fffe5d61d-053b-4399-83e7-2e5b291e7206_786x270.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.1:</strong> <em>High-level overview of the Segment Anything Model (SAM), illustrating how an input image is processed along with user prompts to generate one or more segmentation masks.</em></p><p>Image segmentation has long been considered one of the most challenging problems in computer vision. Unlike image classification, which assigns a single label to an entire image, or object detection, which localizes objects using bounding boxes, <strong>segmentation requires pixel-level understanding</strong>&#8212;every pixel in the image must be assigned to a meaningful region or object. This level of precision is essential in applications such as medical imaging, autonomous driving, robotics, image editing, and scientific analysis.</p><p>Traditional segmentation models are typically <strong>task-specific and class-specific</strong>. For example, a model trained to segment cats and dogs cannot directly generalize to segment trees, vehicles, or furniture without retraining on new annotated data. Moreover, collecting high-quality segmentation masks is expensive and time-consuming, making large-scale generalization difficult. These limitations motivated the need for a <strong>general-purpose, reusable, and flexible segmentation model</strong>.</p><p>This is precisely the gap addressed by the <strong>Segment Anything Model (SAM)</strong>.</p><p>The <strong>Meta AI Research</strong> team introduced the Segment Anything Model in 2023 as a <strong>foundational vision model for segmentation</strong>. Much like how large language models serve as general-purpose foundations for text, SAM is designed to serve as a <strong>universal segmentation backbone</strong> that can be adapted to a wide range of downstream tasks with minimal or no retraining. The paper rapidly gained attention within the research community, accumulating thousands of citations in a short span of time, and has since influenced many follow-up works in vision and multimodal learning.</p><p>At a high level, SAM reframes segmentation as a <strong>promptable task</strong> rather than a fixed prediction problem. Instead of asking the model to segment only predefined object categories, SAM allows users to <strong>guide the segmentation process through prompts</strong>.</p><p>These prompts can take several forms. A user may provide a <strong>point click</strong> on an object of interest, indicating a specific spatial location that the model should focus on. Alternatively, the prompt can be a <strong>rough bounding box</strong> drawn around a region, conveying a vague but useful prior about where the target object lies within the image. In more complex scenarios, the user may supply a <strong>coarse or approximate mask</strong>, roughly shading the area of interest and asking the model to refine it into an accurate, pixel-level segmentation. Additionally, although the Segment Anything Model does not natively accept text as input, <strong>textual descriptions can be incorporated indirectly</strong> by coupling SAM with external vision&#8211;language models, enabling segmentation based on semantic instructions expressed in natural language.</p><p>Given an input image and a prompt, SAM predicts one or more segmentation masks that best correspond to the user&#8217;s intent. This interaction paradigm makes the model highly flexible and suitable for both automated pipelines and human-in-the-loop systems.</p><p>Another defining characteristic of SAM is its ability to handle <strong>ambiguity in user intent</strong>. A single click on an image can be interpreted in multiple valid ways&#8212;for instance, selecting an entire object, a part of an object, or a smaller sub-region. To address this, SAM produces <strong>multiple candidate masks in a single forward pass</strong>, allowing the user or downstream system to choose the most appropriate one. This design choice reflects a practical understanding of how segmentation is used in real-world scenarios.</p><p>Behind this capability lies an unprecedented training effort. SAM was trained on a massive dataset containing <strong>millions of images and over a billion segmentation masks</strong>, covering a diverse range of objects, scenes, and visual concepts. Crucially, the model is <strong>class-agnostic</strong>, t does not rely on fixed semantic labels during inference. Instead, it learns a rich notion of &#8220;objectness&#8221; and spatial coherence, enabling it to segment previously unseen categories.</p><p>In the next section, we will move beyond this conceptual overview and introduce the <strong>architecture of SAM</strong>, examining how its image encoder, prompt encoder, and mask decoder work together to enable promptable, high-quality segmentation at scale.</p><h1>1.2 High-Level Architecture of the Segment Anything Model</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dUzc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dUzc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png 424w, https://substackcdn.com/image/fetch/$s_!dUzc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png 848w, https://substackcdn.com/image/fetch/$s_!dUzc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png 1272w, https://substackcdn.com/image/fetch/$s_!dUzc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dUzc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png" width="1359" height="645" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:645,&quot;width&quot;:1359,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:258200,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dUzc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png 424w, https://substackcdn.com/image/fetch/$s_!dUzc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png 848w, https://substackcdn.com/image/fetch/$s_!dUzc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png 1272w, https://substackcdn.com/image/fetch/$s_!dUzc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae0576e-f902-41b0-8248-f7ad8acb8026_1359x645.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.2:</strong> <em>High-level architecture of the Segment Anything Model (SAM), showing the image encoder, prompt encoder, and mask decoder. The mask decoder integrates image embeddings, prompt embeddings, and initialized mask tokens to produce multiple valid segmentation masks with associated confidence scores.</em></p><p>At first glance, the architecture of the Segment Anything Model (SAM) appears deceptively simple. However, this simplicity is the result of a carefully designed abstraction that unifies multiple input modalities, images and prompts,into a single, coherent segmentation pipeline. At a high level, SAM is structured around <strong>three core components</strong>: an <strong>image encoder</strong>, a <strong>prompt encoder</strong>, and a <strong>mask decoder</strong>. Together, these components enable promptable, class-agnostic, and flexible segmentation.</p><p>The starting point of the pipeline is the <strong>input image</strong>. Since SAM is designed as a foundation model, the image is first transformed into a rich, high-dimensional representation that captures global context as well as fine-grained spatial details. This transformation is performed by the <strong>image encoder</strong>, which converts the raw image into a dense <strong>image embedding</strong>. Conceptually, this embedding can be viewed as a structured feature map that encodes &#8220;what is present&#8221; and &#8220;where it is present&#8221; in the image. Importantly, this step is independent of any specific segmentation task or object category.</p><p>In parallel, SAM processes the <strong>user-provided prompt</strong> through a dedicated <strong>prompt encoder</strong>. Prompts can take different forms, such as points, bounding boxes, or coarse masks, but regardless of their modality, they must be mapped into a common embedding space. The role of the prompt encoder is precisely this: to transform heterogeneous prompt signals into <strong>prompt embeddings</strong> that are compatible with the image embedding. These embeddings do not directly encode semantic class labels; instead, they encode <em>intent</em>, that is, what region or object the user is interested in segmenting.</p><p>The most critical component of SAM is the <strong>mask decoder</strong>, which is responsible for integrating multiple streams of information and transforming them into precise segmentation outputs. Specifically, the mask decoder jointly operates on three distinct inputs. First, it receives the <strong>image embedding</strong> generated by the image encoder, which encodes rich visual and spatial information about the entire image. Second, it takes as input the <strong>prompt embedding</strong> produced by the prompt encoder, which represents the user&#8217;s intent in a form that is compatible with the visual features. Finally, the decoder is initialized with a small set of <strong>learnable mask tokens</strong>, which act as abstract starting points for mask generation. Through successive attention-based interactions, these tokens are refined by attending to both the image and prompt embeddings, ultimately producing one or more segmentation masks that align with the user&#8217;s intended region of interest.</p><p>The inclusion of initialized mask tokens is conceptually similar to the object queries used in Detection Transformers. Rather than predicting masks directly from the image, the decoder starts with a small, fixed number of abstract tokens and progressively refines them through interaction with image and prompt information. This interaction is achieved through attention mechanisms, allowing the decoder to reason jointly over <em>what the image contains</em> and <em>what the user asked for</em>.</p><p>A key design choice in SAM is that the mask decoder produces <strong>multiple candidate masks in a single forward pass</strong>. This is not an incidental detail, but a deliberate response to the inherent ambiguity of prompts. For example, a single click on an image might correspond to an entire object, a part of that object, or a smaller sub-region. By generating multiple masks along with confidence scores, SAM allows downstream systems, or human users, to select the most appropriate interpretation without requiring repeated inference.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XgJ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XgJ2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png 424w, https://substackcdn.com/image/fetch/$s_!XgJ2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png 848w, https://substackcdn.com/image/fetch/$s_!XgJ2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png 1272w, https://substackcdn.com/image/fetch/$s_!XgJ2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XgJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png" width="819" height="363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:363,&quot;width&quot;:819,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:405702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XgJ2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png 424w, https://substackcdn.com/image/fetch/$s_!XgJ2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png 848w, https://substackcdn.com/image/fetch/$s_!XgJ2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png 1272w, https://substackcdn.com/image/fetch/$s_!XgJ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F496fd99a-a6d1-4a85-9019-86f4bf3574e4_819x363.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.3:</strong> <em>Illustration of prompt ambiguity handling in the Segment Anything Model. From a single point prompt (green circle), SAM generates multiple valid segmentation masks in one forward pass, corresponding to different plausible interpretations of user intent.</em></p><p>From an architectural perspective, the division of labor within SAM is both clear and principled. The image encoder is dedicated to visual understanding, transforming the raw input image into a rich representation that captures spatial structure and semantic content. The prompt encoder focuses on modeling user intent, converting different forms of prompts such as points, boxes, or masks into embeddings that guide the segmentation process. The mask decoder then performs multimodal reasoning by jointly attending to the image embeddings, prompt embeddings, and learnable mask tokens in order to generate accurate segmentation outputs. This modular design not only improves interpretability by clearly separating concerns within the model, but also makes SAM highly extensible. Individual components can be independently replaced, scaled, or integrated with external systems such as vision language models without modifying the core segmentation framework.</p><h1>1.<strong>3 Prompting Mechanisms in the Segment Anything Model</strong></h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7OwQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7OwQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png 424w, https://substackcdn.com/image/fetch/$s_!7OwQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png 848w, https://substackcdn.com/image/fetch/$s_!7OwQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png 1272w, https://substackcdn.com/image/fetch/$s_!7OwQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7OwQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png" width="1323" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:1323,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:53541,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7OwQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png 424w, https://substackcdn.com/image/fetch/$s_!7OwQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png 848w, https://substackcdn.com/image/fetch/$s_!7OwQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png 1272w, https://substackcdn.com/image/fetch/$s_!7OwQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6649fc07-0422-4f77-8047-742d86bcaa8f_1323x450.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.4:</strong> <em>Overview of prompt types supported by the Segment Anything Model, including point prompts, bounding box prompts, mask prompts, and text prompts. Point and box prompts provide sparse spatial cues, while mask prompts offer dense guidance, and text prompts are incorporated indirectly via external embedding models.</em></p><p>A defining feature of the Segment Anything Model (SAM) is its ability to perform <strong>promptable segmentation</strong>, where the segmentation output is guided by user-provided prompts rather than fixed semantic classes. These prompts act as signals of intent, specifying <em>where</em> or <em>what</em> the model should segment in an image. SAM supports multiple types of prompts, each differing in structure, information density, and level of user effort.</p><p>The most fundamental prompt type is the <strong>point prompt</strong>. A point prompt is represented as a single spatial coordinate <em>(x, y)</em> on the image, optionally paired with a binary label indicating whether the point belongs to the foreground or the background. Multiple points can be provided simultaneously to refine the user&#8217;s intent. Because point prompts consist of only a few discrete coordinates, they are referred to as <strong>sparse prompts</strong>. Despite their simplicity, point prompts are powerful, enabling intuitive interactions such as clicking on different regions of an image to progressively segment distinct objects or parts of objects.</p><p>A closely related prompt type is the <strong>bounding box prompt</strong>, which is defined by two corner coordinates <em>(x1,y1)</em> and <em>(x2,y2)</em>. This prompt provides a coarse spatial constraint, indicating that the object of interest lies somewhere within the specified rectangular region. Like point prompts, bounding box prompts are sparse, as they rely on a small set of coordinates rather than dense spatial information. They are particularly useful when the approximate extent of an object is known but precise boundaries are not.</p><p>In contrast to sparse prompts, <strong>mask prompts</strong> are considered <strong>dense prompts</strong>. A mask prompt is a two-dimensional binary map, typically of the same spatial resolution as the input image, that provides a rough outline of the region of interest. Rather than specifying intent through isolated points or corners, mask prompts convey intent through a dense collection of pixels, offering stronger guidance to the model. Mask prompts are especially effective in refinement scenarios, where an approximate segmentation already exists and the goal is to improve its precision.</p><p>SAM also supports <strong>text-based prompts</strong>, but in an indirect manner. Text cannot be passed directly into the model. Instead, textual descriptions are first converted into embeddings using an external vision&#8211;language model, such as CLIP, which aligns text and image representations in a shared embedding space. These embeddings can then be fed into the prompt encoder in the same way as other prompt representations. As a result, text prompting in SAM depends on the availability of an external embedding model and is not natively supported in the core architecture.</p><p>Importantly, different prompt types can be used interchangeably or in combination, allowing flexible interaction patterns. Sparse prompts enable fast, lightweight user input, while dense prompts provide stronger constraints when greater precision is required. This unified prompt-based design allows SAM to generalize across tasks and domains without retraining, making it suitable for interactive segmentation, image editing, and downstream vision systems.</p><h1>1.4 How the Dataset for SAM Was Created (Segment Anything 1B)</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YlV_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YlV_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png 424w, https://substackcdn.com/image/fetch/$s_!YlV_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png 848w, https://substackcdn.com/image/fetch/$s_!YlV_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png 1272w, https://substackcdn.com/image/fetch/$s_!YlV_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YlV_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png" width="966" height="651" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2745e211-776f-400a-9650-9a126ee3d59b_966x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:651,&quot;width&quot;:966,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48059,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YlV_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png 424w, https://substackcdn.com/image/fetch/$s_!YlV_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png 848w, https://substackcdn.com/image/fetch/$s_!YlV_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png 1272w, https://substackcdn.com/image/fetch/$s_!YlV_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2745e211-776f-400a-9650-9a126ee3d59b_966x651.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.5:</strong> <em>Three-stage data engine used to construct the Segment Anything 1B (SA-1B) dataset. The process transitions from fully manual annotation to model-assisted refinement and finally to fully automatic mask generation, producing over one billion segmentation masks at scale.</em></p><p>The success of the Segment Anything Model is not only a consequence of architectural design but also of the unprecedented scale and methodology used to construct its training dataset. The dataset, referred to as <strong>Segment Anything 1B (SA-1B)</strong>, contains <strong>over 11 million images and more than 1.1 billion segmentation masks</strong>, making it one of the largest segmentation datasets ever created. Since manually annotating such a dataset is infeasible, the authors introduced a carefully designed <strong>three-stage data engine</strong> that progressively transitions from human annotation to full automation.</p><p><strong>Stage 1: Fully Manual Annotation</strong></p><p>The dataset creation process begins with a small but extremely high-quality seed dataset. In this stage, human annotators manually draw pixel-accurate segmentation masks for a limited number of images. These annotations are created entirely from scratch and represent expert-level, high-fidelity ground truth. While the quality of this data is very high, the quantity is necessarily small due to the cost and effort required for dense mask annotation. This dataset is used to train an initial version of SAM, which is relatively weak and limited in generalization capability but sufficient to begin the bootstrapping process.</p><p><strong>Stage 2: Model-Assisted Annotation</strong></p><p>In the second stage, the partially trained model from Stage 1 is used to assist humans in generating annotations. Instead of drawing masks from scratch, the model proposes segmentation masks based on simple prompts such as points or bounding boxes. Human annotators then refine these proposed masks by correcting inaccuracies or adjusting prompts when the model segments the wrong object. This correction-based workflow is significantly faster than full manual annotation. The refined masks and associated prompts are added back into the training set, and the model is retrained. This stage establishes an iterative human-in-the-loop process in which model quality and dataset size improve together.</p><p><strong>Stage 3: Fully Automatic Mask Generation</strong></p><p>The final stage removes humans from the annotation loop entirely. The improved model from Stage 2 is deployed at scale to automatically generate segmentation masks for millions of new images. On average, the model produces roughly <strong>100 masks per image</strong>, covering objects, object parts, and regions at multiple levels of granularity. Although this stage introduces some noise and inaccuracies due to the absence of human verification, the sheer volume of data compensates for individual errors. Importantly, this stage also helps reduce <strong>human annotation bias</strong>, as the model often segments visual structures that humans might overlook or not consider salient.</p><p>All data from Stages 1, 2, and 3 are combined into a single large dataset, which is then used to train the final version of the Segment Anything Model. The authors explicitly note that the SA-1B dataset was created <strong>for training SAM itself</strong>, not as a standalone public benchmark. The dataset also adheres to privacy and licensing constraints, ensuring responsible large-scale data collection.</p><p>Overall, this three-stage data engine demonstrates how scalable supervision can be achieved by progressively shifting annotation responsibility from humans to models, enabling the creation of datasets at a scale that would otherwise be impossible.</p><h1>1.5 Image Encoder in the Segment Anything Model</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c6FE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c6FE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png 424w, https://substackcdn.com/image/fetch/$s_!c6FE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png 848w, https://substackcdn.com/image/fetch/$s_!c6FE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png 1272w, https://substackcdn.com/image/fetch/$s_!c6FE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c6FE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png" width="1456" height="442" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:442,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:147031,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c6FE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png 424w, https://substackcdn.com/image/fetch/$s_!c6FE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png 848w, https://substackcdn.com/image/fetch/$s_!c6FE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png 1272w, https://substackcdn.com/image/fetch/$s_!c6FE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8639395-3e3a-49d7-b54b-5328051238c5_1710x519.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.6:</strong> <em>Detailed architecture of the SAM image encoder. A high-resolution input image is divided into 16 &#215; 16 patches, projected into a 1280-dimensional embedding space, processed by a Vision Transformer, reshaped into a 64 &#215; 64 spatial grid, and finally projected to a 256-channel feature map used by the mask decoder.</em></p><p>The image encoder is the first and computationally most intensive component of the Segment Anything Model. Its primary role is to convert a high-resolution input image into a compact, context-rich representation that preserves spatial structure while being suitable for downstream mask generation. In SAM, this is achieved using a <strong>Vision Transformer (ViT)</strong> that is <strong>pre-trained with a Masked Autoencoder (MAE) objective</strong>.</p><p><strong>1.5.1 Input Representation</strong></p><p>The input to the image encoder is a standard RGB image of shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(3,1024,1024)&quot;,&quot;id&quot;:&quot;LPCMFSFEWQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where 3 corresponds to the color channels and 1024 &#215; 1024 represents the spatial resolution. This high resolution is intentionally preserved at the input stage to retain fine-grained visual details that are critical for precise segmentation.</p><p><strong>1.5.2 Patchification</strong></p><p>Following the standard Vision Transformer design, the input image is divided into non-overlapping patches of size</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;16&#215;16&quot;,&quot;id&quot;:&quot;TTWFRWWWBN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each patch contains 16 &#215; 16 &#215; 3 = <strong>768 values</strong>, accounting for spatial pixels and color channels. Since the image resolution is 1024 &#215; 1024, this results in</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{1024}{16} \\times \\frac{1024}{16} = 64 \\times 64 = 4096&quot;,&quot;id&quot;:&quot;CEUPCTBXWA&quot;}" data-component-name="LatexBlockToDOM"></div><p>patches in total. These patches are treated as a sequence of tokens, where the sequence length</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N=4096&quot;,&quot;id&quot;:&quot;XISPVDDODW&quot;}" data-component-name="LatexBlockToDOM"></div><p>and each token initially has dimensionality 768.</p><p><strong>1.5.3 Patch Embedding and Positional Encoding</strong></p><p>The raw patch vectors of shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(4096,768)&quot;,&quot;id&quot;:&quot;QQOYJBPWOW&quot;}" data-component-name="LatexBlockToDOM"></div><p>are not directly suitable for transformer processing because the Vision Transformer used in SAM operates in a higher-dimensional embedding space. Therefore, each patch vector is passed through a linear projection layer that maps it from 768 dimensions to <strong>1280 dimensions</strong>, which is the embedding size of the ViT backbone. After this projection, the token representation becomes</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(4096,1280)&quot;,&quot;id&quot;:&quot;KVNLTJRRJL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Learnable positional embeddings are then added to these patch embeddings to encode spatial location information. This allows the transformer to reason about the relative and absolute positions of patches in the original image.</p><p><strong>1.5.4 Transformer Encoding</strong></p><p>The embedded patch sequence is processed by multiple transformer encoder layers. Through repeated self-attention and feed-forward operations, the model produces <strong>context vectors</strong> that capture long-range dependencies across the entire image. Importantly, the number of tokens remains unchanged during this process. The output of the transformer encoder therefore has the same shape as its input</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(4096,1280)&quot;,&quot;id&quot;:&quot;UPNRPCKQYX&quot;}" data-component-name="LatexBlockToDOM"></div><p>where 4096 corresponds to the number of spatial tokens and 1280 is the embedding dimension.</p><p><strong>1.5.5 Reshaping to Spatial Grid</strong></p><p>Although the transformer processes tokens as a flat sequence, segmentation requires spatial structure to be preserved. To restore spatial organization, the token sequence is reshaped back into a two-dimensional grid. </p><p>Since 4096 = 64 &#215; 64, the tensor is reshaped into</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(1280,64,64)&quot;,&quot;id&quot;:&quot;TUQJLSUYCX&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, 64 &#215; 64 represents the spatial layout of patches, and 1280 corresponds to the feature channels at each spatial location. These features are no longer raw patches but <strong>context-enriched representations</strong> produced by the Vision Transformer.</p><p><strong>1.5.6 Channel Projection to Decoder-Compatible Space</strong></p><p>The embedding dimension of 1280 is appropriate for the Vision Transformer but does not match the expected input dimension of the mask decoder. To resolve this, a final linear projection is applied independently at each spatial location to reduce the channel dimension from 1280 to <strong>256</strong>. After this projection, the final output of the image encoder has shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(256,64,64)&quot;,&quot;id&quot;:&quot;UXVBJRWCEH&quot;}" data-component-name="LatexBlockToDOM"></div><p>This representation can be interpreted as a dense feature map with 256 channels and reduced spatial resolution, analogous to the output of a deep convolutional backbone.</p><p><strong>1.5.7 Summary of Image Encoder Transformation</strong></p><p>The complete transformation performed by the image encoder can be summarized as: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(3, 1024, 1024) \\rightarrow (256, 64, 64)&quot;,&quot;id&quot;:&quot;NHBXSMUMNR&quot;}" data-component-name="LatexBlockToDOM"></div><p>During inference, this expensive computation is performed <strong>only once per image</strong>. The resulting feature map can be reused across multiple prompts, which is a key reason why SAM supports fast and interactive segmentation.</p><h1>1.6 Masked Autoencoder Pretraining for the SAM Image Encoder</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UdSD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UdSD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png 424w, https://substackcdn.com/image/fetch/$s_!UdSD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png 848w, https://substackcdn.com/image/fetch/$s_!UdSD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png 1272w, https://substackcdn.com/image/fetch/$s_!UdSD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UdSD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png" width="1407" height="753" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:753,&quot;width&quot;:1407,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:170797,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UdSD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png 424w, https://substackcdn.com/image/fetch/$s_!UdSD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png 848w, https://substackcdn.com/image/fetch/$s_!UdSD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png 1272w, https://substackcdn.com/image/fetch/$s_!UdSD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15abd222-5e19-435d-8e85-668f9cc73b70_1407x753.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.7:</strong> <em>Masked Autoencoder (MAE) pretraining framework used for the SAM image encoder. A large portion of image patches is masked, visible patches are encoded using a Vision Transformer, learnable mask tokens represent hidden patches, and a decoder reconstructs the missing content. After training, only the encoder is retained for use in SAM.</em></p><p>The Vision Transformer used in the Segment Anything Model is not trained from scratch on segmentation data. Instead, it is <strong>pre-trained using a Masked Autoencoder (MAE)</strong> objective. This pretraining strategy plays a crucial role in enabling the image encoder to develop strong visual representations that generalize well across objects, scenes, and spatial structures.</p><p><strong>1.6.1 Motivation Behind Masked Autoencoders</strong></p><p>A masked autoencoder is a self-supervised learning framework designed to force a model to deeply understand the structure of images rather than memorize labels. The central idea is simple: <strong>hide most of the image and train the model to reconstruct what is missing</strong>. If the model can accurately predict large missing regions using only partial information, it must have learned meaningful visual patterns and global context.</p><p>This idea is directly analogous to masked language modeling in models such as BERT, where words are hidden and the model learns to predict them from surrounding context. In MAE, however, the basic units are <strong>image patches instead of words</strong>.</p><p><strong>1.6.2 Patch Masking Strategy</strong></p><p>The input image is first divided into patches in the same way as a standard Vision Transformer. From this set of patches, a <strong>large fraction, typically around 75 percent</strong>, is randomly masked. Only the remaining <strong>25 percent of patches are kept visible</strong>.</p><p>These visible patches are converted into patch embeddings and passed to the encoder. The masked patches are completely removed from the encoder input, which significantly reduces the computational cost during training.</p><p><strong>1.6.3 Encoder Processing</strong></p><p>The encoder receives only the visible patch embeddings and processes them using a Vision Transformer. Since the encoder sees only a small subset of patches, it is forced to extract as much contextual information as possible from limited input. The output of the encoder consists of embeddings corresponding only to the visible patches.</p><p>At this stage, the model has no direct information about the masked regions, yet it must infer what could plausibly exist there based on global structure, object continuity, and visual patterns.</p><p><strong>1.6.4 Learnable Mask Tokens and Decoder</strong></p><p>To reconstruct the original image, the encoded visible patch embeddings are combined with a set of <strong>learnable mask tokens</strong>, one for each masked patch. These mask tokens do not contain image information initially. Instead, they serve as placeholders that the model must learn to fill in.</p><p>The combined sequence of visible patch embeddings and mask tokens is then passed through a <strong>decoder</strong>, which attempts to reconstruct the pixel values of the original image. The decoder predicts the content of all patches, but <strong>the reconstruction loss is computed only on the masked patches</strong>. This design choice prevents the model from simply copying visible information and forces it to learn meaningful representations.</p><p><strong>1.6.5 Training Objective and Representation Learning</strong></p><p>At the beginning of training, the reconstructed patches differ significantly from the true pixel values, leading to high reconstruction loss. As training progresses, the encoder learns increasingly informative representations, and the decoder becomes better at predicting the hidden patches.</p><p>Through this process, the Vision Transformer encoder becomes highly effective at capturing global image structure, object boundaries, textures, and spatial relationships. Importantly, the goal of MAE pretraining is <strong>not classification</strong>, but <strong>visual understanding</strong>. Unlike traditional Vision Transformers that rely heavily on a class token, MAE leverages all patch embeddings, making it particularly suitable for dense prediction tasks such as segmentation.</p><p><strong>1.6.6 Role of MAE in SAM</strong></p><p>After pretraining, the decoder used in the masked autoencoder is discarded. Only the <strong>pre-trained Vision Transformer encoder</strong> is retained and reused as the image encoder in SAM. This encoder provides rich, context-aware feature maps that form the foundation for promptable segmentation.</p><p>Because the encoder has been trained to reason about missing image regions, it is exceptionally well-suited for tasks that require precise spatial understanding, such as predicting segmentation masks from sparse prompts.</p><h1>1.7 Prompt Encoder in the Segment Anything Model</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LtO_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LtO_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png 424w, https://substackcdn.com/image/fetch/$s_!LtO_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png 848w, https://substackcdn.com/image/fetch/$s_!LtO_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png 1272w, https://substackcdn.com/image/fetch/$s_!LtO_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LtO_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png" width="1269" height="759" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:759,&quot;width&quot;:1269,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69537,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LtO_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png 424w, https://substackcdn.com/image/fetch/$s_!LtO_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png 848w, https://substackcdn.com/image/fetch/$s_!LtO_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png 1272w, https://substackcdn.com/image/fetch/$s_!LtO_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F275ca88e-5f96-4e41-80d4-fb06112ff47c_1269x759.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.9:</strong> <em>Internal representation produced by the prompt encoder. Sparse prompts are converted into a token sequence and concatenated with learnable output tokens, while dense mask prompts are fused with image embeddings via element-wise addition before entering the mask decoder.</em></p><p>After the image encoder produces a spatially structured feature map, the second major component in SAM is the <strong>prompt encoder</strong>. The role of the prompt encoder is to translate user-provided prompts into embeddings that can guide the mask decoder toward the intended object or region. Unlike the image encoder and the mask decoder, the prompt encoder is intentionally lightweight and <strong>does not contain any transformer layers or self-attention mechanisms</strong>. Its design reflects the fact that prompts carry explicit semantic intent and do not require heavy contextual reasoning on their own.</p><p>All prompt representations are converted into a <strong>256-dimensional embedding space</strong>, since 256 is the interface dimension expected by the mask decoder and matches the projected output of the image encoder.</p><p><strong>1.7.1 Sparse Prompt Tokenization</strong></p><p>Sparse prompts, namely points and bounding box corners, are converted into <strong>token embeddings</strong>. Each token corresponds to a single geometric entity and occupies one position in a token sequence.</p><p>Each sparse prompt is ultimately represented as a token formed by combining three elements. First, the normalized geometric coordinates are projected into a 256-dimensional embedding space so that they match the dimensionality expected by the decoder. Second, a learnable positional embedding is added to refine spatial bias beyond the raw coordinate values. Third, a learnable type embedding is included to encode the semantic role of the token, indicating whether it corresponds to a point prompt, the first corner of a bounding box, or the second corner. After this encoding step, point prompts are represented as a tensor of shape (N, 256), while bounding box prompts are represented as a tensor of shape (2M, 256), reflecting the fact that each box contributes two corner tokens. These sparse tokens are not handled in isolation; instead, they are concatenated along the token dimension to form a single sparse prompt sequence of length N + 2M, which is then passed downstream to guide mask generation.</p><p><strong>1.7.2 Dense Prompt Encoding and Fusion</strong></p><p>Dense mask prompts are treated fundamentally differently. Rather than producing tokens, the mask prompt encoder outputs a <strong>spatial feature map</strong> of shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(256,64,64)&quot;,&quot;id&quot;:&quot;NSJRJGEFEV&quot;}" data-component-name="LatexBlockToDOM"></div><p>This representation is intentionally aligned with the image encoder output, which also has shape (256, 64, 64). Because of this alignment, dense prompt information is not concatenated with tokens. Instead, it is <strong>fused with image features via element-wise addition</strong>.</p><p>This operation directly injects spatial guidance from the mask prompt into the image embedding, ensuring that dense cues influence every spatial location before decoding begins.</p><p><strong>1.7.3 Output Tokens for Mask Generation</strong></p><p>In addition to prompt-derived embeddings, the mask decoder requires a fixed number of learnable tokens to initiate mask generation. SAM predicts <strong>three candidate masks and one IoU score vector</strong>, which together require <strong>four output tokens</strong>.</p><p>These tokens are randomly initialized learnable vectors, each with a dimensionality of 256, and they do not originate from either the prompt encoder or the image encoder. Instead, they are introduced solely to initialize the decoding process and provide fixed starting points for prediction. Conceptually, they play the same role as object queries in detection transformers, acting as abstract placeholders that are progressively transformed by the decoder into meaningful outputs, namely the final segmentation masks and their associated quality scores.</p><p><strong>1.7.4 Final Token Sequence Construction</strong></p><p>Before entering the mask decoder, the token representations produced so far are assembled into a single sequence. The sparse prompt tokens, which have shape (N + 2M, 256), are concatenated with a fixed set of output tokens of shape (4, 256). This concatenation results in a unified token sequence of shape (N + 2M + 4, 256), which serves as the sequential input to the decoder. In parallel, the spatial input to the decoder is constructed by element-wise addition of the image encoder output and the dense mask prompt embedding, both of which have shape (256, 64, 64). This addition fuses visual features with dense spatial guidance when a mask prompt is provided. Together, the token sequence and the spatial feature map fully specify the information that is passed into the mask decoder.</p><p>The prompt encoder does not perform multimodal reasoning. Instead, it acts as a <strong>structural adapter</strong>, converting prompts into decoder-compatible representations. All semantic interaction between image content, user intent, and output hypotheses is deferred to the mask decoder. This clear separation of responsibilities is a central reason why SAM remains both extensible and efficient for interactive segmentation.</p><h1>1.8 Mask Decoder</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DBFz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DBFz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png 424w, https://substackcdn.com/image/fetch/$s_!DBFz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png 848w, https://substackcdn.com/image/fetch/$s_!DBFz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png 1272w, https://substackcdn.com/image/fetch/$s_!DBFz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DBFz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png" width="777" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:777,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:35144,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DBFz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png 424w, https://substackcdn.com/image/fetch/$s_!DBFz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png 848w, https://substackcdn.com/image/fetch/$s_!DBFz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png 1272w, https://substackcdn.com/image/fetch/$s_!DBFz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc1e95770-3415-41a8-bf4a-b292d65ec26c_777x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.10:</strong> Architecture of the SAM mask decoder. The decoder takes a token sequence and a spatial image feature map as inputs, applies self-attention and bidirectional cross-attention, upsamples image features, and generates multiple segmentation masks using per-mask dot products, along with IoU confidence scores.</em></p><p>The mask decoder is the only component in SAM that performs full transformer-based reasoning. Its role is to combine visual information from the image encoder with prompt-derived tokens and transform them into final segmentation masks along with corresponding IoU confidence scores. Architecturally, the decoder operates on two parallel inputs: a token sequence and a spatial image feature map, each carrying complementary information.</p><p><strong>1.8.1 Inputs to the Mask Decoder</strong></p><p>The first input is a token sequence of shape (N + 2M + 4, 256). This sequence consists of sparse prompt tokens derived from points and bounding box corners, together with four learnable output tokens. Among these four tokens, three are dedicated to producing segmentation masks and one is dedicated to predicting IoU scores. Each token is a 256-dimensional vector, and the sequence length reflects the total number of sparse prompts plus the fixed output queries.</p><p>The second input is a spatial feature map of shape (256, 64, 64). This map is obtained by element-wise addition of the image encoder output and the dense mask prompt embedding, if a mask prompt is provided. Both inputs share identical dimensionality, which allows them to be fused directly. For transformer operations, this spatial map is temporarily reshaped into a sequence of 4096 tokens, each of dimension 256, corresponding to the flattened 64&#215;64 grid.</p><p><strong>1.8.2 Transformer Structure and Attention Flow</strong></p><p>The core of the mask decoder is a transformer block that is repeated twice in series. Each block contains four stages: self-attention over tokens, token-to-image cross-attention, a feed-forward MLP, and image-to-token cross-attention.</p><p>The process begins with self-attention applied only to the token sequence. Queries, keys, and values are all derived from the same token set, allowing prompt tokens and output tokens to exchange information and form context-aware representations. The output of this stage remains a sequence of length N + 2M + 4, with each token still embedded in 256 dimensions.</p><p>Next, token-to-image cross-attention is performed. In this stage, the token sequence generates the queries, while the flattened image feature sequence generates the keys and values. This allows each token to attend over all spatial locations in the image. Importantly, the number of output context vectors produced here equals the number of queries, not the number of image tokens. As a result, the output remains a sequence of N + 2M + 4 tokens, now enriched with image context.</p><p>Following this, an MLP is applied independently to each token to further refine the representations. After the MLP, image-to-token cross-attention is applied. In this case, queries are generated from the image feature sequence, while keys and values come from the token sequence. This operation propagates prompt information back into the image features. The output of this step is a sequence of 4096 image tokens, each of dimension 256, which is then reshaped back into a spatial tensor of shape (256, 64, 64).</p><p>This entire transformer block is executed twice, allowing for deeper bidirectional interaction between token representations and spatial image features.</p><p><strong>1.8.3 Image Upscaling Path</strong></p><p>After the transformer blocks, the refined image feature map of shape (256, 64, 64)is passed through an image upscaling module. This module consists of two successive convolutional upsampling operations, each doubling the spatial resolution. As a result, the feature map is transformed first to (256, 128, 128) and then to (256,256,256). This higher-resolution feature map is used for precise mask generation.</p><p><strong>1.8.4 Mask and IoU Prediction</strong></p><p>At this stage, only the four output tokens are retained from the token sequence. The token dedicated to IoU prediction is passed through a small MLP to produce IoU scores for the predicted masks.</p><p>Each of the three mask tokens is processed independently through its own MLP, producing three vectors of dimension 256. These vectors act as dynamic mask heads. For each mask token, a per-mask dot product is computed between the 256-dimensional token vector and the 256-channel upscaled image feature map. This operation collapses the channel dimension, producing a single-channel spatial mask of shape (1, 256, 256).</p><p>Finally, each mask is resized using interpolation to match the original image resolution, resulting in output masks of shape (1, 1024, 1024). The decoder therefore produces three candidate segmentation masks along with corresponding IoU scores, enabling the model to represent multiple plausible segmentations for ambiguous prompts.</p><h1>1.9 Conclusion <strong>Perspective on SAM</strong></h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8SYJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8SYJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png 424w, https://substackcdn.com/image/fetch/$s_!8SYJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png 848w, https://substackcdn.com/image/fetch/$s_!8SYJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png 1272w, https://substackcdn.com/image/fetch/$s_!8SYJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8SYJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png" width="935" height="678" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:935,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:141750,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/184705881?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8SYJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png 424w, https://substackcdn.com/image/fetch/$s_!8SYJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png 848w, https://substackcdn.com/image/fetch/$s_!8SYJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png 1272w, https://substackcdn.com/image/fetch/$s_!8SYJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0590ffb4-e629-4a39-a78d-faa8a0020bf6_935x678.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.11:</strong> Bird&#8217;s-eye view of the Segment Anything Model architecture. The image encoder extracts visual embeddings, the prompt encoder encodes user guidance, and the mask decoder combines both to generate valid segmentation masks.</em></p><p>The Segment Anything Model represents a clean and principled rethinking of image segmentation as a prompt-driven, general-purpose task. By clearly separating visual understanding, user intent, and mask generation into the image encoder, prompt encoder, and mask decoder, SAM achieves both flexibility and scalability. The image encoder learns strong, general visual representations through masked autoencoder pretraining. The prompt encoder translates diverse user inputs into a unified embedding space without relying on transformers. The mask decoder then performs multimodal reasoning using bidirectional attention to fuse image features with prompts and produce multiple plausible masks along with confidence scores.</p><p>This modular design enables SAM to handle ambiguity, support multiple prompt types, and generalize across datasets without task-specific retraining. More importantly, it establishes a reusable architectural pattern where segmentation becomes an interface problem rather than a fixed-label prediction task. As a result, SAM serves not only as a powerful segmentation model but also as a foundation for interactive vision systems and downstream vision language applications.</p><h1><strong>Watch the full lecture video here</strong></h1><div id="youtube2-SVs-naO2KEA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;SVs-naO2KEA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/SVs-naO2KEA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>If you would like to deepen your understanding of Segment Anything Model (SAM) and see these ideas explained visually and intuitively, you can refer to the accompanying video linked above. If you wish to get access to our code files, handwritten notes, all lecture videos, Discord channel, and other PDF handbooks that we have compiled, along with a code certificate at the end of the program, you can consider being part of the pro version of the &#8220;Transformers for Vision Bootcamp&#8221;. You will find the details here:</p><p>h<a href="http://ttps://vision-transformer.vizuara.ai/">ttps://vision-transformer.vizuara.ai/</a></p><h1><strong>Other resources</strong></h1><p>If you like this content, please check out our research bootcamps on the following topics:</p><p><strong>CV</strong>:<a href="https://cvresearchbootcamp.vizuara.ai/"> https://cvresearchbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><h1><strong>Connect with us</strong></h1><p><strong>Dr. Sreedath Panat</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/sreedath-panat/">https://www.linkedin.com/in/sreedath-panat/</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/sreedathpanat">https://x.com/sreedathpanat</a></p><p><strong>Mayank Pratap Singh</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/mayankpratapsingh022/">www.linkedin.com/in/mayankpratapsingh022</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/Mayank_022">x.com/Mayank_022</a>.</p>]]></content:encoded></item><item><title><![CDATA[Detection Transformer (DETR): An introduction]]></title><description><![CDATA[How to use transformer for object detection from images?]]></description><link>https://www.vizuaranewsletter.com/p/detection-transformer-detr-an-introduction</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/detection-transformer-detr-an-introduction</guid><dc:creator><![CDATA[Mayank Pratap Singh]]></dc:creator><pubDate>Thu, 15 Jan 2026 08:40:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!M0HN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Table Of Content</h1><ol><li><p><em>Introduction to DETR</em></p></li><li><p><em>What Is Object Detection</em></p></li><li><p><em>Core Components of an Object Detection Prediction</em></p></li><li><p><em>Object Detection Versus Classification</em></p></li><li><p><em>Anchor Boxes and Non Maximum Suppression</em></p></li><li><p><em>The Detection Transformer (DETR): Architecture and Design Philosophy</em></p></li><li><p><em>Hungarian Matching Loss in Detection Transformers</em></p></li><li><p><em>When Hungarian Matching Is Needed and When It Is Not in DETR</em></p></li><li><p><em>The Limitation of IoU and the Motivation for Generalized IoU</em></p></li><li><p><em>The DETR Loss Function: Formal Definition and Training Mechanics</em></p></li><li><p><em>Concluding Perspective on DETR</em></p></li></ol><h1>1.1 Introduction to DETR</h1><p>Object detection has historically evolved through increasingly sophisticated architectures that balance localization accuracy, classification performance, and computational efficiency. Detection Transformers represent a conceptual shift in this progression. Instead of relying on handcrafted components such as anchor boxes, region proposal pipelines, or dense sliding window predictions, detection transformers formulate object detection as a direct set prediction problem. By leveraging the transformer architecture, these models jointly reason about global image context and object relationships, enabling a simpler and more unified detection pipeline. While the core ideas are not mathematically complex, detection transformers require careful attention to how multiple components interact, including feature extraction, set based prediction, and loss formulation. This section builds the foundation needed to understand detection transformers by first revisiting object detection itself from first principles.</p><h1>1.2 What Is Object Detection</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M0HN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M0HN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png 424w, https://substackcdn.com/image/fetch/$s_!M0HN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png 848w, https://substackcdn.com/image/fetch/$s_!M0HN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png 1272w, https://substackcdn.com/image/fetch/$s_!M0HN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M0HN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png" width="786" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:786,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:708594,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!M0HN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png 424w, https://substackcdn.com/image/fetch/$s_!M0HN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png 848w, https://substackcdn.com/image/fetch/$s_!M0HN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png 1272w, https://substackcdn.com/image/fetch/$s_!M0HN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32500ddf-8e94-4494-84aa-6335fa46ef07_786x438.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.1 </strong>Example of object detection output showing multiple bounding boxes, class labels, and confidence scores predicted within a single image.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Object detection extends image classification by moving from a single global prediction to multiple localized predictions within the same image. Rather than assigning one label to an entire image, an object detection model identifies and localizes multiple objects, each potentially belonging to a different class. This means that, within one image, the model must reason about several objects simultaneously, their spatial locations, and their semantic categories.</p><p>Figure 1.1 illustrates a typical object detection output. Multiple bounding boxes are drawn over the image, each corresponding to a detected object such as a person, a horse, or a dog. Along with each bounding box, the model produces a class label and a confidence score that reflects how strongly the model believes the object belongs to that class.</p><p>This differs fundamentally from standard image classification, where the entire image is treated as a single entity. In classification, the output is usually a probability distribution over classes for the whole image. In detection, predictions are localized and repeated across different regions of the image.</p><h1>1.3 Core Components of an Object Detection Prediction</h1><p>In object detection, each detected object is described by a compact numerical representation that jointly encodes its spatial location and semantic identity. Unlike image classification, where a single prediction summarizes the entire image, object detection produces multiple localized predictions, one for each object instance. Each prediction must therefore be expressive enough to specify where the object is and what it represents, while remaining simple enough to be learned efficiently.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WAgX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WAgX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png 424w, https://substackcdn.com/image/fetch/$s_!WAgX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png 848w, https://substackcdn.com/image/fetch/$s_!WAgX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png 1272w, https://substackcdn.com/image/fetch/$s_!WAgX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WAgX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png" width="192" height="204" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:204,&quot;width&quot;:192,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36507,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WAgX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png 424w, https://substackcdn.com/image/fetch/$s_!WAgX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png 848w, https://substackcdn.com/image/fetch/$s_!WAgX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png 1272w, https://substackcdn.com/image/fetch/$s_!WAgX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F42f75b3d-a203-4f07-9a63-faa2647f6e45_192x204.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em><strong>Figure 1.2 </strong>Bounding box parameterization for object detection, showing the object represented by the center coordinates (x,y) along with its width and height, which together uniquely define the spatial extent of the detected object within the image.</em></p><p>The spatial extent of an object is captured through a bounding box. A bounding box is uniquely defined by four continuous parameters: the x and y coordinates of its center, along with its width and height. This representation is minimal and sufficient to reconstruct the rectangular region corresponding to the object. In practice, these values are normalized with respect to the image dimensions so that they lie in the range from zero to one, which simplifies optimization and allows the model to generalize across images of different resolutions.</p><p>Beyond localization, the model must also assign a semantic label to the contents of each bounding box. This is achieved by predicting a probability distribution over the predefined set of object classes. The distribution reflects the model&#8217;s relative confidence across all possible categories and forms the basis for deciding which class is ultimately assigned to the detected object.</p><p>Finally, object detection models associate each prediction with a confidence score that reflects the likelihood that a valid object is present at the predicted location. This score plays a crucial role during inference, where low confidence predictions are typically discarded to reduce false positives. Depending on the model design, this confidence may be predicted explicitly or implicitly encoded within the class probabilities.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6UIz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6UIz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png 424w, https://substackcdn.com/image/fetch/$s_!6UIz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png 848w, https://substackcdn.com/image/fetch/$s_!6UIz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png 1272w, https://substackcdn.com/image/fetch/$s_!6UIz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6UIz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png" width="936" height="516" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fe88513-ff04-4222-8781-640fae2fca79_936x516.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:516,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91778,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6UIz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png 424w, https://substackcdn.com/image/fetch/$s_!6UIz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png 848w, https://substackcdn.com/image/fetch/$s_!6UIz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png 1272w, https://substackcdn.com/image/fetch/$s_!6UIz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fe88513-ff04-4222-8781-640fae2fca79_936x516.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.3 </strong><em>High level object detection pipeline illustrating how an input image is processed by an object detection model to produce localized bounding boxes, along with corresponding class predictions and confidence scores.</em></p><h1>1.4 Object Detection Versus Classification</h1><p>The distinction between object detection and image classification becomes clear when examining the structure of their model outputs and the losses used during training. In a standard classification task, the model processes an entire image as a single unit and produces a probability vector over the available classes. The corresponding ground truth is typically represented as a one hot encoded vector, and learning is driven by a cross entropy loss that measures the discrepancy between the predicted probability distribution and the true class label.</p><p>Object detection introduces additional complexity because the model must solve multiple prediction problems simultaneously. In addition to assigning a class label, the model must also predict continuous values that describe the spatial extent of each object, such as the bounding box coordinates. As a result, detection models rely on composite loss functions that integrate several objectives into a single training signal. These typically include a classification loss that supervises the predicted class probabilities, a regression loss that penalizes errors in the predicted bounding box parameters, and, in many formulations, an additional objectness or confidence loss that reflects whether a predicted region corresponds to a real object.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bmod!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bmod!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png 424w, https://substackcdn.com/image/fetch/$s_!Bmod!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png 848w, https://substackcdn.com/image/fetch/$s_!Bmod!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png 1272w, https://substackcdn.com/image/fetch/$s_!Bmod!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bmod!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png" width="1456" height="611" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:611,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:243205,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bmod!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png 424w, https://substackcdn.com/image/fetch/$s_!Bmod!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png 848w, https://substackcdn.com/image/fetch/$s_!Bmod!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png 1272w, https://substackcdn.com/image/fetch/$s_!Bmod!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4289c7be-e9e9-44a2-9e51-b39596eecf70_1602x672.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.4</strong> illustrates this idea in the context of class prediction, showing how the model outputs a probability distribution for each detected object and how this distribution is compared against the one hot encoded ground truth using a cross entropy loss. Together, these loss components enable object detection models to jointly learn accurate localization and reliable classification within a unified optimization framework.</em></p><p>From a purely conceptual standpoint, an object detection model can be seen as a function that maps an image to a set of vectors, where each vector encodes one detected object. In the simplest case, one might imagine a neural network that takes an image and outputs a single detection vector. This approach quickly breaks down in realistic settings, as images often contain multiple objects, possibly of the same class, and in varying spatial configurations.</p><p>This limitation motivated earlier detection architectures such as Faster R CNN and YOLO, which introduced mechanisms for handling multiple objects through region proposals or dense predictions. These approaches rely heavily on convolutional inductive biases and task specific design choices.</p><p>Detection transformers take a different route. They remove many of these handcrafted components and instead rely on the transformer&#8217;s ability to model sets and global relationships. Before diving into that architecture, it is essential to be comfortable with the fundamental structure of object detection outputs, targets, and losses, as developed in this section.</p><h1>1.5 Anchor Boxes and Non Maximum Suppression</h1><p>Before the introduction of detection transformers, most object detection systems relied on a set of carefully engineered components to handle the challenges of localization and duplicate predictions. Two of the most important of these components were anchor boxes and non maximum suppression. While effective in practice, both mechanisms introduced additional complexity and a significant amount of human design choices into the detection pipeline. Understanding them provides useful historical context and clarifies what detection transformers deliberately remove.</p><p>Anchor boxes were introduced as a way to simplify the problem of predicting bounding boxes. Instead of asking a model to predict bounding box coordinates from scratch, earlier detectors predefined a large collection of reference boxes at fixed locations in the image. These reference boxes, called anchor boxes, were placed on a regular grid over the image and instantiated with multiple sizes and aspect ratios at each grid point. The intuition was that, for any object in the image, at least one anchor box would roughly overlap with it. During training, the model learned to slightly adjust the position and shape of these anchor boxes so that they aligned more closely with the ground truth objects.</p><p>Figure 1.4 illustrates this idea. At a single spatial location, multiple anchor boxes with different aspect ratios are defined. Some boxes are nearly square, while others are elongated horizontally or vertically. By starting from these diverse shapes, the model can better handle objects with varying geometries, including extreme aspect ratios that arise due to viewpoint changes or object pose.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3svs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3svs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png 424w, https://substackcdn.com/image/fetch/$s_!3svs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png 848w, https://substackcdn.com/image/fetch/$s_!3svs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png 1272w, https://substackcdn.com/image/fetch/$s_!3svs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3svs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png" width="522" height="543" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5dff9731-2c28-4f23-878b-1869440674b6_522x543.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:543,&quot;width&quot;:522,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:345718,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3svs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png 424w, https://substackcdn.com/image/fetch/$s_!3svs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png 848w, https://substackcdn.com/image/fetch/$s_!3svs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png 1272w, https://substackcdn.com/image/fetch/$s_!3svs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5dff9731-2c28-4f23-878b-1869440674b6_522x543.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.5 </strong>Anchor boxes defined at a fixed location with multiple sizes and aspect ratios, serving as predefined starting points for bounding box regression.</em></p><p>While anchor boxes help with localization, they also introduce a side effect. During inference, many anchor boxes may produce very similar predictions for the same object. As a result, a detector often outputs multiple overlapping bounding boxes that all correspond to a single object. This redundancy must be resolved before producing the final set of detections. Non maximum suppression, commonly abbreviated as NMS, was designed to address this problem.</p><p>Non maximum suppression is a post processing algorithm applied after the model has produced its raw bounding box predictions. The algorithm operates on a set of predicted boxes, each associated with a confidence score. The first step is to discard boxes whose confidence falls below a predefined threshold. This threshold is chosen manually and reflects the minimum confidence required for a prediction to be considered meaningful. After this filtering step, the remaining boxes are sorted in descending order of confidence.</p><p>Figure 1.5 shows the effect of this process. Before non maximum suppression, many overlapping boxes are present around the same object. After applying NMS, only a single bounding box remains, corresponding to the highest confidence prediction.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NQ7o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NQ7o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png 424w, https://substackcdn.com/image/fetch/$s_!NQ7o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png 848w, https://substackcdn.com/image/fetch/$s_!NQ7o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png 1272w, https://substackcdn.com/image/fetch/$s_!NQ7o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NQ7o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png" width="1089" height="510" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:510,&quot;width&quot;:1089,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:413669,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NQ7o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png 424w, https://substackcdn.com/image/fetch/$s_!NQ7o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png 848w, https://substackcdn.com/image/fetch/$s_!NQ7o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png 1272w, https://substackcdn.com/image/fetch/$s_!NQ7o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23aaf46f-1e19-4b10-aebb-a41d9fa947b3_1089x510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.5 </strong>Effect of non maximum suppression. Multiple overlapping predictions before NMS are reduced to a single high confidence bounding box after NMS.</em></p><p>The core operation in non maximum suppression is the comparison of bounding boxes using the intersection over union metric, commonly abbreviated as IoU. Given two bounding boxes <em>B_i</em>&#8203; and <em>B_j</em>&#8203;, the IoU is defined as the ratio of the area of their intersection to the area of their union,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\text{IoU}(B_i, B_j) = \\frac{\\text{Area}(B_i \\cap B_j)}{\\text{Area}(B_i \\cup B_j)} \n\n&quot;,&quot;id&quot;:&quot;AXYPMATRBZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>This value lies between zero and one. An IoU of zero indicates no overlap, while an IoU of one indicates perfect overlap. Figure 1.6 provides a visual interpretation of IoU, along with examples of high, moderate, and low overlap scenarios.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P4B6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P4B6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png 424w, https://substackcdn.com/image/fetch/$s_!P4B6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png 848w, https://substackcdn.com/image/fetch/$s_!P4B6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png 1272w, https://substackcdn.com/image/fetch/$s_!P4B6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P4B6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png" width="1323" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:1323,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:19410,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P4B6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png 424w, https://substackcdn.com/image/fetch/$s_!P4B6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png 848w, https://substackcdn.com/image/fetch/$s_!P4B6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png 1272w, https://substackcdn.com/image/fetch/$s_!P4B6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05403d08-213e-40d2-9b28-815b8108b90d_1323x450.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.6</strong><br><em>Intersection over Union illustrated as the ratio between overlapping area and total union area, with examples ranging from excellent to poor overlap.</em></p><p>Using IoU, non maximum suppression proceeds iteratively. The box with the highest confidence score is selected and added to the final output set. This box is then compared with all remaining boxes. Any box whose IoU with the selected box exceeds a predefined IoU threshold is removed, as it is considered a duplicate prediction of the same object. Boxes with low IoU values, such as 0.1 or 0.2, are retained because they likely correspond to different objects. The process is then repeated with the next highest confidence box among the remaining predictions, continuing until no boxes remain.</p><p>Figure 1.7 summarizes this algorithmic flow, highlighting how confidence thresholds and IoU thresholds guide the suppression process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IT2x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IT2x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png 424w, https://substackcdn.com/image/fetch/$s_!IT2x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png 848w, https://substackcdn.com/image/fetch/$s_!IT2x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png 1272w, https://substackcdn.com/image/fetch/$s_!IT2x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IT2x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png" width="1456" height="513" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:513,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:525362,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IT2x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png 424w, https://substackcdn.com/image/fetch/$s_!IT2x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png 848w, https://substackcdn.com/image/fetch/$s_!IT2x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png 1272w, https://substackcdn.com/image/fetch/$s_!IT2x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54be9afe-cbc4-4a93-9859-7aab9d4497d5_1473x519.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.7</strong><br><em>Step by step procedure of non maximum suppression, including confidence filtering, IoU computation, and iterative removal of duplicate boxes.</em></p><p>Although effective, both anchor boxes and non maximum suppression rely heavily on hand engineered choices. The number of anchor boxes, their aspect ratios, the confidence threshold, and the IoU threshold are all selected manually, typically through trial and error. These design decisions are not learned from data and can significantly influence detection performance. Detection transformers remove both anchor boxes and non maximum suppression, replacing them with a learned set based prediction mechanism. This shift toward end to end learning is one of the central motivations behind the detection transformer architecture and will be explored in detail in the following sections.</p><h1>1.6 The Detection Transformer (DETR): Architecture and Design Philosophy</h1><p>The Detection Transformer (DETR) represents a conceptual shift in object detection by reformulating detection as a <strong>direct set prediction problem</strong>. Unlike traditional detectors that rely on dense proposals, anchor boxes, and post-processing steps such as non-maximum suppression, DETR predicts a fixed-size set of object candidates in a single forward pass and matches them globally to ground-truth objects using a set-based loss. This design eliminates hand-engineered components and allows object detection to be expressed entirely within an end-to-end trainable framework built around the transformer encoder&#8211;decoder architecture.</p><p>At a high level, DETR consists of four major components: a convolutional backbone for feature extraction, a transformer encoder for global contextual reasoning, a transformer decoder driven by object queries, and lightweight prediction heads that output class labels and bounding boxes. The interaction between these components enables DETR to jointly reason about object presence, category, and spatial localization.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AQ4P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AQ4P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png 424w, https://substackcdn.com/image/fetch/$s_!AQ4P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png 848w, https://substackcdn.com/image/fetch/$s_!AQ4P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png 1272w, https://substackcdn.com/image/fetch/$s_!AQ4P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AQ4P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png" width="1456" height="1122" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1122,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:392803,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AQ4P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png 424w, https://substackcdn.com/image/fetch/$s_!AQ4P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png 848w, https://substackcdn.com/image/fetch/$s_!AQ4P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png 1272w, https://substackcdn.com/image/fetch/$s_!AQ4P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2277ddf1-9b17-47ea-b523-3ba364999b2d_1581x1218.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.8</strong> Overall architecture of the Detection Transformer (DETR), illustrating the flow from input image to CNN feature extraction, transformer encoder&#8211;decoder processing, and final set-based object predictions.</em></p><h4><strong>From Image to Feature Map: CNN Backbone Representation</strong></h4><p>The DETR pipeline begins with an input image, which is first processed by a standard convolutional neural network such as ResNet-50 or ResNet-101. This backbone is typically pretrained on ImageNet and serves the purpose of extracting rich, hierarchical visual features from the image. As the image propagates through successive convolutional layers, its spatial resolution is progressively reduced while the number of feature channels increases.</p><p>The resulting output is a <strong>feature map</strong> of shape </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;(H_f \\times W_f \\times C)&quot;,&quot;id&quot;:&quot;MOLYNWJRQC&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>H_f</em>&#8203; and <em>W_f</em>&#8203; are the reduced spatial dimensions and <em>C</em> is the number of channels.</p><p>These feature maps encode semantic and spatial information about the image, but they are still arranged in a two-dimensional grid. Since transformers operate on sequences rather than grids, this representation must be transformed before it can be processed by the encoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lkY0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lkY0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png 424w, https://substackcdn.com/image/fetch/$s_!lkY0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png 848w, https://substackcdn.com/image/fetch/$s_!lkY0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png 1272w, https://substackcdn.com/image/fetch/$s_!lkY0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lkY0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png" width="1230" height="333" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:333,&quot;width&quot;:1230,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:187619,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lkY0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png 424w, https://substackcdn.com/image/fetch/$s_!lkY0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png 848w, https://substackcdn.com/image/fetch/$s_!lkY0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png 1272w, https://substackcdn.com/image/fetch/$s_!lkY0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0804372b-393c-4674-a8bb-1159c90ceda8_1230x333.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.9 </strong><em>Conversion of an input image into a dense feature map using a CNN backbone, where spatial resolution is reduced and semantic richness is increased.</em></p><h4>Tokenization and Positional Encoding of Image Features</h4><p>To make the CNN feature map compatible with the transformer encoder, the spatial grid of features is flattened into a one-dimensional sequence. Each spatial location in the feature map is treated as a token, resulting in a sequence of length</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;H_f \\times W_f&quot;,&quot;id&quot;:&quot;PQHOQXOEAK&quot;}" data-component-name="LatexBlockToDOM"></div><p>Before feeding this sequence into the transformer, a linear projection is applied to map each feature vector from channel dimension <em>C</em> into the transformer&#8217;s embedding dimension <em>d_model</em></p><p>Because transformers lack an inherent notion of spatial order, positional information must be explicitly injected. DETR employs <strong>sinusoidal positional encodings</strong>, derived from the original Transformer formulation, to encode the two-dimensional spatial position of each feature token. These positional encodings are added to the token embeddings and are injected at every transformer encoder layer, ensuring that spatial information is preserved throughout the encoding process.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SIZQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SIZQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png 424w, https://substackcdn.com/image/fetch/$s_!SIZQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png 848w, https://substackcdn.com/image/fetch/$s_!SIZQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png 1272w, https://substackcdn.com/image/fetch/$s_!SIZQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SIZQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png" width="1419" height="921" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:921,&quot;width&quot;:1419,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:124711,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SIZQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png 424w, https://substackcdn.com/image/fetch/$s_!SIZQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png 848w, https://substackcdn.com/image/fetch/$s_!SIZQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png 1272w, https://substackcdn.com/image/fetch/$s_!SIZQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c4ee4d-9eeb-405a-9927-26f9b1e9a1b3_1419x921.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.10 </strong><em>Flattening of CNN feature maps into a sequence of tokens and the addition of sinusoidal positional encodings prior to transformer encoding.</em></p><p>While the encoder processes image-derived tokens, the transformer decoder in DETR introduces a fundamentally different concept: <strong>object queries</strong>. Object queries are a fixed set of learnable vectors, initialized independently of the input image, whose number determines the maximum number of objects the model can predict. For example, if 100 object queries are used, the model will always produce 100 predictions, regardless of how many objects are actually present in the image.</p><p>These object queries act as slots that compete to explain objects in the scene. Some queries will eventually specialize to predict real objects, while others will learn to predict the special &#8220;no-object&#8221; class, corresponding to background.</p><p>The decoder processes object queries through multiple decoder blocks, each consisting of three main sublayers: self-attention among object queries, cross-attention between object queries and encoder outputs, and a feed-forward network. In the self-attention stage, object queries interact with each other, allowing the model to reason about relationships between predicted objects and to avoid redundant detections. In the cross-attention stage, object queries attend to the encoder&#8217;s image representations, enabling each query to focus on relevant regions of the image.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UAEg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UAEg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png 424w, https://substackcdn.com/image/fetch/$s_!UAEg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png 848w, https://substackcdn.com/image/fetch/$s_!UAEg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png 1272w, https://substackcdn.com/image/fetch/$s_!UAEg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UAEg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png" width="1149" height="1122" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1122,&quot;width&quot;:1149,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71189,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UAEg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png 424w, https://substackcdn.com/image/fetch/$s_!UAEg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png 848w, https://substackcdn.com/image/fetch/$s_!UAEg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png 1272w, https://substackcdn.com/image/fetch/$s_!UAEg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde08d6bf-5025-426d-a22b-1c6fc5cdada6_1149x1122.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.11 </strong><em>Transformer decoder structure in DETR, illustrating self-attention among object queries and cross-attention in DETR, highlighting the different origins of queries (object queries) and keys/values (encoder outputs).</em></p><p><strong>Cross-Attention and Query&#8211;Key&#8211;Value Asymmetry</strong></p><p>A key distinction between the transformer encoder and decoder lies in the source of queries, keys, and values. In encoder self-attention, all three originate from the same sequence of image tokens. In contrast, during decoder cross-attention, the <strong>queries</strong> originate from object queries, while the <strong>keys and values</strong> originate from the encoder&#8217;s output representations of the image. This asymmetry is what defines cross-attention and allows object queries to selectively extract information from the image.</p><p>Unlike image tokens, object queries do not have an inherent spatial ordering. As a result, DETR uses <strong>learnable positional embeddings</strong> for object queries instead of sinusoidal ones. These embeddings are trained jointly with the rest of the model and allow the decoder to differentiate between individual query slots.</p><p><strong>Prediction Heads and Set-Based Outputs</strong></p><p>After passing through the transformer decoder, each object query is transformed into a final embedding that is fed into two lightweight feed-forward networks. One prediction head outputs a probability distribution over object classes, including a special background class, while the other predicts the bounding box parameters (x, y, w, h) in normalized image coordinates.</p><p>Crucially, there is a one-to-one correspondence between object queries and predictions: each query produces exactly one class prediction and one bounding box. The full output of DETR is therefore a <strong>set of predictions</strong>, whose size is fixed by design and independent of the image content.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5mWe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5mWe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png 424w, https://substackcdn.com/image/fetch/$s_!5mWe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png 848w, https://substackcdn.com/image/fetch/$s_!5mWe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png 1272w, https://substackcdn.com/image/fetch/$s_!5mWe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5mWe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png" width="1456" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161477,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5mWe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png 424w, https://substackcdn.com/image/fetch/$s_!5mWe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png 848w, https://substackcdn.com/image/fetch/$s_!5mWe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png 1272w, https://substackcdn.com/image/fetch/$s_!5mWe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F517df8df-10f4-41eb-b496-cb9662cf173a_1467x645.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.12 </strong><em>Mapping of transformed object queries to final class predictions and bounding boxes via feed-forward prediction heads.</em></p><p>Training DETR requires matching the unordered set of predictions to the unordered set of ground-truth objects. This is achieved using the <strong>Hungarian matching algorithm</strong>, which computes an optimal bipartite matching between predictions and ground truth based on a cost function that combines classification error, bounding box regression error, and overlap-based metrics such as generalized IoU.</p><p>Once the matching is established, a composite loss is computed over the matched pairs, while unmatched predictions are trained to predict the background class. This global, set-based loss formulation ensures that each ground-truth object is assigned to exactly one prediction and eliminates the need for non-maximum suppression.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oMBQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oMBQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png 424w, https://substackcdn.com/image/fetch/$s_!oMBQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png 848w, https://substackcdn.com/image/fetch/$s_!oMBQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png 1272w, https://substackcdn.com/image/fetch/$s_!oMBQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oMBQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png" width="870" height="429" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8659c7e-1267-4344-b557-93c5bac069af_870x429.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:429,&quot;width&quot;:870,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139994,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oMBQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png 424w, https://substackcdn.com/image/fetch/$s_!oMBQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png 848w, https://substackcdn.com/image/fetch/$s_!oMBQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png 1272w, https://substackcdn.com/image/fetch/$s_!oMBQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8659c7e-1267-4344-b557-93c5bac069af_870x429.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.13</strong><br><em>Set-based matching between predicted boxes and ground-truth objects using the Hungarian algorithm, with unmatched predictions assigned to background.</em></p><h1>1.7 Hungarian Matching Loss in Detection Transformers</h1><p>Detection Transformers frame object detection as a <strong>direct set prediction problem</strong>, where a fixed number of predictions must be matched to a variable number of ground-truth objects. Because the predictions are unordered and there is no predefined correspondence between predicted boxes and ground-truth boxes, DETR requires a principled mechanism to establish a one-to-one assignment during training. This role is fulfilled by the <strong>Hungarian matching algorithm</strong>, which enables global, optimal matching between two sets under a predefined cost.</p><p>The Hungarian algorithm originates from the classical assignment problem and provides a deterministic way to compute the minimum-cost matching between two equally sized sets. DETR adopts this algorithm to match predicted object queries to ground-truth objects, ensuring that each object is detected exactly once and that redundant predictions are explicitly penalized.</p><p><strong>The Hungarian Algorithm Through the Assignment Problem</strong></p><p>To understand Hungarian matching, it is helpful to first consider the classical job assignment problem. Suppose there are four workers and four jobs. Each worker quotes a different cost for performing each job, resulting in a cost matrix where rows represent workers and columns represent jobs. The objective is to assign each worker to exactly one job such that all jobs are completed and the total cost is minimized.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_niJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_niJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png 424w, https://substackcdn.com/image/fetch/$s_!_niJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png 848w, https://substackcdn.com/image/fetch/$s_!_niJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png 1272w, https://substackcdn.com/image/fetch/$s_!_niJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_niJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png" width="1389" height="312" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:312,&quot;width&quot;:1389,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25316,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_niJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png 424w, https://substackcdn.com/image/fetch/$s_!_niJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png 848w, https://substackcdn.com/image/fetch/$s_!_niJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png 1272w, https://substackcdn.com/image/fetch/$s_!_niJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bedbdda-8bc9-4948-8566-0c08994afdc8_1389x312.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Figure 1.14 </strong><em>Initial cost matrix for the assignment problem, showing the cost quoted by each worker for each job.</em></p><p>In Step 1, the problem is represented as a cost matrix, where each row corresponds to a worker and each column corresponds to a job. The entry at row <em>i</em>, column <em>j</em> denotes the cost incurred if worker <em>i</em> is assigned to job <em>j</em>. The objective is to select exactly one entry from each row and each column such that the total cost is minimized.</p><p>In Step 2, the minimum value in each row is identified. This value represents the cheapest job that a particular worker can perform.</p><p>In Step 3, the minimum value of each row is subtracted from all elements in that row. This transformation does not change the relative costs between assignments and therefore does not alter the optimal solution. However, it guarantees that every row now contains at least one zero, which simplifies the identification of low-cost assignments.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_0wn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_0wn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png 424w, https://substackcdn.com/image/fetch/$s_!_0wn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png 848w, https://substackcdn.com/image/fetch/$s_!_0wn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png 1272w, https://substackcdn.com/image/fetch/$s_!_0wn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_0wn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png" width="1389" height="312" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:312,&quot;width&quot;:1389,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28421,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_0wn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png 424w, https://substackcdn.com/image/fetch/$s_!_0wn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png 848w, https://substackcdn.com/image/fetch/$s_!_0wn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png 1272w, https://substackcdn.com/image/fetch/$s_!_0wn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F25ccbdb3-fda1-4a5a-a9a0-c4baed0c730c_1389x312.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Figure 1.15 </strong><em>Steps 4&#8211;6: column-wise normalization and identification of zero structure.</em></p><p>In Step 4, the algorithm inspects each column to determine whether further normalization is required. Columns that already contain a zero are left unchanged.</p><p>In Step 5, the minimum value of the last column, which is nonzero, is subtracted from all elements in that column. As with row normalization, this operation preserves the optimal assignment while introducing additional zeros.</p><p>In Step 6, the algorithm examines the resulting matrix and prepares to determine whether a complete assignment can be formed using the existing zeros. At this point, the matrix typically contains multiple zeros distributed across rows and columns.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Wrs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Wrs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png 424w, https://substackcdn.com/image/fetch/$s_!5Wrs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png 848w, https://substackcdn.com/image/fetch/$s_!5Wrs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png 1272w, https://substackcdn.com/image/fetch/$s_!5Wrs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Wrs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png" width="1389" height="351" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6388232d-aef6-4b14-936c-a17742057036_1389x351.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:351,&quot;width&quot;:1389,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:28371,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5Wrs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png 424w, https://substackcdn.com/image/fetch/$s_!5Wrs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png 848w, https://substackcdn.com/image/fetch/$s_!5Wrs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png 1272w, https://substackcdn.com/image/fetch/$s_!5Wrs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6388232d-aef6-4b14-936c-a17742057036_1389x351.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.16 </strong><em>Steps 7&#8211;9: covering zero entries and identifying the need for further adjustment.</em></p><p>In Step 7, the algorithm attempts to cover all zero entries in the matrix using the minimum number of horizontal and vertical lines. Each line can cover an entire row or column. The key observation here is that the number of lines required is three, while the number of workers and jobs is four.</p><p>In Step 8, the algorithm focuses on the uncovered elements, that is, entries not intersected by any line. Among these uncovered elements, the minimum value is identified.</p><p>In Step 9, this minimum uncovered value is confirmed. The fact that fewer than four lines are sufficient to cover all zeros indicates that a complete assignment cannot yet be constructed, and the matrix must be adjusted further.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P68l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P68l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png 424w, https://substackcdn.com/image/fetch/$s_!P68l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png 848w, https://substackcdn.com/image/fetch/$s_!P68l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png 1272w, https://substackcdn.com/image/fetch/$s_!P68l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P68l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png" width="1395" height="351" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:351,&quot;width&quot;:1395,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32134,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P68l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png 424w, https://substackcdn.com/image/fetch/$s_!P68l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png 848w, https://substackcdn.com/image/fetch/$s_!P68l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png 1272w, https://substackcdn.com/image/fetch/$s_!P68l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd780d778-9c06-4886-a722-3e7213ba0f5f_1395x351.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.17 </strong><em>Steps 10&#8211;12: matrix adjustment and emergence of a complete zero-cost assignment.</em></p><p>In Step 10, the minimum uncovered value is subtracted from every uncovered element. This operation creates new zeros in previously nonzero locations.</p><p>In Step 11, the same minimum value is added to every element that lies at the intersection of two covering lines. This compensatory step ensures that no negative values are introduced and that the optimal assignment remains unchanged.</p><p>In Step 12, the adjusted matrix now contains a configuration of zeros that allows a unique selection of one zero per row and per column. At this stage, a complete assignment can be constructed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Ayt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Ayt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png 424w, https://substackcdn.com/image/fetch/$s_!1Ayt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png 848w, https://substackcdn.com/image/fetch/$s_!1Ayt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png 1272w, https://substackcdn.com/image/fetch/$s_!1Ayt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Ayt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png" width="747" height="351" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:351,&quot;width&quot;:747,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:16490,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Ayt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png 424w, https://substackcdn.com/image/fetch/$s_!1Ayt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png 848w, https://substackcdn.com/image/fetch/$s_!1Ayt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png 1272w, https://substackcdn.com/image/fetch/$s_!1Ayt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10575304-ba95-493c-b0e4-6c10ca7f4415_747x351.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.18 </strong><em>Final optimal assignment and recovery of the minimum total cost from the original matrix.</em></p><p>The highlighted zeros correspond to the optimal worker&#8211;job pairing. Although the transformed matrix yields a zero total cost, this is an artifact of the normalization process. To recover the true minimum cost, the selected assignments are mapped back to the original cost matrix. Summing the original costs associated with the selected pairs yields the final minimum total cost.</p><p>This example illustrates how the Hungarian algorithm converts a global combinatorial optimization problem into a structured sequence of matrix operations. At no point does the algorithm rely on greedy local decisions. Instead, it guarantees a globally optimal one-to-one assignment by systematically reshaping the cost landscape until the optimal solution becomes explicit.</p><p>This exact mechanism is later reused in Detection Transformers, where predicted object queries are optimally matched to ground-truth bounding boxes using an analogous cost matrix and the same Hungarian matching procedure.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZHfe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZHfe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png 424w, https://substackcdn.com/image/fetch/$s_!ZHfe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png 848w, https://substackcdn.com/image/fetch/$s_!ZHfe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png 1272w, https://substackcdn.com/image/fetch/$s_!ZHfe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZHfe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png" width="417" height="286" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/de5341df-21a8-442a-b228-cf09e2a06e63_417x286.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:286,&quot;width&quot;:417,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225128,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZHfe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png 424w, https://substackcdn.com/image/fetch/$s_!ZHfe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png 848w, https://substackcdn.com/image/fetch/$s_!ZHfe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png 1272w, https://substackcdn.com/image/fetch/$s_!ZHfe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fde5341df-21a8-442a-b228-cf09e2a06e63_417x286.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.19 </strong><em>Hungarian matching in DETR, showing one-to-one assignment between predicted bounding boxes and ground-truth objects.</em></p><p>To perform this matching, DETR constructs a cost matrix where each entry represents the cost of assigning a predicted query to a particular ground-truth object. This cost combines classification error and localization error, typically using a weighted sum of class probability loss and bounding box regression terms such as L1 distance and IoU-based loss. Importantly, the cost is computed for all possible pairings, allowing the matching to consider global consistency rather than local heuristics.</p><p>Once the optimal assignment is computed using the Hungarian algorithm, the loss is evaluated only on the matched pairs. Predictions assigned to real objects contribute both classification and localization losses, while predictions assigned to the no-object class contribute only a classification loss. This set-based loss formulation ensures that each object is detected exactly once and eliminates the need for non-maximum suppression.</p><p>Hungarian matching is not merely a training trick but a foundational component of DETR&#8217;s design. By enforcing a global one-to-one correspondence between predictions and ground truth, it removes ambiguity, prevents duplicate detections, and aligns the training objective directly with the final inference output. This shift from heuristic post-processing to principled global optimization is one of the key reasons DETR represents a conceptual departure from traditional object detection pipelines.</p><h1>1.8 When Hungarian Matching Is Needed and When It Is Not in DETR</h1><p>Hungarian matching plays a critical role in the training formulation of Detection Transformers, but its use is conditional rather than universal. Whether it is required depends on the structure of the prediction&#8211;ground-truth correspondence problem and, in particular, on whether that correspondence is ambiguous.</p><p>The simplest case arises when an image contains exactly one ground-truth object. Even though DETR may produce many predictions from multiple object queries, the assignment problem is trivial. Each prediction can be independently compared to the single ground-truth object using a localization cost. The prediction that best matches the ground truth can be selected, and all remaining predictions can be treated as background. Because there is only one object, there is no possibility of conflicting assignments, and no global matching constraint is required.</p><p>The situation changes fundamentally once multiple ground-truth objects are present. In this setting, the model must decide not only which predictions are good, but also how to assign them uniquely to different objects. Independent matching decisions are no longer sufficient because they do not enforce exclusivity. A single prediction may appear to be a good match for more than one ground-truth object, leading to ambiguous supervision. Conversely, multiple predictions may compete for the same object, producing unstable and contradictory gradients during training.</p><p>Hungarian matching resolves this ambiguity by treating matching as a global optimization problem over sets rather than a collection of independent decisions. DETR constructs a cost matrix between all predictions and all ground-truth objects and computes a one-to-one assignment that minimizes the total cost. Each ground-truth object supervises exactly one prediction, and each prediction is assigned to at most one object. Predictions that are not matched to any ground truth are explicitly treated as background. This set-based formulation is what enables DETR to avoid anchors, heuristics, and non-maximum suppression while still producing stable and well-defined training signals.</p><p>In summary, Hungarian matching is unnecessary when the assignment problem is trivial, such as in the single-object case. It becomes essential whenever multiple objects must be matched to multiple predictions in a globally consistent manner.</p><h1>1.9 The Limitation of IoU and the Motivation for Generalized IoU</h1><p>Intersection over Union is the most commonly used metric for measuring the overlap between a predicted bounding box and a ground-truth box. It is defined as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{IoU}(P, G) = \\frac{\\text{Area}(P \\cap G)}{\\text{Area}(P \\cup G)}&quot;,&quot;id&quot;:&quot;HJVFYJLRMA&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6MF-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6MF-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png 424w, https://substackcdn.com/image/fetch/$s_!6MF-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png 848w, https://substackcdn.com/image/fetch/$s_!6MF-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png 1272w, https://substackcdn.com/image/fetch/$s_!6MF-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6MF-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png" width="1290" height="594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5522c660-a129-4a6a-b014-20545125f69f_1290x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:594,&quot;width&quot;:1290,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:407287,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6MF-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png 424w, https://substackcdn.com/image/fetch/$s_!6MF-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png 848w, https://substackcdn.com/image/fetch/$s_!6MF-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png 1272w, https://substackcdn.com/image/fetch/$s_!6MF-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5522c660-a129-4a6a-b014-20545125f69f_1290x594.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.20 </strong><em>Two non-overlapping prediction&#8211;ground-truth pairs with identical IoU values of zero. Although one prediction is spatially much closer to the ground truth than the other, IoU fails to distinguish between them.</em></p><p>In both cases shown above, the predicted box does not intersect the ground-truth box, and IoU assigns a value of zero. However, from a learning perspective, the first prediction is clearly better than the second because it is closer to the target. IoU provides no gradient signal that reflects this difference, which leads to poor optimization behavior during training.</p><p>To address this limitation, <strong>Generalized IoU (GIoU)</strong> introduces an additional geometric term that accounts for the distance between boxes. Let <em>C</em> denote the smallest enclosing rectangle that contains both the prediction <em>P</em> and the ground truth <em>G.</em> Generalized IoU is defined as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{GIoU}(P, G) = \\text{IoU}(P, G) - \\frac{\\text{Area}(C) - \\text{Area}(P \\cup G)}{\\text{Area}(C)}&quot;,&quot;id&quot;:&quot;CKHHVNETIW&quot;}" data-component-name="LatexBlockToDOM"></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U4Dl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U4Dl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png 424w, https://substackcdn.com/image/fetch/$s_!U4Dl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png 848w, https://substackcdn.com/image/fetch/$s_!U4Dl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png 1272w, https://substackcdn.com/image/fetch/$s_!U4Dl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U4Dl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png" width="1290" height="594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:594,&quot;width&quot;:1290,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:445293,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183945695?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U4Dl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png 424w, https://substackcdn.com/image/fetch/$s_!U4Dl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png 848w, https://substackcdn.com/image/fetch/$s_!U4Dl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png 1272w, https://substackcdn.com/image/fetch/$s_!U4Dl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544a1736-2f0e-41bf-8f6b-0c9975c08a22_1290x594.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 1.21 </strong><em>Generalized IoU introduces the smallest enclosing rectangle C, allowing the metric to penalize predictions that are farther from the ground truth even when there is no overlap.</em></p><p>When IoU is zero, the GIoU expression simplifies to</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{GIoU}(P, G) = -1 + \\frac{\\text{Area}(P \\cup G)}{\\text{Area}(C)}&quot;,&quot;id&quot;:&quot;MLQRLVFNNS&quot;}" data-component-name="LatexBlockToDOM"></div><p>If the prediction and ground truth have similar sizes, then <strong>Area(P&#8746;G</strong>) is approximately constant. The decisive factor becomes <strong>Area(C)</strong>. Predictions that are closer to the ground truth yield a smaller enclosing box <em>C</em>, resulting in a higher GIoU value. In this way, GIoU restores meaningful ordering among non-overlapping predictions and provides a smoother, more informative training signal.</p><h1>1.10 The DETR Loss Function: Formal Definition and Training Mechanics</h1><p>The DETR loss is defined over <strong>sets</strong> rather than ordered lists. Let the model produce a fixed number N of predictions, each consisting of a class probability distribution and a bounding box. Let the ground truth consist of M objects, where <em>M &#8804; N</em>. To construct a square matching problem, the ground truth is augmented with <em>N&#8722;M</em> null objects.</p><p><strong>Hungarian Matching</strong></p><p>DETR first computes a cost matrix between all predictions and all ground-truth objects. The cost for matching prediction <em>i</em> with ground-truth object <em>j</em> is defined as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{C}_{ij} = \\lambda_{\\text{cls}} \\mathcal{L}_{\\text{cls}}(i, j) + \\lambda_{L1} \\|b_i - b_j\\|_1 + \\lambda_{\\text{GIoU}} \\mathcal{L}_{\\text{GIoU}}(b_i, b_j)&quot;,&quot;id&quot;:&quot;SMIUBXRARI&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>L_cls</em> is the classification loss, <em>b_i = (x,y,w,h)</em> denotes the predicted box and b_j&#8203; denotes the ground-truth box</p><p>The Hungarian algorithm finds the permutation &#963; that minimizes the total matching cost</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma^* = \\arg \\min_{\\sigma} \\sum_{j=1}^{N} \\mathcal{C}_{\\sigma(j),j}&quot;,&quot;id&quot;:&quot;OTZOUMMKRX&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>Final Loss</strong></p><p>Once the optimal assignment is computed, the final DETR loss is defined as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L} = \\sum_{j=1}^{N} \\left[ \\mathcal{L}_{\\text{cls}}(\\hat{c}_{\\sigma^*(j)}, c_j) + \\mathbb{1}_{\\{c_j \\neq \\varnothing\\}} \\left( \\lambda_{L1} \\|b_{\\sigma^*(j)} - b_j\\|_1 + \\lambda_{\\text{GIoU}} \\mathcal{L}_{\\text{GIoU}}(b_{\\sigma^*(j)}, b_j) \\right) \\right]&quot;,&quot;id&quot;:&quot;QSGBHQBZHX&quot;}" data-component-name="LatexBlockToDOM"></div><p>Localization losses are applied <strong>only to matched non-null objects</strong>, while unmatched predictions are trained purely through the classification loss as background.</p><p>An important implementation detail is that DETR applies this loss not only to the final decoder layer, but also to intermediate decoder outputs during training. These auxiliary losses improve gradient flow and stabilize optimization. During inference, only the final decoder output is used.</p><h1>1.11 Concluding Perspective on DETR</h1><p>Detection Transformers represent a conceptual shift in object detection. Instead of framing detection as a dense prediction problem with anchors, heuristics, and post-processing, DETR formulates detection as a <strong>set prediction problem</strong>. Object queries act as slots that compete to explain the objects present in an image, and Hungarian matching provides the mathematical mechanism that enforces a clean, one-to-one correspondence between predictions and ground truth.</p><p>This design eliminates the need for non-maximum suppression, anchor tuning, and hand-crafted assignment rules. At the same time, it demands careful loss design, robust geometric metrics such as Generalized IoU, and global optimization during training.</p><p>Although the original DETR model is computationally expensive and slow to converge, its formulation has influenced an entire family of transformer-based detectors. The core ideas of set-based prediction, bipartite matching, and global supervision continue to shape modern object detection architectures and provide a clean conceptual foundation for unifying detection with broader multimodal and sequence modeling frameworks</p><h1><strong>Watch the full lecture video here</strong></h1><div id="youtube2-zdMDvJhyrrc" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;zdMDvJhyrrc&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/zdMDvJhyrrc?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>If you would like to deepen your understanding of Detection Transformer and see these ideas explained visually and intuitively, you can refer to the accompanying video linked above. If you wish to get access to our code files, handwritten notes, all lecture videos, Discord channel, and other PDF handbooks that we have compiled, along with a code certificate at the end of the program, you can consider being part of the pro version of the &#8220;Transformers for Vision Bootcamp&#8221;. You will find the details here:</p><p><a href="https://vision-transformer.vizuara.ai/">https://vision-transformer.vizuara.ai/</a></p><h1><strong>Other resources</strong></h1><p>If you like this content, please check out our research bootcamps on the following topics:</p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><h1><strong>Connect with us</strong></h1><p><strong>Dr. Sreedath Panat</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/sreedath-panat/">https://www.linkedin.com/in/sreedath-panat/</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/sreedathpanat">https://x.com/sreedathpanat</a></p><p><strong>Mayank Pratap Singh</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/mayankpratapsingh022/">www.linkedin.com/in/mayankpratapsingh022</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/Mayank_022">x.com/Mayank_022</a>.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Why Deployment and Monitoring Are Central to Production-Grade ML Systems]]></title><description><![CDATA[This article examines the necessity of deployment, the importance of monitoring, the key components that require monitoring, and how engineers select deployment platforms.]]></description><link>https://www.vizuaranewsletter.com/p/why-deployment-and-monitoring-are</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/why-deployment-and-monitoring-are</guid><dc:creator><![CDATA[Prathamesh Dinesh Joshi]]></dc:creator><pubDate>Wed, 14 Jan 2026 06:30:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xX3I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Table of Content </em></p><ol><li><p><em>Introduction</em></p></li><li><p><em>Why Deployment Is Necessary</em></p><ol><li><p><em>Industrial Reality</em></p></li></ol></li><li><p><em>Why Monitoring Is Critical After Deployment</em></p></li><li><p><em>What We Monitor in ML Systems</em></p><ol><li><p><em>System Level Monitoring</em></p></li><li><p><em>Data Level Monitoring</em></p></li><li><p><em>Model Level Monitoring</em></p></li></ol></li><li><p><em>Platforms for Deploying Machine Learning Systems</em></p><ol><li><p><em>AWS EC2</em></p></li><li><p><em>Google Clound Run </em></p></li><li><p><em>Microsoft Azure </em></p></li><li><p><em>Render</em></p></li></ol></li><li><p><em>Choosing the Right Deployment Platform</em></p></li><li><p><em>Understanding AWS EC2 Fundamentals (Storage, Security, and Instance Types)</em></p></li><li><p><em>Connecting GitHub and EC2 Securely Using Secrets</em></p></li><li><p><em>Automated Deployment to EC2 Using GitHub Actions</em></p></li><li><p><em>Conclusion</em></p><p></p></li></ol><h2><strong>1. Introduction</strong></h2><p>Machine learning systems do not create value at the moment a model finishes training. Value is created only when that model is deployed into a real system, exposed to live data, and continuously monitored to ensure reliability over time.</p><p>In modern Machine Learning Engineering (MLE), deployment and monitoring are not optional add-ons. They are foundational engineering responsibilities that determine whether a model survives outside experimentation.</p><p>This article explores why deployment is necessary, why monitoring is critical, what must be monitored, and how engineers choose deployment platforms in real-world production systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xX3I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xX3I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png 424w, https://substackcdn.com/image/fetch/$s_!xX3I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png 848w, https://substackcdn.com/image/fetch/$s_!xX3I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png 1272w, https://substackcdn.com/image/fetch/$s_!xX3I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xX3I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png" width="622" height="348.5934065934066" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2374301e-4407-405a-aa9f-484b544973d4_1684x944.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:622,&quot;bytes&quot;:2136139,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183886241?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xX3I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png 424w, https://substackcdn.com/image/fetch/$s_!xX3I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png 848w, https://substackcdn.com/image/fetch/$s_!xX3I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png 1272w, https://substackcdn.com/image/fetch/$s_!xX3I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2374301e-4407-405a-aa9f-484b544973d4_1684x944.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source: Netflix.com , Image shows latest release and trending movies on Netflix.</figcaption></figure></div><div><hr></div><h2><strong>2. Why Deployment Is Necessary</strong></h2><p>A trained model that exists only in a notebook or local environment is effectively inert. It cannot serve users, integrate with applications, or respond to real-time data.</p><p>Deployment converts a trained model into a <strong>running service</strong> capable of:</p><ul><li><p>Accepting input from users or systems</p></li><li><p>Producing predictions automatically</p></li><li><p>Running continuously without manual intervention</p></li></ul><p>Without deployment:</p><ul><li><p>Models remain experimental artifacts</p></li><li><p>There is no business or product impact</p></li><li><p>Feedback loops for improvement do not exist</p></li></ul><h3>a. Industry Reality</h3><p>At Netflix, recommendation models are deployed as low-latency services responding to millions of requests per second. A highly accurate model that cannot be deployed reliably is functionally useless.</p><p>Similarly, Uber&#8217;s pricing and demand forecasting models must operate continuously. Any deployment failure directly affects rider experience and revenue.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SrcI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SrcI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png 424w, https://substackcdn.com/image/fetch/$s_!SrcI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png 848w, https://substackcdn.com/image/fetch/$s_!SrcI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!SrcI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SrcI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png" width="307" height="284.83393501805057" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1028,&quot;width&quot;:1108,&quot;resizeWidth&quot;:307,&quot;bytes&quot;:758420,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183886241?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SrcI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png 424w, https://substackcdn.com/image/fetch/$s_!SrcI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png 848w, https://substackcdn.com/image/fetch/$s_!SrcI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!SrcI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff38d32ac-7555-4c95-b841-baa843ce6d17_1108x1028.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source : Uber.com </figcaption></figure></div><blockquote><p><em>Deployment is the transition point where machine learning becomes infrastructure.</em></p></blockquote><div><hr></div><h2>3. Why Monitoring Is Critical After Deployment</h2><p>Once deployed, a model enters a dynamic and often adversarial environment. Real-world data changes. User behavior evolves. Infrastructure degrades. Assumptions made during training eventually become invalid.</p><p>Common causes of post-deployment failure include:</p><ul><li><p>Changes in incoming data distributions</p></li><li><p>Seasonal or behavioral shifts</p></li><li><p>System-level failures</p></li><li><p>Gradual performance degradation (concept drift)</p></li></ul><p>Monitoring ensures that failures are <strong>detected early</strong>, diagnosed correctly, and addressed before they escalate.</p><p>At LinkedIn, even small degradations in feed-ranking models can significantly impact engagement metrics. Continuous monitoring enables rapid detection and controlled rollbacks.</p><blockquote><p>Monitoring is not about perfection it is about <strong>maintaining trust in production systems</strong>.</p></blockquote><div><hr></div><h2>4. What We Monitor in ML Systems</h2><p>Monitoring in machine learning systems operates across <strong>three interconnected layers</strong>. Treating monitoring as &#8220;accuracy tracking&#8221; alone is a common and costly mistake.</p><h3>a. System-Level Monitoring</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zm4m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zm4m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png 424w, https://substackcdn.com/image/fetch/$s_!Zm4m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png 848w, https://substackcdn.com/image/fetch/$s_!Zm4m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!Zm4m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zm4m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png" width="620" height="276.3598901098901" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:649,&quot;width&quot;:1456,&quot;resizeWidth&quot;:620,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;https://desk.zoho.com/support/ImageDisplay?blockId=ff99121ae5644d8bfae0a6553792c46e1c94ced4e2848dcd&amp;downloadType=uploadedFile&amp;fileName=lename%2A%3D%22UTF-8%27%27custom-dashboard.png&amp;mode=view&amp;zgId=4d65b98622a455f6&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="https://desk.zoho.com/support/ImageDisplay?blockId=ff99121ae5644d8bfae0a6553792c46e1c94ced4e2848dcd&amp;downloadType=uploadedFile&amp;fileName=lename%2A%3D%22UTF-8%27%27custom-dashboard.png&amp;mode=view&amp;zgId=4d65b98622a455f6" title="https://desk.zoho.com/support/ImageDisplay?blockId=ff99121ae5644d8bfae0a6553792c46e1c94ced4e2848dcd&amp;downloadType=uploadedFile&amp;fileName=lename%2A%3D%22UTF-8%27%27custom-dashboard.png&amp;mode=view&amp;zgId=4d65b98622a455f6" srcset="https://substackcdn.com/image/fetch/$s_!Zm4m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png 424w, https://substackcdn.com/image/fetch/$s_!Zm4m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png 848w, https://substackcdn.com/image/fetch/$s_!Zm4m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png 1272w, https://substackcdn.com/image/fetch/$s_!Zm4m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbb76e7ea-62ad-4671-a1e5-06c81227681b_2878x1282.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source : Manage Engine, System-level monitoring focuses on infrastructure health and service reliability.</figcaption></figure></div><p>This layer ensures that the underlying infrastructure remains healthy.</p><p>Key metrics include:</p><ul><li><p>CPU and memory utilization</p></li><li><p>Disk and network usage</p></li><li><p>API latency and error rates</p></li></ul><p>If system resources are exhausted or latency spikes, even a perfect model becomes unusable.</p><div><hr></div><h3>b. Data-Level Monitoring</h3><p>This layer focuses on the <strong>inputs</strong> entering the model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Oie!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Oie!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png 424w, https://substackcdn.com/image/fetch/$s_!1Oie!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png 848w, https://substackcdn.com/image/fetch/$s_!1Oie!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png 1272w, https://substackcdn.com/image/fetch/$s_!1Oie!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Oie!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png" width="644" height="342.4585635359116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:770,&quot;width&quot;:1448,&quot;resizeWidth&quot;:644,&quot;bytes&quot;:317803,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183886241?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1Oie!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png 424w, https://substackcdn.com/image/fetch/$s_!1Oie!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png 848w, https://substackcdn.com/image/fetch/$s_!1Oie!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png 1272w, https://substackcdn.com/image/fetch/$s_!1Oie!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1dc0d93-dc95-4c8c-a477-113b5274cf1f_1448x770.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source : Evidently AI,  Explains Data level monitoring.</figcaption></figure></div><p>What is monitored:</p><ul><li><p>Input feature distributions</p></li><li><p>Missing or invalid values</p></li><li><p>Sudden statistical shifts indicating data drift</p></li></ul><p>At Airbnb, pricing and demand models rely heavily on data-level monitoring. Changes in travel patterns can silently invalidate training assumptions if not detected early.</p><div><hr></div><h3>c. Model-Level Monitoring</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AStj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AStj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png 424w, https://substackcdn.com/image/fetch/$s_!AStj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png 848w, https://substackcdn.com/image/fetch/$s_!AStj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png 1272w, https://substackcdn.com/image/fetch/$s_!AStj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AStj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png" width="654" height="364.7307692307692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:696,&quot;width&quot;:1248,&quot;resizeWidth&quot;:654,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;https://cdn.prod.website-files.com/660ef16a9e0687d9cc27474a/662c3c83010d1a7f60040604_653fdfe6ffa885d43e0b61ae_model%2520monitoring%2520guide_main-min.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="https://cdn.prod.website-files.com/660ef16a9e0687d9cc27474a/662c3c83010d1a7f60040604_653fdfe6ffa885d43e0b61ae_model%2520monitoring%2520guide_main-min.png" title="https://cdn.prod.website-files.com/660ef16a9e0687d9cc27474a/662c3c83010d1a7f60040604_653fdfe6ffa885d43e0b61ae_model%2520monitoring%2520guide_main-min.png" srcset="https://substackcdn.com/image/fetch/$s_!AStj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png 424w, https://substackcdn.com/image/fetch/$s_!AStj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png 848w, https://substackcdn.com/image/fetch/$s_!AStj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png 1272w, https://substackcdn.com/image/fetch/$s_!AStj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f2cea4f-11fa-4f19-a62c-97f9719b44af_1248x696.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source : Evidently AI, Model-level monitoring evaluates output stability and predictive behavior over time</figcaption></figure></div><p>This layer evaluates <strong>model outputs</strong>, not just labels.</p><p>Typical metrics include:</p><ul><li><p>Prediction distributions</p></li><li><p>Reconstruction error (for anomaly detection systems)</p></li><li><p>Ground-truth-based metrics when labels are available</p></li></ul><p>In anomaly detection systems used for cybersecurity or finance, rising reconstruction error often provides the earliest signal of abnormal system behavior.</p><div><hr></div><h2>5. Platforms for Deploying Machine Learning Systems</h2><p>Different deployment platforms exist because <strong>no single solution fits all production requirements</strong>. The choice depends on control, scalability needs, and operational maturity.</p><div><hr></div><h3>a. Amazon EC2</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P4Z3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P4Z3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png 424w, https://substackcdn.com/image/fetch/$s_!P4Z3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png 848w, https://substackcdn.com/image/fetch/$s_!P4Z3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png 1272w, https://substackcdn.com/image/fetch/$s_!P4Z3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P4Z3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png" width="1026" height="580" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:580,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:108689,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183886241?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P4Z3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png 424w, https://substackcdn.com/image/fetch/$s_!P4Z3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png 848w, https://substackcdn.com/image/fetch/$s_!P4Z3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png 1272w, https://substackcdn.com/image/fetch/$s_!P4Z3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F828e16cb-ba52-454b-a830-a160cc7172a7_1026x580.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source : amazon.com, Illustration of amazon EC2 Architecture.</figcaption></figure></div><p><strong>Figure:</strong> Amazon EC2 provides full control over the deployment environment using virtual machines.</p><p>A virtual machine&#8211;based deployment model offered by <strong>Amazon Web Services</strong>.</p><p><strong>Advantages</strong></p><ul><li><p>Full control over operating system, dependencies, and runtime</p></li><li><p>Seamless integration with Docker and CI/CD pipelines</p></li><li><p>Ideal for research workflows, custom ML pipelines, and live demos</p></li></ul><p><strong>Limitations</strong></p><ul><li><p>Manual scaling</p></li><li><p>Requires system administration expertise</p></li><li><p>Monitoring and security must be explicitly configured</p></li></ul><p>EC2 is commonly used in early-stage ML products and research-heavy environments where flexibility is essential.</p><div><hr></div><h3>b. Google Cloud Run</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4H07!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4H07!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png 424w, https://substackcdn.com/image/fetch/$s_!4H07!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png 848w, https://substackcdn.com/image/fetch/$s_!4H07!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png 1272w, https://substackcdn.com/image/fetch/$s_!4H07!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4H07!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png" width="1055" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:1055,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;https://miro.medium.com/1%2A2gnIctgRQtHYoq6btZ7qYA.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="https://miro.medium.com/1%2A2gnIctgRQtHYoq6btZ7qYA.png" title="https://miro.medium.com/1%2A2gnIctgRQtHYoq6btZ7qYA.png" srcset="https://substackcdn.com/image/fetch/$s_!4H07!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png 424w, https://substackcdn.com/image/fetch/$s_!4H07!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png 848w, https://substackcdn.com/image/fetch/$s_!4H07!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png 1272w, https://substackcdn.com/image/fetch/$s_!4H07!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a2c0779-b882-43a0-82b8-ea7a07753aa0_1055x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Cloud Run automatically scales containerized ML APIs based on incoming traffic.</figcaption></figure></div><p>A serverless container deployment platform from <strong>Google Cloud</strong>.</p><p><strong>Advantages</strong></p><ul><li><p>Automatic scaling</p></li><li><p>No server management</p></li><li><p>Pay-per-use pricing</p></li><li><p>Well suited for stateless inference APIs</p></li></ul><p><strong>Limitations</strong></p><ul><li><p>Limited infrastructure control</p></li><li><p>Cold-start latency</p></li><li><p>Unsuitable for long-running ML jobs</p></li></ul><p>Cloud Run excels when traffic patterns are unpredictable and operational simplicity is prioritized.</p><div><hr></div><h3>c. Microsoft Azure</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f4OG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f4OG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg 424w, https://substackcdn.com/image/fetch/$s_!f4OG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg 848w, https://substackcdn.com/image/fetch/$s_!f4OG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg 1272w, https://substackcdn.com/image/fetch/$s_!f4OG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f4OG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg" width="996" height="780" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71548740-18f4-4129-b5c4-053d512df37a_996x780.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:780,&quot;width&quot;:996,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;https://learn.microsoft.com/en-us/azure/architecture/ai-ml/idea/_images/azure-machine-learning-solution-architecture.svg&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/idea/_images/azure-machine-learning-solution-architecture.svg" title="https://learn.microsoft.com/en-us/azure/architecture/ai-ml/idea/_images/azure-machine-learning-solution-architecture.svg" srcset="https://substackcdn.com/image/fetch/$s_!f4OG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg 424w, https://substackcdn.com/image/fetch/$s_!f4OG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg 848w, https://substackcdn.com/image/fetch/$s_!f4OG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg 1272w, https://substackcdn.com/image/fetch/$s_!f4OG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71548740-18f4-4129-b5c4-053d512df37a_996x780.svg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Azure provides an enterprise-grade ecosystem for ML deployment and monitoring.</figcaption></figure></div><p>A comprehensive cloud ecosystem offered by <strong>Microsoft Azure</strong>.</p><p><strong>Advantages</strong></p><ul><li><p>Strong integration with enterprise systems</p></li><li><p>Managed ML services available</p></li><li><p>Advanced monitoring and logging tools</p></li></ul><p><strong>Limitations</strong></p><ul><li><p>Steeper learning curve</p></li><li><p>Higher cost for small projects</p></li><li><p>Configuration complexity</p></li></ul><p>Azure is frequently chosen in regulated industries and enterprise-heavy environments.</p><div><hr></div><h3>d. Render</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wlmu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wlmu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png 424w, https://substackcdn.com/image/fetch/$s_!wlmu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png 848w, https://substackcdn.com/image/fetch/$s_!wlmu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png 1272w, https://substackcdn.com/image/fetch/$s_!wlmu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wlmu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png" width="548" height="356.4258241758242" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:947,&quot;width&quot;:1456,&quot;resizeWidth&quot;:548,&quot;bytes&quot;:273892,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183886241?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wlmu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png 424w, https://substackcdn.com/image/fetch/$s_!wlmu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png 848w, https://substackcdn.com/image/fetch/$s_!wlmu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png 1272w, https://substackcdn.com/image/fetch/$s_!wlmu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd56ebd2c-0999-4d4d-b194-df56bac2faf1_1458x948.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source : render.com, Render helps in deploying model in simplest way.</figcaption></figure></div><p>A lightweight deployment platform focused on developer experience.</p><p><strong>Advantages</strong></p><ul><li><p>Extremely easy setup</p></li><li><p>Minimal configuration</p></li><li><p>Ideal for demos and prototypes</p></li></ul><p><strong>Limitations</strong></p><ul><li><p>Limited customization</p></li><li><p>Not suitable for heavy ML workloads</p></li><li><p>Restricted control over scaling</p></li></ul><p>Render is best suited for educational content, proofs of concept, and lightweight demonstrations.</p><div><hr></div><h2>6. Choosing the Right Deployment Platform</h2><p>Platform selection should be driven by <strong>engineering constraints</strong>, not popularity.</p><p>General guidelines:</p><ul><li><p><strong>Full control and live demonstrations</strong> &#8594; Amazon EC2</p></li><li><p><strong>Stateless ML APIs with variable traffic</strong> &#8594; Google Cloud Run</p></li><li><p><strong>Enterprise-grade production systems</strong> &#8594; Microsoft Azure</p></li><li><p><strong>Simple demos with minimal setup</strong> &#8594; Render</p></li></ul><p>Choosing the correct platform reduces operational friction and allows teams to focus on <strong>model quality rather than infrastructure firefighting</strong>.</p><div><hr></div><h2>7. Understanding AWS EC2 Fundamentals (Storage, Security, and Instance Types)</h2><p>Before deploying anything to AWS EC2, it is essential to understand the core building blocks that make an instance usable and secure.</p><p><strong>Storage (EBS Volumes)</strong><br>Every EC2 instance requires storage to hold the operating system, application code, logs, and Docker artifacts. This is provided using <strong>Elastic Block Store (EBS)</strong> volumes, which act like virtual hard disks attached to the instance. EBS ensures data persistence even if the EC2 instance is stopped or restarted.</p><p><strong>Security Groups</strong><br>Security Groups act as <strong>virtual firewalls</strong> for EC2. They control inbound and outbound traffic rules. For deployment workflows, the most critical rule is allowing <strong>SSH access on port 22</strong>, typically restricted to trusted IPs or GitHub Actions runners. Without proper security group configuration, remote access to EC2 is impossible.</p><p><strong>PEM File (Key Pair)</strong><br>AWS uses <strong>public&#8211;private key authentication</strong> instead of passwords. When creating an EC2 instance, a <strong>key pair (.pem file)</strong> is generated.</p><ul><li><p>The <strong>private key (.pem)</strong> stays with you</p></li><li><p>The <strong>public key</strong> is stored on the EC2 instance</p></li></ul><p>This key pair enables secure SSH access and is later reused inside CI/CD pipelines.</p><p><strong>Instance Types (t2 vs t3)</strong></p><ul><li><p><strong>t2 instances</strong> are older burstable instances suitable for lightweight workloads.</p></li><li><p><strong>t3 instances</strong> are newer, more cost-efficient, and provide better baseline performance.</p></li></ul><p>For Docker-based ML or backend deployments, <strong>t3.micro or t3.small</strong> is generally preferred due to better CPU credit handling.</p><div><hr></div><h2>8. Connecting GitHub and EC2 Securely Using Secrets</h2><p>Directly hardcoding credentials inside GitHub workflows is insecure. Instead, GitHub Actions uses <strong>encrypted secrets</strong> to establish a secure connection between GitHub and EC2.</p><h3>Required GitHub Secrets</h3><p>Each secret has a specific role:</p><ul><li><p><strong>EC2_HOST</strong><br>The public IP address or DNS of the EC2 instance.</p></li><li><p><strong>EC2_USER</strong><br>The default SSH username (for Ubuntu AMI, this is <code>ubuntu</code>).</p></li><li><p><strong>EC2_SSH_KEY</strong><br>The private <code>.pem</code> key contents used for SSH authentication.</p></li><li><p><strong>DOCKER_COMPOSE_DIR</strong><br>The directory path on EC2 where <code>docker-compose.yml</code> is located.</p></li><li><p><strong>GHCR_PAT (GitHub Container Registry Personal Access Token)</strong><br>Used to authenticate Docker with GitHub Container Registry (<code>ghcr.io</code>) so private images can be pulled securely.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!obmO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!obmO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png 424w, https://substackcdn.com/image/fetch/$s_!obmO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png 848w, https://substackcdn.com/image/fetch/$s_!obmO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png 1272w, https://substackcdn.com/image/fetch/$s_!obmO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!obmO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png" width="1456" height="567" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:567,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:93078,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183886241?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!obmO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png 424w, https://substackcdn.com/image/fetch/$s_!obmO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png 848w, https://substackcdn.com/image/fetch/$s_!obmO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png 1272w, https://substackcdn.com/image/fetch/$s_!obmO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11222dd5-c3dd-4bb2-91cd-aa5e9d7c5bc3_1798x700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Snapshot from our actual implementation.</figcaption></figure></div></li></ul><p>These secrets allow GitHub Actions to authenticate with EC2 <strong>without exposing credentials in the repository</strong>.</p><div><hr></div><h2>9. Automated Deployment to EC2 Using GitHub Actions</h2><pre><code>- name: Deploy to AWS EC2
  uses: appleboy/ssh-action@v0.1.7
  with:
    host: ${{ secrets.EC2_HOST }}
    username: ${{ secrets.EC2_USER }}
    key: ${{ secrets.EC2_SSH_KEY }}
    port: 22
    script: |
      docker login ghcr.io -u ${{ github.repository_owner }} -p ${{ secrets.GHCR_PAT }}
      
      cd ${{ secrets.DOCKER_COMPOSE_DIR }}
      docker-compose pull
      docker-compose down
      docker-compose up -d
</code></pre><p>Above is the deployment step used inside the GitHub Actions workflow:</p><h3>What this workflow does</h3><ol><li><p><strong>Establishes SSH Connection</strong><br>GitHub Actions securely connects to the EC2 instance using the SSH private key.</p></li><li><p><strong>Authenticates Docker with GHCR</strong><br>Logs into GitHub Container Registry to pull private Docker images.</p></li><li><p><strong>Navigates to Deployment Directory</strong><br>Moves to the directory containing <code>docker-compose.yml</code>.</p></li><li><p><strong>Pulls Latest Images</strong><br>Ensures the EC2 instance always runs the latest container versions.</p></li><li><p><strong>Restarts Services Cleanly</strong><br>Stops existing containers and redeploys them in detached mode.</p></li></ol><p>This results in a <strong>fully automated CI/CD pipeline</strong>, where every push to GitHub triggers a fresh deployment on EC2.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tI1t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tI1t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png 424w, https://substackcdn.com/image/fetch/$s_!tI1t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png 848w, https://substackcdn.com/image/fetch/$s_!tI1t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png 1272w, https://substackcdn.com/image/fetch/$s_!tI1t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tI1t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png" width="1456" height="273" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:273,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:110400,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183886241?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!tI1t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png 424w, https://substackcdn.com/image/fetch/$s_!tI1t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png 848w, https://substackcdn.com/image/fetch/$s_!tI1t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png 1272w, https://substackcdn.com/image/fetch/$s_!tI1t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb97d0242-c9e0-4cc2-976d-1d8118f70a20_2124x398.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Image helps us to understand how to verify if CI/CD Pipeline is working correctly or not upon code change.</figcaption></figure></div><div><hr></div><h2>10. Conclusion</h2><p>Deployment and monitoring complete the ML lifecycle by ensuring models move reliably from development to production while remaining observable and controllable. Automated CI/CD with cloud infrastructure enables consistent, repeatable deployments, while monitoring detects failures, performance drift, and system anomalies early. Together, they transform experimental models into robust, production-grade systems suitable for real-world use.</p><p>To understand further , please watch the video on deployment </p><div id="youtube2-SIGFopnCoNQ" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;SIGFopnCoNQ&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/SIGFopnCoNQ?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.vizuaranewsletter.com/subscribe?"><span>Subscribe now</span></a></p><p></p>]]></content:encoded></item><item><title><![CDATA[An beginners introduction to Swin transformer]]></title><description><![CDATA[Why did Microsoft introduce the idea of "shifted window" attention?]]></description><link>https://www.vizuaranewsletter.com/p/an-beginners-introduction-to-swin</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/an-beginners-introduction-to-swin</guid><dc:creator><![CDATA[Sreedath Panat]]></dc:creator><pubDate>Tue, 13 Jan 2026 09:20:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!X8hk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Table of Content</h1><ol><li><p>Introduction to Swin Transformer</p></li><li><p>Limitations of Vision Transformers at High Resolution</p></li><li><p>Window Based Attention in Swin Transformers</p></li><li><p>Hierarchical Feature Representation in Swin Transformers</p></li><li><p>Overview of the Swin Transformer Architecture</p></li><li><p>Patch Partitioning and Linear Embedding</p></li><li><p>Patch Merging and Hierarchical Downsampling</p></li><li><p>Attention Complexity in Swin Transformers</p></li><li><p>Shifted Windows for Long Range Interaction in Swin Transformer</p></li><li><p>Relative Position Bias Parameterization in Swin Transformer</p></li><li><p>Absence of Class Token in Swin Transformer</p></li><li><p>Output Heads and Task Generalization</p></li><li><p>Comparison with Convolutional Backbones</p></li><li><p>Concluding Remarks on Swin Transformer</p></li></ol><h1>1.1 Introduction to Swin Transformer</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KASB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KASB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png 424w, https://substackcdn.com/image/fetch/$s_!KASB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png 848w, https://substackcdn.com/image/fetch/$s_!KASB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png 1272w, https://substackcdn.com/image/fetch/$s_!KASB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KASB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png" width="1456" height="827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:827,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125069,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KASB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png 424w, https://substackcdn.com/image/fetch/$s_!KASB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png 848w, https://substackcdn.com/image/fetch/$s_!KASB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png 1272w, https://substackcdn.com/image/fetch/$s_!KASB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bcb2229-416f-45a3-865b-462cbef7c08d_1506x855.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.1 Vision Transformer overview.</strong><br>An input image is divided into fixed size patches, flattened, linearly projected, and combined with positional embeddings to form a sequence of tokens processed by a standard transformer encoder.<br>The final image representation is obtained from a dedicated class token and passed to an MLP head for prediction.</em></p><p>Transformer based models have reshaped computer vision by reformulating images as sequences of tokens. An input image, represented by its height, width, and color channels, is first partitioned into fixed size spatial patches. Each patch is flattened into a vector and treated as a token, enabling the direct application of self attention mechanisms originally developed for language modeling. This formulation allows the model to capture long range dependencies across the image, but it also exposes a key limitation. As image resolution increases, the number of patches grows proportionally, and the computational and memory cost of self attention scales quadratically with the number of tokens.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p2Yu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p2Yu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png 424w, https://substackcdn.com/image/fetch/$s_!p2Yu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png 848w, https://substackcdn.com/image/fetch/$s_!p2Yu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png 1272w, https://substackcdn.com/image/fetch/$s_!p2Yu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p2Yu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png" width="1456" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65172,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p2Yu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png 424w, https://substackcdn.com/image/fetch/$s_!p2Yu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png 848w, https://substackcdn.com/image/fetch/$s_!p2Yu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png 1272w, https://substackcdn.com/image/fetch/$s_!p2Yu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F299b76ed-b7ab-43c6-a227-3c0c32904369_1533x510.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.2 Swin Transformer architecture.</strong><br>The input image is hierarchically processed through multiple stages, where patch partitioning and merging progressively reduce spatial resolution while increasing feature dimensionality.<br>Each stage is composed of successive Swin Transformer blocks that alternate between window based and shifted window self attention, enabling efficient local computation with cross window information exchange.</em></p><p>The Swin Transformer, short for shifted window transformer, was proposed to overcome this scalability challenge while retaining the expressive power of transformer architectures. Instead of computing attention globally over all image patches, Swin Transformer restricts self attention to local windows and introduces a systematic window shifting strategy across layers. This design enables efficient computation at high resolutions while still allowing information exchange beyond local neighborhoods. As a result, Swin Transformer serves as a strong and scalable backbone for a wide range of visual tasks. In the next section, we will place this architecture in context by contrasting it with earlier vision transformer designs.</p><h1>1.2 Limitations of Vision Transformers at High Resolution</h1><p>Vision Transformers model images as a sequence of patch tokens and apply global self attention over all tokens. While this design enables strong long range interactions, it introduces a fundamental computational bottleneck. For an image of height H and width W, divided into patches of size P, the total number of tokens is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N = \\frac{H}{P} \\times \\frac{W}{P}\n&quot;,&quot;id&quot;:&quot;HEHGYKVWRW&quot;}" data-component-name="LatexBlockToDOM"></div><p>Self attention computes interactions between all query key pairs, resulting in an attention complexity that scales as <em>O(N&#178;).</em> Since <em>N</em> itself grows linearly with image resolution, the overall attention cost scales quadratically with the number of pixels, effectively O((HW)&#178;).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x7T0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x7T0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png 424w, https://substackcdn.com/image/fetch/$s_!x7T0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png 848w, https://substackcdn.com/image/fetch/$s_!x7T0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png 1272w, https://substackcdn.com/image/fetch/$s_!x7T0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x7T0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png" width="774" height="570" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:570,&quot;width&quot;:774,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25376,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!x7T0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png 424w, https://substackcdn.com/image/fetch/$s_!x7T0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png 848w, https://substackcdn.com/image/fetch/$s_!x7T0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png 1272w, https://substackcdn.com/image/fetch/$s_!x7T0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F90bd28d7-c85a-4e23-8c0a-fa0f137fee02_774x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.3 Global self attention in Vision Transformers.</strong><br>For N input tokens, self attention computes an N &#215; N matrix of query&#8211;key dot products, capturing all pairwise token interactions and leading to quadratic computational complexity.</em></p><p>Figure 1.4 illustrates this behavior with a concrete example. When image resolution is doubled along both spatial dimensions, the total number of pixels increases by a factor of four, but the attention computation increases by a factor of sixteen. This quadratic scaling makes Vision Transformers increasingly impractical for high resolution inputs. As a result, tasks that inherently require fine spatial detail, such as semantic segmentation or instance segmentation, become prohibitively expensive, limiting the applicability of vanilla Vision Transformers beyond image classification.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c9e-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c9e-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png 424w, https://substackcdn.com/image/fetch/$s_!c9e-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png 848w, https://substackcdn.com/image/fetch/$s_!c9e-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png 1272w, https://substackcdn.com/image/fetch/$s_!c9e-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c9e-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png" width="1398" height="936" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:936,&quot;width&quot;:1398,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:478061,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c9e-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png 424w, https://substackcdn.com/image/fetch/$s_!c9e-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png 848w, https://substackcdn.com/image/fetch/$s_!c9e-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png 1272w, https://substackcdn.com/image/fetch/$s_!c9e-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc199b03-01ac-4126-bd09-3bcb31927236_1398x936.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.4 Quadratic complexity of global self attention in Vision Transformers.</strong><br>Increasing image resolution leads to a quadratic growth in attention computation, as every patch token attends to every other token, making high resolution vision tasks computationally expensive.</em></p><h1>1.3 Window Based Attention in Swin Transformers</h1><p>Swin Transformers address the quadratic complexity of global self attention by restricting attention computation to local windows. Instead of allowing each patch token to attend to all other tokens in the image, attention is computed only among tokens that fall within a fixed size window. If each window contains M &#215; M tokens, the attention cost within a window scales as O(M&#178;), and since the number of windows grows linearly with image size, the overall complexity becomes linear with respect to the number of pixels.</p><p>This design dramatically improves scalability, but purely local attention introduces a new limitation. Tokens in different windows do not directly interact, which can restrict information flow across the image. Swin Transformers resolve this by introducing shifted window self attention. In alternating layers, the window partitioning is shifted spatially, allowing tokens that were previously in separate windows to attend to one another. Over multiple layers, this mechanism enables effective cross window communication while preserving linear computational complexity.</p><h1>1.4 Hierarchical Feature Representation in Swin Transformers</h1><p>Another key limitation of Vision Transformers is their flat representation structure. All transformer blocks operate on tokens derived from a single fixed patch size, resulting in a uniform spatial resolution throughout the network. This contrasts with convolutional architectures, which naturally build hierarchical feature representations by progressively reducing spatial resolution while increasing channel capacity.</p><p>Swin Transformers explicitly introduce a hierarchical architecture through patch merging stages. As shown in Figure 1.5, the model is organized into multiple stages, each operating at a different spatial scale. Early stages process high resolution features with fewer channels, while later stages operate on lower resolution representations with richer semantic content. This pyramid like structure closely aligns with the inductive biases of vision tasks and is particularly beneficial for dense prediction problems such as detection and segmentation.</p><p>By combining window based attention, shifted windows, and hierarchical feature construction, Swin Transformers bridge the gap between convolutional backbones and transformer based modeling. This design enables transformers to function as general purpose vision backbones, achieving strong performance across classification, object detection, and semantic segmentation tasks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yfin!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yfin!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png 424w, https://substackcdn.com/image/fetch/$s_!yfin!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png 848w, https://substackcdn.com/image/fetch/$s_!yfin!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png 1272w, https://substackcdn.com/image/fetch/$s_!yfin!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yfin!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png" width="1089" height="648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:648,&quot;width&quot;:1089,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:188705,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yfin!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png 424w, https://substackcdn.com/image/fetch/$s_!yfin!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png 848w, https://substackcdn.com/image/fetch/$s_!yfin!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png 1272w, https://substackcdn.com/image/fetch/$s_!yfin!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd11bcc3-f3e4-46dd-aec8-bcf5d23d12a9_1089x648.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.5 Hierarchical feature construction in Swin Transformers.</strong><br>Spatial resolution is progressively reduced while channel dimensionality increases across stages, enabling multi scale feature representations similar to those used in convolutional architectures.</em></p><h1>1.5 Overview of the Swin Transformer Architecture</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X8hk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X8hk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png 424w, https://substackcdn.com/image/fetch/$s_!X8hk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png 848w, https://substackcdn.com/image/fetch/$s_!X8hk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png 1272w, https://substackcdn.com/image/fetch/$s_!X8hk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X8hk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png" width="1456" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65172,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X8hk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png 424w, https://substackcdn.com/image/fetch/$s_!X8hk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png 848w, https://substackcdn.com/image/fetch/$s_!X8hk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png 1272w, https://substackcdn.com/image/fetch/$s_!X8hk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F791e917c-b14b-4506-b09b-70c612d3c888_1533x510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.6 Swin Transformer architecture and block structure.</strong><br>The model is composed of multiple stages with progressive resolution reduction and channel expansion. Each stage contains repeated Swin Transformer blocks, where window based and shifted window self attention are applied sequentially to enable efficient and scalable visual representation learning.</em></p><p>The Swin Transformer architecture is organized around two core ideas: staged hierarchical processing and localized self attention within transformer blocks. At a high level, the model is composed of multiple stages arranged sequentially, where each stage operates at a specific spatial resolution and feature dimensionality. The left portion of Figure 1.6 illustrates this overall architecture, showing how an input image is progressively transformed through a series of stages. Each stage consists of an embedding or patch merging operation followed by repeated Swin Transformer blocks. As the model advances through stages, spatial resolution is reduced while the number of feature channels increases, enabling increasingly abstract and semantically rich representations.</p><p>The fundamental computational unit of the architecture is the Swin Transformer block, expanded on the right side of Figure 1.6. A single Swin Transformer block is composed of two consecutive transformer style sub blocks. The first applies window based multi head self attention, where attention is computed independently within non overlapping local windows. The second applies shifted window based multi head self attention, which uses a spatially shifted window configuration to enable information exchange across neighboring windows. Each attention module is wrapped with layer normalization, residual connections, and a feed forward multilayer perceptron, closely mirroring the standard transformer design. While these components are individually familiar, their specific arrangement and interaction through windowing and shifting form the core novelty of the Swin architecture.</p><p>At this stage, several architectural questions naturally arise. How are patches grouped into windows, and how are windows constructed efficiently from the patch representation? How is positional information encoded without relying on absolute positional embeddings tied to a fixed token sequence? How exactly does window based attention differ from standard global attention, and how does the shifted window mechanism preserve cross window connectivity? Finally, how does this staged design give rise to hierarchical feature representations suitable for a wide range of vision tasks? These questions define the remainder of the Swin Transformer discussion and will be addressed step by step, beginning with the patching and window construction process in the next section.</p><h1>1.6 Patch Partitioning and Linear Embedding</h1><p>The first concrete operation applied to the input image in the Swin Transformer pipeline is <strong>patch partitioning</strong>. Given an RGB image of spatial resolution <em>H&#215;W</em> with <em>3</em> channels, the image is divided into non overlapping square patches of size <em>4&#215;4</em>. This step is purely a reshaping operation and does not involve any learnable parameters. Each patch contains spatial information across all three color channels and serves as the atomic unit that will later be processed by transformer blocks.</p><p>Formally, partitioning the image into <em>4&#215;4</em> patches produces a grid of patches along the height and width dimensions. The total number of patches is determined by the image resolution and patch size. Each patch is then flattened into a one dimensional vector by concatenating all pixel values within the patch across channels. Since each patch contains 4&#215;4&#215;3 values, the resulting vector has dimensionality 48. At this stage, the image is represented as a collection of vectors with shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{H}{4} \\times \\frac{W}{4} \\times 48,\n&quot;,&quot;id&quot;:&quot;XXDMQNHBQF&quot;}" data-component-name="LatexBlockToDOM"></div><p>where each vector corresponds to a single spatial patch.</p><p><strong>The patch partitioning process can be summarized as follows:</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Input image shape} = H \\times W \\times 3&quot;,&quot;id&quot;:&quot;CBXBWZXDAQ&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Patch size} = 4 \\times 4&quot;,&quot;id&quot;:&quot;GPXHDLMHEX&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Number of patches} = \\frac{H}{4} \\times \\frac{W}{4}&quot;,&quot;id&quot;:&quot;AWGKPSJOHF&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Patch vector dimension} = 4 \\times 4 \\times 3 = 48&quot;,&quot;id&quot;:&quot;FLXOOPTEKX&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Patch tensor shape} = \\frac{H}{4} \\times \\frac{W}{4} \\times 48&quot;,&quot;id&quot;:&quot;KYCROUUOJH&quot;}" data-component-name="LatexBlockToDOM"></div><p>While these 48 dimensional vectors faithfully preserve all pixel level information within each patch, they are not yet suitable for transformer based processing. Transformers operate on tokens that share a common embedding dimension, typically denoted as C. To achieve this, each flattened patch vector is passed through a <strong>linear embedding layer</strong>, which performs a learned linear projection from 48 dimensions to C dimensions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wFML!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wFML!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png 424w, https://substackcdn.com/image/fetch/$s_!wFML!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png 848w, https://substackcdn.com/image/fetch/$s_!wFML!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png 1272w, https://substackcdn.com/image/fetch/$s_!wFML!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wFML!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png" width="678" height="696" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:696,&quot;width&quot;:678,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48734,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wFML!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png 424w, https://substackcdn.com/image/fetch/$s_!wFML!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png 848w, https://substackcdn.com/image/fetch/$s_!wFML!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png 1272w, https://substackcdn.com/image/fetch/$s_!wFML!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5c7dfff-644a-443e-a9f9-311ae2f41d95_678x696.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.7 Patch partitioning and linear embedding in Swin Transformer.</strong><br>The input image is divided into non overlapping 4&#215;4 patches, each flattened into a 48 dimensional vector and linearly projected into a C dimensional embedding space before being processed by Swin Transformer blocks.</em></p><p>This linear embedding is equivalent to a fully connected layer applied independently to each patch. The transformation is parameterized by a weight matrix of shape 48&#215;C , mapping each 1&#215;48  patch vector to a 1&#215;C embedding. After this step, all patches are represented in a common feature space, and the token dimensionality remains fixed throughout subsequent transformer blocks.</p><p>After linear embedding, the patch representation takes the form</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\quad \\frac{H}{4} \\times \\frac{W}{4} \\times C, \\quad \n&quot;,&quot;id&quot;:&quot;TOKRFJTJYP&quot;}" data-component-name="LatexBlockToDOM"></div><p>which serves as the input to the first stage of Swin Transformer blocks. Importantly, neither patch partitioning nor linear embedding introduces any notion of attention or contextual interaction between patches. These steps strictly prepare the input representation and defer all contextual modeling to later stages of the architecture.</p><h1>1.7 Patch Merging and Hierarchical Downsampling</h1><p>Patch merging is the mechanism through which Swin Transformer transitions between stages and constructs hierarchical feature representations. Unlike patch partitioning, which operates directly on image pixels, patch merging operates <strong>on patch tokens</strong> produced by earlier stages. At the beginning of a stage, the feature map has spatial dimensions</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N = \\frac{H}{4} \\times \\frac{W}{4} \\quad \\text{(Number of tokens)}\n&quot;,&quot;id&quot;:&quot;FAJOVJCZWA&quot;}" data-component-name="LatexBlockToDOM"></div><p>with each token represented by a C-dimensional embedding. Patch merging reduces the number of tokens while increasing their representational capacity, closely mirroring the downsampling behavior observed in convolutional neural networks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!H6vt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!H6vt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png 424w, https://substackcdn.com/image/fetch/$s_!H6vt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png 848w, https://substackcdn.com/image/fetch/$s_!H6vt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png 1272w, https://substackcdn.com/image/fetch/$s_!H6vt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!H6vt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png" width="1456" height="397" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:397,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14547,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!H6vt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png 424w, https://substackcdn.com/image/fetch/$s_!H6vt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png 848w, https://substackcdn.com/image/fetch/$s_!H6vt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png 1272w, https://substackcdn.com/image/fetch/$s_!H6vt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b9298e2-96e5-4862-be53-93b93012ad72_1476x402.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.8 Patch merging operation in Swin Transformer.</strong><br>Four neighboring patch tokens are concatenated along the channel dimension to form a 4C-dimensional representation, followed by a linear projection to 2C, reducing spatial resolution while increasing feature abstraction across stages.</em></p><p>The core operation in patch merging is the grouping of <strong>four neighboring patches arranged in a 2 &#215; 2 grid</strong>. Consider four adjacent patch tokens, each of dimensionality C. These four tokens are concatenated <strong>along the channel dimension</strong>, producing a single token of dimensionality 4C. This operation reduces the spatial resolution by a factor of two along both height and width, since each 2 &#215; 2 group of patches is replaced by a single patch token.</p><p>Concretely, if the input to patch merging has shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{H}{4} \\times \\frac{W}{4} \\times C, &quot;,&quot;id&quot;:&quot;PLHMRKPJJP&quot;}" data-component-name="LatexBlockToDOM"></div><p>then after concatenation of 2 &#215; 2 neighboring patches, the intermediate representation becomes</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{H}{8} \\times \\frac{W}{8} \\times 4C.&quot;,&quot;id&quot;:&quot;GXSEILXRYO&quot;}" data-component-name="LatexBlockToDOM"></div><p>At this point, the number of tokens has been reduced by a factor of four, but each token now aggregates information from a larger spatial region.</p><p>However, the Swin Transformer architecture does not propagate tokens with dimensionality 4C to the next stage. Instead, a <strong>linear projection</strong> is applied to each concatenated token to reduce the channel dimensionality from 4C to 2C. This projection is implemented as a fully connected layer shared across all tokens, analogous to the linear embedding used during initial patch embedding. As a result, the final output of patch merging has shape</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{H}{8} \\times \\frac{W}{8} \\times 2C \n&quot;,&quot;id&quot;:&quot;ANLMABXDGT&quot;}" data-component-name="LatexBlockToDOM"></div><p><strong>This two step process can be summarized as follows:</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Input shape} = \\frac{H}{4} \\times \\frac{W}{4} \\times C&quot;,&quot;id&quot;:&quot;TLVSWQKBYF&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Concatenation of $2 \\times 2$ patches} \\Rightarrow\n&quot;,&quot;id&quot;:&quot;IPDANVRNMN&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{H}{8} \\times \\frac{W}{8} \\times 4C \n&quot;,&quot;id&quot;:&quot;MYNEGYWHBM&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Linear projection} : 4C \\rightarrow 2C&quot;,&quot;id&quot;:&quot;UGFMRLWNDA&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Output shape} = \\frac{H}{8} \\times \\frac{W}{8} \\times 2C&quot;,&quot;id&quot;:&quot;QRDOCDQTUB&quot;}" data-component-name="LatexBlockToDOM"></div><p>Patch merging therefore performs two complementary roles simultaneously. First, it <strong>reduces spatial resolution</strong>, decreasing the number of tokens and improving computational efficiency for subsequent attention layers. Second, it <strong>increases feature abstraction</strong> by expanding the channel dimension before projecting it to a higher capacity embedding space. As this operation is repeated across stages, the model progressively moves from fine grained local representations to coarser, more semantic features. This progressive reduction in token count and increase in feature dimensionality is the foundation of the hierarchical structure that enables Swin Transformer to serve as a general purpose vision backbone.</p><h1>1.8 Attention Complexity in Swin Transformers</h1><p>The primary motivation behind the Swin Transformer design is to address the quadratic attention complexity of standard Vision Transformers while preserving the expressive power of self attention. This is achieved by restricting self attention computation to local windows instead of the entire image.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dgfL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dgfL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png 424w, https://substackcdn.com/image/fetch/$s_!dgfL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png 848w, https://substackcdn.com/image/fetch/$s_!dgfL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!dgfL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dgfL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png" width="1428" height="1275" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1275,&quot;width&quot;:1428,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:404945,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dgfL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png 424w, https://substackcdn.com/image/fetch/$s_!dgfL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png 848w, https://substackcdn.com/image/fetch/$s_!dgfL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png 1272w, https://substackcdn.com/image/fetch/$s_!dgfL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2ec4fa61-9d55-4ad9-902a-d5e0d9933d47_1428x1275.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.9 Attention complexity comparison between Vision Transformer and Swin Transformer.</strong><br>Vision Transformer computes global self attention over all patches, resulting in quadratic complexity with respect to image size. Swin Transformer restricts attention to fixed size local windows, leading to linear scaling while preserving local contextual modeling.</em></p><p>Consider an input image of spatial resolution H&#215;W, divided into non overlapping patches of size P&#215;P. As before, each patch is treated as a token. The total number of patches in the image is therefore</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N_{\\text{patches}} = \\frac{H}{P} \\times \\frac{W}{P}&quot;,&quot;id&quot;:&quot;GDIIEKTEDM&quot;}" data-component-name="LatexBlockToDOM"></div><p>In a Vision Transformer, all <em>N_patches</em> tokens participate in global self attention, leading to quadratic complexity. Swin Transformer alters this computation by introducing fixed size windows.</p><p><strong>Attention Computation Within a Single Window</strong></p><p>In Swin Transformer, attention is computed locally within windows of size <em>M&#215;M</em> patches. For a single window, the number of tokens is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N_{\\text{window}} = M^2&quot;,&quot;id&quot;:&quot;ZJPXVGBRWR&quot;}" data-component-name="LatexBlockToDOM"></div><p>Since self attention computes all pairwise interactions between queries and keys, the attention complexity for <strong>one window</strong> is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}(N_{\\text{window}}^2) = \\mathcal{O}(M^4)&quot;,&quot;id&quot;:&quot;BNSDQCFDSI&quot;}" data-component-name="LatexBlockToDOM"></div><p>This cost is independent of the overall image resolution and depends only on the window size.</p><p><strong>Number of Windows in the Image</strong></p><p>To compute the total attention cost, we must account for how many such windows exist in the image.</p><p>The number of patches along height and width are</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{H}{P}, \\quad \\frac{W}{P}&quot;,&quot;id&quot;:&quot;SXXUDUPWAL&quot;}" data-component-name="LatexBlockToDOM"></div><p>Each window spans M patches along each spatial dimension. Therefore, the number of windows along height and width are</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\frac{H}{P M}, \\quad \\frac{W}{P M}&quot;,&quot;id&quot;:&quot;FCHWWXFQKP&quot;}" data-component-name="LatexBlockToDOM"></div><p>The total number of windows in the image is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;N_{\\text{windows}} = \\frac{H W}{P^2 M^2}&quot;,&quot;id&quot;:&quot;MXWVSMUOHO&quot;}" data-component-name="LatexBlockToDOM"></div><h4>Total Attention Complexity</h4><p>The total attention complexity for the entire image is obtained by multiplying the cost per window with the total number of windows</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}_{\\text{total}}\n= \\mathcal{O}(M^4) \\times \\frac{H W}{P^2 M^2}\n= \\mathcal{O}\\left(\\frac{M^2 H W}{P^2}\\right)&quot;,&quot;id&quot;:&quot;APBEGTXMDM&quot;}" data-component-name="LatexBlockToDOM"></div><p>Since M and P are fixed hyperparameters, the dominant scaling term is HW. Thus, the overall attention complexity scales linearly with the number of image pixels</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}_{\\text{Swin}} \\sim \\mathcal{O}(H W)&quot;,&quot;id&quot;:&quot;WFIVIICXSW&quot;}" data-component-name="LatexBlockToDOM"></div><p>This is in contrast to the Vision Transformer, where attention complexity scales as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{O}_{\\text{ViT}} \\sim \\mathcal{O}\\left((H W)^2\\right)&quot;,&quot;id&quot;:&quot;JOLYYYSBGJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>By limiting self attention to local windows, Swin Transformer converts quadratic global attention into <strong>linear attention with respect to image size</strong>. As image resolution increases, the attention cost grows proportionally rather than quadratically, making Swin Transformer far more suitable for high resolution vision tasks.</p><h1>1.9 Shifted Windows for Long Range Interaction in Swin Transformer</h1><p>Window based self attention significantly reduces computational cost by restricting attention to local regions. However, this restriction introduces an important limitation: patches that are spatially close but lie in different windows cannot attend to each other. Swin Transformer resolves this limitation using the idea of <strong>shifted windows</strong>, which enables information flow across window boundaries while preserving linear complexity.</p><p><strong>1.9.1 Limitation of Regular Non Overlapping Windows</strong></p><p>In regular window based self attention, the image is first partitioned into non overlapping patches. These patches are then grouped into non overlapping windows of fixed size.</p><p>Let <em><strong>H</strong></em> and <em><strong>W</strong></em> denote image height and width in pixels <em><strong>P</strong></em> denote patch size in pixels <em><strong>M</strong></em> denote window size in number of patches per side</p><p>Each window contains</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; N = M^{2} \n&quot;,&quot;id&quot;:&quot;EVEVFAUTNR&quot;}" data-component-name="LatexBlockToDOM"></div><p>tokens, and self attention is computed only within these tokens.</p><p>As a result, attention between any two patches is <strong>allowed if and only if</strong> both patches belong to the same window.</p><p><strong>This creates a hard boundary:</strong></p><p>&#8226; Patches A and B may be geometrically adjacent but have zero attention if they fall in different windows<br>&#8226; Patches A and C may be far apart but can attend if they share a window</p><p>Thus, attention is governed by window membership, not spatial proximity.</p><p><strong>1.9.2 Why Regular Windows Are Insufficient</strong></p><p>Because attention is restricted to fixed windows:</p><p>&#8226; Local context is well captured<br>&#8226; Global and cross window dependencies are missing</p><p>This is problematic for vision tasks where objects often span across window boundaries. Simply enlarging the window would increase computation and defeat the purpose of window based attention.</p><p>Swin Transformer solves this using <strong>alternating regular and shifted window attention blocks</strong>.</p><p><strong>1.9.3 Shifted Window Mechanism</strong></p><p>In the shifted window block, window boundaries are shifted by half the window size along both height and width.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Shift size} = \\left( \\frac{M}{2}, \\frac{M}{2} \\right)\n&quot;,&quot;id&quot;:&quot;WQHQEBLOUQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>This shift changes window assignments without changing patch content.</p><p>As a result:</p><p>&#8226; Patches that were previously in different windows are now grouped together<br>&#8226; New attention connections are formed across previous window boundaries</p><p>Importantly, this shift does not introduce overlapping attention computation. Each block still computes attention within fixed size windows.</p><p><strong>1.9.4 Cyclic Shifting to Preserve Window Structure</strong></p><p>A naive shift creates incomplete windows near image boundaries, leading to uneven window sizes. Padding could fix this, but padding introduces unnecessary computation and irregular window sizes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cl_l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cl_l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png 424w, https://substackcdn.com/image/fetch/$s_!cl_l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png 848w, https://substackcdn.com/image/fetch/$s_!cl_l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png 1272w, https://substackcdn.com/image/fetch/$s_!cl_l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cl_l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png" width="1456" height="833" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:833,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:432933,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cl_l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png 424w, https://substackcdn.com/image/fetch/$s_!cl_l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png 848w, https://substackcdn.com/image/fetch/$s_!cl_l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png 1272w, https://substackcdn.com/image/fetch/$s_!cl_l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff9719e77-242c-44ac-8b3f-fef199a42d91_1521x870.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em><strong>Figure 1.10 Shifted window attention in Swin Transformer.</strong><br>Regular window attention restricts interactions to fixed non overlapping windows, preventing neighboring patches across window boundaries from attending to each other. Shifted windows reassign patches to new windows using cyclic shifting, enabling cross window and long range interactions across consecutive transformer blocks while preserving linear attention complexity.</em></p><p>Instead, Swin Transformer uses <strong>cyclic shifting</strong>.</p><p>Cyclic shifting works as follows:</p><p>&#8226; Patches that move out of the image on one side re enter from the opposite side<br>&#8226; Only window indices are shifted, not pixel values<br>&#8226; Each shifted window still contains exactly M &#215; M patches</p><p>This mechanism is equivalent to rolling the image on a torus or cylinder.</p><p>As a result:</p><p>&#8226; All windows remain uniform in size<br>&#8226; Attention computation remains efficient<br>&#8226; No padding overhead is introduced</p><p><strong>1.9.5 How Shifted Windows Enable Long Range Attention</strong></p><p>Consider two consecutive transformer blocks operating in sequence: a regular window attention block followed by a shifted window attention block.<em>(as illustrated in Figure 1.10)</em> In the first block, self attention is strictly confined to non overlapping local windows, which limits interactions to patches that share the same window. In the subsequent block, the window partitioning is shifted, causing patches to be regrouped into different windows. As a result, patches that were previously unable to attend to each other due to window boundaries are now placed within the same window and can directly interact. Across these two blocks, information propagates across neighboring windows, and with repeated stacking of such block pairs, progressively longer range dependencies are established. Consequently, Swin Transformer achieves effective global context modeling indirectly, without ever computing full global self attention.</p><p>Swin Transformer does not allow every patch to attend to every other patch in a single layer. Instead, it enables long range dependency through <strong>alternating locality patterns across layers</strong>, while maintaining linear computational complexity with respect to image size.</p><h1>1.10 Window-Based Self-Attention with Relative Position Bias and Masking</h1><p>In Swin Transformer, positional information and locality constraints are integrated directly into the self attention computation rather than being handled as a preprocessing step. This design choice is tightly coupled with window based attention and is essential for maintaining linear computational complexity while preserving spatial structure. Each transformer stage alternates between blocks that use regular window attention and blocks that use shifted window attention, and the attention formulation is adapted accordingly.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-FQg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-FQg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png 424w, https://substackcdn.com/image/fetch/$s_!-FQg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png 848w, https://substackcdn.com/image/fetch/$s_!-FQg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png 1272w, https://substackcdn.com/image/fetch/$s_!-FQg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-FQg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png" width="459" height="510" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a6f549f1-7108-436a-be3c-132e020a297c_459x510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:510,&quot;width&quot;:459,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:20987,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/183324523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-FQg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png 424w, https://substackcdn.com/image/fetch/$s_!-FQg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png 848w, https://substackcdn.com/image/fetch/$s_!-FQg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png 1272w, https://substackcdn.com/image/fetch/$s_!-FQg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa6f549f1-7108-436a-be3c-132e020a297c_459x510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We first consider the case of <strong>regular window attention</strong>. In this setting, the feature map is partitioned into non overlapping windows of fixed size, and self attention is computed independently within each window. Let <em>qi</em> and <em>kj</em> denote the query and key vectors corresponding to patches <em>i</em> and <em>j</em> within the same window, and let d be the head dimension. The attention weight <em>&#945;_ij</em> is computed as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A_{ij} = \\text{softmax}\\left( \\frac{q_i k_j^\\top}{\\sqrt{d}} + b_{ij} \\right) \n&quot;,&quot;id&quot;:&quot;GYSSWEPUWP&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <em>b_ij</em>&#8203; is a learnable <strong>relative position bias</strong> term. This bias depends only on the relative spatial offset between patches <em>i</em> and <em>j</em> within the window, not on their absolute positions in the image. Unlike Vision Transformers, which add positional embeddings to token representations before entering the transformer blocks, Swin Transformer injects positional information directly into the attention logits. This allows the model to encode spatial relationships locally and naturally aligns with the window based formulation. Since all query key pairs belong to the same window in this block, no additional constraints are required, and attention is computed over all patch pairs within the window.</p><p>The situation changes in the <strong>shifted window attention</strong> block. Here, the window partition is shifted by half the window size along both height and width to enable information flow across neighboring windows. After shifting, a single shifted window may contain patches that originate from different regular windows. If attention were computed using the same formulation as above, patches that should not interact would incorrectly attend to each other. To prevent this, Swin Transformer introduces an additive <strong>attention mask</strong> term.</p><p>In the shifted window block, the attention weight is computed as</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\alpha_{ij} = \\text{softmax} \\left( \\frac{q_i k_j^\\top}{\\sqrt{d}} + b_{ij} + \\text{mask}_{ij} \\right).&quot;,&quot;id&quot;:&quot;XSOMTRBXKF&quot;}" data-component-name="LatexBlockToDOM"></div><p>The mask term <em>mask_ij</em>&#8203; enforces window locality after shifting. If patches <em>i</em> and <em>j</em> belong to the same window after shifting, the mask value is zero,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{mask}_{ij} = 0,&quot;,&quot;id&quot;:&quot;ALQDVXEZSQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>and attention between them is allowed. If they belong to different windows, the mask value is set to negative infinity,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{mask}_{ij} = -\\infty.&quot;,&quot;id&quot;:&quot;YUOOSBAUHN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Because the mask is added before the softmax operation, any attention score associated with <em>&#8722;&#8734;</em>  is driven to zero after normalization. This guarantees that patches from different windows do not attend to each other, even though they may be present within the same shifted window tensor. Importantly, this masking operation is additive and should not be confused with multiplicative masking or dropout. It is a precise mathematical mechanism that preserves the structure of window based attention under shifting.</p><p>Together, relative position bias and attention masking form the core of Swin Transformer&#8217;s window based self attention. Relative position bias provides fine grained spatial awareness within each window, while masking ensures that attention remains well defined and localized in the shifted window configuration. By alternating regular and shifted window attention blocks, Swin Transformer enables information to propagate across windows over depth, gradually building long range dependencies without ever computing full global self attention. This formulation is a key reason why Swin Transformer achieves strong performance while maintaining linear scaling with respect to image size.</p><h1>1.11 Relative Position Bias Parameterization in Swin Transformer</h1><p>An important detail in Swin Transformer is how relative position bias is parameterized and shared across attention heads. Unlike absolute positional embeddings, which require a unique embedding for every spatial location, relative position bias depends only on the relative offset between two patches inside a window. This makes the formulation independent of absolute image size and naturally compatible with window-based attention.</p><p>For a window of size <em>M&#215;M</em>, the relative displacement between two patches along height and width can take values in the range</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\Delta h, \\Delta w \\in \\{ -(M-1), \\dots, 0, \\dots, (M-1) \\}&quot;,&quot;id&quot;:&quot;GHQTMALVSM&quot;}" data-component-name="LatexBlockToDOM"></div><p>This results in a total of</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\left( 2M - 1 \\right)^2&quot;,&quot;id&quot;:&quot;PAGSYEPLCV&quot;}" data-component-name="LatexBlockToDOM"></div><p>unique relative position offsets. Swin Transformer maintains a learnable bias table of this size for each attention head. During attention computation, each <em>query&#8211;key pair (i,j</em>) indexes into this table based on their relative spatial displacement, and the corresponding scalar bias is added to the attention logit.</p><p>This design has several advantages. First, it significantly reduces the number of learnable positional parameters compared to absolute embeddings. Second, it allows the same bias table to be reused across all windows within a layer, enforcing translation equivariance. Finally, because the bias depends only on relative offsets, it generalizes naturally to different image sizes at inference time.</p><h1>1.12 Absence of Class Token in Swin Transformer</h1><p>Another notable departure from the original Vision Transformer design is the absence of a dedicated class token in Swin Transformer. In Vision Transformers, a special learnable token is prepended to the patch sequence and used as the global image representation for classification tasks. This approach assumes global self attention, where the class token can attend to all patches in a single layer.</p><p>In Swin Transformer, attention is localized within windows, and no single token has access to all patches in one layer. As a result, a class token would not be able to aggregate global information efficiently. Instead, Swin Transformer relies on hierarchical feature aggregation to build global representations.</p><p>For image classification, the final stage of the network produces a low-resolution feature map with rich semantic content. Global average pooling is applied over the spatial dimensions to aggregate information across all remaining tokens. The pooled representation is then passed to a classification head. This approach aligns closely with convolutional architectures and avoids introducing a special token that does not naturally fit the window-based attention paradigm.</p><h1>1.13 Output Heads and Task Generalization</h1><p>One of the strengths of Swin Transformer is its flexibility as a general-purpose vision backbone. The hierarchical feature maps produced at different stages can be directly reused for a wide range of downstream tasks.</p><p>For image classification, only the final stage output is used, followed by global pooling and a linear classifier. For object detection and instance segmentation, intermediate feature maps from multiple stages are extracted and fed into feature pyramid networks or task-specific heads. For semantic segmentation, high-resolution features from early stages and semantically rich features from deeper stages are combined to produce dense predictions.</p><p>This multi-scale output capability is a direct consequence of the hierarchical design introduced by patch merging. Unlike flat Vision Transformers, Swin Transformer naturally exposes feature representations at multiple resolutions, making it particularly well suited for dense prediction tasks.</p><h1>1.14 Comparison with Convolutional Backbones</h1><p>Although Swin Transformer is built entirely using transformer blocks, its architectural philosophy closely mirrors that of convolutional neural networks. Locality is enforced through window-based attention, hierarchical feature representations are constructed via patch merging, and translation equivariance is preserved through relative position bias and weight sharing across windows.</p><p>At the same time, Swin Transformer retains the key advantages of transformers, including dynamic content-dependent receptive fields and the ability to model long-range dependencies through depth. The shifted window mechanism effectively replaces the role of increasing convolutional kernel sizes or dilated convolutions, enabling cross-region interaction without sacrificing efficiency.</p><p>This hybridization of transformer flexibility with convolutional inductive biases is a central reason for Swin Transformer&#8217;s strong empirical performance across diverse vision benchmarks.</p><h1>1.15 Concluding Remarks on Swin Transformer</h1><p>Swin Transformer represents a significant step forward in adapting transformer architectures to the unique demands of visual data. By replacing global self attention with window-based attention and introducing shifted windows for cross-region communication, it resolves the fundamental scalability limitations of Vision Transformers while preserving their expressive capacity.</p><p>The introduction of hierarchical feature representations through patch merging aligns the model with long-standing principles of visual processing, enabling seamless integration into detection, segmentation, and recognition pipelines. Relative position bias provides an elegant and efficient mechanism for encoding spatial relationships, while attention masking ensures correctness under window shifting.</p><p>Rather than computing global context in a single layer, Swin Transformer builds global understanding gradually through depth, leveraging alternating locality patterns across layers. This design demonstrates that effective global reasoning does not require global computation at every step.</p><p>As a result, Swin Transformer serves not only as a powerful architecture in its own right, but also as a blueprint for future vision transformers that balance efficiency, scalability, and representational strength.</p><h1>Watch the full lecture video here</h1><div id="youtube2-Rt4oRES1QLM" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Rt4oRES1QLM&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Rt4oRES1QLM?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>If you would like to deepen your understanding of Swin Transformer and see these ideas explained visually and intuitively, you can refer to the accompanying video linked above. If you wish to get access to our code files, handwritten notes, all lecture videos, Discord channel, and other PDF handbooks that we have compiled, along with a code certificate at the end of the program, you can consider being part of the pro version of the &#8220;Transformers for Vision Bootcamp&#8221;. You will find the details here:</p><p><a href="https://vision-transformer.vizuara.ai/">https://vision-transformer.vizuara.ai/</a></p><h1><strong>Other resources</strong></h1><p>If you like this content, please check out our research bootcamps on the following topics:</p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><h1>Connect with us </h1><p><strong>Dr. Sreedath Panat</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/sreedath-panat/">https://www.linkedin.com/in/sreedath-panat/</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/sreedathpanat">https://x.com/sreedathpanat</a></p><p></p><p><strong>Mayank Pratap Singh</strong></p><p><strong>LinkedIn</strong> : <a href="https://www.linkedin.com/in/mayankpratapsingh022/">www.linkedin.com/in/mayankpratapsingh022</a></p><p><strong>Twitter/X</strong> : <a href="https://x.com/Mayank_022">x.com/Mayank_022</a></p><p></p>]]></content:encoded></item><item><title><![CDATA[How does ACT (Action Chunking with Transformers) actually work?]]></title><description><![CDATA[Understanding the ACT architecture from scratch]]></description><link>https://www.vizuaranewsletter.com/p/how-does-act-action-chunking-with</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/how-does-act-action-chunking-with</guid><dc:creator><![CDATA[Dr Rajat Dandekar]]></dc:creator><pubDate>Sat, 10 Jan 2026 09:53:03 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2e3d2fdf-a7b5-46c6-b4d6-fdc78ecb3c80_1586x646.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The ACT architecture was first described in a paper which came out in 2023:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!szkr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!szkr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png 424w, https://substackcdn.com/image/fetch/$s_!szkr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png 848w, https://substackcdn.com/image/fetch/$s_!szkr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png 1272w, https://substackcdn.com/image/fetch/$s_!szkr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!szkr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png" width="1446" height="898" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:898,&quot;width&quot;:1446,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:943943,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!szkr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png 424w, https://substackcdn.com/image/fetch/$s_!szkr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png 848w, https://substackcdn.com/image/fetch/$s_!szkr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png 1272w, https://substackcdn.com/image/fetch/$s_!szkr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1147de1-1dd0-4e9d-8aac-897518d6671e_1446x898.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Why did it gain attention?</h3><p>This paper showed that for the first time, AI policies can be implemented in low-cost hardware for achieving complex tasks. </p><p>Some tasks which were completed using the ACT policy included:</p><p>(1) Opening a ZipLoc bag:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F761!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F761!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png 424w, https://substackcdn.com/image/fetch/$s_!F761!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png 848w, https://substackcdn.com/image/fetch/$s_!F761!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png 1272w, https://substackcdn.com/image/fetch/$s_!F761!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F761!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png" width="1456" height="260" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:260,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:627448,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!F761!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png 424w, https://substackcdn.com/image/fetch/$s_!F761!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png 848w, https://substackcdn.com/image/fetch/$s_!F761!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png 1272w, https://substackcdn.com/image/fetch/$s_!F761!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45d4c703-4337-4305-a412-a5f0b2cfd08f_1570x280.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>(2) Slot Battery:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xTTJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xTTJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png 424w, https://substackcdn.com/image/fetch/$s_!xTTJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png 848w, https://substackcdn.com/image/fetch/$s_!xTTJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png 1272w, https://substackcdn.com/image/fetch/$s_!xTTJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xTTJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png" width="1456" height="257" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:257,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:540869,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xTTJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png 424w, https://substackcdn.com/image/fetch/$s_!xTTJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png 848w, https://substackcdn.com/image/fetch/$s_!xTTJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png 1272w, https://substackcdn.com/image/fetch/$s_!xTTJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfb78f80-7276-4489-8916-ceab149d21c0_1552x274.png 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>(3) Open Cup:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!B57_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!B57_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png 424w, https://substackcdn.com/image/fetch/$s_!B57_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png 848w, https://substackcdn.com/image/fetch/$s_!B57_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png 1272w, https://substackcdn.com/image/fetch/$s_!B57_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!B57_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png" width="1456" height="244" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:244,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:585218,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!B57_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png 424w, https://substackcdn.com/image/fetch/$s_!B57_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png 848w, https://substackcdn.com/image/fetch/$s_!B57_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png 1272w, https://substackcdn.com/image/fetch/$s_!B57_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F663ab90e-7f5f-4965-ae64-7cc639e4d619_1538x258.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>All of these are tasks which require fine manipulation.</p><p>Have a look at the gains the policy achieved compared to other policies! Naturally, it caught attention of the Robotics community.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1kJs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1kJs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png 424w, https://substackcdn.com/image/fetch/$s_!1kJs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png 848w, https://substackcdn.com/image/fetch/$s_!1kJs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png 1272w, https://substackcdn.com/image/fetch/$s_!1kJs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1kJs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png" width="1456" height="417" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:417,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:161285,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!1kJs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png 424w, https://substackcdn.com/image/fetch/$s_!1kJs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png 848w, https://substackcdn.com/image/fetch/$s_!1kJs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png 1272w, https://substackcdn.com/image/fetch/$s_!1kJs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0612d2-b156-4571-a316-cf77b01003d0_2174x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are innovations in both the hardware and the software side. An example of low-cost hardware, where ACT can be successfully implemented in the SO-101 Robot (a practical on this is coming soon!)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f9q0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f9q0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg 424w, https://substackcdn.com/image/fetch/$s_!f9q0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg 848w, https://substackcdn.com/image/fetch/$s_!f9q0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!f9q0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f9q0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg" width="900" height="506" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:506,&quot;width&quot;:900,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Hugging Face Launches the SO-101, an Upgraded Low-Cost 3D-Printable  Autonomous Robot Arm - Hackster.io&quot;,&quot;title&quot;:&quot;Hugging Face Launches the SO-101, an Upgraded Low-Cost 3D-Printable  Autonomous Robot Arm - Hackster.io&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hugging Face Launches the SO-101, an Upgraded Low-Cost 3D-Printable  Autonomous Robot Arm - Hackster.io" title="Hugging Face Launches the SO-101, an Upgraded Low-Cost 3D-Printable  Autonomous Robot Arm - Hackster.io" srcset="https://substackcdn.com/image/fetch/$s_!f9q0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg 424w, https://substackcdn.com/image/fetch/$s_!f9q0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg 848w, https://substackcdn.com/image/fetch/$s_!f9q0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!f9q0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30c010e7-de39-4ff6-becc-c897c7d631c1_900x506.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this article, we are going to focus on the software side of the innovation.</p><p>Once you read the paper, you realize that there are 3 things on the software side which make this policy unique:</p><p>(1) <strong>It used a Conditional Variational AutoEncoder</strong></p><p>(2) <strong>It uses a DETR-inspired Transformer Decoder</strong></p><p>(3) <strong>It uses Action Chunking</strong></p><p>Don&#8217;t worry if these words sound a bit complex, we are going to go over everything in detail!</p><div class="pullquote"><p>We will cover the architecture of the ACT policy in a series of Architecture Versions, where we will think from first principles and understand how the architecture evolves.</p></div><p>First, let us understand what we want to do:</p><blockquote><p>We want to predict the distribution of the variation of the joint angles, given the state of the robot.</p></blockquote><p><strong>Architecture Version 0:</strong></p><p>So, our first thought is that, we need the following:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A7AS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A7AS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png 424w, https://substackcdn.com/image/fetch/$s_!A7AS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png 848w, https://substackcdn.com/image/fetch/$s_!A7AS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png 1272w, https://substackcdn.com/image/fetch/$s_!A7AS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A7AS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png" width="1456" height="571" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:571,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:396599,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!A7AS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png 424w, https://substackcdn.com/image/fetch/$s_!A7AS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png 848w, https://substackcdn.com/image/fetch/$s_!A7AS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png 1272w, https://substackcdn.com/image/fetch/$s_!A7AS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd54842fa-00dc-48e0-8585-b65774e464c6_2426x952.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Have a look at our article on Variational AutoEncoders here to understand the above diagram properly: https://vizuara.substack.com/p/variational-autoencoders-explained</em></p><p>This is a great start, but we can immediately see there is one major drawback of this approach.</p><p>Even if we manage to train this Variation AutoEncoder, at the end of it, we will get different actions which are the joint configurations sampled from the action space. So, the robot will move randomly.</p><p>We do not want that. We want the robot to move in a specific way for a specific robot configuration.</p><p><strong>Architecture Version 1:</strong></p><p>We want to model the distribution of the joints given the current state of the robot.</p><p>This is how we will modify our architecture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zz4r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zz4r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png 424w, https://substackcdn.com/image/fetch/$s_!zz4r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png 848w, https://substackcdn.com/image/fetch/$s_!zz4r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png 1272w, https://substackcdn.com/image/fetch/$s_!zz4r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zz4r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png" width="1456" height="799" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:799,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:685225,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f67c2e1-379d-419b-82e5-1945f80675f2_2570x1614.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zz4r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png 424w, https://substackcdn.com/image/fetch/$s_!zz4r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png 848w, https://substackcdn.com/image/fetch/$s_!zz4r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png 1272w, https://substackcdn.com/image/fetch/$s_!zz4r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b945092-ae73-436a-ade4-f18cf9a4c4d7_2570x1411.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This looks like a decent setup. Just so that we can be clear, the following diagram visually represents the distribution that we are trying to predict.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-IgQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-IgQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png 424w, https://substackcdn.com/image/fetch/$s_!-IgQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png 848w, https://substackcdn.com/image/fetch/$s_!-IgQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!-IgQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-IgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:355077,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-IgQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png 424w, https://substackcdn.com/image/fetch/$s_!-IgQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png 848w, https://substackcdn.com/image/fetch/$s_!-IgQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!-IgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a207ffe-25ff-4925-b1c2-cc64b43aecf5_2828x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There is another crucial point which comes to mind at this stage.</p><p>Don&#8217;t you think that the distribution of the joint angles not only depends on the current state of the robot but also the environment around it?</p><p>For example, if the task of the robot is to pick and place an object, then the distributions will greatly change if the robot is close to the object or not.</p><p><strong>Architecture Version 2:</strong></p><p>The environment around the robot is captured with the help of cameras. So we have images which are obtained from by these cameras.</p><p>So the final architecture that we are thinking about looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!I1fN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!I1fN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png 424w, https://substackcdn.com/image/fetch/$s_!I1fN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png 848w, https://substackcdn.com/image/fetch/$s_!I1fN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png 1272w, https://substackcdn.com/image/fetch/$s_!I1fN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!I1fN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png" width="1456" height="1030" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1030,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:636168,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!I1fN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png 424w, https://substackcdn.com/image/fetch/$s_!I1fN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png 848w, https://substackcdn.com/image/fetch/$s_!I1fN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png 1272w, https://substackcdn.com/image/fetch/$s_!I1fN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F256ac67c-00ab-4697-bfd5-1cbda33693fb_2424x1714.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We are trying to predict the variations in the joint angles conditioned on the position of the joints as well as the images collected from the camera feeds.</p><p>This architecture is also called a <strong>Conditional Variational Autoencoder</strong>.</p><p>Now, let us understand the encoder and the decoder separately.</p><p>First, we start with the encoder.</p><p>Let us start with what first comes to our mind regarding the encoder architecture.</p><p>The following is the first architecture that comes to my mind:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dzOj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dzOj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!dzOj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!dzOj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!dzOj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dzOj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2034250,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dzOj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!dzOj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!dzOj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!dzOj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feae130ae-2cba-43ca-8e95-b6f1473b5c2b_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Note that here we have conveniently ignored the input images from the camera feed.</p><p><em>This is a great start, but this is not the architecture which is used in the ACT paper.</em></p><p>The main reason is that in the ACT paper, they do not consider one single action, rather they consider a sequence of actions for a given state.</p><div class="pullquote"><p>The process of predicting a sequence of actions instead of one is called chunking.</p></div><blockquote><p>This is inspired by a neuroscience concept where individual actions are grouped together and executed as one unit, making them more efficient to store and execute.</p></blockquote><p>Intuitively, a chunk of actions could correspond to grasping a corner of the candy wrapper or inserting a battery into the slot.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9o8o!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9o8o!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!9o8o!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!9o8o!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!9o8o!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9o8o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:956198,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9o8o!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!9o8o!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!9o8o!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!9o8o!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7a36472-0d8f-408b-8293-324cb7e4855e_1024x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This means that the encoder pipeline looks something as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uFIs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uFIs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png 424w, https://substackcdn.com/image/fetch/$s_!uFIs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png 848w, https://substackcdn.com/image/fetch/$s_!uFIs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png 1272w, https://substackcdn.com/image/fetch/$s_!uFIs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uFIs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png" width="1456" height="897" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:897,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:790464,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!uFIs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png 424w, https://substackcdn.com/image/fetch/$s_!uFIs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png 848w, https://substackcdn.com/image/fetch/$s_!uFIs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png 1272w, https://substackcdn.com/image/fetch/$s_!uFIs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7429f43-d9ff-4bf4-aa44-443f0e236066_2728x1680.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Architecture Version 3:</strong></p><p><strong>How an MLP sees it: </strong>An MLP sees the whole sequence as one giant, flat bag of numbers. It doesn&#8217;t inherently understand that Action 1 comes before Action 2. It has to learn these relationships from scratch using brute force (massive amounts of data and weights).</p><p>Hence we don&#8217;t use a multi-layer perceptron. Instead, we use a transformer encoder.</p><p>So, broadly speaking, our encoder-architecture looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!G4Dm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!G4Dm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png 424w, https://substackcdn.com/image/fetch/$s_!G4Dm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png 848w, https://substackcdn.com/image/fetch/$s_!G4Dm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png 1272w, https://substackcdn.com/image/fetch/$s_!G4Dm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!G4Dm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png" width="1456" height="524" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:524,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:984927,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1f956d9-0bd6-4311-a2f6-2647ad58c119_1999x1999.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!G4Dm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png 424w, https://substackcdn.com/image/fetch/$s_!G4Dm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png 848w, https://substackcdn.com/image/fetch/$s_!G4Dm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png 1272w, https://substackcdn.com/image/fetch/$s_!G4Dm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb69567a-8517-4aea-92fd-0a4e25d9b27f_1999x719.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The transformer encoder has the following architecture:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eaAI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eaAI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png 424w, https://substackcdn.com/image/fetch/$s_!eaAI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png 848w, https://substackcdn.com/image/fetch/$s_!eaAI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png 1272w, https://substackcdn.com/image/fetch/$s_!eaAI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eaAI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png" width="426" height="431.85164835164835" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1476,&quot;width&quot;:1456,&quot;resizeWidth&quot;:426,&quot;bytes&quot;:378809,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!eaAI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png 424w, https://substackcdn.com/image/fetch/$s_!eaAI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png 848w, https://substackcdn.com/image/fetch/$s_!eaAI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png 1272w, https://substackcdn.com/image/fetch/$s_!eaAI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffd9d3f84-5ee1-4598-b667-c4e6928dadc7_1986x2013.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>First, the current state and the action chunk are tokenized. In this process, embedding vectors are created for each state and action vector.</p><p>These embedding vectors are then passed to the subsequent layers.</p><p>The most important layer is the attention layer, which understands the relationship between each token and generates attention scores for each pair of tokens.</p><p>These attention scores are then used to modify the values of the embedding vectors.</p><p>To generate the latent vector, all we need to focus on is the <strong>CLS token</strong> and the values which are present inside the <strong>CLS token</strong>.</p><p>Let us look at an interactive visualization to understand all the steps in detail:</p><p>https://actencoder.vizuara.ai/</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TxDE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TxDE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif 424w, https://substackcdn.com/image/fetch/$s_!TxDE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif 848w, https://substackcdn.com/image/fetch/$s_!TxDE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif 1272w, https://substackcdn.com/image/fetch/$s_!TxDE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TxDE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif" width="1411" height="732" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:732,&quot;width&quot;:1411,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2538892,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!TxDE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif 424w, https://substackcdn.com/image/fetch/$s_!TxDE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif 848w, https://substackcdn.com/image/fetch/$s_!TxDE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif 1272w, https://substackcdn.com/image/fetch/$s_!TxDE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45939ebf-d9e7-41a1-8ed8-428e8b630472_1411x732.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now that we have understood how the encoder looks like, let us understand numerically what happens to the tokens as they pass through the encoder.</p><p><strong>ACT Decoder</strong></p><p>We want our ACT decoder to do something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uEGD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uEGD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png 424w, https://substackcdn.com/image/fetch/$s_!uEGD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png 848w, https://substackcdn.com/image/fetch/$s_!uEGD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!uEGD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uEGD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png" width="1456" height="972" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:972,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:164742,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!uEGD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png 424w, https://substackcdn.com/image/fetch/$s_!uEGD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png 848w, https://substackcdn.com/image/fetch/$s_!uEGD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!uEGD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa15a6c42-60b0-4e3e-9c67-b483bf87fcc0_1758x1174.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You might realize that there is a problem with the above architecture.</p><p><strong>Architecture Version 4:</strong></p><p>To understand the next time steps for the robotic joints, we need the current joint positions and the inputs from the camera feeds as well.</p><p>The modified architecture looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VmAn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VmAn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png 424w, https://substackcdn.com/image/fetch/$s_!VmAn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png 848w, https://substackcdn.com/image/fetch/$s_!VmAn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png 1272w, https://substackcdn.com/image/fetch/$s_!VmAn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VmAn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png" width="1456" height="855" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:855,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:380516,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!VmAn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png 424w, https://substackcdn.com/image/fetch/$s_!VmAn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png 848w, https://substackcdn.com/image/fetch/$s_!VmAn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png 1272w, https://substackcdn.com/image/fetch/$s_!VmAn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F267d5635-025a-468c-8e05-a12ab939b2a9_2254x1324.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let us imagine a scenario, where a robot arm is positioned to pour coffee, with visual representations of both an overhead camera view and a wrist-mounted camera view.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-BQL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-BQL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png 424w, https://substackcdn.com/image/fetch/$s_!-BQL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png 848w, https://substackcdn.com/image/fetch/$s_!-BQL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png 1272w, https://substackcdn.com/image/fetch/$s_!-BQL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-BQL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png" width="298" height="298" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:298,&quot;bytes&quot;:2913318,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-BQL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png 424w, https://substackcdn.com/image/fetch/$s_!-BQL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png 848w, https://substackcdn.com/image/fetch/$s_!-BQL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png 1272w, https://substackcdn.com/image/fetch/$s_!-BQL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F99ec1822-b3cd-46d6-9744-fa903259c5ba_2000x2000.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The robot has two eyes:</p><p><strong>(1) Overhead Camera:</strong> Looks down at the table.</p><p><strong>(2) Wrist Camera:</strong> Mounted on the robot&#8217;s hand, looking closely at the gripper.</p><p>Now consider the following situation: The robot is about to pour its arm moves over the mug:</p><p>The <strong>Overhead Camera</strong> is now blocked by the robot&#8217;s own arm (it can&#8217;t see the mug anymore).</p><p>The <strong>Wrist Camera</strong> is the only one that can see the mug now.</p><p>To understand where the mug is <em>in 3D space</em>, the robot needs to combine the information from both cameras instantly.</p><p>This is not happening in the above architecture!</p><p><strong>Architecture Version 5:</strong></p><p>This fusing of information is exactly what an encoder does.</p><p>So now we can modify the architecture of the decoder as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lA3p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lA3p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png 424w, https://substackcdn.com/image/fetch/$s_!lA3p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png 848w, https://substackcdn.com/image/fetch/$s_!lA3p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png 1272w, https://substackcdn.com/image/fetch/$s_!lA3p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lA3p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png" width="1456" height="541" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:541,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:299584,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lA3p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png 424w, https://substackcdn.com/image/fetch/$s_!lA3p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png 848w, https://substackcdn.com/image/fetch/$s_!lA3p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png 1272w, https://substackcdn.com/image/fetch/$s_!lA3p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24ad6b23-437d-46d7-bacf-dd3db412b9ed_2664x990.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>First, let us understand how the fusing of this information happens, which is passed as an input to the encoder.</p><p>Let us start by focusing on the images.</p><p>Let us consider that the robot is pouring coffee from a mug. Here are the four images collected from the four cameras:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b86m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b86m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!b86m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!b86m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!b86m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b86m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3599678,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!b86m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!b86m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!b86m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!b86m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F43b55060-92de-4e89-bcd0-d92e9e91165d_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These 4 images are then passed through a CNN, which reduces the spatial dimension of the image to 15x20 and increases the depth to 512 dimensions.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R2Pu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R2Pu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!R2Pu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!R2Pu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!R2Pu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R2Pu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3062874,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!R2Pu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!R2Pu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!R2Pu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!R2Pu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60e3ebbb-cc6a-48c0-bd4b-2573369875e3_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>An example feature map can look as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AMx7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AMx7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!AMx7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!AMx7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!AMx7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AMx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1892138,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!AMx7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!AMx7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!AMx7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!AMx7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc2eaa49-3caa-4ea0-b086-c0f06e86b873_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Next, we convert these feature maps into one single list. For that, we need to flatten the 15x20 vector into a single vector of 300 dimensions.</p><p>The flattening process looks like this. Imagine the same transformation happening for the size of 15x20.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bs-K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bs-K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!Bs-K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!Bs-K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!Bs-K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bs-K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png" width="430" height="234.7870879120879" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:430,&quot;bytes&quot;:1182946,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Bs-K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!Bs-K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!Bs-K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!Bs-K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9940e9db-7cfa-43f9-9f1b-514f6b778ad8_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>So, our architecture is modified to look as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Km1F!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Km1F!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!Km1F!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!Km1F!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!Km1F!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Km1F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2231648,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Km1F!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!Km1F!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!Km1F!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!Km1F!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe114d1c2-a49d-43c7-9a47-37d305687589_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So, totally 1200 vectors with dimensions 512 are passed to the encoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!owdt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!owdt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png 424w, https://substackcdn.com/image/fetch/$s_!owdt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png 848w, https://substackcdn.com/image/fetch/$s_!owdt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!owdt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!owdt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png" width="643" height="383.3269230769231" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1456,&quot;resizeWidth&quot;:643,&quot;bytes&quot;:181505,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!owdt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png 424w, https://substackcdn.com/image/fetch/$s_!owdt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png 848w, https://substackcdn.com/image/fetch/$s_!owdt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!owdt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe76fb56f-0517-4f96-93a0-02c76aa11f06_1680x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let us move on to understanding how the inputs for the joint positions are created.</p><p>We use a linear layer for this purpose:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!abk6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!abk6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!abk6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!abk6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!abk6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!abk6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1190932,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!abk6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!abk6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!abk6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!abk6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c258f40-d1c4-4838-acbe-03150e5ee0d7_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This completes the processing of the inputs which are passed to the encoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SVsH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SVsH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png 424w, https://substackcdn.com/image/fetch/$s_!SVsH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png 848w, https://substackcdn.com/image/fetch/$s_!SVsH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png 1272w, https://substackcdn.com/image/fetch/$s_!SVsH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SVsH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png" width="1456" height="789" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:789,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:191142,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!SVsH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png 424w, https://substackcdn.com/image/fetch/$s_!SVsH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png 848w, https://substackcdn.com/image/fetch/$s_!SVsH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png 1272w, https://substackcdn.com/image/fetch/$s_!SVsH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13f5108-eed0-4c51-a32a-a4850ced5011_1834x994.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What about the encoder?</p><p>As discussed before, we need the encoder because it has to synthesize all three different types of data:</p><p><strong>Visual Data (High dimensional):</strong> Pixels/Features from 4 cameras.</p><p><strong>Proprioceptive Data (Low dimensional):</strong> Precise numbers for joint angles.</p><p><strong>Latent Style (z):</strong> A abstract &#8220;instruction&#8221; on how to behave (e.g., &#8220;move fast&#8221;).</p><p>Now, all these data types are very different, and if we just stack them side-by-side, the model wouldn&#8217;t know how they relate.</p><p>So, we need some mechanism to force these modalities to talk with each other.</p><p><em>Example:</em> It links the <strong>&#8220;Gripper Token&#8221;</strong> (from Joint data) with the <strong>&#8220;Mug Handle Token&#8221;</strong> (from Camera 1).</p><p><em>Result:</em> It creates a new understanding: <em>&#8220;The gripper is currently 2cm away from the handle.&#8221;</em></p><p>This is exactly what the self-attention mechanism in the transformer architecture does. Hence, we use a transformer encoder for this.</p><p><strong>Architecture Version 6:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HXxD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HXxD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png 424w, https://substackcdn.com/image/fetch/$s_!HXxD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png 848w, https://substackcdn.com/image/fetch/$s_!HXxD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!HXxD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HXxD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png" width="1408" height="1336" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1336,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:580921,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HXxD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png 424w, https://substackcdn.com/image/fetch/$s_!HXxD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png 848w, https://substackcdn.com/image/fetch/$s_!HXxD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png 1272w, https://substackcdn.com/image/fetch/$s_!HXxD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F66234c9c-04fa-4cad-b8c0-9dd56ae16504_1408x1336.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Have a look at the visualization below to understand this in detail:</p><p>Now let us look at our original architecture for the ACT decoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Q2v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Q2v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png 424w, https://substackcdn.com/image/fetch/$s_!0Q2v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png 848w, https://substackcdn.com/image/fetch/$s_!0Q2v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png 1272w, https://substackcdn.com/image/fetch/$s_!0Q2v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Q2v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png" width="1456" height="564" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:564,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:198730,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0Q2v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png 424w, https://substackcdn.com/image/fetch/$s_!0Q2v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png 848w, https://substackcdn.com/image/fetch/$s_!0Q2v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png 1272w, https://substackcdn.com/image/fetch/$s_!0Q2v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1065eb3f-335f-47d6-96c0-7fcfbfaf6b5a_2122x822.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now let us look at the Decoder:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8u29!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8u29!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png 424w, https://substackcdn.com/image/fetch/$s_!8u29!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png 848w, https://substackcdn.com/image/fetch/$s_!8u29!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png 1272w, https://substackcdn.com/image/fetch/$s_!8u29!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8u29!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png" width="1456" height="703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:703,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117204,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8u29!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png 424w, https://substackcdn.com/image/fetch/$s_!8u29!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png 848w, https://substackcdn.com/image/fetch/$s_!8u29!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png 1272w, https://substackcdn.com/image/fetch/$s_!8u29!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F198edd1c-d697-4f86-bb69-a88cdb71d6b0_1542x744.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But what about the keys and values?</p><p>Let us revisit the encoder architecture again:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6iIc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6iIc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png 424w, https://substackcdn.com/image/fetch/$s_!6iIc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png 848w, https://substackcdn.com/image/fetch/$s_!6iIc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!6iIc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6iIc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2439711,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6iIc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png 424w, https://substackcdn.com/image/fetch/$s_!6iIc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png 848w, https://substackcdn.com/image/fetch/$s_!6iIc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!6iIc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa958ef65-3a40-45b9-9322-aa9023fe908f_2707x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here, we can clearly see that the output of the encoder is the keys and the values which will be used by the decoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mK9n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mK9n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png 424w, https://substackcdn.com/image/fetch/$s_!mK9n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png 848w, https://substackcdn.com/image/fetch/$s_!mK9n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png 1272w, https://substackcdn.com/image/fetch/$s_!mK9n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mK9n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png" width="1456" height="864" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:864,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96563,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!mK9n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png 424w, https://substackcdn.com/image/fetch/$s_!mK9n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png 848w, https://substackcdn.com/image/fetch/$s_!mK9n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png 1272w, https://substackcdn.com/image/fetch/$s_!mK9n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92defcf0-df20-40ee-ae0e-1b98fdfe6587_1560x926.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, we will perform the cross-attention.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XpY3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XpY3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png 424w, https://substackcdn.com/image/fetch/$s_!XpY3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png 848w, https://substackcdn.com/image/fetch/$s_!XpY3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png 1272w, https://substackcdn.com/image/fetch/$s_!XpY3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XpY3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png" width="1456" height="816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:816,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115571,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!XpY3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png 424w, https://substackcdn.com/image/fetch/$s_!XpY3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png 848w, https://substackcdn.com/image/fetch/$s_!XpY3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png 1272w, https://substackcdn.com/image/fetch/$s_!XpY3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62ec7466-7d10-49f7-8a75-42820ae85bf2_1464x820.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once we calculate the attention scores, we will use the values to compute the context vectors.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YuxH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YuxH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png 424w, https://substackcdn.com/image/fetch/$s_!YuxH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png 848w, https://substackcdn.com/image/fetch/$s_!YuxH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png 1272w, https://substackcdn.com/image/fetch/$s_!YuxH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YuxH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png" width="1456" height="741" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:741,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148951,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!YuxH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png 424w, https://substackcdn.com/image/fetch/$s_!YuxH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png 848w, https://substackcdn.com/image/fetch/$s_!YuxH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png 1272w, https://substackcdn.com/image/fetch/$s_!YuxH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca6600ae-33a5-42ef-b6c3-fd04601294ab_1858x946.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These context vectors will now be updated by passing them through an MLP layer:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O5qX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O5qX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png 424w, https://substackcdn.com/image/fetch/$s_!O5qX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png 848w, https://substackcdn.com/image/fetch/$s_!O5qX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png 1272w, https://substackcdn.com/image/fetch/$s_!O5qX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O5qX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png" width="474" height="417.32608695652175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:972,&quot;width&quot;:1104,&quot;resizeWidth&quot;:474,&quot;bytes&quot;:69807,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!O5qX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png 424w, https://substackcdn.com/image/fetch/$s_!O5qX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png 848w, https://substackcdn.com/image/fetch/$s_!O5qX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png 1272w, https://substackcdn.com/image/fetch/$s_!O5qX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7849e46c-168a-46b5-96fc-83f30feb9f87_1104x972.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the final step, this will be passed through a projection layer, where the 512 values will be converted to the 6 joint values. This will happen for the number of timesteps included in the action chunk (e.g: 6).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mP0u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mP0u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png 424w, https://substackcdn.com/image/fetch/$s_!mP0u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png 848w, https://substackcdn.com/image/fetch/$s_!mP0u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png 1272w, https://substackcdn.com/image/fetch/$s_!mP0u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mP0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png" width="1456" height="937" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:937,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:565265,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!mP0u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png 424w, https://substackcdn.com/image/fetch/$s_!mP0u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png 848w, https://substackcdn.com/image/fetch/$s_!mP0u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png 1272w, https://substackcdn.com/image/fetch/$s_!mP0u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f363692-ec61-4240-b05c-16752a71f402_2042x1314.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kkj9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kkj9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png 424w, https://substackcdn.com/image/fetch/$s_!kkj9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png 848w, https://substackcdn.com/image/fetch/$s_!kkj9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png 1272w, https://substackcdn.com/image/fetch/$s_!kkj9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kkj9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png" width="1456" height="927" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:927,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:567622,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!kkj9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png 424w, https://substackcdn.com/image/fetch/$s_!kkj9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png 848w, https://substackcdn.com/image/fetch/$s_!kkj9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png 1272w, https://substackcdn.com/image/fetch/$s_!kkj9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39f589d9-014f-4cb2-a8e4-e4f7113c9d54_2044x1302.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, let us piece all of this together and see how the final architecture for the ACT Variational AutoEncoder looks like:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yRLv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yRLv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png 424w, https://substackcdn.com/image/fetch/$s_!yRLv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png 848w, https://substackcdn.com/image/fetch/$s_!yRLv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png 1272w, https://substackcdn.com/image/fetch/$s_!yRLv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yRLv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png" width="1456" height="457" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:457,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1381512,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yRLv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png 424w, https://substackcdn.com/image/fetch/$s_!yRLv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png 848w, https://substackcdn.com/image/fetch/$s_!yRLv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png 1272w, https://substackcdn.com/image/fetch/$s_!yRLv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a276d0b-ff40-4a9a-a1c1-d80c9f74e324_3569x1120.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>How is the Decoder similar to DETR?</strong></h3><p>In a standard setup, the inputs at the base of the decoder are word-token embeddings corresponding to the text the transformer is trained to generate.</p><p>By analogy with the original transformer architecture, one option would be to feed in a start-of-sequence token followed by embeddings of the actions to be predicted, effectively casting the task as next-action prediction.</p><p>However, this formulation comes with a key constraint: it requires causal attention. As a result, when predicting the action at time <em>t</em>, the model can only attend to actions up to <em>t&#8211;1</em>.</p><p>To avoid this limitation, the authors take a clever alternative route by borrowing ideas from DETR (DEtection TRansformer). This design allows the decoder to reason over the entire output sequence at once, free from the restrictions imposed by causal masking.</p><p>Have at look at this video, which talks about the intuition behind DETR:</p><div id="youtube2-WGlXhQKXh5c" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;WGlXhQKXh5c&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/WGlXhQKXh5c?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div class="pullquote"><p>During training, both the transformer encoder and decoder networks are trained. However, during testing, we discard the encoder and only use the transformer decoder. </p></div><p>The value of the latent variable (z) is set to 0 during inference, for deterministic trajectories.</p><p>It&#8217;s like telling the encoder&#8211;decoder transformer at inference time: <em>you&#8217;ve already learned from a vast range of trajectories&#8212;now just focus on taking me from point A to point B. The uniqueness of the path no longer matters.</em></p><p>The training loop for the ACT Policy looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2WbD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2WbD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif 424w, https://substackcdn.com/image/fetch/$s_!2WbD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif 848w, https://substackcdn.com/image/fetch/$s_!2WbD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif 1272w, https://substackcdn.com/image/fetch/$s_!2WbD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2WbD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif" width="1098" height="213" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:213,&quot;width&quot;:1098,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:963879,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183635076?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2WbD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif 424w, https://substackcdn.com/image/fetch/$s_!2WbD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif 848w, https://substackcdn.com/image/fetch/$s_!2WbD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif 1272w, https://substackcdn.com/image/fetch/$s_!2WbD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cde59c4-bb0d-4567-a3dd-eb0c1b1f1f61_1098x213.gif 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>That&#8217;s it! </p><p>If you like this content, please check out our bootcamps on the following topics:</p><p><strong>Modern Robot Learning</strong>: <a href="https://robotlearningbootcamp.vizuara.ai/">https://robotlearningbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>:  <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[What exactly is Denoising Score Matching? ]]></title><description><![CDATA[What is Denoising Score Matching? Why is it central to Diffusion Models?]]></description><link>https://www.vizuaranewsletter.com/p/what-exactly-is-denoising-score-matching</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/what-exactly-is-denoising-score-matching</guid><dc:creator><![CDATA[Dr Rajat Dandekar]]></dc:creator><pubDate>Wed, 07 Jan 2026 09:48:26 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!A9a-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In the previous article, we had looked at the technique of score matching where the objective was to match the predicted score function with the true score function.</p><p>Here is the link to the article: </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;e66b828b-767a-4041-ae6d-e5439ad951a5&quot;,&quot;caption&quot;:&quot;EBMs define a probability density via an energy function which assigns lower energy to more likely configurations.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;Energy Based Models - Score Matching&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:136641784,&quot;name&quot;:&quot;Dr Rajat Dandekar&quot;,&quot;bio&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/144539bd-c2b3-4909-8a05-1e9309cc9572_144x144.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2026-01-01T09:52:23.663Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wemw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/energy-based-models-score-matching&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:182943043,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:1,&quot;comment_count&quot;:0,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:false,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>We encountered a challenge where we thought that since we do not know the true score function, how can we ever match our predicted score with it?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j5CW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j5CW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png 424w, https://substackcdn.com/image/fetch/$s_!j5CW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png 848w, https://substackcdn.com/image/fetch/$s_!j5CW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png 1272w, https://substackcdn.com/image/fetch/$s_!j5CW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j5CW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png" width="1456" height="1280" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1280,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1561160,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!j5CW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png 424w, https://substackcdn.com/image/fetch/$s_!j5CW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png 848w, https://substackcdn.com/image/fetch/$s_!j5CW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png 1272w, https://substackcdn.com/image/fetch/$s_!j5CW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8306e37d-07a3-4346-b2f3-e0a101b15643_2132x1875.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This paper came to the rescue where we found an alternative loss function that only requires the data samples.</p><p>This loss function looked as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{L_{SM}}(\\phi) = Tr (\\nabla_{x}s_{\\phi}(x)) + \\frac{1}{2}||s_{\\phi}(x)||^{2}&quot;,&quot;id&quot;:&quot;LUKPPJWUTZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>We looked at an example where, for a given set of data samples, we learned to find a score function using the above formulation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TECE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TECE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png 424w, https://substackcdn.com/image/fetch/$s_!TECE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png 848w, https://substackcdn.com/image/fetch/$s_!TECE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!TECE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TECE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png" width="1456" height="617" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:617,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1438780,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!TECE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png 424w, https://substackcdn.com/image/fetch/$s_!TECE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png 848w, https://substackcdn.com/image/fetch/$s_!TECE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png 1272w, https://substackcdn.com/image/fetch/$s_!TECE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdbcd2b8d-34d6-4b8b-b46a-13704de09c8a_2424x1028.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is excellent, but this technique is not used in practice. Let us understand why.</p><p>To calculate the trace of a matrix of dimension D, we need to calculate all the elements of the matrix which are D x D.</p><p>The order of complexity scales as the square of the dimension of the matrix. This becomes extremely computationally expensive for larger matrices.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wgeV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wgeV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png 424w, https://substackcdn.com/image/fetch/$s_!wgeV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png 848w, https://substackcdn.com/image/fetch/$s_!wgeV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png 1272w, https://substackcdn.com/image/fetch/$s_!wgeV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wgeV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png" width="1456" height="1361" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1361,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:997004,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!wgeV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png 424w, https://substackcdn.com/image/fetch/$s_!wgeV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png 848w, https://substackcdn.com/image/fetch/$s_!wgeV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png 1272w, https://substackcdn.com/image/fetch/$s_!wgeV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93d83bea-f72c-48bf-8d1d-47aa03db48c7_2068x1933.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This technique was introduced by Pascal Vincent in 2010.</p><p>What Pascal Vincent said was very interesting. To understand what he said, let us take a practical example:</p><blockquote><p>Imagine you have a tabletop. There are invisible magnets hidden at specific spots on this table. These magnets represent your <strong>Real Data.</strong></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JYJO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JYJO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png 424w, https://substackcdn.com/image/fetch/$s_!JYJO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png 848w, https://substackcdn.com/image/fetch/$s_!JYJO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png 1272w, https://substackcdn.com/image/fetch/$s_!JYJO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JYJO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png" width="1456" height="613" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:613,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2059075,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!JYJO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png 424w, https://substackcdn.com/image/fetch/$s_!JYJO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png 848w, https://substackcdn.com/image/fetch/$s_!JYJO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png 1272w, https://substackcdn.com/image/fetch/$s_!JYJO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F658cd6ae-28b0-4303-8265-6eda0980b95d_2142x902.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Your goal is to draw a map of the magnetic field that tells you, for any point on the table, which direction the nearest magnet is pulling.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!R9b-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!R9b-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png 424w, https://substackcdn.com/image/fetch/$s_!R9b-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png 848w, https://substackcdn.com/image/fetch/$s_!R9b-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png 1272w, https://substackcdn.com/image/fetch/$s_!R9b-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!R9b-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png" width="1456" height="597" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:597,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2064184,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!R9b-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png 424w, https://substackcdn.com/image/fetch/$s_!R9b-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png 848w, https://substackcdn.com/image/fetch/$s_!R9b-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png 1272w, https://substackcdn.com/image/fetch/$s_!R9b-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb76e91df-8314-485c-9391-feb63d4d8c27_2122x870.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you just look at the empty table, you can&#8217;t calculate the magnetic field. You don&#8217;t know where the magnets are or how strong they are.  For example, there might be more magnets than you see and you do not know the magnetic field at all places.</p><p>Okay, now we do a small trick:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A9a-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A9a-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!A9a-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!A9a-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!A9a-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A9a-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5250892,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!A9a-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!A9a-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!A9a-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!A9a-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2852bb17-d200-42c2-a24d-f9fdf8dc7fae_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We place a metal ball exactly on top of a hidden magnet. We flick the ball in a random direction. It rolls away and stops at a new, random spot. This is the <strong>Noisy Data.</strong></p><p>Next, we bring in a student:</p><ul><li><p>We show the student the ball&#8217;s new location</p></li><li><p>We hide the original magnet location</p></li><li><p>We ask the student, &#8220;Draw an arrow representing the force needed to pull this ball back to where it started.&#8221;</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tk3Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tk3Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!tk3Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!tk3Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!tk3Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tk3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3686151,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tk3Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!tk3Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!tk3Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!tk3Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4c60e7e5-bb82-416b-94de-ed2a91f9b724_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The student has no idea where this noisy data came from.</p><p>But we know where it came from.</p><div class="pullquote"><p>So if we can give the student feedback based on the student&#8217;s prediction and our knowledge, we can teach the student how to draw the force to pull the ball back to the starting point for every possible noisy data in the field.</p></div><p>So through this process, wouldn&#8217;t the student learn the magnetic field at all points in the field?</p><p>Now, let us look at how this analogy relates to Vincent&#8217;s ideas in his paper.</p><p>The hidden magnets in the analogy represent the real, clean data points. This is denoted by the following symbol:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p_{data}(x)&quot;,&quot;id&quot;:&quot;QSBMIDPRNC&quot;}" data-component-name="LatexBlockToDOM"></div><p>The &#8220;flick&#8221; represents noise added to the clean data points. The noisy data (ball&#8217;s new spot) is represented as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{x}&quot;,&quot;id&quot;:&quot;CXLNZNXZLN&quot;}" data-component-name="LatexBlockToDOM"></div><p>The probability distribution of the noisy data is represented as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p_{\\sigma}(\\tilde{x})&quot;,&quot;id&quot;:&quot;HSOWILYDVI&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, sigma represents the noise added to the data.</p><p>The student represents the neural network trying to guess the direction back to the magnet. This is represented as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s_{\\phi}(\\tilde{x})&quot;,&quot;id&quot;:&quot;IQSEHMXLJA&quot;}" data-component-name="LatexBlockToDOM"></div><p>The correct arrow from the original data to the noisy data denotes the score function for the distribution of noisy data, it is denoted as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\nabla_{x}\\text{log}p_{\\sigma}(\\tilde{x} | x)&quot;,&quot;id&quot;:&quot;BXVYKDXAIK&quot;}" data-component-name="LatexBlockToDOM"></div><p>Hence the loss function boils down to the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L_{DSM} = \\frac{1}{2}||s_{\\phi}(x) - \\nabla_{x}\\text{log}p_{\\sigma}(\\tilde{x} | x)||^{2}&quot;,&quot;id&quot;:&quot;YVAWBLSYQA&quot;}" data-component-name="LatexBlockToDOM"></div><p><em>The conditioning technique (probability of the noisy data, given the original data) also appears in the variational view of diffusion models in DDPM.</em></p><p>We can actually simplify this further to get a very simple loss formulation:</p><p>If we assume that the noise is Gaussian, we can simplify the score function, which we want to learn.</p><p>If we add a Gaussian noise with variance &#963;^2 to each data point, then we can write the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{x} = x + \\sigma \\epsilon&quot;,&quot;id&quot;:&quot;OWFNGVEYAY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us consider the Batman example, which we have looked at in some of the previous articles as well:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OU-d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OU-d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png 424w, https://substackcdn.com/image/fetch/$s_!OU-d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png 848w, https://substackcdn.com/image/fetch/$s_!OU-d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png 1272w, https://substackcdn.com/image/fetch/$s_!OU-d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OU-d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png" width="1444" height="1462" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1462,&quot;width&quot;:1444,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:529285,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!OU-d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png 424w, https://substackcdn.com/image/fetch/$s_!OU-d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png 848w, https://substackcdn.com/image/fetch/$s_!OU-d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png 1272w, https://substackcdn.com/image/fetch/$s_!OU-d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1f60ddd-4a68-4071-9b57-ae5738851280_1444x1462.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p>In this image, what we have done is we have taken a pixel from the Batman image and then added noise to that pixel. The addition of noise is done by sampling using a Gaussian distribution with the same mean as the pixel value and the standard deviation as the noise level. </p></div><p>We do a sequence of mathematical steps below to calculate the target for our loss function (it is not complicated, trust me!)</p><p>First, we write down the probability distribution for noisy sample, given the original sample as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p_{\\sigma}(\\tilde{x}|x) = \\frac{1}{\\sqrt{2\\pi}}e^{-\\frac{(\\tilde{x} - x)^{2}}{2\\sigma^{2}}}&quot;,&quot;id&quot;:&quot;ETZVXGABLP&quot;}" data-component-name="LatexBlockToDOM"></div><p>This formula comes from the Gaussian distribution formula.</p><p>Now, if we take the logarithm of the above and then take the derivative, we get the following: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\nabla_{x}\\text{log}p_{\\sigma}(\\tilde{x} | x) = \\frac {x - \\tilde{x}}{\\sigma^{2}}&quot;,&quot;id&quot;:&quot;ETOBUUZERV&quot;}" data-component-name="LatexBlockToDOM"></div><p>Hence the Denoising Score Matching loss simplifies to:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L_{DSM} = \\frac{1}{2}||s_{\\phi}(x) - \\frac{x - \\tilde{x}}{\\sigma^{2}}||^{2}&quot;,&quot;id&quot;:&quot;HLISUEVXLT&quot;}" data-component-name="LatexBlockToDOM"></div><p><em>This objective function tells you that you are training your score function to predict the noise which is added to create the noisy distribution.</em></p><p>In the context of our previous example, this means that you are trying to guess the direction of the flick.</p><p>Doesn&#8217;t this remind you of Denoising Diffusion Probabilistic Models (DDPM) where we came to the same conclusion towards the end? [Refer to this article written by us: </p><div class="digest-post-embed" data-attrs="{&quot;nodeId&quot;:&quot;6770a097-6b1c-4de1-bf69-35da59883f0c&quot;,&quot;caption&quot;:&quot;Diffusion is the natural tendency of particles (like molecules, heat, or even information) to move and spread out until they are evenly distributed.&quot;,&quot;cta&quot;:&quot;Read full story&quot;,&quot;showBylines&quot;:true,&quot;size&quot;:&quot;sm&quot;,&quot;isEditorNode&quot;:true,&quot;title&quot;:&quot;What exactly are Diffusion Models?&quot;,&quot;publishedBylines&quot;:[{&quot;id&quot;:136641784,&quot;name&quot;:&quot;Dr Rajat Dandekar&quot;,&quot;bio&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/144539bd-c2b3-4909-8a05-1e9309cc9572_144x144.png&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;post_date&quot;:&quot;2025-12-23T09:02:02.896Z&quot;,&quot;cover_image&quot;:&quot;https://substackcdn.com/image/fetch/$s_!fEKX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png&quot;,&quot;cover_image_alt&quot;:null,&quot;canonical_url&quot;:&quot;https://www.vizuaranewsletter.com/p/what-exactly-are-diffusion-models&quot;,&quot;section_name&quot;:null,&quot;video_upload_id&quot;:null,&quot;id&quot;:182400848,&quot;type&quot;:&quot;newsletter&quot;,&quot;reaction_count&quot;:8,&quot;comment_count&quot;:2,&quot;publication_id&quot;:3591997,&quot;publication_name&quot;:&quot;Vizuara&#8217;s AI Newsletter&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!f_Wd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59481c77-2230-4e2f-b78b-88e9ee5fe9d9_1088x1088.png&quot;,&quot;belowTheFold&quot;:true,&quot;youtube_url&quot;:null,&quot;show_links&quot;:null,&quot;feed_url&quot;:null}"></div><p>Let us take a practical example to understand how this is implemented:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5C7E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5C7E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png 424w, https://substackcdn.com/image/fetch/$s_!5C7E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png 848w, https://substackcdn.com/image/fetch/$s_!5C7E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png 1272w, https://substackcdn.com/image/fetch/$s_!5C7E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5C7E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png" width="1456" height="445" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:445,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:173869,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5C7E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png 424w, https://substackcdn.com/image/fetch/$s_!5C7E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png 848w, https://substackcdn.com/image/fetch/$s_!5C7E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png 1272w, https://substackcdn.com/image/fetch/$s_!5C7E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F56b0b21d-8377-41b5-9fb0-6f8c2d22054c_2288x700.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We will be using the method of Denoising Score Matching to predict the score function for this data distribution:</p><p>Here is the link to the Google Colab notebook which is used for this practical:</p><p><a href="https://miro.com/app/board/uXjVGV4CZTM=/?share_link_id=761394004969">Google Colab Notebook for Denoising Score Matching</a> </p><p>This is how our learned score function behaves after the training is completed:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rM1X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rM1X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png 424w, https://substackcdn.com/image/fetch/$s_!rM1X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png 848w, https://substackcdn.com/image/fetch/$s_!rM1X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png 1272w, https://substackcdn.com/image/fetch/$s_!rM1X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rM1X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:356758,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!rM1X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png 424w, https://substackcdn.com/image/fetch/$s_!rM1X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png 848w, https://substackcdn.com/image/fetch/$s_!rM1X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png 1272w, https://substackcdn.com/image/fetch/$s_!rM1X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff93576e3-a7b1-4fbc-b13e-0324753517c5_2667x1499.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This looks correct because the arrows are pointing towards the direction of the data.</p><p>You might have guessed what we are about to do next.</p><p>Once the score function is learned, let us sample from it using Langevin Dynamics.</p><p>We have already seen the formula for this, which looks as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{x}_{n+1} = \\tilde{x}_{n} + \\eta s_{\\phi}(\\tilde{x}_{n}) + \\sqrt{2\\eta} \\epsilon_{n}&quot;,&quot;id&quot;:&quot;JXBXFEKZOL&quot;}" data-component-name="LatexBlockToDOM"></div><p> This is our &#8220;drunker hiker&#8221; who is taking steps to move towards the data samples. See the image below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rnGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rnGB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 424w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 848w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rnGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png" width="1456" height="831" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:831,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!rnGB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 424w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 848w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p> On applying Langevin Dynamics to the above example, this is what we get:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h5T4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h5T4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png 424w, https://substackcdn.com/image/fetch/$s_!h5T4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png 848w, https://substackcdn.com/image/fetch/$s_!h5T4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png 1272w, https://substackcdn.com/image/fetch/$s_!h5T4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h5T4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png" width="1456" height="787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:419269,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183768823?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!h5T4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png 424w, https://substackcdn.com/image/fetch/$s_!h5T4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png 848w, https://substackcdn.com/image/fetch/$s_!h5T4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png 1272w, https://substackcdn.com/image/fetch/$s_!h5T4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5324ccb4-9260-41a2-a048-277597309c04_2720x1470.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The generated distribution does not match the true distribution perfectly with the density, but you can see the two peaks located at the same location as that of the true data distribution, which is exactly what we want.</p><p>That&#8217;s it! </p><p>Here is the link to the original paper which introduced <em>DSM (Denoising Score Matching): <a href="https://www.iro.umontreal.ca/~vincentp/Publications/smdae_techreport.pdf">https://www.iro.umontreal.ca/~vincentp/Publications/smdae_techreport.pdf</a></em></p><p>For more detailed proofs, please refer to the book: <em>The Principles of Diffusion Models From Origins to Advances (<a href="https://arxiv.org/abs/2510.21890">https://arxiv.org/abs/2510.21890</a>) [Pages 68-79]</em></p><p>If you like this content, please check out our bootcamps on the following topics:</p><p><strong>Modern Robot Learning</strong>: <a href="https://robotlearningbootcamp.vizuara.ai/">https://robotlearningbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[How I Shipped an End-to-End ML Anomaly Detection System on the NYC Taxi Dataset (With CI/CD)]]></title><description><![CDATA[In this article , My aim is to explain how you can implement entire production ready project in Machine learning.]]></description><link>https://www.vizuaranewsletter.com/p/how-i-shipped-an-end-to-end-ml-anomaly</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/how-i-shipped-an-end-to-end-ml-anomaly</guid><dc:creator><![CDATA[Prathamesh Dinesh Joshi]]></dc:creator><pubDate>Mon, 05 Jan 2026 07:30:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Eg1U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Table of content </p><ol><li><p>Introduction</p></li><li><p>Repository Organization and Engineering Rationale</p></li><li><p>Data Ingestion and Reproducibility</p></li><li><p>LSTM Autoencoder for Time-Series Reconstruction</p></li><li><p>Training Pipeline and Artifact Generation</p></li><li><p>Batch and Streaming Inference with FastAPI</p><ol><li><p>Model loading and scoring utilities</p></li><li><p>Batch endpoint (offline scoring)</p></li><li><p>Streaming endpoint (online scoring) and the &#8220;window availability&#8221; issue  </p></li></ol></li><li><p>MongoDB Logging</p></li><li><p>Visualization via Streamlit Dashboard</p></li><li><p>Monitoring: Prometheus-Compatible Metrics</p></li><li><p>Containerization and Local Orchestration</p></li><li><p>Continuous Integration and Continuous Delivery (CI/CD)</p></li><li><p>Execution Summary (Reproducible Runbook)</p></li><li><p>Limitations and Planned Extensions</p></li></ol><h1>1.Introduction</h1><p>Most anomaly detection projects die in a notebook.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>You train an autoencoder, plot reconstruction error, pick a threshold, and call it a day. But the moment you try to use the model like a real system, where data arrives one point at a time, where you need persistence, monitoring, and safe deployments, you realize the &#8220;ML part&#8221; was only 20% of the job.</p><p>So I built this project to answer a simple question:</p><p><strong>What does a production-style anomaly detection pipeline look like end-to-end&#8212;training &#8594; packaging &#8594; streaming inference &#8594; dashboard &#8594; CI/CD?</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Eg1U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Eg1U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png 424w, https://substackcdn.com/image/fetch/$s_!Eg1U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png 848w, https://substackcdn.com/image/fetch/$s_!Eg1U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png 1272w, https://substackcdn.com/image/fetch/$s_!Eg1U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Eg1U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png" width="1456" height="734" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:734,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:809405,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182070494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Eg1U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png 424w, https://substackcdn.com/image/fetch/$s_!Eg1U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png 848w, https://substackcdn.com/image/fetch/$s_!Eg1U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png 1272w, https://substackcdn.com/image/fetch/$s_!Eg1U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd6b30289-1577-4425-a560-3475af1d2dd6_1584x798.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source :- Kaggle.com. Image displays some taxis on the busy streets</figcaption></figure></div><p>This Article is based on the actual repo I implemented, using the <strong>NAB NYC Taxi time-series dataset</strong> (<code>nyc_taxi.csv</code>). I&#8217;ll explain the architecture, the reasoning behind key decisions, the &#8220;gotchas&#8221; (including the classic streaming bug), and how I wired it into a CI/CD pipeline that ships Docker images automatically.</p><h2>What We are going to built throughout the article is :</h2><p>A <strong>time-series anomaly detection system</strong> for NYC Taxi data that supports:</p><ul><li><p><strong>Offline training</strong> (LSTM Autoencoder)</p></li><li><p><strong>Batch inference</strong> (score a whole sequence)</p></li><li><p><strong>Streaming inference</strong> (score one datapoint at a time with a sliding buffer)</p></li><li><p><strong>Persistent logging</strong> (MongoDB)</p></li><li><p><strong>Dashboarding</strong> (Streamlit)</p></li><li><p><strong>Metrics endpoint</strong> (Prometheus)</p></li><li><p><strong>CI + CD pipelines</strong> (GitHub Actions + GHCR Docker image publishing)</p></li></ul><p>And the best part: it&#8217;s runnable locally with Docker Compose.</p><div><hr></div><h1>2. Repository Organization and Engineering Rationale</h1><p>A central design principle was to separate <em>concerns</em> (training, shared model definition, serving, infrastructure, evaluation, and visualization) into clear modules. This reduces coupling, simplifies testing, and enables CI/CD to operate on the serving component without requiring the entire training environment.</p><p>The repository structure is as follows:</p><pre><code>nyc-anomaly-fixed/
&#9500;&#9472; data/
&#9474;  &#9500;&#9472; download_nab.py                 # programmatic retrieval of NAB dataset (nyc_taxi.csv)
&#9474;  &#9492;&#9472; nyc_taxi.csv                    # generated by download script (or provided)
&#9474;
&#9500;&#9472; train/
&#9474;  &#9500;&#9472; config.yaml                     # training hyperparameters and thresholding policy
&#9474;  &#9492;&#9472; train.py                        # training pipeline: preprocessing &#8594; training &#8594; artifacts
&#9474;
&#9500;&#9472; common/
&#9474;  &#9492;&#9472; model_arch.py                   # LSTM Autoencoder architecture shared by train &amp; serve
&#9474;
&#9500;&#9472; app/
&#9474;  &#9500;&#9472; config.py                       # environment-driven application settings
&#9474;  &#9500;&#9472; model.py                        # ModelWrapper: artifact loading and inference utilities
&#9474;  &#9500;&#9472; main.py                         # FastAPI service (batch + streaming endpoints)
&#9474;  &#9492;&#9472; Dockerfile                      # container build for the inference service
&#9474;
&#9500;&#9472; infra/
&#9474;  &#9492;&#9472; docker-compose.yml              # local stack: API + MongoDB
&#9474;
&#9500;&#9472; dashboard/
&#9474;  &#9492;&#9472; streamlit_app.py                # visualization of time series + anomalies from MongoDB
&#9474;
&#9500;&#9472; eval/
&#9474;  &#9500;&#9472; download_nab_labels.py          # optional retrieval of NAB anomaly windows/labels
&#9474;  &#9492;&#9472; evaluate.py                     # evaluation utilities for sanity checking
&#9474;
&#9500;&#9472; models/                            # generated artifacts consumed by the API
&#9474;  &#9500;&#9472; lstm_ae.pth
&#9474;  &#9500;&#9472; scaler.npz
&#9474;  &#9500;&#9472; threshold.txt
&#9474;  &#9492;&#9472; model_meta.json
&#9474;
&#9500;&#9472; tests/
&#9474;  &#9492;&#9472; test_model_wrapper.py           # CI test scaffold (extensible)
&#9474;
&#9500;&#9472; requirements.txt
&#9500;&#9472; requirements-dashboard.txt
&#9474;
&#9492;&#9472; .github/workflows/
   &#9500;&#9472; ci-cd.yml                       # CI: run tests on PR/push
   &#9492;&#9472; cd.yml                          # CD: build &amp; push Docker image to 
                                                                                  </code></pre><p>This layout supports three important engineering requirements:</p><ul><li><p><strong>Reproducibility</strong>: training outputs are serialized into a stable artifact bundle.</p></li><li><p><strong>Deployability</strong>: the service (<code>app/</code>) depends only on artifacts and shared model code (<code>common/</code>).</p></li><li><p><strong>Auditability</strong>: predictions and scores are logged to a database rather than being ephemeral.</p></li></ul><div><hr></div><h2>3. Data Ingestion and Reproducibility</h2><p>The dataset is retrieved programmatically via <code>data/download_nab.py</code>, which ensures that a fresh clone of the repository can reproduce the input data without manual downloads. This is a modest but important practice: by treating dataset retrieval as part of the pipeline, the system becomes easier to validate in CI and easier for other users to reproduce reliably.</p><div><hr></div><h2>4. LSTM Autoencoder for Time-Series Reconstruction</h2><h3>4.1 Rationale</h3><p>The core detector is an <strong>LSTM Autoencoder</strong>, implemented in <code>common/model_arch.py</code>. The underlying assumption is standard in reconstruction-based anomaly detection:</p><ul><li><p>The model is trained primarily on typical (dominant) behaviour in the series.</p></li><li><p>At inference, windows inconsistent with learned structure yield higher reconstruction error.</p></li><li><p>Reconstruction error serves as an anomaly score.</p></li></ul><p>This choice was guided by practical considerations: the model is sufficiently expressive for temporal structure, relatively straightforward to implement, and computationally feasible for repeated retraining during development.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Em1m!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Em1m!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png 424w, https://substackcdn.com/image/fetch/$s_!Em1m!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png 848w, https://substackcdn.com/image/fetch/$s_!Em1m!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png 1272w, https://substackcdn.com/image/fetch/$s_!Em1m!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Em1m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png" width="1232" height="730" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:730,&quot;width&quot;:1232,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222057,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182070494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Em1m!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png 424w, https://substackcdn.com/image/fetch/$s_!Em1m!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png 848w, https://substackcdn.com/image/fetch/$s_!Em1m!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png 1272w, https://substackcdn.com/image/fetch/$s_!Em1m!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee0a5702-8904-403b-b30c-35228ea4c678_1232x730.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Source : medium.co , Figure explains LSTM Autoencoder Architecture in nutshell</figcaption></figure></div><h3>4.2 Windowing</h3><p>The training and inference pipelines operate on sliding windows, where each input to the autoencoder is a contiguous subsequence of length <code>window_size</code> (configured in <code>train/config.yaml</code>). Windowing is not merely a modelling detail; it is the key interface between raw streaming events and the model.</p><div><hr></div><h2>5. Training Pipeline and Artifact Generation</h2><p>Training is executed from <code>train/train.py</code>, configured by <code>train/config.yaml</code>. The pipeline performs the following steps:</p><ol><li><p><strong>Load and select signal</strong>: read <code>nyc_taxi.csv</code> and extract the <code>value</code> column.</p></li><li><p><strong>Normalize</strong>: fit a scaler (saved as <code>scaler.npz</code>) so inference uses identical preprocessing.</p></li><li><p><strong>Construct windows</strong>: generate overlapping windows to form training samples.</p></li><li><p><strong>Train LSTM Autoencoder</strong>: optimize reconstruction loss over windows.</p></li><li><p><strong>Compute reconstruction errors</strong>: evaluate window-level errors on representative data.</p></li><li><p><strong>Derive a threshold</strong>: select an anomaly threshold using a percentile policy.</p></li><li><p><strong>Save artifacts</strong>: serialize the model, scaler, threshold, and metadata.A simplified representation of the thresholding policy is:</p><pre><code># conceptual illustration of the training thresholding step
errors = reconstruction_errors_over_windows
threshold = np.percentile(errors, 85.0)   # configurable in train/config.yaml
</code></pre><h3>5.1 Why artifact bundling matters</h3><p>The training stage produces not only a model but a <strong>complete artifact bundle</strong> required for correct inference:</p><ul><li><p><code>lstm_ae.pth</code> &#8212; model weights</p></li><li><p><code>scaler.npz</code> &#8212; normalization parameters</p></li><li><p><code>threshold.txt</code> &#8212; anomaly threshold</p></li><li><p><code>model_meta.json</code> &#8212; model metadata (notably window size and configuration)</p></li></ul><p>This design avoids a common operational failure mode: deploying a model without its exact preprocessing and thresholding context.</p><div><hr></div></li></ol><h2>6. Batch and Streaming Inference with FastAPI</h2><h3>a) Model loading and scoring utilities</h3><p>Inference is encapsulated in <code>app/model.py</code> via a <code>ModelWrapper</code>, which loads the artifacts from <code>models/</code> and exposes scoring functions. This is an intentional boundary: API code should not re-implement preprocessing or scoring logic.</p><h3>b) Batch endpoint (offline scoring)</h3><p>The batch endpoint supports scoring of a provided sequence, enabling offline evaluation and experimentation. Batch scoring is particularly useful during development and for dataset-level assessments.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eCgV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eCgV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png 424w, https://substackcdn.com/image/fetch/$s_!eCgV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png 848w, https://substackcdn.com/image/fetch/$s_!eCgV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png 1272w, https://substackcdn.com/image/fetch/$s_!eCgV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eCgV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png" width="567" height="386.15408320493066" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:884,&quot;width&quot;:1298,&quot;resizeWidth&quot;:567,&quot;bytes&quot;:587762,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182070494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eCgV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png 424w, https://substackcdn.com/image/fetch/$s_!eCgV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png 848w, https://substackcdn.com/image/fetch/$s_!eCgV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png 1272w, https://substackcdn.com/image/fetch/$s_!eCgV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68974338-d4df-4215-80a6-691e03f1ecea_1298x884.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Batch Processing using FASTAPI . Source : medium.com</figcaption></figure></div><p></p><h3>c) Streaming endpoint (online scoring) and the &#8220;window availability&#8221; issue</h3><p>A central operational challenge in streaming anomaly detection is that a reconstruction model requires a <strong>full window</strong> to produce a meaningful score. If an API expects sequences but receives one point at a time, it may never accumulate enough observations to score&#8212;often leading to a misleading system that outputs &#8220;normal&#8221; indefinitely.</p><p>To address this, the service implements a streaming endpoint:</p><ul><li><p><code>POST /predict_point</code></p></li></ul><p>This endpoint maintains a per-stream buffer (indexed by <code>stream_id</code>) and returns a warm-up state until enough samples are available. Conceptually:</p><pre><code># conceptual illustration of streaming buffering
buffer[stream_id].append(value)

if len(buffer[stream_id]) &lt; window_size:
    return {"label": "warmup"}

window = last_window(buffer[stream_id])
error = score(window)
label = "anomaly" if error &gt; threshold else "normal"
</code></pre><p>This is an essential engineering feature: it aligns the inference contract with real telemetry settings, where observations arrive sequentially.</p><h2>7. MongoDB Logging</h2><p>The system logs each prediction to MongoDB (configured via <code>app/config.py</code> and orchestrated locally via <code>infra/docker-compose.yml</code>). Stored attributes include:</p><ul><li><p>timestamp</p></li><li><p>stream identifier</p></li><li><p>observed value</p></li><li><p>reconstruction error</p></li><li><p>anomaly flag / label</p></li><li><p>inference mode (stream vs batch)</p></li></ul><p>This logging enables:</p><ul><li><p>retrospective debugging (why did we flag this point?)</p></li><li><p>dashboard visualization</p></li><li><p>downstream reporting and evaluation</p></li><li><p>auditability for operational use</p></li></ul><div><hr></div><h2>8. Visualization via Streamlit Dashboard</h2><p>The Streamlit application (<code>dashboard/streamlit_app.py</code>) queries MongoDB to display:</p><ul><li><p>time-series trajectory</p></li><li><p>anomaly markers</p></li><li><p>reconstruction error over time</p></li><li><p>recent prediction records</p></li></ul><p>The dashboard is intentionally lightweight; its function is to support rapid inspection and operational verification without requiring specialized observability infrastructure.</p><div><hr></div><h2>9. Monitoring: Prometheus-Compatible Metrics</h2><p>The API exposes a metrics endpoint (<code>/metrics</code>) suitable for Prometheus scraping. At minimum, this supports:</p><ul><li><p>counts of predictions by class (normal vs anomaly)</p></li><li><p>latency distributions for inference endpoints</p></li></ul><p>This provides a practical bridge to production monitoring systems (Prometheus/Grafana) without overcomplicating the initial implementation.</p><div><hr></div><h2>10. Containerization and Local Orchestration</h2><p>The inference service is containerized via <code>app/Dockerfile</code>. Local orchestration is provided via <code>infra/docker-compose.yml</code>, which runs:</p><ul><li><p>MongoDB</p></li><li><p>the FastAPI service</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5eGO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5eGO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png 424w, https://substackcdn.com/image/fetch/$s_!5eGO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png 848w, https://substackcdn.com/image/fetch/$s_!5eGO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png 1272w, https://substackcdn.com/image/fetch/$s_!5eGO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5eGO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png" width="1456" height="401" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:401,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224412,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182070494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5eGO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png 424w, https://substackcdn.com/image/fetch/$s_!5eGO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png 848w, https://substackcdn.com/image/fetch/$s_!5eGO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png 1272w, https://substackcdn.com/image/fetch/$s_!5eGO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2fb7ce4-d6c8-4e4f-8129-7d265fae826b_1822x502.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Above image shows , 2 services are live on docker , i.e Mongo for Database , API for the taking input request and output response</figcaption></figure></div></li></ul><p>A notable design choice is that <code>models/</code> is mounted into the service container. This permits:</p><ul><li><p>retraining and artifact replacement without rebuilding the service image</p></li><li><p>clear separation between model development and service deployment</p></li></ul><div><hr></div><h2>11. Continuous Integration and Continuous Delivery (CI/CD)</h2><h3>11.1 Continuous Integration</h3><p>The CI workflow (<code>.github/workflows/ci-cd.yml</code>) executes on pushes and pull requests, running <code>pytest</code>. While the test suite is presently a scaffold, the workflow is correctly positioned to enforce:</p><ul><li><p>artifact loading checks</p></li><li><p>scoring sanity checks</p></li><li><p>endpoint smoke tests</p></li><li><p>schema validation for request/response payloads</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n7aX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n7aX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png 424w, https://substackcdn.com/image/fetch/$s_!n7aX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png 848w, https://substackcdn.com/image/fetch/$s_!n7aX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png 1272w, https://substackcdn.com/image/fetch/$s_!n7aX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n7aX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png" width="1140" height="362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:362,&quot;width&quot;:1140,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49803,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182070494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n7aX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png 424w, https://substackcdn.com/image/fetch/$s_!n7aX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png 848w, https://substackcdn.com/image/fetch/$s_!n7aX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png 1272w, https://substackcdn.com/image/fetch/$s_!n7aX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa43c230c-ddaf-48e1-be04-3ac22e988561_1140x362.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Demonstration of Continous Integration (C.I.) via GitHub Actions</figcaption></figure></div></li></ul><h3>11.2 Continuous delivery : build and publish container images</h3><p>The CD workflow (<code>.github/workflows/cd.yml</code>) builds and publishes the service container image to GitHub Container Registry (GHCR) on merges to <code>main</code>, tagging:</p><ul><li><p><code>latest</code></p></li><li><p>the commit SHA</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-8BR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-8BR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png 424w, https://substackcdn.com/image/fetch/$s_!-8BR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png 848w, https://substackcdn.com/image/fetch/$s_!-8BR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png 1272w, https://substackcdn.com/image/fetch/$s_!-8BR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-8BR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png" width="1456" height="371" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:371,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:66564,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182070494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-8BR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png 424w, https://substackcdn.com/image/fetch/$s_!-8BR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png 848w, https://substackcdn.com/image/fetch/$s_!-8BR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png 1272w, https://substackcdn.com/image/fetch/$s_!-8BR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b029dab-4af7-4660-b678-5808e772f7c9_1532x390.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Actual Code Snippet from our Project.</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IZJ5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IZJ5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png 424w, https://substackcdn.com/image/fetch/$s_!IZJ5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png 848w, https://substackcdn.com/image/fetch/$s_!IZJ5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png 1272w, https://substackcdn.com/image/fetch/$s_!IZJ5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IZJ5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png" width="1070" height="350" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:350,&quot;width&quot;:1070,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52806,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182070494?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IZJ5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png 424w, https://substackcdn.com/image/fetch/$s_!IZJ5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png 848w, https://substackcdn.com/image/fetch/$s_!IZJ5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png 1272w, https://substackcdn.com/image/fetch/$s_!IZJ5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2eacc111-63b2-4617-b92f-7b77a53c594b_1070x350.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Demonstration of CI/CD both working correclty.</figcaption></figure></div></li></ul><p>This ensures each release is reproducible, traceable, and readily deployable.</p><p></p><div><hr></div><h2>12. Execution Summary (Reproducible Runbook)</h2><p><strong>1) Download dataset</strong></p><pre><code><code>python data/download_nab.py --out data/nyc_taxi.csv
</code></code></pre><p><strong>2) Train and generate artifacts</strong></p><pre><code><code>python train/train.py --config train/config.yaml
</code></code></pre><p><strong>3) Run the local stack</strong></p><pre><code><code>docker-compose -f infra/docker-compose.yml up --build
</code></code></pre><p><strong>4) Streaming inference</strong></p><pre><code><code>curl -X POST http://localhost:8000/predict_point \
  -H "Content-Type: application/json" \
  -d '{"stream_id":"default","value":23.1}'
</code></code></pre><p><strong>5) Run the dashboard</strong></p><pre><code><code>streamlit run dashboard/streamlit_app.py
</code></code></pre><div><hr></div><h2>13. Limitations and Planned Extensions</h2><p>While the system is end-to-end and deployable, several enhancements would be appropriate for production at scale:</p><ul><li><p><strong>Distributed streaming state</strong>: replace in-memory buffers with Redis to support multiple API replicas.</p></li><li><p><strong>Model versioning</strong>: store artifact version/commit hash in prediction logs.</p></li><li><p><strong>Robust evaluation</strong>: incorporate NAB label windows and compute precision/recall/F1 under standardized protocols.</p></li><li><p><strong>Drift monitoring</strong>: track rolling distributions of reconstruction error and raw values.</p></li><li><p><strong>Deployment automation</strong>: extend CD to deploy to a managed runtime (Cloud Run, ECS, Kubernetes).</p></li></ul><h2>Conclusion</h2><p>This project demonstrates that effective anomaly detection requires both modelling and systems engineering. The LSTM Autoencoder provides a principled reconstruction-based detector, but the distinguishing contribution of the implementation is the operationalization: artifact bundling, streaming-safe inference, persistence, observability hooks, containerization, and automated CI/CD.</p><p>To know more about project and implementation detail please watch the video</p><div id="youtube2-r73fsl1G44Q" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;r73fsl1G44Q&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/r73fsl1G44Q?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Transformer Architecture - Application to Robotics]]></title><description><![CDATA[We have a detailed look at the transformer encoder and decoder architecture. Then we will look at Vision Transformers and their importance in the field of Robotics.]]></description><link>https://www.vizuaranewsletter.com/p/the-transformer-architecture-application</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/the-transformer-architecture-application</guid><dc:creator><![CDATA[Dr Rajat Dandekar]]></dc:creator><pubDate>Sat, 03 Jan 2026 08:21:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Grys!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Let us say that our task is to teach AI a simple rule: Swap the adjective and the noun.</p><p>For example:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Input</strong>: red car</p><p><strong>Output</strong>: car red</p><p>How will you solve this problem?</p><p>Let us understand how we can do this using the transformer architecture.</p><p>First we look at the Transformer Encoder.</p><h3>Transformer Encoder</h3><p><strong>Step 1: Our Vocabulary</strong></p><p>First, we will create our vocabulary:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dbQG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dbQG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png 424w, https://substackcdn.com/image/fetch/$s_!dbQG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png 848w, https://substackcdn.com/image/fetch/$s_!dbQG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png 1272w, https://substackcdn.com/image/fetch/$s_!dbQG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dbQG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png" width="1456" height="100" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2563f294-ba69-4219-a38c-0809994279d1_7615x525.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:100,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:542805,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!dbQG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png 424w, https://substackcdn.com/image/fetch/$s_!dbQG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png 848w, https://substackcdn.com/image/fetch/$s_!dbQG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png 1272w, https://substackcdn.com/image/fetch/$s_!dbQG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2563f294-ba69-4219-a38c-0809994279d1_7615x525.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>It looks like we have, 8+8=16 words (tokens) in our vocabulary space.</p><p>However, we need to some more tokens known as &#8220;special tokens&#8221;</p><p>These are the special tokens which we will use:</p><p><em>&lt;bos&gt;: Tells the decoder to start generating</em></p><p><em>&lt;eos&gt;: Tells the decoder to stop generating</em></p><p><em>&lt;pad&gt;: Used to make all sentences in a batch of same length</em></p><p><strong>Step 2: Create Embeddings</strong></p><p>Next, we create embeddings for our tokens. There are two types of embeddings - position embeddings and token embeddings.</p><p>For example, if we consider the token &#8220;red&#8221;, after applying the embedding transformation, it might look as follows (we are assuming an embedding dimension of 3):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pHBV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pHBV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png 424w, https://substackcdn.com/image/fetch/$s_!pHBV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png 848w, https://substackcdn.com/image/fetch/$s_!pHBV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png 1272w, https://substackcdn.com/image/fetch/$s_!pHBV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pHBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png" width="448" height="318.2145922746781" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:932,&quot;resizeWidth&quot;:448,&quot;bytes&quot;:27448,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pHBV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png 424w, https://substackcdn.com/image/fetch/$s_!pHBV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png 848w, https://substackcdn.com/image/fetch/$s_!pHBV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png 1272w, https://substackcdn.com/image/fetch/$s_!pHBV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2dcf8217-26dd-4836-ba2b-996511304a75_932x662.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 3: Convert each embeddings to queries, keys and values</strong></p><p>Every embedding is converted to queries, keys, and values. This can be represented visually as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZH1_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZH1_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png 424w, https://substackcdn.com/image/fetch/$s_!ZH1_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png 848w, https://substackcdn.com/image/fetch/$s_!ZH1_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png 1272w, https://substackcdn.com/image/fetch/$s_!ZH1_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZH1_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png" width="1456" height="437" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:437,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:90925,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ZH1_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png 424w, https://substackcdn.com/image/fetch/$s_!ZH1_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png 848w, https://substackcdn.com/image/fetch/$s_!ZH1_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png 1272w, https://substackcdn.com/image/fetch/$s_!ZH1_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae36186a-85d1-4da7-ae74-4b5669b2f9da_2204x662.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 4: Calculate the attention matrix</strong></p><p>We can represent the attention matrix schematically as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9WrM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9WrM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png 424w, https://substackcdn.com/image/fetch/$s_!9WrM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png 848w, https://substackcdn.com/image/fetch/$s_!9WrM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png 1272w, https://substackcdn.com/image/fetch/$s_!9WrM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9WrM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png" width="292" height="340.1871921182266" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:946,&quot;width&quot;:812,&quot;resizeWidth&quot;:292,&quot;bytes&quot;:70969,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9WrM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png 424w, https://substackcdn.com/image/fetch/$s_!9WrM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png 848w, https://substackcdn.com/image/fetch/$s_!9WrM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png 1272w, https://substackcdn.com/image/fetch/$s_!9WrM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9efee8f9-bcc4-46d5-bf3a-5d44b33fd16d_812x946.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The color of the circle represents the magnitude of the attention scores. So, a brighter color means a larger attention score, and a lighter color means a lower attention score. These values lie between 0 and 1.</p><p><strong>Step 5: Getting the context vectors</strong></p><p>Once we have obtained the attention scores, we can now combine them with the values for all the tokens to get the context vector representing each token.</p><p>This can be represented schematically as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yJBi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yJBi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png 424w, https://substackcdn.com/image/fetch/$s_!yJBi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png 848w, https://substackcdn.com/image/fetch/$s_!yJBi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png 1272w, https://substackcdn.com/image/fetch/$s_!yJBi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yJBi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png" width="1364" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1364,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80647,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yJBi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png 424w, https://substackcdn.com/image/fetch/$s_!yJBi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png 848w, https://substackcdn.com/image/fetch/$s_!yJBi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png 1272w, https://substackcdn.com/image/fetch/$s_!yJBi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32645011-382c-4708-82c8-fb8b5c5d9f89_1364x728.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Observe what we are doing here very closely. For the token corresponding to the word &#8220;The,&#8221; we are adding all the attention values corresponding to the query &#8220;The&#8221;, and then multiplying each of them with the value vectors.</p><p> The final context vectors for all the tokens would look visually as something as follows (assuming the dimension is three):</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Pc0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Pc0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png 424w, https://substackcdn.com/image/fetch/$s_!6Pc0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png 848w, https://substackcdn.com/image/fetch/$s_!6Pc0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png 1272w, https://substackcdn.com/image/fetch/$s_!6Pc0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Pc0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png" width="1456" height="664" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:664,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69388,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6Pc0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png 424w, https://substackcdn.com/image/fetch/$s_!6Pc0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png 848w, https://substackcdn.com/image/fetch/$s_!6Pc0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png 1272w, https://substackcdn.com/image/fetch/$s_!6Pc0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8d7c6f6-c97d-4db5-9391-39b576b3f165_1654x754.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 6: Passing the context vectors to the Feed-Forward Neural Network</strong></p><p>In the next step, we pass all these context vectors to a feed-forward neural network which first increases the dimensions of these vectors and then again brings it down.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lB14!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lB14!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png 424w, https://substackcdn.com/image/fetch/$s_!lB14!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png 848w, https://substackcdn.com/image/fetch/$s_!lB14!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png 1272w, https://substackcdn.com/image/fetch/$s_!lB14!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lB14!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png" width="1456" height="798" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:798,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107024,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lB14!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png 424w, https://substackcdn.com/image/fetch/$s_!lB14!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png 848w, https://substackcdn.com/image/fetch/$s_!lB14!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png 1272w, https://substackcdn.com/image/fetch/$s_!lB14!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F052312c8-fee0-44e2-adb4-2519725ea17e_2076x1138.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can see here that first we increase the dimensions for all the tokens to 6, and then again bring them down to 3.</p><p>Here is how we can visualize these steps in a single diagram:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KLUB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KLUB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png 424w, https://substackcdn.com/image/fetch/$s_!KLUB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png 848w, https://substackcdn.com/image/fetch/$s_!KLUB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png 1272w, https://substackcdn.com/image/fetch/$s_!KLUB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KLUB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png" width="486" height="436.9326923076923" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1309,&quot;width&quot;:1456,&quot;resizeWidth&quot;:486,&quot;bytes&quot;:185644,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!KLUB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png 424w, https://substackcdn.com/image/fetch/$s_!KLUB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png 848w, https://substackcdn.com/image/fetch/$s_!KLUB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png 1272w, https://substackcdn.com/image/fetch/$s_!KLUB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe86c0f9c-82b2-4cf6-b8ca-9742b47be8b9_1628x1464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What we have until now is the Transformer Encoder. The goal of the Encoder is to read the source sentence (red car) and create a rich numerical representation of it, that understands the context of each word.</p><p>In the paper &#8220;Attention is All You Need&#8221;, which introduced Transformers to the world, they used the following schematic to represent the encoder:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NkM9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NkM9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png 424w, https://substackcdn.com/image/fetch/$s_!NkM9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png 848w, https://substackcdn.com/image/fetch/$s_!NkM9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png 1272w, https://substackcdn.com/image/fetch/$s_!NkM9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NkM9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png" width="232" height="413.27457627118645" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1051,&quot;width&quot;:590,&quot;resizeWidth&quot;:232,&quot;bytes&quot;:244091,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc57b5628-5687-4ca6-9a1f-6c559fac1e55_624x1100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!NkM9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png 424w, https://substackcdn.com/image/fetch/$s_!NkM9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png 848w, https://substackcdn.com/image/fetch/$s_!NkM9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png 1272w, https://substackcdn.com/image/fetch/$s_!NkM9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45735aaa-f07b-403a-ba71-04ef51505998_590x1051.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But just encoding the sentences is not enough, we need something which gives us the desired output: &#8220;Swapping the adjective and noun&#8221;, in our case.</p><p>Which brings us to the Transformer Decoder.</p><h3>Transformer Decoder</h3><p> Let us say the input to the decoder is the following:</p><p>&lt;bos&gt; car</p><p>We know that the next word it should predict it red, because we want to swap the input (red car).</p><p><strong>Step 1: Self Attention</strong></p><p>In this step, every token in the input sequence attends to itself. </p><p>First, we calculate the query, keys and values for all tokens:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Euzi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Euzi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png 424w, https://substackcdn.com/image/fetch/$s_!Euzi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png 848w, https://substackcdn.com/image/fetch/$s_!Euzi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png 1272w, https://substackcdn.com/image/fetch/$s_!Euzi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Euzi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png" width="1456" height="551" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:551,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59022,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Euzi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png 424w, https://substackcdn.com/image/fetch/$s_!Euzi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png 848w, https://substackcdn.com/image/fetch/$s_!Euzi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png 1272w, https://substackcdn.com/image/fetch/$s_!Euzi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b609808-7a98-4963-a91b-84e65e9df7e3_1506x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The next step is to calculate the attention matrix:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!M4DF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!M4DF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png 424w, https://substackcdn.com/image/fetch/$s_!M4DF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png 848w, https://substackcdn.com/image/fetch/$s_!M4DF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png 1272w, https://substackcdn.com/image/fetch/$s_!M4DF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!M4DF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png" width="372" height="284.84057971014494" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:634,&quot;width&quot;:828,&quot;resizeWidth&quot;:372,&quot;bytes&quot;:51275,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!M4DF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png 424w, https://substackcdn.com/image/fetch/$s_!M4DF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png 848w, https://substackcdn.com/image/fetch/$s_!M4DF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png 1272w, https://substackcdn.com/image/fetch/$s_!M4DF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ffeee3c-8064-4247-a8ae-b22575f144c4_828x634.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Then we find the context vector for the &#8220;car&#8221; token by using its attention values and multiplying them by the value vectors:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lNfE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lNfE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png 424w, https://substackcdn.com/image/fetch/$s_!lNfE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png 848w, https://substackcdn.com/image/fetch/$s_!lNfE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png 1272w, https://substackcdn.com/image/fetch/$s_!lNfE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lNfE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png" width="528" height="120.89236790606654" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c7515775-e990-4018-a923-a3c205fa795d_1022x234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:234,&quot;width&quot;:1022,&quot;resizeWidth&quot;:528,&quot;bytes&quot;:18756,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lNfE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png 424w, https://substackcdn.com/image/fetch/$s_!lNfE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png 848w, https://substackcdn.com/image/fetch/$s_!lNfE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png 1272w, https://substackcdn.com/image/fetch/$s_!lNfE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7515775-e990-4018-a923-a3c205fa795d_1022x234.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p> We have already seen these steps in the Transformer Encoder.</p><p> After this comes a very unique step, which is called &#8220;cross-attention&#8221;</p><p><strong>Step 2: Cross-Attention</strong></p><p>In the cross-attention layer, the query comes from the decoder side, but the keys and values come from the encoder side. </p><p>In the above example, we will use the context vector for the &#8220;car&#8221; token to calculate the query, but the keys and values will come from the encoder. Let us look at the visualization below to understand this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!031M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!031M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png 424w, https://substackcdn.com/image/fetch/$s_!031M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png 848w, https://substackcdn.com/image/fetch/$s_!031M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png 1272w, https://substackcdn.com/image/fetch/$s_!031M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!031M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png" width="1456" height="706" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:706,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:127181,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!031M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png 424w, https://substackcdn.com/image/fetch/$s_!031M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png 848w, https://substackcdn.com/image/fetch/$s_!031M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png 1272w, https://substackcdn.com/image/fetch/$s_!031M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f0e3147-7f88-4fd3-b32b-c6cab07e9cbd_2364x1146.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Once the query, keys and values have been computed, we can calculate the attention scores:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!49PO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!49PO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png 424w, https://substackcdn.com/image/fetch/$s_!49PO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png 848w, https://substackcdn.com/image/fetch/$s_!49PO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png 1272w, https://substackcdn.com/image/fetch/$s_!49PO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!49PO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png" width="440" height="177.18120805369128" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:360,&quot;width&quot;:894,&quot;resizeWidth&quot;:440,&quot;bytes&quot;:33139,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!49PO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png 424w, https://substackcdn.com/image/fetch/$s_!49PO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png 848w, https://substackcdn.com/image/fetch/$s_!49PO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png 1272w, https://substackcdn.com/image/fetch/$s_!49PO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93457deb-4462-4073-94d3-e6cb4fc6b876_894x360.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Note that here we are trying to understand how much the token &#8220;car&#8221; relates to all the tokens in the input sentence. This is important for us because we know that the attention score for the token &#8220;red&#8221; is going to be high, since we want to swap the adjective and the nouns.</p><p>After this, we again calculate the context vector in a similar fashion like we have done a couple of times before:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0d2I!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0d2I!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png 424w, https://substackcdn.com/image/fetch/$s_!0d2I!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png 848w, https://substackcdn.com/image/fetch/$s_!0d2I!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png 1272w, https://substackcdn.com/image/fetch/$s_!0d2I!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0d2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png" width="1352" height="278" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:278,&quot;width&quot;:1352,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:26072,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0d2I!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png 424w, https://substackcdn.com/image/fetch/$s_!0d2I!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png 848w, https://substackcdn.com/image/fetch/$s_!0d2I!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png 1272w, https://substackcdn.com/image/fetch/$s_!0d2I!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe59303dc-73c6-4ed4-99db-c5e2d3544c22_1352x278.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Note that here the values are coming from the encoder tokens.</p><p>You might have guessed what happens next. We pass it through the feed-forward network, just as we did with the encoder, where we increase the dimensions and then decrease them again:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9m0B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9m0B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png 424w, https://substackcdn.com/image/fetch/$s_!9m0B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png 848w, https://substackcdn.com/image/fetch/$s_!9m0B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png 1272w, https://substackcdn.com/image/fetch/$s_!9m0B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9m0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png" width="528" height="298.72583201267827" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:714,&quot;width&quot;:1262,&quot;resizeWidth&quot;:528,&quot;bytes&quot;:38119,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9m0B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png 424w, https://substackcdn.com/image/fetch/$s_!9m0B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png 848w, https://substackcdn.com/image/fetch/$s_!9m0B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png 1272w, https://substackcdn.com/image/fetch/$s_!9m0B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6aafbf1a-e011-4d01-9b90-6a0443cd7fd5_1262x714.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Decoder&#8217;s job is to generate the translation one word at a time, using the memory created by the Encoder.</p><p>What next? How do we generate the next token? </p><p>One last step is remaining:</p><p><strong>Step 3: Projection into the Vocabulary Space</strong></p><p>In this step, what we do is, we take the context vector from the previous step and then we project it into the entire vocabulary space. We look at which token has the maximum value, and we choose that as our next token. This can be visually presented as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4ZCO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4ZCO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png 424w, https://substackcdn.com/image/fetch/$s_!4ZCO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png 848w, https://substackcdn.com/image/fetch/$s_!4ZCO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png 1272w, https://substackcdn.com/image/fetch/$s_!4ZCO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4ZCO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png" width="1456" height="930" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:930,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:145622,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!4ZCO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png 424w, https://substackcdn.com/image/fetch/$s_!4ZCO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png 848w, https://substackcdn.com/image/fetch/$s_!4ZCO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png 1272w, https://substackcdn.com/image/fetch/$s_!4ZCO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff48bb146-eb4c-4746-a32e-9e0b23f33af2_1954x1248.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the paper &#8220;Attention is All You Need&#8221;, which introduced Transformers to the world, they used the following schematic to represent the decoder:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RLAj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RLAj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png 424w, https://substackcdn.com/image/fetch/$s_!RLAj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png 848w, https://substackcdn.com/image/fetch/$s_!RLAj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png 1272w, https://substackcdn.com/image/fetch/$s_!RLAj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RLAj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png" width="186" height="530.5573770491803" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1044,&quot;width&quot;:366,&quot;resizeWidth&quot;:186,&quot;bytes&quot;:169712,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RLAj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png 424w, https://substackcdn.com/image/fetch/$s_!RLAj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png 848w, https://substackcdn.com/image/fetch/$s_!RLAj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png 1272w, https://substackcdn.com/image/fetch/$s_!RLAj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5a1f3ca-83b6-4d80-b6db-1aa1bdf3db50_366x1044.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We have explained the example which we started out with from scratch in a Google Colab notebook which you can use to understand the entire architecture. Here is a link to the notebook:  </p><p><a href="https://colab.research.google.com/drive/1Ev4Y1hfvQTPXVygz6hEMf4Bla3Wa7oXy">Google Colab Notebook: Designing Transformer Encoder and Decoder for Swapping Adjective and Nouns</a></p><p>Here is the sample output which we get:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2RzL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2RzL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png 424w, https://substackcdn.com/image/fetch/$s_!2RzL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png 848w, https://substackcdn.com/image/fetch/$s_!2RzL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png 1272w, https://substackcdn.com/image/fetch/$s_!2RzL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2RzL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png" width="1214" height="418" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:418,&quot;width&quot;:1214,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58764,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2RzL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png 424w, https://substackcdn.com/image/fetch/$s_!2RzL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png 848w, https://substackcdn.com/image/fetch/$s_!2RzL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png 1272w, https://substackcdn.com/image/fetch/$s_!2RzL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b774f39-84fd-4d42-91be-79366d40401e_1214x418.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Hmm..Not bad!</p><p>In the field of robotics, we will be mostly working with image-based data.</p><p>We should develop a method which can look at images and understand the context from the images.</p><p>This brings us to the topic of vision transformers.</p><h3>Vision Transformers:</h3><p>Let us understand what happens inside a Vision Transformer:</p><p>We will take an example of one of the observations collected from our robot (we use an SO-101 Robot in our company). Let us say the observation looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!szA2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!szA2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png 424w, https://substackcdn.com/image/fetch/$s_!szA2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png 848w, https://substackcdn.com/image/fetch/$s_!szA2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png 1272w, https://substackcdn.com/image/fetch/$s_!szA2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!szA2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png" width="1456" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3752838,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!szA2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png 424w, https://substackcdn.com/image/fetch/$s_!szA2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png 848w, https://substackcdn.com/image/fetch/$s_!szA2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png 1272w, https://substackcdn.com/image/fetch/$s_!szA2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F48a5747a-9741-4b28-aa76-2917a6dd37b2_2667x1499.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From this image, we can probably guess that the camera is mounted somewhere in front of the robotic arm, and the task that the robot is trying to perform is to place the golf ball into the orange cup.</p><p><strong>Step 1: Dividing the image into patches</strong></p><p>In the first step, we will take the image and divide it into a specific number of patches. Let us visually see how this looks like for the above observation:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vOJB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vOJB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png 424w, https://substackcdn.com/image/fetch/$s_!vOJB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png 848w, https://substackcdn.com/image/fetch/$s_!vOJB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!vOJB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vOJB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png" width="1456" height="821" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:821,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1667903,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vOJB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png 424w, https://substackcdn.com/image/fetch/$s_!vOJB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png 848w, https://substackcdn.com/image/fetch/$s_!vOJB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!vOJB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F454935d6-0e23-4c23-ab35-d5c628d85c48_1776x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Okay, we have divided the image into 9 patches, what next?</p><p>Remember that in the transformer architecture, which we discussed before, we had a layer that converted the token into token embeddings.</p><p>But we do not have tokens here. So, what do we do instead?</p><p>What if we take the patches and convert them to patch embeddings?</p><p><em>Let us look at how exactly patch embeddings are created</em></p><p><strong>Step 2: Creating Patch Embeddings</strong></p><p>Let us say the dimensions for each of our patches is 28x28x3, so totally we have 2352 values.</p><p>Now imagine a neural network which takes these 2352 values as an input and produces an output of dimension 512.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KHio!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KHio!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png 424w, https://substackcdn.com/image/fetch/$s_!KHio!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png 848w, https://substackcdn.com/image/fetch/$s_!KHio!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!KHio!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KHio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png" width="282" height="302.4990176817289" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1018,&quot;resizeWidth&quot;:282,&quot;bytes&quot;:139947,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!KHio!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png 424w, https://substackcdn.com/image/fetch/$s_!KHio!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png 848w, https://substackcdn.com/image/fetch/$s_!KHio!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png 1272w, https://substackcdn.com/image/fetch/$s_!KHio!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5484e99b-d2e1-4fc0-9a85-0dde82703373_1018x1092.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is what is visually happening in the creation of the patch embeddings.</p><p>So, the process till now can be visually represented as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!88Hg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!88Hg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png 424w, https://substackcdn.com/image/fetch/$s_!88Hg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png 848w, https://substackcdn.com/image/fetch/$s_!88Hg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png 1272w, https://substackcdn.com/image/fetch/$s_!88Hg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!88Hg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png" width="1456" height="463" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:463,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:461858,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!88Hg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png 424w, https://substackcdn.com/image/fetch/$s_!88Hg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png 848w, https://substackcdn.com/image/fetch/$s_!88Hg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png 1272w, https://substackcdn.com/image/fetch/$s_!88Hg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08f6c303-b255-4a00-9e89-0a06874d17ba_2234x710.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This looks good for now, but we are forgetting something!</p><p><strong>Step 3: Creating Position Embeddings</strong></p><p>Imagine that you have the following pictures:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qLin!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qLin!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!qLin!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!qLin!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!qLin!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qLin!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3552884,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qLin!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!qLin!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!qLin!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!qLin!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F960ca55a-5f80-4078-a512-9a7635b31a59_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Both the images contain the same patches. The only difference is that they are rearranged.</p><p>Now, if we continue with the above architecture, our model would treat both these images as the same. It would not understand the importance of the ordering of the patches inside these images.</p><p>This is why, along with patch embeddings, we need position embeddings as well.</p><p>We had done the same thing for the Transformer architecture when applied to language data, where the position embeddings were calculated for the tokens that come in a sequence.</p><p>When we add the position embeddings, our modified architecture looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gDDZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gDDZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png 424w, https://substackcdn.com/image/fetch/$s_!gDDZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png 848w, https://substackcdn.com/image/fetch/$s_!gDDZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png 1272w, https://substackcdn.com/image/fetch/$s_!gDDZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gDDZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png" width="1456" height="466" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:466,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:466539,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!gDDZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png 424w, https://substackcdn.com/image/fetch/$s_!gDDZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png 848w, https://substackcdn.com/image/fetch/$s_!gDDZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png 1272w, https://substackcdn.com/image/fetch/$s_!gDDZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5d0cb0b-16f1-4905-be29-2ba677e45c44_2230x714.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, we simply pass this as an input to the transformer encoder.</p><p><strong>Step 4: Pass this as an input to Transformer Encoder</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WZbu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WZbu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png 424w, https://substackcdn.com/image/fetch/$s_!WZbu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png 848w, https://substackcdn.com/image/fetch/$s_!WZbu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png 1272w, https://substackcdn.com/image/fetch/$s_!WZbu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WZbu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png" width="1456" height="673" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:673,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:503132,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!WZbu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png 424w, https://substackcdn.com/image/fetch/$s_!WZbu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png 848w, https://substackcdn.com/image/fetch/$s_!WZbu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png 1272w, https://substackcdn.com/image/fetch/$s_!WZbu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46d095b8-8afd-4d0e-96a2-a8ad3153f075_2246x1038.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Are we done?</p><p>Question: What is the output of the transformer encoder?</p><p>How do we use it to classify the images?</p><p>Remember that the output of the transformer encoder are the context vectors for all tokens.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6png!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6png!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png 424w, https://substackcdn.com/image/fetch/$s_!6png!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png 848w, https://substackcdn.com/image/fetch/$s_!6png!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png 1272w, https://substackcdn.com/image/fetch/$s_!6png!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6png!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png" width="1456" height="691" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:691,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:553744,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6png!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png 424w, https://substackcdn.com/image/fetch/$s_!6png!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png 848w, https://substackcdn.com/image/fetch/$s_!6png!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png 1272w, https://substackcdn.com/image/fetch/$s_!6png!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2f9c42b0-9c5f-48f8-987d-d8921c6e2b2e_2618x1242.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So, what next? How do we classify this image?</p><p> This is where we come to an important concept.</p><p><strong>Step 5: Classification Token [CLS]</strong></p><p>We add a learnable embedding to the sequence of embedded patches whose state at the output of the Transformer encoder serves as the image representation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2ygC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2ygC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png 424w, https://substackcdn.com/image/fetch/$s_!2ygC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png 848w, https://substackcdn.com/image/fetch/$s_!2ygC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!2ygC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2ygC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png" width="1456" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:509067,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2ygC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png 424w, https://substackcdn.com/image/fetch/$s_!2ygC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png 848w, https://substackcdn.com/image/fetch/$s_!2ygC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png 1272w, https://substackcdn.com/image/fetch/$s_!2ygC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4796c9bf-6a67-419c-925c-ec83ae2a2e4b_2284x1068.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This learnable token is also called the &#8220;class&#8221; token or the &#8220;classification&#8221; token.  </p><p>To understand what the CLS token does, let us calculate the attention matrix and the role of this token in the attention matrix.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HQfm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HQfm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png 424w, https://substackcdn.com/image/fetch/$s_!HQfm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png 848w, https://substackcdn.com/image/fetch/$s_!HQfm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png 1272w, https://substackcdn.com/image/fetch/$s_!HQfm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HQfm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png" width="426" height="467.4262560777958" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1354,&quot;width&quot;:1234,&quot;resizeWidth&quot;:426,&quot;bytes&quot;:325937,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HQfm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png 424w, https://substackcdn.com/image/fetch/$s_!HQfm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png 848w, https://substackcdn.com/image/fetch/$s_!HQfm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png 1272w, https://substackcdn.com/image/fetch/$s_!HQfm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5651f15-3653-4ba9-9d54-348fcc8c33b8_1234x1354.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From the above attention matrix, we can see that the CLS token is attending to itself and also the other 9 patches.</p><p>So, the CLS token can be thought of as something which contains a summary of the entire image.</p><p><strong>Question: Can we think of how do we move from here to how the context vector is generated for the CLS token?</strong></p><p>This brings us to the next step:</p><p><strong>Step 6: The Classification Head</strong></p><p><em>The last step is the classification head. Here, the context vector corresponding to the CLS token is passed through a multi-layer perceptron with the output dimensions which are equal to that of the number of classes which we have.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p6kx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p6kx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png 424w, https://substackcdn.com/image/fetch/$s_!p6kx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png 848w, https://substackcdn.com/image/fetch/$s_!p6kx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png 1272w, https://substackcdn.com/image/fetch/$s_!p6kx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p6kx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png" width="1456" height="773" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:773,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:152317,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!p6kx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png 424w, https://substackcdn.com/image/fetch/$s_!p6kx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png 848w, https://substackcdn.com/image/fetch/$s_!p6kx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png 1272w, https://substackcdn.com/image/fetch/$s_!p6kx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F432fb8db-67f9-4edc-86d0-a9ab3dd8ab40_2110x1120.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The final Vision Transformer Architecture looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Grys!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Grys!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png 424w, https://substackcdn.com/image/fetch/$s_!Grys!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png 848w, https://substackcdn.com/image/fetch/$s_!Grys!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png 1272w, https://substackcdn.com/image/fetch/$s_!Grys!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Grys!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png" width="1456" height="801" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:801,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Grys!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png 424w, https://substackcdn.com/image/fetch/$s_!Grys!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png 848w, https://substackcdn.com/image/fetch/$s_!Grys!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png 1272w, https://substackcdn.com/image/fetch/$s_!Grys!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa2c94a7-afd3-4af9-bfd7-8534d9bc27c0_2318x1276.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let us look at a practical example to understand how ViT is implemented in practice:</p><p>We will look at the following dataset:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-4DS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-4DS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png 424w, https://substackcdn.com/image/fetch/$s_!-4DS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png 848w, https://substackcdn.com/image/fetch/$s_!-4DS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png 1272w, https://substackcdn.com/image/fetch/$s_!-4DS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-4DS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png" width="1456" height="1479" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1479,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-4DS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png 424w, https://substackcdn.com/image/fetch/$s_!-4DS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png 848w, https://substackcdn.com/image/fetch/$s_!-4DS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png 1272w, https://substackcdn.com/image/fetch/$s_!-4DS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d8f29f8-33e9-410c-a83e-4bbf91ab7855_1466x1489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There are 3 classes here: </p><p>(1) angular_leaf_spot (A fungal disease causing angular spots on leaves)</p><p>(2) bean_rust (A fungal disease causing rust-colored pustules)</p><p>(3) healthy (No disease detected)</p><p>We will use a pretrained Vision Transformer as our base model and then fine-tune it for our specific task.</p><p>We will use the following Transformer:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Yol!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Yol!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png 424w, https://substackcdn.com/image/fetch/$s_!5Yol!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png 848w, https://substackcdn.com/image/fetch/$s_!5Yol!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png 1272w, https://substackcdn.com/image/fetch/$s_!5Yol!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Yol!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png" width="1456" height="841" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:841,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:595891,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5Yol!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png 424w, https://substackcdn.com/image/fetch/$s_!5Yol!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png 848w, https://substackcdn.com/image/fetch/$s_!5Yol!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png 1272w, https://substackcdn.com/image/fetch/$s_!5Yol!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2fa4f797-21d0-4858-a8e1-e04629ed0b56_2698x1558.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>This model was pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.</em></p><p>We use the following pipeline for our task:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-36y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-36y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!-36y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!-36y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!-36y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-36y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png" width="1024" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:651983,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/183031673?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-36y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png 424w, https://substackcdn.com/image/fetch/$s_!-36y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png 848w, https://substackcdn.com/image/fetch/$s_!-36y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png 1272w, https://substackcdn.com/image/fetch/$s_!-36y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7282c0a-3630-4933-aec4-b15960eb3d9f_1024x559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>After training for 3 epochs, we get an accuracy of 96%. See the result below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!l9fn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!l9fn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png 424w, https://substackcdn.com/image/fetch/$s_!l9fn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png 848w, https://substackcdn.com/image/fetch/$s_!l9fn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png 1272w, https://substackcdn.com/image/fetch/$s_!l9fn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!l9fn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png" width="1445" height="1489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1489,&quot;width&quot;:1445,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!l9fn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png 424w, https://substackcdn.com/image/fetch/$s_!l9fn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png 848w, https://substackcdn.com/image/fetch/$s_!l9fn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png 1272w, https://substackcdn.com/image/fetch/$s_!l9fn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23acc9ca-dfd1-4a50-b732-aeacb61d016c_1445x1489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is the link to the Google Colab Notebook:</p><p><a href="https://colab.research.google.com/drive/101A2TRFfH-fHX--aNQfmqBU39sSEYjyu?usp=sharing">Google Colab Notebook: Implementing Vision Transformer on a Practical Dataset</a>  </p><p>Here is the link to the original paper which introduced <em>Transformers: <a href="https://arxiv.org/abs/1706.03762">https://arxiv.org/abs/1706.03762</a></em></p><p>Here is the link to the original paper which introduced <em>Vision</em> <em>Transformers: <a href="https://arxiv.org/pdf/2010.11929">https://arxiv.org/pdf/2010.11929</a></em></p><p>If you like this content, please check out our bootcamps on the following topics:</p><p><strong>Modern Robot Learning</strong>: <a href="https://robotlearningbootcamp.vizuara.ai/">https://robotlearningbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Energy Based Models - Score Matching]]></title><description><![CDATA[Modeling probability distributions using energy functions.]]></description><link>https://www.vizuaranewsletter.com/p/energy-based-models-score-matching</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/energy-based-models-score-matching</guid><dc:creator><![CDATA[Dr Rajat Dandekar]]></dc:creator><pubDate>Thu, 01 Jan 2026 09:52:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!wemw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>EBMs define a probability density via an energy function which assigns lower energy to more likely configurations.</p><p>The Energy Function is represented as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E_{\\phi}(x)&quot;,&quot;id&quot;:&quot;IQREREUAGR&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us look at a simple example:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2_cQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2_cQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!2_cQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!2_cQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!2_cQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2_cQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:679638,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2_cQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!2_cQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!2_cQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!2_cQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd9762309-1c0a-41f8-86c6-6c010604274e_2679x1492.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, let us understand how do we think of converting this energy function into a probability distribution. </p><p>Some things which we understand intuitively are:</p><p><em>The points with low energy should have a higher probability.</em></p><p><em>The points with high energy should have a lower probability.</em></p><p>This is inspired from physics, where we see that the systems always reach a point of the lowest energy value. Think of an apple which is dropped from a height. The reason it settles on the ground is because the potential energy there is the minimum. </p><p>So, the probability curve should look something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ThSN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ThSN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!ThSN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!ThSN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!ThSN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ThSN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:830802,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ThSN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!ThSN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!ThSN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!ThSN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F947c65d4-5877-4a0b-ab87-1c61d3c97ed8_2679x1492.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let us superimpose both the curves now:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Ukx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Ukx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!8Ukx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!8Ukx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!8Ukx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Ukx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1153248,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8Ukx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!8Ukx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!8Ukx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!8Ukx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d534ed8-f091-4200-bcc4-9d9ed588c037_2679x1492.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is what we want.</p><p><em>Can we think of a mathematical function which takes us from this energy curve to the probability curve?</em></p><p>The function should satisfy the following properties:</p><ul><li><p>Higher energy should have lower probabilities</p></li><li><p>Lower energy should have higher probabilities</p></li><li><p>Should have only positive values</p></li><li><p>Should lie between 0 and 1</p></li></ul><blockquote><p>People use an exponential function to relate the energy to the probability since it satisfies all these properties.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jkXy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jkXy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png 424w, https://substackcdn.com/image/fetch/$s_!jkXy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png 848w, https://substackcdn.com/image/fetch/$s_!jkXy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png 1272w, https://substackcdn.com/image/fetch/$s_!jkXy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jkXy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png" width="1456" height="866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:245821,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!jkXy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png 424w, https://substackcdn.com/image/fetch/$s_!jkXy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png 848w, https://substackcdn.com/image/fetch/$s_!jkXy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png 1272w, https://substackcdn.com/image/fetch/$s_!jkXy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1ea068b-49b5-4a65-9e07-41e30fabb2be_2593x1542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From the above graph, we can relate the energy to the probability using the following equation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p_{\\phi}(x) = e^{-E_{\\phi}(x)}&quot;,&quot;id&quot;:&quot;LABRTOTUYI&quot;}" data-component-name="LatexBlockToDOM"></div><p>Okay, this looks great, until it is not..</p><p>Let us look at an example:</p><p>Let us take an example. Suppose that we have a set of discrete states which are -3, -2, -1, 0, 1, 2, and 3.</p><p>Now, let us say we use the above formula and calculate the probability densities for all these states.</p><p>It will look something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GYi_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GYi_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!GYi_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!GYi_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!GYi_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GYi_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:223058,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!GYi_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!GYi_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!GYi_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!GYi_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc67c956b-0d1d-44de-801e-918cf0bab4df_2679x1492.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here, we are assuming that the energy function for all these states is known to us, and we are simply converting them into probability using the exponential formula, which we looked at before. </p><p>The sum of all these probabilities is 2.5066.</p><p>This is not what we want. We want the summation of all the probabilities to be 1, so that it is an actual probability density function.</p><p>However, there is a simple solution to this. We can simply normalize the probabilities by dividing it by 2.5066.</p><p>It will look something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lL7g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lL7g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!lL7g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!lL7g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!lL7g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lL7g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:240023,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lL7g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png 424w, https://substackcdn.com/image/fetch/$s_!lL7g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png 848w, https://substackcdn.com/image/fetch/$s_!lL7g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png 1272w, https://substackcdn.com/image/fetch/$s_!lL7g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff5f40f99-a4f5-4797-9474-226c3687263d_2679x1492.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, these probabilities do sum up to 1!</p><p>The number 2.5066 is called as the partition function.</p><p>Hence the final relation between energy and the probability density function will look as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p_{\\phi}(x) = \\frac{e^{-E_{\\phi}(x)}}{\\int e^{-E_{\\phi}(x)} dx }&quot;,&quot;id&quot;:&quot;SCZKIDMTCY&quot;}" data-component-name="LatexBlockToDOM"></div><p>The partition function, which is the denominator in the above equation is also denoted by the symbol &#8220;Z&#8221;.</p><p>This is great, but how do we train Energy-based models?</p><h3>Training Energy-Based Models</h3><p> Conceptually, we want to do something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bJq6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bJq6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png 424w, https://substackcdn.com/image/fetch/$s_!bJq6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png 848w, https://substackcdn.com/image/fetch/$s_!bJq6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png 1272w, https://substackcdn.com/image/fetch/$s_!bJq6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bJq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png" width="1456" height="581" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:581,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:794935,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!bJq6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png 424w, https://substackcdn.com/image/fetch/$s_!bJq6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png 848w, https://substackcdn.com/image/fetch/$s_!bJq6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png 1272w, https://substackcdn.com/image/fetch/$s_!bJq6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02d44293-a5d3-4d83-813f-9ec377a08393_3166x1263.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p>We start with the random energy configuration and slowly modify the energy landscape, so that the bad data have a lower probability and the good data have a higher probability.</p></div><p>We will again use the maximum likelihood which we have seen before multiple times.</p><p>We want to maximize the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\text{log}p_{\\phi}(x)&quot;,&quot;id&quot;:&quot;PJMUZOWGUP&quot;}" data-component-name="LatexBlockToDOM"></div><p>Using the above formula, which relates the energy to the probability, we can rewrite the above equation as follows: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; \\text{log}p_{\\phi}(x) = -E_{\\phi}(x) - log(Z)&quot;,&quot;id&quot;:&quot;FYTMNXGBJX&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, Z denotes the partition function.</p><p>The main challenge with the above equation is that the partition function is intractable, i.e., it is impossible to calculate it. The reason is that we do not know the energy function for all the states in the distribution, and calculating an integral over all of them is impossible.</p><p>Remember we had faced this same issue while training variational autoencoders and diffusion models as well.</p><p>We had solved those issues by formulating an ELBO term and then maximizing that ELBO term. This worked for us because the ELBO term is always less than the maximum likelihood.</p><blockquote><p>For energy-based methods, we don&#8217;t use the ELBO approach, but instead we introduce the notion of the score function and present score matching as a tractable training objective which bypasses the partition function.</p></blockquote><h3>What is the &#8220;Score&#8221;?</h3><p>The score function is the gradient of the log density, given by the following equation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s(x) = \\nabla_{x}\\text{log}p(x)&quot;,&quot;id&quot;:&quot;WJDHFPAPLW&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, p(x) denotes the probability distribution of the data.</p><div class="pullquote"><p>Intuitively, the score forms a vector field that points toward regions of higher probability, providing a local guide to where the data is most likely to occur </p></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wemw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wemw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png 424w, https://substackcdn.com/image/fetch/$s_!wemw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png 848w, https://substackcdn.com/image/fetch/$s_!wemw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png 1272w, https://substackcdn.com/image/fetch/$s_!wemw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wemw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png" width="508" height="302.49725274725273" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:867,&quot;width&quot;:1456,&quot;resizeWidth&quot;:508,&quot;bytes&quot;:4306581,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!wemw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png 424w, https://substackcdn.com/image/fetch/$s_!wemw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png 848w, https://substackcdn.com/image/fetch/$s_!wemw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png 1272w, https://substackcdn.com/image/fetch/$s_!wemw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f27f7fa-7558-480a-9516-1ed18ab43b47_2592x1543.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the above figure, the arrows represent the score field which are pointed towards the direction where the density of the data is the maximum.</p><p>The score function acts as a compass, guiding you towards areas where the probability of the data being from the distribution is the maximum.</p><p><strong>Let us look at a simple practical example to build our intuition about the score function. </strong></p><p>Let us assume that the probability density curve is Gaussian.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yg3g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yg3g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png 424w, https://substackcdn.com/image/fetch/$s_!Yg3g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png 848w, https://substackcdn.com/image/fetch/$s_!Yg3g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png 1272w, https://substackcdn.com/image/fetch/$s_!Yg3g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yg3g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png" width="523" height="311.07005494505495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:523,&quot;bytes&quot;:343722,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Yg3g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png 424w, https://substackcdn.com/image/fetch/$s_!Yg3g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png 848w, https://substackcdn.com/image/fetch/$s_!Yg3g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png 1272w, https://substackcdn.com/image/fetch/$s_!Yg3g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b89e29d-9981-44f8-8745-28d522a69ead_2593x1542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>First, let us write down the mathematical functional form for the Gaussian:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p(x) = \\frac{1}{\\sqrt{2\\pi}}e^{-x^{2}/2}&quot;,&quot;id&quot;:&quot;VUTYWZTXDN&quot;}" data-component-name="LatexBlockToDOM"></div><p>We can verify whether this makes sense. If we substitute x = 0, we get a positive value for the probability, and for very high values (positive or negative), the probability becomes 0, which matches the graph above.</p><p>Now, let us calculate the score function.</p><p>First, let us calculate the logarithm of the probability:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{log}(p(x)) = \\text{log}(\\frac{1}{\\sqrt{2\\pi}}) - \\frac{x^{2}}{2}&quot;,&quot;id&quot;:&quot;FYLDNQSREL&quot;}" data-component-name="LatexBlockToDOM"></div><p> Now, we will take the gradient. The gradient of the constant will vanish.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s(x) = \\nabla_{x}\\text{log}(p(x)) = -x &quot;,&quot;id&quot;:&quot;OWDEVPTRQZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us visualize this score function superimposed on the probability density curve:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OyBQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OyBQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png 424w, https://substackcdn.com/image/fetch/$s_!OyBQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png 848w, https://substackcdn.com/image/fetch/$s_!OyBQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png 1272w, https://substackcdn.com/image/fetch/$s_!OyBQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OyBQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png" width="1456" height="866" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d0275457-6360-4531-8b79-91045e050fa7_2592x1542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:866,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:326788,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!OyBQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png 424w, https://substackcdn.com/image/fetch/$s_!OyBQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png 848w, https://substackcdn.com/image/fetch/$s_!OyBQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png 1272w, https://substackcdn.com/image/fetch/$s_!OyBQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd0275457-6360-4531-8b79-91045e050fa7_2592x1542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><em>In this example, we can clearly see that all the score vectors are pointed towards the center because the origin has the maximum probability density.</em></p><p><em>The further you are from the center, the magnitude of the arrows increases because it is farther away, and it will require more force to pull it back to the center.</em></p></blockquote><h3><strong>But why model scores instead of densities?</strong></h3><p>Modeling the score offers both theoretical and practical benefits:</p><p><strong>Freedom from Partition Function:</strong></p><p>We had seen before that calculating the partition function was intractable.</p><p>Because of this, we could not find an expression for maximizing the probability density likelihood.</p><p>Now, let us see how this formulation changes for the score function.</p><p>Let us write down the formula for the score function and substitute the probability with the energy function.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s(x) = \\nabla_{x}\\text{log}(p(x)) =  \\nabla_{x}\\text{log}(\\frac{e^{-E_{\\phi}(x)}}{Z})&quot;,&quot;id&quot;:&quot;YEJCMPVQCR&quot;}" data-component-name="LatexBlockToDOM"></div><p>This can be simplified to the following. Remember the log of the exponential of a value is the value itself.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s(x) = -\\nabla_{x}E_{\\phi}(x) - \\nabla_{x} \\text{log}Z&quot;,&quot;id&quot;:&quot;NDHRAKMJPD&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now the second term here involves calculating the gradient of the partition function, which does not depend on x. So that gradient will become zero, and this is exactly why we get the freedom from the partition function.</p><p>So, we can write the score function as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s(x) = -\\nabla_{x}E_{\\phi}(x)&quot;,&quot;id&quot;:&quot;TMRNEOYQMZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now that we have understood the meaning of the score function, let us use this formulation of the score function for training energy-based models.</p><p><em>Before we go to the training of energy-based models using the score function, let us first understand how can we get samples from our distribution if we only know the score function. </em></p><h3>Sampling using the Score Function</h3><p>The question that we will address is, &#8220;How do you sample the data if you have the score function?&#8221;</p><p>Let us start with an example:</p><p>Imagine that you are dropped into a thick fog on a vast landscape. Your goal is to find the deepest valley because that is where the treasure is hidden.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jXu7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jXu7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!jXu7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!jXu7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!jXu7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jXu7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png" width="485" height="264.8179945054945" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:485,&quot;bytes&quot;:3024354,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!jXu7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png 424w, https://substackcdn.com/image/fetch/$s_!jXu7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png 848w, https://substackcdn.com/image/fetch/$s_!jXu7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!jXu7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc6647b30-5166-4bcb-9add-529475fa6f37_2706x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Ideally, you want to trace the route which goes something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HfmN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HfmN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png 424w, https://substackcdn.com/image/fetch/$s_!HfmN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png 848w, https://substackcdn.com/image/fetch/$s_!HfmN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!HfmN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HfmN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png" width="487" height="265.57554945054943" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:487,&quot;bytes&quot;:3013381,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!HfmN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png 424w, https://substackcdn.com/image/fetch/$s_!HfmN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png 848w, https://substackcdn.com/image/fetch/$s_!HfmN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png 1272w, https://substackcdn.com/image/fetch/$s_!HfmN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F64d57d28-6f9d-4272-90d9-6948186532f7_2707x1477.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>What is the strategy that you will use?</p><p><em>Since you know that the treasure is in the valley somewhere, you know that going down is good and going up is bad.</em></p><div class="pullquote"><p>To reach the valley in the quickest possible time, you will go in the direction where the downward slope is the maximum.</p></div><p>Let us say the slope is given by the symbol &#8220;q&#8221;.</p><p>If x(t) is your current position and x(t+1) is your next position, then you can write the next position in terms of the current position using the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{t+1} = x_{t} + \\eta q&quot;,&quot;id&quot;:&quot;IGMWIPUKUN&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, n is the step size.</p><p>In this analogy, our vast landscape is the energy function landscape. So, we are trying to find the point where the energy landscape achieves a minima.</p><p>So, the slope &#8220;q&#8221; can be expressed as the gradient of the energy function as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;q = - \\nabla_{x} E_{\\phi}(x_{t})&quot;,&quot;id&quot;:&quot;PQGGXJATLA&quot;}" data-component-name="LatexBlockToDOM"></div><p>Note that we have used a negative sign because we want to find the minima, and not the maxima.</p><p> Now we can write the final update rule as follows:  </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{t+1}= x_{t} - \\eta \\nabla_{x} E_{\\phi}(x_{t})&quot;,&quot;id&quot;:&quot;ONHBSCXMQW&quot;}" data-component-name="LatexBlockToDOM"></div><p>If we go according to the above rule, we are guaranteed to move towards regions where the energy function is minimum.</p><p>This would remind you of the &#8220;Gradient Descent Algorithm&#8221; in Machine Learning, where we have a very similar update rule.</p><p>We are not done yet :(</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Iebn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Iebn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Iebn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Iebn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Iebn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Iebn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png" width="399" height="266.09134615384613" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:399,&quot;bytes&quot;:2261863,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Iebn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Iebn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Iebn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Iebn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe4a09c6-45de-4120-b323-7c9f2f5e6376_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Consider this scenario:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S1ay!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S1ay!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png 424w, https://substackcdn.com/image/fetch/$s_!S1ay!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png 848w, https://substackcdn.com/image/fetch/$s_!S1ay!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png 1272w, https://substackcdn.com/image/fetch/$s_!S1ay!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S1ay!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png" width="1456" height="637" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/79314648-74b4-402e-9723-707daf7b2738_1934x846.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:637,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2196860,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!S1ay!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png 424w, https://substackcdn.com/image/fetch/$s_!S1ay!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png 848w, https://substackcdn.com/image/fetch/$s_!S1ay!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png 1272w, https://substackcdn.com/image/fetch/$s_!S1ay!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F79314648-74b4-402e-9723-707daf7b2738_1934x846.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We are at a point where the slope of the mountain is 0, i.e, the gradient is also 0.</p><p>This means that we are at a local minima and we are stuck there.</p><p>Now, what if there is another minima further down the road which is even below where we are sitting right now? Because of our algorithm, we will never find this minima.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-H92!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-H92!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png 424w, https://substackcdn.com/image/fetch/$s_!-H92!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png 848w, https://substackcdn.com/image/fetch/$s_!-H92!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!-H92!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-H92!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png" width="1456" height="746" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:746,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2129021,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!-H92!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png 424w, https://substackcdn.com/image/fetch/$s_!-H92!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png 848w, https://substackcdn.com/image/fetch/$s_!-H92!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png 1272w, https://substackcdn.com/image/fetch/$s_!-H92!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fee9fcea1-5165-4a93-8cdd-26d3eac26b52_2092x1072.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To solve this issue, we need to provide a shake which gives you just enough random energy to kick you out of those small potholes so you can keep moving toward the <em>true</em> bottom of the valley (the Global Minimum).</p><p>Remember in the lecture on diffusion, we had discussed about adding noise to the data and we had written a simple expression to add noise which looks as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{i+1} = x_{i} + \\beta \\epsilon &quot;,&quot;id&quot;:&quot;GSVIUQERBX&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Refer to the Diffusion article here: <a href="https://www.vizuaranewsletter.com/p/what-exactly-are-diffusion-models">https://www.vizuaranewsletter.com/p/what-exactly-are-diffusion-models</a></p><p>The above expression also means that we will sample from a Gaussian distribution with a mean, same as that of x(i), and a standard deviation of beta.</p><p>With this same understanding in mind, we will  modify our &#8220;walking&#8221; algorithm as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{t+1}= x_{t} - \\eta \\nabla_{x} E_{\\phi}(x_{t}) + \\sqrt{2\\eta}\\epsilon_{t}&quot;,&quot;id&quot;:&quot;CDCLRVOOHM&quot;}" data-component-name="LatexBlockToDOM"></div><p>Note that here epsilon represents a random variable which adds an element of stochasticity to the above equation.</p><div class="pullquote"><p>This means that even if we are at a local minima where the gradient of the energy function is zero, we will not remain stationary and we will be pulled out of that hole because of that shake provided by the noise term. This will allow us to explore other areas where we might find a global minima.  </p></div><p>This equation is also called as <strong>Discrete-Time Langevin Update</strong></p><p>Now, as we have looked at before, the gradient of the energy function is the negative of the score function. So we can rewrite the above equation as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{t+1}= x_{t} + \\eta s_{\\phi}(x) + \\sqrt{2\\eta}\\epsilon_{t}&quot;,&quot;id&quot;:&quot;UFGQKIBOKH&quot;}" data-component-name="LatexBlockToDOM"></div><p>Note that, here we are assuming that we already have a trained score function, and we are understanding how we can sample images from the trained score function using the Discrete-Time Langevin Update. </p><p>Sampling using the Langevin update tool might visually look as follow</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rnGB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rnGB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 424w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 848w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rnGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png" width="1456" height="831" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:831,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2275834,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!rnGB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 424w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 848w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 1272w, https://substackcdn.com/image/fetch/$s_!rnGB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a9155e2-0ffd-4ae6-be2f-31ac7cc25617_2647x1510.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see here the <strong>zig-zag lines</strong> which are the trajectories take from the starting point to the end point. This is exactly because of the<strong> stochastic term</strong> that we have added in the <strong>update rule</strong>. It almost looks like a <strong>hiker who is drunk</strong> and trying to navigate their way in the terrain. </p><p>Let us understand this using a practical example. We will use Langevin dynamics to sample from a known probability distribution.  </p><h3>Practical example: Sampling using the Score Function</h3><p> We will use the following the probability distribution as the known probability distribution:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3whp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3whp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png 424w, https://substackcdn.com/image/fetch/$s_!3whp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png 848w, https://substackcdn.com/image/fetch/$s_!3whp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png 1272w, https://substackcdn.com/image/fetch/$s_!3whp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3whp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png" width="446" height="338.48214285714283" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1105,&quot;width&quot;:1456,&quot;resizeWidth&quot;:446,&quot;bytes&quot;:415870,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!3whp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png 424w, https://substackcdn.com/image/fetch/$s_!3whp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png 848w, https://substackcdn.com/image/fetch/$s_!3whp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png 1272w, https://substackcdn.com/image/fetch/$s_!3whp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08ea8da7-f211-4944-9bf6-ed8cf324d4a8_2295x1742.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can see that the distribution has two peaks. There are two regions where the probability of finding the data is the maximum. Then, the probability slowly tapers off as you move away from those peaks.</p><p> So, if we start from any point in the grid, our update rule should take us to the areas which appear in yellow color in the contour plot.</p><p> This would mean that our update rule has worked successfully, and we have learned to arrive at places where sampling from the probability distribution is the maximum.</p><p>Here is the link to the Google Colab Notebook, where we have used the score function in the Langevin update rule to estimate the trajectories.</p><p><a href="https://colab.research.google.com/drive/1pu-RD0BmoTweeloruoa667tjuAaW-OL5?usp=sharing">Google Colab Notebook: Langevin Dynamics - Sampling using a known Score Function  </a></p><p> An example trajectory looks as follows:  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wym1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wym1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png 424w, https://substackcdn.com/image/fetch/$s_!wym1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png 848w, https://substackcdn.com/image/fetch/$s_!wym1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png 1272w, https://substackcdn.com/image/fetch/$s_!wym1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wym1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png" width="439" height="460.1057692307692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1526,&quot;width&quot;:1456,&quot;resizeWidth&quot;:439,&quot;bytes&quot;:735760,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!wym1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png 424w, https://substackcdn.com/image/fetch/$s_!wym1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png 848w, https://substackcdn.com/image/fetch/$s_!wym1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png 1272w, https://substackcdn.com/image/fetch/$s_!wym1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26354949-1dae-486d-b6fb-fd4b50f28315_1953x2047.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see here that using our update rule, we are reaching towards areas where the probability is the maximum, which is exactly what we want.</p><p> So, it looks like we have solved the sampling part. However, this is only 50% of the story. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Iy05!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Iy05!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Iy05!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png" width="440" height="293.4340659340659" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:440,&quot;bytes&quot;:520623,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Iy05!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We are yet to understand how to train our score function so that it matches the true score function as close as possible.</p><p>This brings us to Score-based Generative Models.</p><h3>Score-based Generative Models  </h3><div class="pullquote"><p>The key idea is that since sampling with Langevin dynamics needs only the score, we can learn it directly with a neural network. This shift, from modeling energies to modeling scores, forms the foundation of score-based generative models</p></div><p> Let us look at a sample example:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pd5L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pd5L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png 424w, https://substackcdn.com/image/fetch/$s_!Pd5L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png 848w, https://substackcdn.com/image/fetch/$s_!Pd5L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png 1272w, https://substackcdn.com/image/fetch/$s_!Pd5L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pd5L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png" width="478" height="291.1991758241758" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:887,&quot;width&quot;:1456,&quot;resizeWidth&quot;:478,&quot;bytes&quot;:2456254,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Pd5L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png 424w, https://substackcdn.com/image/fetch/$s_!Pd5L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png 848w, https://substackcdn.com/image/fetch/$s_!Pd5L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png 1272w, https://substackcdn.com/image/fetch/$s_!Pd5L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb2bfd701-6e47-409f-bdc3-bb85feea6794_2562x1560.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The neural network score (blue) is trained to match the ground truth score (black) using a MSE loss. Both are represented as vector fields.</p><p> The true score function is denoted as follows: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s(x) = \\nabla_{x}\\text{log}p_{data}(x)&quot;,&quot;id&quot;:&quot;NESJSAEEVJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>The predicted score function is denoted as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;s_{\\phi}(x) = \\nabla_{x}\\text{log}p_{\\phi}(x)&quot;,&quot;id&quot;:&quot;JJFVKCJKHF&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><blockquote><p>Score matching works by minimizing the mean squared error (MSE) between the true and estimated scores.</p></blockquote><p> The mean squared error loss is given as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;L_{SM}(\\phi) = \\frac{1}{2}||s_{\\phi}(x) - s(x)||^{2}&quot;,&quot;id&quot;:&quot;YABDZNKWJV&quot;}" data-component-name="LatexBlockToDOM"></div><p>One of the main problems with this approach is that the true score function is unknown. If we do not know the true probability distribution, then how can we possibly find the gradient of the probability distribution.</p><h3>Tractable Score Matching:</h3><p>At first glance, this objective seems infeasible because the true score s(x), which serves as the regression target, is unknown.</p><p>Fortunately, Hyvrinen and Dayan (2005) showed that integration by parts yields an equivalent objective that depends only on the model and the data samples, without requiring access to the true score.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XxLr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XxLr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png 424w, https://substackcdn.com/image/fetch/$s_!XxLr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png 848w, https://substackcdn.com/image/fetch/$s_!XxLr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png 1272w, https://substackcdn.com/image/fetch/$s_!XxLr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XxLr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png" width="516" height="457.1703296703297" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1290,&quot;width&quot;:1456,&quot;resizeWidth&quot;:516,&quot;bytes&quot;:1539035,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!XxLr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png 424w, https://substackcdn.com/image/fetch/$s_!XxLr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png 848w, https://substackcdn.com/image/fetch/$s_!XxLr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png 1272w, https://substackcdn.com/image/fetch/$s_!XxLr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F59b4bdb5-107f-407e-a75e-a440476b145d_2124x1882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p>This example really highlights the value of research. A paper released 20 years back serves as the backbone of score-based diffusion models. This is truly amazing. </p></div><p>The modified tractable loss function looks as follows: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{L_{SM}}(\\phi) = Tr (\\nabla_{x}s_{\\phi}(x)) + \\frac{1}{2}||s_{\\phi}(x)||^{2}&quot;,&quot;id&quot;:&quot;CUZOIEHTTP&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, Tr(.) means the trace of the matrix.</p><p>Using this objective, we train the score model solely from observed samples, eliminating the need for the true score function.</p><p>Let us understand the logic behind both of these terms separately:</p><p><strong>Term 1: </strong></p><p>This term measures divergence of your arrows. It measures if the arrows are spreading out (exploding) or converging (imploding) at a specific point. Since we are minimizing this loss, we want this value to be extremely negative.</p><p><em><strong>Intuition:</strong></em></p><ul><li><p><em><strong>Positive Trace:</strong> Arrows are exploding outward (like a bomb went off).</em></p></li><li><p><em><strong>Negative Trace:</strong> Arrows are sucking inward (like a black hole or a sink drain).</em></p></li></ul><p><em>By forcing the trace to be negative at the data points, it forces all the arrows to point INWARD towards the data.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P6Vl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P6Vl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png 424w, https://substackcdn.com/image/fetch/$s_!P6Vl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png 848w, https://substackcdn.com/image/fetch/$s_!P6Vl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png 1272w, https://substackcdn.com/image/fetch/$s_!P6Vl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P6Vl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png" width="1456" height="624" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:624,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2282863,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!P6Vl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png 424w, https://substackcdn.com/image/fetch/$s_!P6Vl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png 848w, https://substackcdn.com/image/fetch/$s_!P6Vl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png 1272w, https://substackcdn.com/image/fetch/$s_!P6Vl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb42c0506-8713-444f-b2b1-38014dd4a65b_3054x1309.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Term 2:</strong></p><p>This term measures the length (or strength) of your arrows. It wants the arrows to be as small (short) as possible, ideally zero.</p><p>Regions where the probability of the data is high will have more score and contribute more to this term.</p><p>This term drives the score to be zero in high probability areas, so that these locations become stationary.</p><div class="pullquote"><p>To summarize, the first term makes the high probability areas look as sinks, so that if you take out a compass and want to navigate, you are forced to move inwards towards the high probability areas. Once you are near the high probability areas, you probably don&#8217;t want to move too much, and this is achieved through the second term, which makes these points stationary.   </p></div><p>Let us look at a practical example where this formulation is used to learn the score function.</p><p><strong>Practical Example: Score Matching</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Iy05!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Iy05!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Iy05!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png" width="400" height="266.75824175824175" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Iy05!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!Iy05!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb956cce8-d298-4e86-a0c3-47421aef0ec5_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In this example, we will solve both parts of the puzzle:</p><ol><li><p>We will train the score function</p></li><li><p>We will sample from it</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!87zK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!87zK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png 424w, https://substackcdn.com/image/fetch/$s_!87zK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png 848w, https://substackcdn.com/image/fetch/$s_!87zK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png 1272w, https://substackcdn.com/image/fetch/$s_!87zK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!87zK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png" width="398" height="398" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1528f031-5649-4179-8567-26d06681ff20_1999x1999.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:398,&quot;bytes&quot;:1107313,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!87zK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png 424w, https://substackcdn.com/image/fetch/$s_!87zK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png 848w, https://substackcdn.com/image/fetch/$s_!87zK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png 1272w, https://substackcdn.com/image/fetch/$s_!87zK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1528f031-5649-4179-8567-26d06681ff20_1999x1999.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p> We will learn to predict the score function for this data.</p><p>Here is the link to the Google Colab notebook which uses the tractable score matching loss formulation to solve this problem:</p><p><a href="https://colab.research.google.com/drive/1bf3NehKTzk7kMuqkY05u7n3fz3l8VeN9?usp=sharing">Google Colab Notebook - Using the tractable Score Matching Loss Formulation</a>  </p><p>This is final learnt score function which we get:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m_fi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m_fi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png 424w, https://substackcdn.com/image/fetch/$s_!m_fi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png 848w, https://substackcdn.com/image/fetch/$s_!m_fi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png 1272w, https://substackcdn.com/image/fetch/$s_!m_fi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m_fi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png" width="432" height="439.7142857142857" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5220092-d312-4538-90c0-f82af86e5074_1982x2017.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1482,&quot;width&quot;:1456,&quot;resizeWidth&quot;:432,&quot;bytes&quot;:3056170,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!m_fi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png 424w, https://substackcdn.com/image/fetch/$s_!m_fi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png 848w, https://substackcdn.com/image/fetch/$s_!m_fi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png 1272w, https://substackcdn.com/image/fetch/$s_!m_fi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5220092-d312-4538-90c0-f82af86e5074_1982x2017.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From the above image, we can see that how this score function acts like a compass which points us to the region where the data is located. </p><p>We can also see some regions like sinks which are created near the data which pull you inside and so you are close to those regions.</p><p>So, it looks like we have managed to train our score function properly :)</p><p><em>Once this score field is learned, we can also sample from it using Langevin Dynamics, exactly as we have learned before:</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1Xnm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1Xnm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png 424w, https://substackcdn.com/image/fetch/$s_!1Xnm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png 848w, https://substackcdn.com/image/fetch/$s_!1Xnm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png 1272w, https://substackcdn.com/image/fetch/$s_!1Xnm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1Xnm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png" width="440" height="445.1373626373626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1473,&quot;width&quot;:1456,&quot;resizeWidth&quot;:440,&quot;bytes&quot;:2004732,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/182938742?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!1Xnm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png 424w, https://substackcdn.com/image/fetch/$s_!1Xnm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png 848w, https://substackcdn.com/image/fetch/$s_!1Xnm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png 1272w, https://substackcdn.com/image/fetch/$s_!1Xnm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3facc3c0-e62b-4d45-b958-f2bee60c9581_1988x2011.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We start from the &#8220;green point&#8221; and end in the &#8220;blue point&#8221;. This trajectory is calculated using the Langevin Dynamics update rule which we looked at before.</p><p>Our &#8220;drunk hiker&#8221; does quite well in ending up close to the area where the probability density of the data is high :)</p><p>That&#8217;s it! This was the score matching formulation as one of techniques in Deep Generative Modeling.</p><div class="pullquote"><p>It is very interesting to note that these two tracks of diffusion models and score-based models appear very distinct, but they converge very nicely together. We can understand both of them using a single unified framework</p></div><p>We will look at this in the next chapter, where we will examine the foundational role of the score function in modern diffusion models.</p><p>Initially introduced to enable efficient training of EBMs, the score function has evolved into a central component of a new generation of generative models.</p><p>Here is the link to the original paper which introduced <em>the tractable Score Matching formulation): <a href="https://jmlr.org/papers/volume6/hyvarinen05a/hyvarinen05a.pdf">https://arxiv.org/abs/2006.11239</a></em></p><p>For further reading, please refer to the book: <em>The Principles of Diffusion Models From Origins to Advances (<a href="https://arxiv.org/abs/2510.21890">https://arxiv.org/abs/2510.21890</a>) [Pages 56-68]</em></p><p>If you like this content, please check out our bootcamps on the following topics:</p><p><strong>Modern Robot Learning</strong>: <a href="https://robotlearningbootcamp.vizuara.ai/">https://robotlearningbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[What exactly is a Data efficient image Transformer (DeiT)? ]]></title><description><![CDATA[And how does it use teacher-student model?]]></description><link>https://www.vizuaranewsletter.com/p/what-exactly-is-a-data-efficient</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/what-exactly-is-a-data-efficient</guid><dc:creator><![CDATA[Sreedath Panat]]></dc:creator><pubDate>Wed, 31 Dec 2025 09:29:03 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7409b32a-5447-4451-9c41-fa0c4cbfa023_886x823.gif" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Table of Content</h1><ol><li><p><em>Problem with vision transformers </em></p></li><li><p><em>Introduction to the DeiT paper </em></p></li><li><p><em>Introduction to the teacher-student model</em></p></li><li><p><em>What is knowledge distillation?</em></p></li><li><p><em>DeiT architecture</em></p></li><li><p><em>CLASS and DISTIL tokens</em></p></li><li><p><em>The distillation mechanism</em></p></li><li><p><em>DeiT loss function overview</em></p></li><li><p><em>What is KL divergence loss?</em></p></li><li><p><em>So, how good was DeiT compared to other models?</em></p></li><li><p><em>Key points to note in DeiT architecture</em></p></li><li><p><em>DeiT loss function detailed</em></p><ol><li><p><em>Ground-truth (standard classification) loss</em></p></li><li><p><em>Teacher (distillation) loss</em></p><ol><li><p><em>What is the teacher?</em></p></li><li><p><em>What does the teacher provide?</em></p></li></ol></li><li><p><em>Why is temperature used?</em></p></li><li><p><em>Why multiply by T^2?</em></p></li><li><p><em>Final combined loss</em></p></li></ol></li><li><p><em>Coding DeiT from scratch</em></p></li><li><p><em>Conclusion</em></p></li><li><p><em>Other resources</em></p></li></ol><h1>Problem with vision transformers </h1><p>Vision Transformer (VIT) requires a huge amount of data for training. More importantly, ViTs do not assume locality or translation invariance, unlike CNNs.</p><p>Convolution has a property called translation invariance or translation equivariance, as shown in the figure below. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HcDn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HcDn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png 424w, https://substackcdn.com/image/fetch/$s_!HcDn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png 848w, https://substackcdn.com/image/fetch/$s_!HcDn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png 1272w, https://substackcdn.com/image/fetch/$s_!HcDn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HcDn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png" width="412" height="503.9642857142857" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1781,&quot;width&quot;:1456,&quot;resizeWidth&quot;:412,&quot;bytes&quot;:627313,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HcDn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png 424w, https://substackcdn.com/image/fetch/$s_!HcDn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png 848w, https://substackcdn.com/image/fetch/$s_!HcDn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png 1272w, https://substackcdn.com/image/fetch/$s_!HcDn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0d8414a-ba42-4fa4-a8aa-6132caedbdc6_1808x2211.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">This is a simple example to demonstrate the idea of translation invariance in convolution operation.</figcaption></figure></div><p>Convolution also comes with locality bias meaning features are assumed to be local. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!36gT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!36gT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png 424w, https://substackcdn.com/image/fetch/$s_!36gT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png 848w, https://substackcdn.com/image/fetch/$s_!36gT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png 1272w, https://substackcdn.com/image/fetch/$s_!36gT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!36gT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png" width="1456" height="470" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:470,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:455721,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!36gT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png 424w, https://substackcdn.com/image/fetch/$s_!36gT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png 848w, https://substackcdn.com/image/fetch/$s_!36gT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png 1272w, https://substackcdn.com/image/fetch/$s_!36gT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa4043c4-861f-4ede-87a3-4adeafb32f5d_3519x1136.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Convolution also comes with locality bias meaning features are assumed to be local. In this example look at the digits 8, 0, and 6. The bottom part of all these digits have the same local features which convolution filters cannot distinguish separately. </figcaption></figure></div><p>Many image-based applications require long-range dependencies. For example, if there is a person who is crossing a street, a self-driving car should ideally not move forward even if the light is green. It doesn&#8217;t matter how far the pixels corresponding to the green light and the pixels corresponding to the person are. So this is something that convolution will struggle to capture. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uEBr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uEBr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png 424w, https://substackcdn.com/image/fetch/$s_!uEBr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png 848w, https://substackcdn.com/image/fetch/$s_!uEBr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png 1272w, https://substackcdn.com/image/fetch/$s_!uEBr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uEBr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png" width="1456" height="878" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:878,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2070094,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uEBr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png 424w, https://substackcdn.com/image/fetch/$s_!uEBr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png 848w, https://substackcdn.com/image/fetch/$s_!uEBr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png 1272w, https://substackcdn.com/image/fetch/$s_!uEBr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7dca0b35-84f5-44d9-9923-aff84a341d41_2575x1552.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Many image-based applications require long-range dependencies. For example, if there is a person who is crossing a street, a self-driving car should ideally not move forward even if the light is green</figcaption></figure></div><p>And for the above reasons Vision Transformer has the following shortcomings: </p><ul><li><p>They need to &#8220;learn&#8221; these properties from data instead of being built into the architecture.</p></li><li><p>As a result, ViTs need huge labeled datasets and long training schedules to reach the same performance as CNNs.</p></li></ul><h1>Introduction to the DeiT paper </h1><p>The data-efficient image transformer paper was written by a team of researchers from Facebook AI Research. It has more than 10,000 citations now and is regarded as one of the most seminal papers that extended the capabilities of Vision Transformer. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GSgi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GSgi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png 424w, https://substackcdn.com/image/fetch/$s_!GSgi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png 848w, https://substackcdn.com/image/fetch/$s_!GSgi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png 1272w, https://substackcdn.com/image/fetch/$s_!GSgi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GSgi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png" width="1456" height="289" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:289,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1041365,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GSgi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png 424w, https://substackcdn.com/image/fetch/$s_!GSgi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png 848w, https://substackcdn.com/image/fetch/$s_!GSgi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png 1272w, https://substackcdn.com/image/fetch/$s_!GSgi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37df83fe-10df-478b-bd5b-3dd9af8bf52d_4487x891.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This is the link to the paper published on arXiv: <a href="https://arxiv.org/abs/2012.12877">https://arxiv.org/abs/2012.12877</a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hlNr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hlNr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png 424w, https://substackcdn.com/image/fetch/$s_!hlNr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png 848w, https://substackcdn.com/image/fetch/$s_!hlNr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png 1272w, https://substackcdn.com/image/fetch/$s_!hlNr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hlNr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png" width="1002" height="972" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:972,&quot;width&quot;:1002,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:204171,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hlNr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png 424w, https://substackcdn.com/image/fetch/$s_!hlNr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png 848w, https://substackcdn.com/image/fetch/$s_!hlNr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png 1272w, https://substackcdn.com/image/fetch/$s_!hlNr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7420e0cc-ecda-4a68-8690-04095537a897_1002x972.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">DeiT paper abstract</figcaption></figure></div><p>The DeiT paper aimed to make Transformers <strong>data-efficient</strong> enough to train on standard ImageNet-1k (1.2M images) without pretraining. Please note that the original ViT model was trained on around 300 million images. </p><p>I am copying the snippet below directly from the DeiT paper, which quotes the ViT paper. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HjKQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HjKQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png 424w, https://substackcdn.com/image/fetch/$s_!HjKQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png 848w, https://substackcdn.com/image/fetch/$s_!HjKQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png 1272w, https://substackcdn.com/image/fetch/$s_!HjKQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HjKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png" width="1456" height="318" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:318,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1027574,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HjKQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png 424w, https://substackcdn.com/image/fetch/$s_!HjKQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png 848w, https://substackcdn.com/image/fetch/$s_!HjKQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png 1272w, https://substackcdn.com/image/fetch/$s_!HjKQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F11a90b3c-8ede-4150-8d65-2feab113ba81_4281x934.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The paper &#8220;Data-efficient Image Transformers (DeiT)&#8221; (ICML 2021) showed that ViTs can be trained from scratch without large datasets by introducing <strong>knowledge distillation</strong> from a strong CNN teacher.</p><p>Now this begs the question: what exactly is a teacher-student model&gt;</p><h1>Introduction to the teacher-student model</h1><p>The teacher-student model is a very intuitive way of thinking about how knowledge can be transferred from a complex system to a simpler one. It mirrors how learning often happens in real life as well. </p><p>In this setup, the teacher is usually a large, powerful, and well-trained model that has learned rich representations from data, often with high accuracy but also high computational cost, while the student is a smaller, lighter model that we actually want to deploy in practice because it is faster, cheaper, and more efficient to run. </p><p>Instead of training the student directly on hard labels alone, the idea is to let the student learn by observing the teacher&#8217;s behavior, especially the probability distributions or soft predictions that the teacher produces, because these predictions contain much more information than just the final class label. They reflect the teacher&#8217;s understanding of similarities and ambiguities in the data.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3AGX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3AGX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png 424w, https://substackcdn.com/image/fetch/$s_!3AGX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png 848w, https://substackcdn.com/image/fetch/$s_!3AGX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png 1272w, https://substackcdn.com/image/fetch/$s_!3AGX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3AGX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png" width="1456" height="1537" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1537,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:683385,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3AGX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png 424w, https://substackcdn.com/image/fetch/$s_!3AGX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png 848w, https://substackcdn.com/image/fetch/$s_!3AGX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png 1272w, https://substackcdn.com/image/fetch/$s_!3AGX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8fabebc4-fcf7-4d03-a76e-4a435a6df13a_1946x2054.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>What is knowledge distillation?</h1><p>Knowledge distillation is the formal process by which this transfer from teacher to student occurs, and the key insight is that the soft targets produced by the teacher encode what is often called dark knowledge, meaning information about how the teacher ranks different classes and how confident it is in each. </p><p>During training, the student is optimized to match these soft outputs, typically using a softened softmax with a temperature parameter, along with or sometimes instead of the original ground truth labels. </p><p>This helps the student learn smoother decision boundaries and capture generalization patterns that would be very hard to infer from hard labels alone, especially when the dataset is small or noisy. </p><p>As a result, the student often performs surprisingly well, sometimes approaching the teacher&#8217;s accuracy while using a fraction of the parameters and compute.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5RTU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5RTU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png 424w, https://substackcdn.com/image/fetch/$s_!5RTU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png 848w, https://substackcdn.com/image/fetch/$s_!5RTU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png 1272w, https://substackcdn.com/image/fetch/$s_!5RTU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5RTU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png" width="1456" height="291" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:291,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:353326,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5RTU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png 424w, https://substackcdn.com/image/fetch/$s_!5RTU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png 848w, https://substackcdn.com/image/fetch/$s_!5RTU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png 1272w, https://substackcdn.com/image/fetch/$s_!5RTU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1c6f8de-8d00-4ac1-aa16-628913a0d1c6_4475x893.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Yts4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Yts4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png 424w, https://substackcdn.com/image/fetch/$s_!Yts4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png 848w, https://substackcdn.com/image/fetch/$s_!Yts4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png 1272w, https://substackcdn.com/image/fetch/$s_!Yts4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Yts4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png" width="1456" height="181" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:181,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:502218,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Yts4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png 424w, https://substackcdn.com/image/fetch/$s_!Yts4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png 848w, https://substackcdn.com/image/fetch/$s_!Yts4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png 1272w, https://substackcdn.com/image/fetch/$s_!Yts4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7610e13b-ece9-423d-b29a-6e6eb69a35ed_5668x705.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h1>DeiT architecture</h1><p>DeiT retains the ViT architecture:</p><p>Patch embedding &#8594; positional embedding &#8594; transformer encoder &#8594; classification head.</p><p>Each image is divided into fixed-size patches, flattened, and linearly projected into embedding vectors.</p><p>The below architecture diagram of DeiT is the modified version, which I have adopted from the DeiT paper. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LcCz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LcCz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png 424w, https://substackcdn.com/image/fetch/$s_!LcCz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png 848w, https://substackcdn.com/image/fetch/$s_!LcCz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png 1272w, https://substackcdn.com/image/fetch/$s_!LcCz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LcCz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png" width="1456" height="1398" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1398,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:683031,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LcCz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png 424w, https://substackcdn.com/image/fetch/$s_!LcCz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png 848w, https://substackcdn.com/image/fetch/$s_!LcCz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png 1272w, https://substackcdn.com/image/fetch/$s_!LcCz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df7b212-a077-4b2a-8af6-411ba645e3e1_2041x1959.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>CLASS and DISTIL tokens</h1><p>The most significant difference between DeiT and other architectures is the Distillation Token. The original Vision Transformer only had the classification token, also known as the class token. </p><p>Two special tokens are appended: the <strong>[CLS] token</strong> for class prediction and the <strong>[DIST] token</strong> for distillation learning.</p><p>Both tokens are updated through all transformer layers via self-attention.</p><p>The MLP head at the top maps the final embedding of the [CLS] token (for standard classification) and the [DIST] token (for teacher supervision) into class probabilities.</p><h1>The distillation mechanism</h1><p>The knowledge distillation mechanism happens through a teacher-student style setup.</p><p><strong>Teacher-Student setup:</strong> A pretrained CNN (teacher) guides the transformer (student).</p><p>In performing knowledge distillation, we can use hard distillation or soft distillation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Fq22!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Fq22!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png 424w, https://substackcdn.com/image/fetch/$s_!Fq22!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png 848w, https://substackcdn.com/image/fetch/$s_!Fq22!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png 1272w, https://substackcdn.com/image/fetch/$s_!Fq22!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Fq22!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png" width="1456" height="377" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:377,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:369424,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Fq22!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png 424w, https://substackcdn.com/image/fetch/$s_!Fq22!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png 848w, https://substackcdn.com/image/fetch/$s_!Fq22!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png 1272w, https://substackcdn.com/image/fetch/$s_!Fq22!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F249075e6-aaa0-4d53-8966-50104dfcc97f_3930x1017.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>DeiT uses <strong>hard distillation</strong>, where the [DIST] token learns from the teacher&#8217;s output while the [CLS] token learns from the ground truth labels.</p><h1>DeiT loss function overview</h1><p>In DeiT, the loss function is built to help a Vision Transformer learn efficiently from limited data. The core idea is that the model should learn from two sources at the same time. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!seUf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!seUf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png 424w, https://substackcdn.com/image/fetch/$s_!seUf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png 848w, https://substackcdn.com/image/fetch/$s_!seUf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png 1272w, https://substackcdn.com/image/fetch/$s_!seUf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!seUf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png" width="1456" height="161" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:161,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:294326,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!seUf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png 424w, https://substackcdn.com/image/fetch/$s_!seUf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png 848w, https://substackcdn.com/image/fetch/$s_!seUf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png 1272w, https://substackcdn.com/image/fetch/$s_!seUf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ba3a3b4-4af6-40b2-84e5-65e315f1004d_6023x664.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The first loss is a standard cross-entropy loss. This loss is computed between the class token output and the ground truth label. This part is exactly the same as regular supervised training. It ensures that the model learns the actual classification task correctly.</p><p>The second loss is also a cross-entropy loss. This time it is computed between the distillation token output and the teacher&#8217;s predicted label. In DeiT, the teacher gives a hard label, not a soft probability distribution. Because of this, DeiT uses hard distillation instead of the usual KL-divergence-based soft distillation.</p><p>The final training loss is simply the sum of these two losses. In most cases, they are averaged. This forces the student transformer to agree with both the dataset labels and the teacher&#8217;s decisions. During inference, the model can use the class token alone or combine both outputs.</p><h1>What is KL divergence loss?</h1><p>Kullback&#8211;Leibler divergence, usually called KL divergence, is a loss function that measures how different two probability distributions are. It is not a distance in the strict mathematical sense. Instead, it tells us how much information is lost when one distribution is used to approximate another. In machine learning, it is commonly used when both the target and the prediction are probability distributions.</p><p>The KL divergence loss between a teacher distribution P and a student distribution Q is written as:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VBrJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VBrJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png 424w, https://substackcdn.com/image/fetch/$s_!VBrJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png 848w, https://substackcdn.com/image/fetch/$s_!VBrJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png 1272w, https://substackcdn.com/image/fetch/$s_!VBrJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VBrJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png" width="624" height="166" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:166,&quot;width&quot;:624,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22974,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VBrJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png 424w, https://substackcdn.com/image/fetch/$s_!VBrJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png 848w, https://substackcdn.com/image/fetch/$s_!VBrJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png 1272w, https://substackcdn.com/image/fetch/$s_!VBrJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F821675c9-0ff4-4291-8a21-6c5271549b54_624x166.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here, C is the number of classes, P(i) is the probability assigned by the teacher to class i, and Q(i) is the probability assigned by the student to the same class.</p><p>In knowledge distillation, these probabilities usually come from a softmax with temperature T, so the loss is often written as:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4Cpk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4Cpk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png 424w, https://substackcdn.com/image/fetch/$s_!4Cpk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png 848w, https://substackcdn.com/image/fetch/$s_!4Cpk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png 1272w, https://substackcdn.com/image/fetch/$s_!4Cpk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4Cpk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png" width="1218" height="420" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:420,&quot;width&quot;:1218,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61370,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4Cpk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png 424w, https://substackcdn.com/image/fetch/$s_!4Cpk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png 848w, https://substackcdn.com/image/fetch/$s_!4Cpk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png 1272w, https://substackcdn.com/image/fetch/$s_!4Cpk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2c61d4a-790f-42a6-8028-3cd09dd38586_1218x420.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let us take a straightforward classification example with three classes, say cat, dog, and horse, and assume we already have a well trained teacher model and a smaller student model that we want to train using KL divergence. For one input image, the teacher does not just say &#8220;this is a cat&#8221;; instead, it outputs probabilities like cat 0.7, dog 0.2, horse 0.1, which already tell us that the image looks mainly like a cat, with some similarity to a dog and very little to a horse. This full probability vector is the teacher distribution.</p><p>Now assume the student model, for the same image, outputs cat 0.4, dog 0.4, horse 0.2, which clearly shows confusion between cat and dog and more uncertainty overall. KL divergence compares these two distributions class by class and asks a simple question: how much information is lost if the student&#8217;s distribution is used instead of the teacher&#8217;s distribution. Since the student assigns much less probability to the cat compared to the teacher and too much probability to the dog and horse, the KL divergence value will be high.</p><p>During training, the KL divergence loss pushes the student to move closer to the teacher&#8217;s behavior. Over time, the student updates its parameters so that its output slowly becomes something like cat 0.65, dog 0.25, horse 0.1, which is much closer to what the teacher believes. When the two distributions become similar, the KL divergence becomes small, and this indicates that the student has successfully absorbed the teacher&#8217;s knowledge, not just the final label, but also the relative confidence across classes.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!95bF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!95bF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png 424w, https://substackcdn.com/image/fetch/$s_!95bF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png 848w, https://substackcdn.com/image/fetch/$s_!95bF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png 1272w, https://substackcdn.com/image/fetch/$s_!95bF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!95bF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png" width="1456" height="1352" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1352,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1303871,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!95bF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png 424w, https://substackcdn.com/image/fetch/$s_!95bF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png 848w, https://substackcdn.com/image/fetch/$s_!95bF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png 1272w, https://substackcdn.com/image/fetch/$s_!95bF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F163dfa9d-ebb7-4da5-ad3c-e989ac22c4dd_2075x1927.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Here is the complete architecture diagram of DeiT that shows CNN teacher and ViT student</figcaption></figure></div><h1>So, how good was DeiT compared to other models?</h1><p>DeiT outperformed ViT and comparable-size EfficientNet models (SOTA at the time) in terms of performance and accuracy.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aE-3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aE-3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png 424w, https://substackcdn.com/image/fetch/$s_!aE-3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png 848w, https://substackcdn.com/image/fetch/$s_!aE-3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png 1272w, https://substackcdn.com/image/fetch/$s_!aE-3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aE-3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png" width="1456" height="1351" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1351,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:659311,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aE-3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png 424w, https://substackcdn.com/image/fetch/$s_!aE-3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png 848w, https://substackcdn.com/image/fetch/$s_!aE-3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png 1272w, https://substackcdn.com/image/fetch/$s_!aE-3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96eba24a-1d91-49a6-9f05-4febce8a4f5a_2076x1926.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">This figure is directly taken from the DeiT paper</figcaption></figure></div><h1>Key points to note in DeiT architecture</h1><p>I am noting these points here as a reminder to myself and also for those what had the same questions as me.</p><ul><li><p>DISTIL token is randomly initialized just like CLASS token</p></li><li><p>We use pretrained CNN or ViT</p></li><li><p>CNN is used only for producing soft/hard targets using given input image</p></li><li><p>CNN does not directly influence the value of DISTIL</p></li></ul><h1>DeiT loss function - detailed</h1><p>Big picture first. DeiT trains a Vision Transformer using two teachers at the same time:</p><ol><li><p>The real ground-truth label</p></li><li><p>A strong CNN teacher&#8217;s prediction</p></li></ol><p>The model is asked to satisfy both, and the final loss is a weighted combination of these two objectives.</p><h2>Ground-truth (standard classification) loss</h2><p>What is happening conceptually? The Vision Transformer produces a prediction using its [CLS] token. This prediction is compared with the true label provided in the dataset.</p><ul><li><p>If the model assigns high probability to the correct class &#8594; good.</p></li><li><p>If it assigns probability to the wrong class &#8594; penalty.</p></li></ul><p>Why is this needed? This ensures the model learns: &#8220;What class does this image actually belong to?&#8221; This is just normal supervised learning.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jlzH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jlzH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png 424w, https://substackcdn.com/image/fetch/$s_!jlzH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png 848w, https://substackcdn.com/image/fetch/$s_!jlzH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png 1272w, https://substackcdn.com/image/fetch/$s_!jlzH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jlzH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png" width="1456" height="409" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/df4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:409,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238921,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jlzH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png 424w, https://substackcdn.com/image/fetch/$s_!jlzH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png 848w, https://substackcdn.com/image/fetch/$s_!jlzH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png 1272w, https://substackcdn.com/image/fetch/$s_!jlzH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdf4ad1ce-3c41-422f-983d-d72036e4bc76_3774x1059.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Teacher (distillation) loss</h2><p>This is the <strong>key innovation of DeiT</strong>.</p><h3>What is the teacher?</h3><ul><li><p>A pretrained CNN (for example, RegNet).</p></li><li><p>The CNN is fixed - it is not trained further.</p></li></ul><h3>What does the teacher provide?</h3><p>Instead of a single hard label, the teacher gives:</p><ul><li><p>A soft probability distribution over all classes</p></li><li><p>Example intuition:</p><ul><li><p>&#8220;This image is 70% cat, 20% dog, 10% fox&#8221;</p></li></ul></li></ul><p>This contains much richer information than a single label.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XtrB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XtrB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png 424w, https://substackcdn.com/image/fetch/$s_!XtrB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png 848w, https://substackcdn.com/image/fetch/$s_!XtrB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png 1272w, https://substackcdn.com/image/fetch/$s_!XtrB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XtrB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png" width="1456" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:316773,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XtrB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png 424w, https://substackcdn.com/image/fetch/$s_!XtrB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png 848w, https://substackcdn.com/image/fetch/$s_!XtrB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png 1272w, https://substackcdn.com/image/fetch/$s_!XtrB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95b97b03-66d0-48a7-9d43-4ab51fde648e_3151x1269.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Why is temperature used?</h3><ul><li><p>The teacher&#8217;s logits are divided by a temperature TTT before softmax.</p></li><li><p>Higher temperature &#8594; softer probabilities.</p></li><li><p>This reveals:</p><ul><li><p>Which wrong classes are <em>less wrong</em></p></li><li><p>How the teacher ranks alternatives</p></li></ul></li></ul><p>So the student learns relative class similarities, not just the top-1 answer.</p><h2>Why multiply by T^2?</h2><p>When the temperature increases:</p><ul><li><p>Gradients become smaller</p></li><li><p>Learning signal weakens</p></li></ul><p>Multiplying by T^2 <strong>correctly rescales gradients</strong>, ensuring:</p><ul><li><p>The distillation signal remains strong</p></li><li><p>Training stays stable</p></li></ul><p>This is a standard trick from knowledge distillation.</p><h2> Final combined loss</h2><p>DeiT does not choose between the two losses - it weighs them.</p><ul><li><p>One term forces correctness w.r.t ground truth</p></li><li><p>The other forces the imitation of the CNN teacher</p></li></ul><p>A weighting factor decides how much to trust each source.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4NcL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4NcL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png 424w, https://substackcdn.com/image/fetch/$s_!4NcL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png 848w, https://substackcdn.com/image/fetch/$s_!4NcL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png 1272w, https://substackcdn.com/image/fetch/$s_!4NcL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4NcL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png" width="496" height="64.04395604395604" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:188,&quot;width&quot;:1456,&quot;resizeWidth&quot;:496,&quot;bytes&quot;:203813,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182939529?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4NcL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png 424w, https://substackcdn.com/image/fetch/$s_!4NcL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png 848w, https://substackcdn.com/image/fetch/$s_!4NcL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png 1272w, https://substackcdn.com/image/fetch/$s_!4NcL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F580e79db-bac5-4448-9699-587f3ac3986d_5566x718.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h1>Coding DeiT from scratch</h1><p>If you wish to code DeiT from scratch, you can do so along with me. Check this out: </p><div id="youtube2-d6EaVdjsCHI" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;d6EaVdjsCHI&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/d6EaVdjsCHI?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>If you wish to get access to our code files, handwritten notes, all lecture videos, Discord channel, and other PDF handbooks that we have compiled, along with a code certificate at the end of the program, you can consider being part of the pro version of the &#8220;Transformers for Vision Bootcamp&#8221;. You will find the details here:</p><p><a href="https://vision-transformer.vizuara.ai/">https://vision-transformer.vizuara.ai/</a></p><h1>Other resources</h1><p>If you like this content, please check out our research bootcamps on the following topics:</p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p></p>]]></content:encoded></item><item><title><![CDATA[What exactly is a VLM (Vision-Language Model)?]]></title><description><![CDATA[How does it work and how can you build one from scratch?]]></description><link>https://www.vizuaranewsletter.com/p/what-exactly-is-a-vlm-vision-language</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/what-exactly-is-a-vlm-vision-language</guid><dc:creator><![CDATA[Sreedath Panat]]></dc:creator><pubDate>Tue, 30 Dec 2025 08:16:11 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9cRy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Table of contents </h1><ol><li><p><em>What is a VLM?</em></p></li><li><p><em>How do VLMs work?</em></p><ol><li><p><em>Visual encoder</em></p></li><li><p><em>Text encoder</em></p></li><li><p><em>Multimodal fusion module</em></p><ol><li><p><em>Early fusion</em></p></li><li><p><em>Late fusion</em></p></li><li><p><em>Cross-attention fusion</em></p></li></ol></li></ol></li><li><p><em>Some popular VLMs</em></p></li><li><p><em>Questions that can come up during VLM design</em></p></li><li><p><em>Let us start with the simplest idea: Dual encoder</em></p></li><li><p><em>Understanding contrastive learning</em></p></li><li><p><em>Contrastive loss formula</em></p></li><li><p><em>Contrastive Language-Image Pre-training (CLIP)</em></p></li><li><p><em>Let us build a VLM</em></p><ol><li><p><em>Task description</em></p></li><li><p><em>Dataset</em></p></li><li><p><em>Model architecture</em></p></li><li><p><em>Image encoder</em></p></li><li><p><em>Text encoder</em></p></li><li><p><em>Loss function</em></p></li></ol></li><li><p><em>Embedding similarity before and after training</em></p></li><li><p><em>Why is this model called &#8220;nano&#8221;?</em></p><ol><li><p><em>Image Encoder: Number of parameters</em></p></li><li><p><em>Text Encoder: Number of parameters</em></p></li></ol></li><li><p><em>Conclusion</em></p></li><li><p><em>Relevant resources</em></p></li></ol><h1>What is a VLM?</h1><p>VLMs are AI models that can <strong>understand both images and text</strong> together.</p><p>VLMs can take both text and image as input whereas LLMs by default only take text input. So what is the output produced by a VLM? Output is whatever we design it to be. But our goal is to &#8220;<em>align</em>&#8221; the visual and textual representation in VLMs.</p><p>You may have heard of this in the context of the term <em>&#8220;multimodal alignment&#8221;</em> - alignment of different modalities.</p><p>Let us try to understand with a <strong>simple example</strong>. If I write &#8220;apple&#8221; or if I say the word &#8220;apple&#8221; or if I show you the picture of an apple, they all represent the same thing or the same idea. Somehow your brain represents all these 3 modalities (text, sound, picture of apple) with some sort of alignment.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7YgQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7YgQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png 424w, https://substackcdn.com/image/fetch/$s_!7YgQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png 848w, https://substackcdn.com/image/fetch/$s_!7YgQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png 1272w, https://substackcdn.com/image/fetch/$s_!7YgQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7YgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png" width="316" height="245.10256410256412" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:726,&quot;width&quot;:936,&quot;resizeWidth&quot;:316,&quot;bytes&quot;:30523,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7YgQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png 424w, https://substackcdn.com/image/fetch/$s_!7YgQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png 848w, https://substackcdn.com/image/fetch/$s_!7YgQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png 1272w, https://substackcdn.com/image/fetch/$s_!7YgQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93e4c65f-c864-4e37-8f37-f27315c5fb31_936x726.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">If I write &#8220;apple&#8221; or if I say the word &#8220;apple&#8221; or if I show you the picture of an apple, they all represent the same thing or the same idea.</figcaption></figure></div><p>LLMs represent text or tokens using vectors - called embeddings. VLMs also do that same because they are an extension of LLMs.</p><p>The basic idea of VLM is that the text and image embeddings (or vectors) that represent same thing should have <strong>high similarity</strong>. A simple mathematical representation of similarity between 2 vectors is the cosine of the angle between then. If the vectors are perfectly parallel, the <strong>cosine similarity</strong> will be 1.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s6kN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s6kN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png 424w, https://substackcdn.com/image/fetch/$s_!s6kN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png 848w, https://substackcdn.com/image/fetch/$s_!s6kN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png 1272w, https://substackcdn.com/image/fetch/$s_!s6kN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s6kN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png" width="530" height="423.3447802197802" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1163,&quot;width&quot;:1456,&quot;resizeWidth&quot;:530,&quot;bytes&quot;:850924,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s6kN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png 424w, https://substackcdn.com/image/fetch/$s_!s6kN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png 848w, https://substackcdn.com/image/fetch/$s_!s6kN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png 1272w, https://substackcdn.com/image/fetch/$s_!s6kN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13c16050-c062-446d-9535-503f3c04afc6_2237x1787.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The basic idea of VLM is that the text and image embeddings (or vectors) that represent same thing should have high similarity</figcaption></figure></div><p>If we take a step back to what models could do before VLMs, we can see that models could take care of either text or image. But not both.</p><p>Computer vision models like CNNs or Vision Transformers handle images, while language models like GPT handle text.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9cRy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9cRy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png 424w, https://substackcdn.com/image/fetch/$s_!9cRy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png 848w, https://substackcdn.com/image/fetch/$s_!9cRy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png 1272w, https://substackcdn.com/image/fetch/$s_!9cRy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9cRy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png" width="522" height="416.9546703296703" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1163,&quot;width&quot;:1456,&quot;resizeWidth&quot;:522,&quot;bytes&quot;:563455,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9cRy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png 424w, https://substackcdn.com/image/fetch/$s_!9cRy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png 848w, https://substackcdn.com/image/fetch/$s_!9cRy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png 1272w, https://substackcdn.com/image/fetch/$s_!9cRy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F07de8b6f-4d4c-4fc9-a629-e4bb29ff894d_2237x1787.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Computer vision models like CNNs or Vision Transformers handle images, while language models like GPT handle text.</figcaption></figure></div><p>VLM bridges these two domains. It takes visual inputs (like images or videos) and text inputs (like captions, questions, or prompts) and learns a joint representation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Xt2t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Xt2t!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png 424w, https://substackcdn.com/image/fetch/$s_!Xt2t!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png 848w, https://substackcdn.com/image/fetch/$s_!Xt2t!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png 1272w, https://substackcdn.com/image/fetch/$s_!Xt2t!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Xt2t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png" width="1456" height="879" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:879,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1112541,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Xt2t!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png 424w, https://substackcdn.com/image/fetch/$s_!Xt2t!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png 848w, https://substackcdn.com/image/fetch/$s_!Xt2t!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png 1272w, https://substackcdn.com/image/fetch/$s_!Xt2t!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4622ec26-16fd-462c-9388-890640e4ab1c_2573x1554.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">VLM takes visual inputs (like images or videos) and text inputs (like captions, questions, or prompts) and learns a joint representation.</figcaption></figure></div><p>Real world information is multimodal. We understand our surroundings by seeing, reading, and listening at the same time. VLM allows applications such as:</p><ul><li><p>Generating image captions automatically</p></li><li><p>Searching for images by describing them in words</p></li><li><p>Understanding memes, advertisements, or infographics</p></li><li><p>Supporting robotics and self-driving systems that must interpret surroundings and follow instructions</p></li></ul><h1>How do VLMs work?</h1><p>A typical VLM has <strong>three major components</strong>:</p><h2>1. Visual encoder</h2><p>Usually a Vision Transformer (ViT) or CNN, which converts the input image into a sequence of visual embeddings. Each embedding represents a patch or region of the image.</p><h2>2. Text encoder</h2><p>Often a Transformer-based model (like BERT or GPT), which converts the input text into language embeddings that capture the meaning of words and their context.</p><h2>3. Multimodal Fusion Module</h2><p>This is where the two modalities meet. There are three main ways this fusion is done.</p><h3><strong>Early fusion</strong></h3><p>Combine visual and text embeddings at the beginning and train a single transformer to process both.</p><h3><strong>Late fusion</strong></h3><p>Encode each modality separately and align them using similarity losses (e.g., CLIP).</p><h3><strong>Cross-attention fusion</strong></h3><p>Use attention mechanisms where image tokens attend to text tokens and vice versa (e.g., BLIP, Flamingo).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1b5W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1b5W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png 424w, https://substackcdn.com/image/fetch/$s_!1b5W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png 848w, https://substackcdn.com/image/fetch/$s_!1b5W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png 1272w, https://substackcdn.com/image/fetch/$s_!1b5W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1b5W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png" width="1456" height="665" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:665,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:586370,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1b5W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png 424w, https://substackcdn.com/image/fetch/$s_!1b5W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png 848w, https://substackcdn.com/image/fetch/$s_!1b5W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png 1272w, https://substackcdn.com/image/fetch/$s_!1b5W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa066529d-ca18-480c-879f-23e1cf7ffaf5_2960x1351.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">A typical VLM has <strong>three major components</strong>. 1) Visual or image encoder, 2) text encoder and 3) Multimodal fusion module</figcaption></figure></div><p>Let us see an example VLM workflow.</p><p>Suppose you input an image of a cat sitting on a laptop and a text prompt &#8220;<em>What is happening in the image?</em>&#8221;</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vuS8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vuS8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png 424w, https://substackcdn.com/image/fetch/$s_!vuS8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png 848w, https://substackcdn.com/image/fetch/$s_!vuS8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png 1272w, https://substackcdn.com/image/fetch/$s_!vuS8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vuS8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png" width="417" height="453.65934065934067" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1584,&quot;width&quot;:1456,&quot;resizeWidth&quot;:417,&quot;bytes&quot;:354545,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vuS8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png 424w, https://substackcdn.com/image/fetch/$s_!vuS8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png 848w, https://substackcdn.com/image/fetch/$s_!vuS8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png 1272w, https://substackcdn.com/image/fetch/$s_!vuS8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F83491fd8-1e48-4568-9802-4786ca891283_1917x2085.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Example of VLM workflow.</figcaption></figure></div><h1>Some popular VLMs</h1><p>Here are some popular VLMs that you should care about. At Vizuara we have conducted detailed lectures on a bunch of these models. Links are provided at the end of this article.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bXsr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bXsr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png 424w, https://substackcdn.com/image/fetch/$s_!bXsr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png 848w, https://substackcdn.com/image/fetch/$s_!bXsr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png 1272w, https://substackcdn.com/image/fetch/$s_!bXsr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bXsr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png" width="1456" height="885" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:885,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:532978,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bXsr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png 424w, https://substackcdn.com/image/fetch/$s_!bXsr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png 848w, https://substackcdn.com/image/fetch/$s_!bXsr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png 1272w, https://substackcdn.com/image/fetch/$s_!bXsr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa7974b0-29d8-4df6-906a-da4ca219fed8_2565x1559.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">CLIP, BLIP, ALIGN, Flamingo and LLaVa are the most famous VLMs at the moment.</figcaption></figure></div><h1>Questions that can come up during VLM design</h1><p>Now let&#8217;s say we want to design out own VLM that can understand language and vision, there are some questions that naturally popup in our mind.</p><ol><li><p>How to encode different modalities?</p></li><li><p>How to combine these modalities?</p></li><li><p>What kind of loss function to use?</p></li><li><p>Should we train from scratch or use pretrained models?</p></li><li><p>What type of data for training?</p></li></ol><h1>Let us start with the simplest idea: Dual encoder</h1><p>Dual encoder is literally the simplest VLM.</p><ul><li><p>It has two separate encoders</p></li><li><p>Each encoder converts its input into a vector embedding.</p></li><li><p>Both embeddings lie in a shared feature space, so related image-text pairs have similar vectors.</p></li><li><p>The model is trained using a <strong>contrastive loss</strong> (for example, CLIP loss) that brings matching pairs closer and pushes non-matching pairs apart. (What is contrastive learning? We will discuss in the next section)</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!imai!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!imai!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png 424w, https://substackcdn.com/image/fetch/$s_!imai!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png 848w, https://substackcdn.com/image/fetch/$s_!imai!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png 1272w, https://substackcdn.com/image/fetch/$s_!imai!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!imai!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png" width="453" height="483.801510989011" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1555,&quot;width&quot;:1456,&quot;resizeWidth&quot;:453,&quot;bytes&quot;:591656,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!imai!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png 424w, https://substackcdn.com/image/fetch/$s_!imai!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png 848w, https://substackcdn.com/image/fetch/$s_!imai!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png 1272w, https://substackcdn.com/image/fetch/$s_!imai!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40c8f754-ccbf-4e3f-bfeb-9ed7ed116634_1935x2067.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Dual encoder has two separate encoders. Each encoder converts its input into a vector embedding. Both embeddings lie in a shared feature space, so related image-text pairs have similar vectors.</figcaption></figure></div><ul><li><p>Image Encoder is usually a <strong>CNN</strong> or <strong>Vision Transformer</strong> (ViT).</p></li><li><p>Text Encoder is usually a <strong>Transformer</strong> (like BERT or GPT).</p></li><li><p>It is mainly used for <em>image-text retrieval, zero-shot classification, and multimodal alignment</em>.</p></li><li><p>Because encoders are independent, embeddings can be pre-computed, making it fast and scalable for large datasets - you don&#8217;t have to recalculate for each search query.</p></li></ul><p>Now before we proceed ahead, we should understand what exactly is contrastive learning.</p><h1>Understanding contrastive learning</h1><p>The main goal of contrastive learning is to bring similar pairs (called <em>positive pairs</em>) closer together in the embedding space, and to push dissimilar pairs (called <em>negative pairs</em>) farther apart. This idea was introduced in a 2020 paper titled &#8220;<em>A Simple Framework for Contrastive Learning of Visual Representations</em>&#8221; published on arXiv. At this time of this writing, the paper has 28600+ citations, which is huge.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lLly!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lLly!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png 424w, https://substackcdn.com/image/fetch/$s_!lLly!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png 848w, https://substackcdn.com/image/fetch/$s_!lLly!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png 1272w, https://substackcdn.com/image/fetch/$s_!lLly!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lLly!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png" width="721" height="663.405325443787" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1244,&quot;width&quot;:1352,&quot;resizeWidth&quot;:721,&quot;bytes&quot;:422419,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lLly!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png 424w, https://substackcdn.com/image/fetch/$s_!lLly!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png 848w, https://substackcdn.com/image/fetch/$s_!lLly!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png 1272w, https://substackcdn.com/image/fetch/$s_!lLly!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5a6e3ce9-7b94-4225-86b0-4d2a018f1506_1352x1244.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The paper that introduced contrastive learning in 2020.</figcaption></figure></div><p>In simple terms the idea of contrastive learning is this: </p><p>If two images show the same object (say, a dog from two angles), they should have <strong>similar embeddings</strong>.</p><p>If two images show different objects (say, a dog and a car), their embeddings should be <strong>far apart</strong>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4cdR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4cdR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png 424w, https://substackcdn.com/image/fetch/$s_!4cdR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png 848w, https://substackcdn.com/image/fetch/$s_!4cdR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png 1272w, https://substackcdn.com/image/fetch/$s_!4cdR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4cdR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png" width="1456" height="641" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:641,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:5285900,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4cdR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png 424w, https://substackcdn.com/image/fetch/$s_!4cdR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png 848w, https://substackcdn.com/image/fetch/$s_!4cdR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png 1272w, https://substackcdn.com/image/fetch/$s_!4cdR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff81dfd5e-f713-42bc-8620-4d29f705647d_3012x1327.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">In this image, all the examples are positive pairs.</figcaption></figure></div><p><strong>Positive pairs</strong>: Represent the <em>same</em> underlying concept (for example, an image and its augmented version, or an image and its correct caption).</p><p><strong>Negative pairs</strong>: Represent <em>different</em> concepts (for example, two unrelated images or mismatched image-text pairs).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O3zm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O3zm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png 424w, https://substackcdn.com/image/fetch/$s_!O3zm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png 848w, https://substackcdn.com/image/fetch/$s_!O3zm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png 1272w, https://substackcdn.com/image/fetch/$s_!O3zm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O3zm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png" width="450" height="391.58653846153845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1267,&quot;width&quot;:1456,&quot;resizeWidth&quot;:450,&quot;bytes&quot;:268479,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O3zm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png 424w, https://substackcdn.com/image/fetch/$s_!O3zm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png 848w, https://substackcdn.com/image/fetch/$s_!O3zm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png 1272w, https://substackcdn.com/image/fetch/$s_!O3zm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F817beacd-73af-4f5f-83d7-99cec19f7b11_2144x1865.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The goal is to minimize the distance between anchor and positive example and maximize the distance between anchor and negative example. Anchor is simply the original image under consideration against which you are comparing other images in the dataset.</figcaption></figure></div><p>Below is another figure that nicely illustrates the concept of anchor, positive pair and negative pair.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7HEc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7HEc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png 424w, https://substackcdn.com/image/fetch/$s_!7HEc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png 848w, https://substackcdn.com/image/fetch/$s_!7HEc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png 1272w, https://substackcdn.com/image/fetch/$s_!7HEc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7HEc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png" width="1456" height="1023" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1023,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1304461,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7HEc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png 424w, https://substackcdn.com/image/fetch/$s_!7HEc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png 848w, https://substackcdn.com/image/fetch/$s_!7HEc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png 1272w, https://substackcdn.com/image/fetch/$s_!7HEc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9be8fa9c-3816-4520-869a-fba1e094cba4_2386x1676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Anchor has high similarity with positive example. Which means the embeddings corresponding to those should have good similarity. Alternately, the distance between those pairs should be less. The top 2 images are positive pairs. The bottom 2 images are negative pairs. Their distance function should be maximized or the similarity should be minimized.</figcaption></figure></div><h1>Contrastive loss formula</h1><p>Let us consider the example of positive pairs to understand the contrastive loss formula. This is used in the famous CLIP paper from OpenAI: https://arxiv.org/abs/2103.00020</p><p>I am pasting my hand-written explanation of my formula here. So please excuse the lack if beauty of my hand-writing.</p><p>Say these are the vector representations of 2 positive pairs we are considering. We want to maximize the similarity between them.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!t8aq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!t8aq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png 424w, https://substackcdn.com/image/fetch/$s_!t8aq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png 848w, https://substackcdn.com/image/fetch/$s_!t8aq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png 1272w, https://substackcdn.com/image/fetch/$s_!t8aq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!t8aq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png" width="159" height="53.72802197802198" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:492,&quot;width&quot;:1456,&quot;resizeWidth&quot;:159,&quot;bytes&quot;:161907,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!t8aq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png 424w, https://substackcdn.com/image/fetch/$s_!t8aq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png 848w, https://substackcdn.com/image/fetch/$s_!t8aq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png 1272w, https://substackcdn.com/image/fetch/$s_!t8aq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4e12bb-3e33-4629-9695-a903b9c91533_3439x1162.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>What is an easy measure of similarity? Cosine similarity. So we can try to maximize cosine similarity. Remember: Cosine similarity is also the same as the dot product between 2 normalized vectors as shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2GBN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2GBN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png 424w, https://substackcdn.com/image/fetch/$s_!2GBN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png 848w, https://substackcdn.com/image/fetch/$s_!2GBN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png 1272w, https://substackcdn.com/image/fetch/$s_!2GBN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2GBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png" width="478" height="182.53296703296704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:556,&quot;width&quot;:1456,&quot;resizeWidth&quot;:478,&quot;bytes&quot;:231553,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2GBN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png 424w, https://substackcdn.com/image/fetch/$s_!2GBN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png 848w, https://substackcdn.com/image/fetch/$s_!2GBN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png 1272w, https://substackcdn.com/image/fetch/$s_!2GBN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ad08e83-2c17-4e6d-86dc-9618dc3b26a8_3235x1236.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Now we have multiple pairs in our dataset. Consider that the pairs are images and corresponding captions. Because if you have a VLM, you will have embeddings of text and images from an image-caption dataset. If you have N text and image embeddings, you can construct N*N pairs and calculate the cosine similarity. </p><p>Your goal is to make sure that pairs that are actual image-caption pairs should have high similarity.</p><p>So where can we start?</p><p>Firstly it will be great to have a probability distribution of similarity scores. How to convert bunch of numbers to probability scores? Take softmax.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PNC_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PNC_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png 424w, https://substackcdn.com/image/fetch/$s_!PNC_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png 848w, https://substackcdn.com/image/fetch/$s_!PNC_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png 1272w, https://substackcdn.com/image/fetch/$s_!PNC_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PNC_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png" width="618" height="188.8804945054945" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:445,&quot;width&quot;:1456,&quot;resizeWidth&quot;:618,&quot;bytes&quot;:244647,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PNC_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png 424w, https://substackcdn.com/image/fetch/$s_!PNC_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png 848w, https://substackcdn.com/image/fetch/$s_!PNC_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png 1272w, https://substackcdn.com/image/fetch/$s_!PNC_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5832c88e-8d4d-440e-86c2-f9bbca77e6a5_3618x1105.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>We can introduce an additional parameter to change softmax sensitivity. This parameter will be &#120591;. This is a standard practice.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7zW9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7zW9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png 424w, https://substackcdn.com/image/fetch/$s_!7zW9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png 848w, https://substackcdn.com/image/fetch/$s_!7zW9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!7zW9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7zW9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png" width="548" height="266.4725274725275" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:708,&quot;width&quot;:1456,&quot;resizeWidth&quot;:548,&quot;bytes&quot;:255517,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7zW9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png 424w, https://substackcdn.com/image/fetch/$s_!7zW9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png 848w, https://substackcdn.com/image/fetch/$s_!7zW9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png 1272w, https://substackcdn.com/image/fetch/$s_!7zW9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1f652ec2-e022-42d1-992a-d2592b4739fe_2868x1394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now these are similarity scores that lie between 0 and 1. We want to convert this to loss. When similarity is high, loss should be low and vice-versa. How to do that? Take -log() just like cross-entropy loss.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ijwe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ijwe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png 424w, https://substackcdn.com/image/fetch/$s_!ijwe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png 848w, https://substackcdn.com/image/fetch/$s_!ijwe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png 1272w, https://substackcdn.com/image/fetch/$s_!ijwe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ijwe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png" width="1456" height="437" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:437,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:311532,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ijwe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png 424w, https://substackcdn.com/image/fetch/$s_!ijwe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png 848w, https://substackcdn.com/image/fetch/$s_!ijwe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png 1272w, https://substackcdn.com/image/fetch/$s_!ijwe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F807143ba-7478-41d4-a0ea-42e0a6dac150_3651x1095.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here &#8220;i&#8221; is the anchor point and &#8220;j&#8221; are the other points we compare i against.</p><p>Now i can be any point in the N examples. Because any point can be considered as an anchor to compare against the available pairs. </p><p>Thus, the total contrastive loss is average over all anchors in the batch.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UYPA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UYPA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png 424w, https://substackcdn.com/image/fetch/$s_!UYPA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png 848w, https://substackcdn.com/image/fetch/$s_!UYPA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png 1272w, https://substackcdn.com/image/fetch/$s_!UYPA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UYPA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png" width="268" height="120.19505494505495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:653,&quot;width&quot;:1456,&quot;resizeWidth&quot;:268,&quot;bytes&quot;:157223,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UYPA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png 424w, https://substackcdn.com/image/fetch/$s_!UYPA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png 848w, https://substackcdn.com/image/fetch/$s_!UYPA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png 1272w, https://substackcdn.com/image/fetch/$s_!UYPA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cec94d4-2d80-4bca-9417-66fbdc5bd0bf_2986x1339.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>So, the negatives are <strong>implicitly present in the denominator</strong>, competing against the positive pair. The loss becomes low only when:</p><ul><li><p>The similarity between <strong>positive pair</strong> is <strong>high</strong>, and</p></li><li><p>The similarity between <strong>negative pairs</strong> is <strong>low</strong></p></li></ul><h1>Contrastive Language-Image Pre-training (CLIP)</h1><p>Now that we understand contrastive loss, let us discuss CLIP paper a bit because that is one of the most famous VLMs.</p><p>CLIP draws inspiration from contrastive learning.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lRhE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lRhE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png 424w, https://substackcdn.com/image/fetch/$s_!lRhE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png 848w, https://substackcdn.com/image/fetch/$s_!lRhE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png 1272w, https://substackcdn.com/image/fetch/$s_!lRhE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lRhE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png" width="1456" height="1069" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/db4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1069,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:526411,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lRhE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png 424w, https://substackcdn.com/image/fetch/$s_!lRhE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png 848w, https://substackcdn.com/image/fetch/$s_!lRhE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png 1272w, https://substackcdn.com/image/fetch/$s_!lRhE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdb4a3c8b-99b9-42b9-9d15-f1986f7130d8_2333x1713.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Here you have N image-caption pairs. They are converted to vectors. Ti is a text vector and Ii is an image vector. Now you need to maximize similarity between Ti - Ii and minimize similarity between Ti - Ij. </figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5PDf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5PDf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png 424w, https://substackcdn.com/image/fetch/$s_!5PDf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png 848w, https://substackcdn.com/image/fetch/$s_!5PDf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png 1272w, https://substackcdn.com/image/fetch/$s_!5PDf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5PDf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png" width="1456" height="1068" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1068,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:524249,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5PDf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png 424w, https://substackcdn.com/image/fetch/$s_!5PDf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png 848w, https://substackcdn.com/image/fetch/$s_!5PDf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png 1272w, https://substackcdn.com/image/fetch/$s_!5PDf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa6a7fc6-f180-4410-87d5-c09f437edfc0_2335x1712.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Classification problem can be handled using a pre-trained VLM by converting the classes to caption using additional prompt.</figcaption></figure></div><p>This is the final contrastive loss formula. Looks ugly and intimidating when you don&#8217;t know what is going on. But this is actually simple.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zN6N!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zN6N!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png 424w, https://substackcdn.com/image/fetch/$s_!zN6N!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png 848w, https://substackcdn.com/image/fetch/$s_!zN6N!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png 1272w, https://substackcdn.com/image/fetch/$s_!zN6N!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zN6N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png" width="1456" height="211" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:211,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:670638,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!zN6N!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png 424w, https://substackcdn.com/image/fetch/$s_!zN6N!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png 848w, https://substackcdn.com/image/fetch/$s_!zN6N!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png 1272w, https://substackcdn.com/image/fetch/$s_!zN6N!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff13d9882-35f2-4864-a1c9-270d7b05b721_5246x762.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Each image x (called the <strong>anchor</strong>) is paired with:</p><ul><li><p>One <strong>positive</strong> example x+: usually an augmented version of the same image (for example, a rotated or cropped view of the same polar bear).</p></li><li><p>Several <strong>negative</strong> examples xi&#8722;&#8203;: other images from the batch (for example, the lemur, bird, or deer shown in the image).</p></li></ul><p>So, for every anchor image, the model has exactly <strong>one correct match</strong> and <strong>many distractors</strong>.</p><p>In image-caption pair dataset, we can have 2 objectives.</p><ol><li><p>Retrieve image that fits a caption</p></li><li><p>Given an image, provide the best caption</p></li></ol><p>For these 2 objectives we can have 2 different losses. Pretty straightforward equation once you understand the basic idea behind contrastive loss.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_g66!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_g66!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png 424w, https://substackcdn.com/image/fetch/$s_!_g66!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png 848w, https://substackcdn.com/image/fetch/$s_!_g66!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png 1272w, https://substackcdn.com/image/fetch/$s_!_g66!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_g66!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png" width="530" height="181.27747252747253" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:498,&quot;width&quot;:1456,&quot;resizeWidth&quot;:530,&quot;bytes&quot;:1151606,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_g66!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png 424w, https://substackcdn.com/image/fetch/$s_!_g66!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png 848w, https://substackcdn.com/image/fetch/$s_!_g66!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png 1272w, https://substackcdn.com/image/fetch/$s_!_g66!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F476a4e59-09a9-4955-b28c-347e3c6220f8_3421x1169.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aMNs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aMNs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png 424w, https://substackcdn.com/image/fetch/$s_!aMNs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png 848w, https://substackcdn.com/image/fetch/$s_!aMNs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png 1272w, https://substackcdn.com/image/fetch/$s_!aMNs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aMNs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png" width="534" height="171.64285714285714" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:468,&quot;width&quot;:1456,&quot;resizeWidth&quot;:534,&quot;bytes&quot;:1208833,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aMNs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png 424w, https://substackcdn.com/image/fetch/$s_!aMNs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png 848w, https://substackcdn.com/image/fetch/$s_!aMNs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png 1272w, https://substackcdn.com/image/fetch/$s_!aMNs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F18b10a74-4ee5-447a-9437-c510f2ec7e4f_3528x1133.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h1>Let us build a VLM</h1><p>Now let us build and train a NanoVLM from scratch. We need to have an idea about the following.</p><ol><li><p>Task description</p></li><li><p>Dataset</p></li><li><p>Model architecture</p></li><li><p>Image encoder</p></li><li><p>Text encoder</p></li><li><p>Loss function</p></li></ol><h2>Task description</h2><p>We will build a NanoVLM: tiny CLIP-style model trained on synthetic colored-shape captions&#65279;. Why &#8220;nano&#8221;? because number of trainable parameters will be less than 5 million.</p><p>Task: <em>For a give text caption, we have to retrieve the best images from the dataset</em></p><h2>Dataset</h2><p>For a given text caption, we have to retrieve the best images from the dataset. We will use a synthetic small dataset.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TGVc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TGVc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png 424w, https://substackcdn.com/image/fetch/$s_!TGVc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png 848w, https://substackcdn.com/image/fetch/$s_!TGVc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png 1272w, https://substackcdn.com/image/fetch/$s_!TGVc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TGVc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png" width="1456" height="533" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:533,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:190702,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TGVc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png 424w, https://substackcdn.com/image/fetch/$s_!TGVc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png 848w, https://substackcdn.com/image/fetch/$s_!TGVc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png 1272w, https://substackcdn.com/image/fetch/$s_!TGVc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd5891a01-ceb7-4e3a-afa0-94b5a9eb102f_3303x1210.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Images are shapes with different colors and positions. Captions literally mention the color, shape and position in one string.</figcaption></figure></div><p>It is easy to create this synthetic dataset if we use parameters like below.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WxQF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WxQF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png 424w, https://substackcdn.com/image/fetch/$s_!WxQF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png 848w, https://substackcdn.com/image/fetch/$s_!WxQF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png 1272w, https://substackcdn.com/image/fetch/$s_!WxQF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WxQF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png" width="1456" height="138" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:138,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209478,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WxQF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png 424w, https://substackcdn.com/image/fetch/$s_!WxQF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png 848w, https://substackcdn.com/image/fetch/$s_!WxQF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png 1272w, https://substackcdn.com/image/fetch/$s_!WxQF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49d56864-7bea-4c5c-9457-5408565dbd5d_6486x616.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0fFu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0fFu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png 424w, https://substackcdn.com/image/fetch/$s_!0fFu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png 848w, https://substackcdn.com/image/fetch/$s_!0fFu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png 1272w, https://substackcdn.com/image/fetch/$s_!0fFu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0fFu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png" width="1456" height="472" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:472,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:294366,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0fFu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png 424w, https://substackcdn.com/image/fetch/$s_!0fFu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png 848w, https://substackcdn.com/image/fetch/$s_!0fFu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png 1272w, https://substackcdn.com/image/fetch/$s_!0fFu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b1f2322-4f05-4c26-9d06-ca5b7cbd2eb7_3511x1139.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Overall model architecture</h2><p>This is a <strong>tiny CLIP-style Vision-Language Model (VLM)</strong>.</p><p>It has two separate encoders - one for images and one for text.</p><p>Both encoders map their inputs into a <strong>common embedding space</strong> of dimension 64 (or other dimension we choose).</p><p>The goal is to make matching image-text pairs lie close together in this space.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pmZl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pmZl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png 424w, https://substackcdn.com/image/fetch/$s_!pmZl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png 848w, https://substackcdn.com/image/fetch/$s_!pmZl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png 1272w, https://substackcdn.com/image/fetch/$s_!pmZl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pmZl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png" width="485" height="517.9773351648352" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1555,&quot;width&quot;:1456,&quot;resizeWidth&quot;:485,&quot;bytes&quot;:526723,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pmZl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png 424w, https://substackcdn.com/image/fetch/$s_!pmZl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png 848w, https://substackcdn.com/image/fetch/$s_!pmZl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png 1272w, https://substackcdn.com/image/fetch/$s_!pmZl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F49ea270b-1dc6-4cea-bc2c-9abe69f9ff11_1935x2067.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">This is the simple architecture of VLM we are going to build</figcaption></figure></div><h2>Image Encoder</h2><ul><li><p>A small CNN (4 convolutional layers) progressively downsamples the input image.</p></li></ul><ul><li><p>After the convolution blocks, a <strong>global average pooling</strong> layer reduces spatial features.</p></li></ul><ul><li><p>A <strong>linear projection</strong> maps to the embedding dimension.</p></li></ul><ul><li><p>Finally, a <strong>LayerNorm + L2 normalization</strong> ensures embeddings are unit vectors (important for cosine similarity).</p></li></ul><p>The CNN architecture is shown below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3FqZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3FqZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png 424w, https://substackcdn.com/image/fetch/$s_!3FqZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png 848w, https://substackcdn.com/image/fetch/$s_!3FqZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png 1272w, https://substackcdn.com/image/fetch/$s_!3FqZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3FqZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png" width="1456" height="365" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:365,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:298754,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3FqZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png 424w, https://substackcdn.com/image/fetch/$s_!3FqZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png 848w, https://substackcdn.com/image/fetch/$s_!3FqZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png 1272w, https://substackcdn.com/image/fetch/$s_!3FqZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f892d92-7b1f-4cde-b112-89a5451e0b25_3993x1001.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">The architecture of the image encoder we are using in this VLM</figcaption></figure></div><h2>Text Encoder</h2><ul><li><p>Each caption has tokens like [CLS] red triangle left.</p></li></ul><ul><li><p>A <strong>token embedding layer</strong> converts each word to a 64-d vector.</p></li></ul><ul><li><p>A <strong>positional embedding layer</strong> adds position info (like transformers).</p></li><li><p>MHA after this</p></li></ul><ul><li><p>Followed by a <strong>Linear layer + LayerNorm + L2 normalization</strong>.</p></li></ul><p>The code below shows a layer-by-layer breakdown of the text encoder.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Asxo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Asxo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png 424w, https://substackcdn.com/image/fetch/$s_!Asxo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png 848w, https://substackcdn.com/image/fetch/$s_!Asxo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png 1272w, https://substackcdn.com/image/fetch/$s_!Asxo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Asxo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png" width="1456" height="926" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:926,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:375631,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Asxo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png 424w, https://substackcdn.com/image/fetch/$s_!Asxo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png 848w, https://substackcdn.com/image/fetch/$s_!Asxo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png 1272w, https://substackcdn.com/image/fetch/$s_!Asxo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a18ba96-ac76-44f4-b354-9281516a2170_2507x1595.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Loss function</h2><p>The loss attempts to <strong>align the image and text embeddings</strong> such that:</p><ul><li><p>Matching pairs (correct caption for image) have <strong>high similarity</strong>.</p></li><li><p>Non-matching pairs have <strong>low similarity</strong>.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Oa03!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Oa03!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png 424w, https://substackcdn.com/image/fetch/$s_!Oa03!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png 848w, https://substackcdn.com/image/fetch/$s_!Oa03!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png 1272w, https://substackcdn.com/image/fetch/$s_!Oa03!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Oa03!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png" width="1456" height="209" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:209,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:132249,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Oa03!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png 424w, https://substackcdn.com/image/fetch/$s_!Oa03!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png 848w, https://substackcdn.com/image/fetch/$s_!Oa03!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png 1272w, https://substackcdn.com/image/fetch/$s_!Oa03!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2f26e13-b05e-4c37-aeb6-2d111f765d8b_5282x757.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Compute embeddings for a batch:</figcaption></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0dLl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0dLl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png 424w, https://substackcdn.com/image/fetch/$s_!0dLl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png 848w, https://substackcdn.com/image/fetch/$s_!0dLl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png 1272w, https://substackcdn.com/image/fetch/$s_!0dLl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0dLl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png" width="1456" height="139" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:139,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:150382,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0dLl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png 424w, https://substackcdn.com/image/fetch/$s_!0dLl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png 848w, https://substackcdn.com/image/fetch/$s_!0dLl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png 1272w, https://substackcdn.com/image/fetch/$s_!0dLl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa987a7b9-73ff-4a59-b783-7ba3fa83c182_6461x619.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Compute similarity matrix:</figcaption></figure></div><p>Each row i compares image i with all text embeddings.</p><p>Diagonal elements are the correct image-text pairs.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8Aje!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8Aje!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png 424w, https://substackcdn.com/image/fetch/$s_!8Aje!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png 848w, https://substackcdn.com/image/fetch/$s_!8Aje!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png 1272w, https://substackcdn.com/image/fetch/$s_!8Aje!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8Aje!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png" width="1456" height="264" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:264,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229642,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8Aje!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png 424w, https://substackcdn.com/image/fetch/$s_!8Aje!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png 848w, https://substackcdn.com/image/fetch/$s_!8Aje!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png 1272w, https://substackcdn.com/image/fetch/$s_!8Aje!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfbbcfcd-55e8-445f-b569-169bdeefc361_4692x852.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Apply Cross-Entropy Loss in both directions</figcaption></figure></div><p>Image &#10141; Text classification</p><p>Text &#10141; Image classification</p><p>This is <strong>symmetric contrastive learning</strong>, exactly like CLIP.</p><p>I am not pasting the full code here, but I will show you how the embeddings look before and after training.</p><h1>Embedding similarity before and after training</h1><p>Look at the beautiful color bands below. Each color shows the value of the embedding vector along a particular dimension. Totally there are 64 dimensions. Look at how similar the embeddings look after training. So beautiful.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rFMR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rFMR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png 424w, https://substackcdn.com/image/fetch/$s_!rFMR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png 848w, https://substackcdn.com/image/fetch/$s_!rFMR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png 1272w, https://substackcdn.com/image/fetch/$s_!rFMR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rFMR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png" width="1456" height="1559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1559,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:430851,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rFMR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png 424w, https://substackcdn.com/image/fetch/$s_!rFMR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png 848w, https://substackcdn.com/image/fetch/$s_!rFMR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png 1272w, https://substackcdn.com/image/fetch/$s_!rFMR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3cf4fe73-76e7-497e-95b8-98f4e0e80941_1932x2069.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I also tried 3D embeddings instead of 64-D embeddings so that visualization can be exactly like vector embeddings. Here are the results.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-SLk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-SLk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png 424w, https://substackcdn.com/image/fetch/$s_!-SLk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png 848w, https://substackcdn.com/image/fetch/$s_!-SLk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png 1272w, https://substackcdn.com/image/fetch/$s_!-SLk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-SLk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png" width="496" height="506.9010989010989" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1488,&quot;width&quot;:1456,&quot;resizeWidth&quot;:496,&quot;bytes&quot;:743365,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-SLk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png 424w, https://substackcdn.com/image/fetch/$s_!-SLk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png 848w, https://substackcdn.com/image/fetch/$s_!-SLk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png 1272w, https://substackcdn.com/image/fetch/$s_!-SLk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F921df1ed-23af-4587-b34f-53dbfe324fe1_1978x2021.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ljYl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ljYl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png 424w, https://substackcdn.com/image/fetch/$s_!ljYl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png 848w, https://substackcdn.com/image/fetch/$s_!ljYl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png 1272w, https://substackcdn.com/image/fetch/$s_!ljYl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ljYl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png" width="486" height="495.67994505494505" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1485,&quot;width&quot;:1456,&quot;resizeWidth&quot;:486,&quot;bytes&quot;:790047,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ljYl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png 424w, https://substackcdn.com/image/fetch/$s_!ljYl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png 848w, https://substackcdn.com/image/fetch/$s_!ljYl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png 1272w, https://substackcdn.com/image/fetch/$s_!ljYl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55ce8cdf-54ad-47a3-a747-0ff7a7791360_1980x2020.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I think this is a great example to show how the image and text embeddings align after training in a VLM.</p><h1> Why is this model called &#8220;Nano&#8221;?</h1><p>We can answer this question by simply hand calculating the total number of trainable parameters.</p><p>We can calculate the trainable parameters separately for text encoder and image encoder.   </p><h2>Image Encoder: Number of parameters</h2><p>Image encoder is a CNN with a total of around 440k trainable parameters. I am not going to show you the entire math of how for each layer I have calculated these parameters but I will show you the layer-wise distribution of the parameters.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JaD-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JaD-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png 424w, https://substackcdn.com/image/fetch/$s_!JaD-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png 848w, https://substackcdn.com/image/fetch/$s_!JaD-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png 1272w, https://substackcdn.com/image/fetch/$s_!JaD-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JaD-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png" width="500" height="901.4423076923077" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:2625,&quot;width&quot;:1456,&quot;resizeWidth&quot;:500,&quot;bytes&quot;:301021,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JaD-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png 424w, https://substackcdn.com/image/fetch/$s_!JaD-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png 848w, https://substackcdn.com/image/fetch/$s_!JaD-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png 1272w, https://substackcdn.com/image/fetch/$s_!JaD-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4a12568d-577f-4cfd-8181-42a31f65dffd_1489x2684.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Just notice that the early layers of a CNN does not contribute to that many number of trainable parameters.  The more convolutional layers you add later, the more your number of trainable parameters increase at a faster rate.  So if you care about making your image encoder lightweight, you should reduce the number of layers that come later.  The logical reason why this happens is because later layers have more number of channels that are produced at the output of convolution operation and for every channel you need a separate filter. Thus total number of filters in a given layer is same as the total number of channels that their layer produces at the output.</p><p>I simply encourage you to perform this calculation yourself if you don&#8217;t know how to calculate the number of parameters in a given convolution layer I am linking a video here this will definitely help you.  This is a short ~20 minute video that I recorded recently to explain what exactly does filters do dimensionality wise in a convolution operation: </p><div id="youtube2-8nCIPB67w0Y" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;8nCIPB67w0Y&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/8nCIPB67w0Y?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>Now let us also calculate the total number of trainable parameters in the text encoder.</p><h2>Text Encoder </h2><p>Like the Image Encoder, I am not going to show the entire calculation, but I&#8217;ll show the layer-wise distribution of parameters.</p><p>Pardon my atrocious hand-writing once again.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2KK8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2KK8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png 424w, https://substackcdn.com/image/fetch/$s_!2KK8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png 848w, https://substackcdn.com/image/fetch/$s_!2KK8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png 1272w, https://substackcdn.com/image/fetch/$s_!2KK8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2KK8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png" width="528" height="1222.3561643835617" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:3042,&quot;width&quot;:1314,&quot;resizeWidth&quot;:528,&quot;bytes&quot;:382670,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.vizuaranewsletter.com/i/182932994?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2KK8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png 424w, https://substackcdn.com/image/fetch/$s_!2KK8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png 848w, https://substackcdn.com/image/fetch/$s_!2KK8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png 1272w, https://substackcdn.com/image/fetch/$s_!2KK8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6a11c24e-5efa-4c93-b6ab-995c41f12aea_1314x3042.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The total number of parameters contributed by the text encoder is just 22.5k. This is only 5% as that of the image encoder. The total number of parameters is less than 500k. And this is the reason why we are calling this as a &#8220;nano&#8221; vision language model.</p><h1>Conclusion</h1><p>This exercise of building NanoVLM from scratch was done as part of a series called &#8220;Transformers for Vision.&#8221;  If you wish to watch the full lecture video, you can have a look at it here. You can code along with me in the video to learn how to build this Nano VLM completely from scratch yourself: </p><div id="youtube2-O4i_Uue08AI" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;O4i_Uue08AI&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/O4i_Uue08AI?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p>If you wish to get access to our code files, handwritten notes, all lecture videos, Discord channel, and other PDF handbooks that we have compiled along with a code certificate at the end of the program, you can consider being part of the pro version of the &#8220;Transformers for Vision Bootcamp&#8221;.  you will find the details here: </p><p><a href="https://vision-transformer.vizuara.ai/">https://vision-transformer.vizuara.ai/</a></p><h1>Other resources</h1><p>If you like this content, please check out our research bootcamps on the following topics:</p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>: <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p></p><p></p><p> </p><p> </p><p>  </p><p></p><p>  </p><p> </p><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[What exactly are Diffusion Models?]]></title><description><![CDATA[How do diffusion models work?]]></description><link>https://www.vizuaranewsletter.com/p/what-exactly-are-diffusion-models</link><guid isPermaLink="false">https://www.vizuaranewsletter.com/p/what-exactly-are-diffusion-models</guid><dc:creator><![CDATA[Dr Rajat Dandekar]]></dc:creator><pubDate>Tue, 23 Dec 2025 09:02:02 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fEKX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Diffusion is the natural tendency of particles (like molecules, heat, or even information) to <strong>move and spread out</strong> until they are evenly distributed.</p><p>Some examples are as follows:</p><ol><li><p>Smell of perfume spreading across the room:</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tc20!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tc20!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png 424w, https://substackcdn.com/image/fetch/$s_!tc20!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png 848w, https://substackcdn.com/image/fetch/$s_!tc20!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png 1272w, https://substackcdn.com/image/fetch/$s_!tc20!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tc20!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png" width="360" height="202.25274725274724" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1456,&quot;resizeWidth&quot;:360,&quot;bytes&quot;:3435787,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!tc20!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png 424w, https://substackcdn.com/image/fetch/$s_!tc20!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png 848w, https://substackcdn.com/image/fetch/$s_!tc20!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png 1272w, https://substackcdn.com/image/fetch/$s_!tc20!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6937bcc4-fc6c-4d56-8da4-366b72c2ac92_2668x1499.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><ol start="2"><li><p>Sugar dissolving and spreading uniformly in water:</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z44k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z44k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif 424w, https://substackcdn.com/image/fetch/$s_!z44k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif 848w, https://substackcdn.com/image/fetch/$s_!z44k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif 1272w, https://substackcdn.com/image/fetch/$s_!z44k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z44k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif" width="320" height="180.57142857142856" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:158,&quot;width&quot;:280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:972812,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!z44k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif 424w, https://substackcdn.com/image/fetch/$s_!z44k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif 848w, https://substackcdn.com/image/fetch/$s_!z44k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif 1272w, https://substackcdn.com/image/fetch/$s_!z44k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92af1820-58e2-4102-9063-ee7c69dabf80_280x158.gif 1456w" sizes="100vw"></picture><div></div></div></a></figure></div><p>There are some properties which the diffusion process carries:</p><ol><li><p>Structure slowly disappears</p></li><li><p>Things become more uniform and noisy over time</p></li></ol><p>But why are we discussing about this now?</p><p>The main question is that can we do something similar with our data as well?</p><p>Remember that in the variational autoencoder, our encoder took the data as the input and then converted that into a representation in the latent space.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yhSb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yhSb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png 424w, https://substackcdn.com/image/fetch/$s_!yhSb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png 848w, https://substackcdn.com/image/fetch/$s_!yhSb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png 1272w, https://substackcdn.com/image/fetch/$s_!yhSb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yhSb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png" width="1456" height="679" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/affd6886-1812-484b-b397-c9814f1c7d47_1912x892.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:679,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:175230,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yhSb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png 424w, https://substackcdn.com/image/fetch/$s_!yhSb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png 848w, https://substackcdn.com/image/fetch/$s_!yhSb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png 1272w, https://substackcdn.com/image/fetch/$s_!yhSb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faffd6886-1812-484b-b397-c9814f1c7d47_1912x892.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Refer to this article on Variational AutoEncoders: <a href="https://www.vizuaranewsletter.com/p/variational-autoencoders-explained">https://www.vizuaranewsletter.com/p/variational-autoencoders-explained</a></p><p>What if we think of our encoder as a machine which diffuses the data?</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XWqH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XWqH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png 424w, https://substackcdn.com/image/fetch/$s_!XWqH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png 848w, https://substackcdn.com/image/fetch/$s_!XWqH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png 1272w, https://substackcdn.com/image/fetch/$s_!XWqH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XWqH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png" width="1456" height="362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:362,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65414,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!XWqH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png 424w, https://substackcdn.com/image/fetch/$s_!XWqH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png 848w, https://substackcdn.com/image/fetch/$s_!XWqH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png 1272w, https://substackcdn.com/image/fetch/$s_!XWqH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc79bc21f-8b8c-4246-8dfa-02de521bb6d5_1602x398.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>And the diffuser works such that it converts the data into pure noise.</p><p>Let us take an example</p><p>Consider this image:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aKD-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aKD-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png 424w, https://substackcdn.com/image/fetch/$s_!aKD-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png 848w, https://substackcdn.com/image/fetch/$s_!aKD-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png 1272w, https://substackcdn.com/image/fetch/$s_!aKD-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aKD-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png" width="221" height="291.125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1918,&quot;width&quot;:1456,&quot;resizeWidth&quot;:221,&quot;bytes&quot;:1888133,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!aKD-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png 424w, https://substackcdn.com/image/fetch/$s_!aKD-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png 848w, https://substackcdn.com/image/fetch/$s_!aKD-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png 1272w, https://substackcdn.com/image/fetch/$s_!aKD-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1b8c358-0f94-4711-90c8-910dd193d9a9_1742x2295.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Yes, we are taking Batman as our example :)</p><p>The encoder will do something as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VShm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VShm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png 424w, https://substackcdn.com/image/fetch/$s_!VShm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png 848w, https://substackcdn.com/image/fetch/$s_!VShm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png 1272w, https://substackcdn.com/image/fetch/$s_!VShm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VShm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png" width="1456" height="494" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:494,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:413104,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!VShm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png 424w, https://substackcdn.com/image/fetch/$s_!VShm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png 848w, https://substackcdn.com/image/fetch/$s_!VShm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png 1272w, https://substackcdn.com/image/fetch/$s_!VShm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5cf134d-c897-464b-8e8b-8fe035b64ece_1622x550.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We will make one additional change, instead of directly transforming the image into noise, we will make the transformation gradual.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8_Mb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8_Mb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png 424w, https://substackcdn.com/image/fetch/$s_!8_Mb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png 848w, https://substackcdn.com/image/fetch/$s_!8_Mb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png 1272w, https://substackcdn.com/image/fetch/$s_!8_Mb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8_Mb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png" width="728" height="113.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:227,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:663758,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8_Mb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png 424w, https://substackcdn.com/image/fetch/$s_!8_Mb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png 848w, https://substackcdn.com/image/fetch/$s_!8_Mb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png 1272w, https://substackcdn.com/image/fetch/$s_!8_Mb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4727e538-b474-4944-b66d-da43a9d68fa5_2704x422.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>So, there are multiple encoders which we need to train? </p><div class="pullquote"><p>Remember, this was one of the drawbacks of VAEs, where both the encoder and decoder had to be trained simultaneously</p></div><p><strong>What if we fix these encoders/diffusers?</strong></p><p>Let us say we modify our Batman image by adding a fixed Gaussian Kernel.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vJ7E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vJ7E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!vJ7E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!vJ7E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!vJ7E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vJ7E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png" width="217" height="325.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1536,&quot;width&quot;:1024,&quot;resizeWidth&quot;:217,&quot;bytes&quot;:2725335,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vJ7E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png 424w, https://substackcdn.com/image/fetch/$s_!vJ7E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png 848w, https://substackcdn.com/image/fetch/$s_!vJ7E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png 1272w, https://substackcdn.com/image/fetch/$s_!vJ7E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa9d0e0ad-2032-4dd3-a143-488c089ffe24_1024x1536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can represent our image first a grid of pixels.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pcEJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pcEJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png 424w, https://substackcdn.com/image/fetch/$s_!pcEJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png 848w, https://substackcdn.com/image/fetch/$s_!pcEJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png 1272w, https://substackcdn.com/image/fetch/$s_!pcEJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pcEJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png" width="250" height="314.76683937823833" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:486,&quot;width&quot;:386,&quot;resizeWidth&quot;:250,&quot;bytes&quot;:122609,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!pcEJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png 424w, https://substackcdn.com/image/fetch/$s_!pcEJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png 848w, https://substackcdn.com/image/fetch/$s_!pcEJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png 1272w, https://substackcdn.com/image/fetch/$s_!pcEJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb5fff96d-273c-48bf-aee4-d27acd7455ad_386x486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These pixels have some fixed values.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xyPS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xyPS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png 424w, https://substackcdn.com/image/fetch/$s_!xyPS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png 848w, https://substackcdn.com/image/fetch/$s_!xyPS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png 1272w, https://substackcdn.com/image/fetch/$s_!xyPS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xyPS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png" width="254" height="308.0425531914894" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5ef30989-9857-44b6-8916-226589d69444_376x456.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:456,&quot;width&quot;:376,&quot;resizeWidth&quot;:254,&quot;bytes&quot;:28010,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xyPS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png 424w, https://substackcdn.com/image/fetch/$s_!xyPS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png 848w, https://substackcdn.com/image/fetch/$s_!xyPS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png 1272w, https://substackcdn.com/image/fetch/$s_!xyPS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5ef30989-9857-44b6-8916-226589d69444_376x456.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, for each pixel, we will sample from a Gaussian with the mean fixed to be the pixel value and a small variance (beta).</p><p>This will look something like below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5HMq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5HMq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png 424w, https://substackcdn.com/image/fetch/$s_!5HMq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png 848w, https://substackcdn.com/image/fetch/$s_!5HMq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!5HMq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5HMq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png" width="1448" height="1116" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1116,&quot;width&quot;:1448,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:323272,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!5HMq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png 424w, https://substackcdn.com/image/fetch/$s_!5HMq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png 848w, https://substackcdn.com/image/fetch/$s_!5HMq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png 1272w, https://substackcdn.com/image/fetch/$s_!5HMq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F759e92e6-4636-4f41-8f08-f34cfb9ad33e_1448x1116.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>What will happen if we do this for all the pixels?</strong></p><p>If we do this process to all the pixels, we will get something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FDcT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FDcT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png 424w, https://substackcdn.com/image/fetch/$s_!FDcT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png 848w, https://substackcdn.com/image/fetch/$s_!FDcT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png 1272w, https://substackcdn.com/image/fetch/$s_!FDcT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FDcT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png" width="206" height="271.78983516483515" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1921,&quot;width&quot;:1456,&quot;resizeWidth&quot;:206,&quot;bytes&quot;:3065856,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!FDcT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png 424w, https://substackcdn.com/image/fetch/$s_!FDcT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png 848w, https://substackcdn.com/image/fetch/$s_!FDcT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png 1272w, https://substackcdn.com/image/fetch/$s_!FDcT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F965ce3f6-0735-4e6f-9ee8-bdeb593df8c5_1741x2297.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is a &#8220;noisy version&#8221; of the original Batman image.</p><div class="pullquote"><p>What we did is also called as adding Gaussian Noise to the image</p></div><p>The mathematical representation for this can be written as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{1} = x_{0} + \\beta \\epsilon&quot;,&quot;id&quot;:&quot;DDRUJJXRDE&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, x0, x1 and beta represent the original image, transformed image and the standard deviation respectively. Epsilon denotes a random variable which can take any value between 0 and 1.</p><p><strong>Now, what will happen if you do this a large number of times?</strong></p><p>This is what you get:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fwRK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fwRK!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif 424w, https://substackcdn.com/image/fetch/$s_!fwRK!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif 848w, https://substackcdn.com/image/fetch/$s_!fwRK!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif 1272w, https://substackcdn.com/image/fetch/$s_!fwRK!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fwRK!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif" width="269" height="354.54052197802196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1919,&quot;width&quot;:1456,&quot;resizeWidth&quot;:269,&quot;bytes&quot;:17693096,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/gif&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fwRK!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif 424w, https://substackcdn.com/image/fetch/$s_!fwRK!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif 848w, https://substackcdn.com/image/fetch/$s_!fwRK!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif 1272w, https://substackcdn.com/image/fetch/$s_!fwRK!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F800da212-d368-4303-b6f9-f25726f596f3_1554x2048.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>There is one problem with the above method though. If you observe the animation closely, you will realize that we are adding a lot of noise, but the original image is preserved as it is.</p><p>This is different from the definition of &#8220;Diffusion&#8221; which we started out with:</p><ol><li><p>Structure slowly disappears</p></li><li><p>Things become more uniform and noisy over time</p></li></ol><div class="pullquote"><p>This happens because we are preserving the mean value of the pixels. The structure will slowly break down when the mean also changes and move towards 0.</p></div><p>Remember that we want to achieve this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9-qk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9-qk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png 424w, https://substackcdn.com/image/fetch/$s_!9-qk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png 848w, https://substackcdn.com/image/fetch/$s_!9-qk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png 1272w, https://substackcdn.com/image/fetch/$s_!9-qk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9-qk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png" width="1456" height="230" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:230,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:675479,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!9-qk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png 424w, https://substackcdn.com/image/fetch/$s_!9-qk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png 848w, https://substackcdn.com/image/fetch/$s_!9-qk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png 1272w, https://substackcdn.com/image/fetch/$s_!9-qk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c6138d-e850-4926-b0f4-12a4dd2aa6c2_2746x434.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Our current &#8220;diffuser&#8221; does not achieve this. </p><p>Let us do this: For each &#8220;diffuser&#8221;, we will also scale the mean down by some factor along with injecting noise.</p><p>Something like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IaEt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IaEt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png 424w, https://substackcdn.com/image/fetch/$s_!IaEt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png 848w, https://substackcdn.com/image/fetch/$s_!IaEt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png 1272w, https://substackcdn.com/image/fetch/$s_!IaEt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IaEt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png" width="1456" height="1164" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1164,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:366179,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!IaEt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png 424w, https://substackcdn.com/image/fetch/$s_!IaEt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png 848w, https://substackcdn.com/image/fetch/$s_!IaEt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png 1272w, https://substackcdn.com/image/fetch/$s_!IaEt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8442bf1-02c5-442d-83b2-7aa3cfa3e77c_1564x1250.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Notice how for the 2 pixels, we are sampling from a distribution with a mean which is scaled down (such that it moves towards zero).</p><p>The mathematical representation for this can be written as:</p><p>Okay, this is looking good. But, we have multiple diffusers (4 in the diagram above). Let us how we transform the original image to the final image mathematically:</p><p>Okay this makes sense.</p><p>But there is one small thing left:</p><p>The noise is not kept constant for all transitions.  The noise schedule is kept such that it increases as we transform the clean image to noise.</p><p> Also, the mean is chosen such that the square of the mean and the standard deviation is equal to 1.  This is done so that the total variance at every step remains constant and also to ensure numerical stability across time steps.</p><p>So, the transitions can now be written as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{1} = \\sqrt{1-\\beta_{1}^{2}} x_{0} + \\beta_{1} \\epsilon&quot;,&quot;id&quot;:&quot;DINKUUNKRF&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{2} = \\sqrt{1-\\beta_{2}^{2}} x_{1} + \\beta_{2} \\epsilon&quot;,&quot;id&quot;:&quot;PHPVXGAXKM&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{3} = \\sqrt{1-\\beta_{3}^{2}} x_{2} + \\beta_{3} \\epsilon&quot;,&quot;id&quot;:&quot;TYVHMEIPLH&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{4} = \\sqrt{1-\\beta_{4}^{2}} x_{3} + \\beta_{4} \\epsilon&quot;,&quot;id&quot;:&quot;LZKZGNZVKZ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Now, what will happen if you do the same thing large number of times?</p><p>This is what you get:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fo3p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fo3p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png 424w, https://substackcdn.com/image/fetch/$s_!fo3p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png 848w, https://substackcdn.com/image/fetch/$s_!fo3p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png 1272w, https://substackcdn.com/image/fetch/$s_!fo3p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fo3p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png" width="208" height="276.4878048780488" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:436,&quot;width&quot;:328,&quot;resizeWidth&quot;:208,&quot;bytes&quot;:197276,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fo3p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png 424w, https://substackcdn.com/image/fetch/$s_!fo3p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png 848w, https://substackcdn.com/image/fetch/$s_!fo3p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png 1272w, https://substackcdn.com/image/fetch/$s_!fo3p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779264d8-21cd-449b-96fa-6a29b80737ca_328x436.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is exactly what we want!</p><p>Reiterating our approach, we can now express this diffusion process as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8g9D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8g9D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png 424w, https://substackcdn.com/image/fetch/$s_!8g9D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png 848w, https://substackcdn.com/image/fetch/$s_!8g9D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png 1272w, https://substackcdn.com/image/fetch/$s_!8g9D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8g9D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png" width="1452" height="660" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:660,&quot;width&quot;:1452,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:412792,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!8g9D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png 424w, https://substackcdn.com/image/fetch/$s_!8g9D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png 848w, https://substackcdn.com/image/fetch/$s_!8g9D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png 1272w, https://substackcdn.com/image/fetch/$s_!8g9D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F724023b4-4dfd-4cbb-83df-3175d2a18e11_1452x660.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="pullquote"><p>&#8220;For every pixel in the image, sample from a Gaussian distribution. The mean of the Gaussian distribution should be scaled by a factor of alpha, and the standard deviation should be beta.&#8221;</p></div><p>We do this for every transition. So, our forward diffusion process looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v5mJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v5mJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png 424w, https://substackcdn.com/image/fetch/$s_!v5mJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png 848w, https://substackcdn.com/image/fetch/$s_!v5mJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png 1272w, https://substackcdn.com/image/fetch/$s_!v5mJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v5mJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png" width="420" height="624.9800796812749" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1494,&quot;width&quot;:1004,&quot;resizeWidth&quot;:420,&quot;bytes&quot;:856949,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!v5mJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png 424w, https://substackcdn.com/image/fetch/$s_!v5mJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png 848w, https://substackcdn.com/image/fetch/$s_!v5mJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png 1272w, https://substackcdn.com/image/fetch/$s_!v5mJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c61e604-2f60-45a8-abb4-d0569b190c6f_1004x1494.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let us take a look at a practical example of applying the forward diffusion process to simple English Letters.</p><p>We will transform the letter &#8220;T&#8221; to noise using the forward diffusion process.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IP1-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IP1-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png 424w, https://substackcdn.com/image/fetch/$s_!IP1-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png 848w, https://substackcdn.com/image/fetch/$s_!IP1-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png 1272w, https://substackcdn.com/image/fetch/$s_!IP1-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IP1-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png" width="1456" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1771737,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!IP1-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png 424w, https://substackcdn.com/image/fetch/$s_!IP1-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png 848w, https://substackcdn.com/image/fetch/$s_!IP1-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png 1272w, https://substackcdn.com/image/fetch/$s_!IP1-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F40e4a074-36f3-4449-afd5-bdb62f9d9d7d_5390x742.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here is the link to the Google Colab notebook:</p><p><a href="https://colab.research.google.com/drive/1xb2QF9j5RuLQjvCx0ap2bb9jLjr9_zsk?usp=sharing">Application of the forward diffusion process to a practical example</a></p><p>So our choice of the Gaussian transition kernel actually works!</p><p>Remember we started out with the following:</p><p>What if we think of our encoder as a machine which diffuses the data?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!INfp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!INfp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png 424w, https://substackcdn.com/image/fetch/$s_!INfp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png 848w, https://substackcdn.com/image/fetch/$s_!INfp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png 1272w, https://substackcdn.com/image/fetch/$s_!INfp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!INfp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png" width="1456" height="377" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:377,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67245,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!INfp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png 424w, https://substackcdn.com/image/fetch/$s_!INfp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png 848w, https://substackcdn.com/image/fetch/$s_!INfp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png 1272w, https://substackcdn.com/image/fetch/$s_!INfp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93542dcd-1394-4915-bb6a-acf0d4bcde96_1636x424.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, we have completely defined the above process which looks as follows:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vC-G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vC-G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png 424w, https://substackcdn.com/image/fetch/$s_!vC-G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png 848w, https://substackcdn.com/image/fetch/$s_!vC-G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png 1272w, https://substackcdn.com/image/fetch/$s_!vC-G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vC-G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png" width="1456" height="352" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:352,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60693,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!vC-G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png 424w, https://substackcdn.com/image/fetch/$s_!vC-G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png 848w, https://substackcdn.com/image/fetch/$s_!vC-G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png 1272w, https://substackcdn.com/image/fetch/$s_!vC-G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff235544f-34f1-4cc2-b218-9a5501312c7b_1620x392.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>But the main question remains:</p><p><strong>(1) What about the decoder? How does it look like?</strong></p><p><strong>(2) And how can we learn the original data distribution?</strong></p><h3>The Learnable Decoder</h3><blockquote><p><em>Our objective is to starting from pure, unstructured noise and to progressively denoise this randomness, step by step, until a coherent and meaningful data sample emerges.</em></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UgDq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UgDq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png 424w, https://substackcdn.com/image/fetch/$s_!UgDq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png 848w, https://substackcdn.com/image/fetch/$s_!UgDq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png 1272w, https://substackcdn.com/image/fetch/$s_!UgDq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UgDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png" width="1456" height="676" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30e59239-804a-474a-91fd-77a3788c037d_1914x888.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115433,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!UgDq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png 424w, https://substackcdn.com/image/fetch/$s_!UgDq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png 848w, https://substackcdn.com/image/fetch/$s_!UgDq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png 1272w, https://substackcdn.com/image/fetch/$s_!UgDq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30e59239-804a-474a-91fd-77a3788c037d_1914x888.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>For example, if the true data distribution is that of cats, then this process would look something as follows:</em></p><p>The decoder distribution is denoted as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;p_{\\theta}(x)&quot;,&quot;id&quot;:&quot;KSNKTNMMRJ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Whatever we do in this reverse process, the final goal is to maximize the probability of sampling the images from the true data distribution.</p><p>So, our objective is the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;max[p_{\\theta}(x_{0})]&quot;,&quot;id&quot;:&quot;WGEPWYWIVB&quot;}" data-component-name="LatexBlockToDOM"></div><p>This also means that we want to maximize the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;max[\\text{log}(p_{\\theta}(x_{0}))]&quot;,&quot;id&quot;:&quot;LMEFJHSBEA&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Just as we did for VAEs, we can calculate the lower bound for this quantity as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{log}(p_{\\theta}(x_{0})) > E&quot;,&quot;id&quot;:&quot;ZKZXPTOMOF&quot;}" data-component-name="LatexBlockToDOM"></div><p>It can proved that, here E is given by the following expression:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;E = \\text{log}p_{\\theta}(x_{0}|x_{1}) - D_{KL} (p(x_{T})|q(x_{T}|x_{0})) - D_{KL}[p_{\\theta}(x_{t-1}|x_{t}) | q(x_{t-1}|x_{t},x_{0})]&quot;,&quot;id&quot;:&quot;FMEWWNBGWY&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, &#8216;T&#8217; denotes the last time-step in the forward diffusion process, when the data becomes noise.</p><p>&#8220;q&#8221; denotes the true posterior, which means the <strong>actual conditional distribution implied by the real generative process</strong>, not an approximation made by your model.</p><p>The three terms can be denoted as &#8220;<strong>Reconstruction</strong>&#8221;, &#8220;<strong>Regularization</strong>&#8221; and &#8220;<strong>Matching the Reverse Distribution</strong>&#8221;</p><p>The first 2 terms are very similar to what we saw in the variational autoencoders. For the training of diffusion models, we ignore the first two terms and only focus on the third term.  </p><p>The third term is an extra term here, which basically means that the predicted reverse transition distribution should match as close as possible to the true transition distribution (also known as the true posterior).</p><p>Let&#8217;s take an example:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fEKX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fEKX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png 424w, https://substackcdn.com/image/fetch/$s_!fEKX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png 848w, https://substackcdn.com/image/fetch/$s_!fEKX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png 1272w, https://substackcdn.com/image/fetch/$s_!fEKX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fEKX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png" width="445" height="293.76953125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/af553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:676,&quot;width&quot;:1024,&quot;resizeWidth&quot;:445,&quot;bytes&quot;:889202,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!fEKX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png 424w, https://substackcdn.com/image/fetch/$s_!fEKX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png 848w, https://substackcdn.com/image/fetch/$s_!fEKX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png 1272w, https://substackcdn.com/image/fetch/$s_!fEKX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faf553fa5-43b1-4201-a708-0fb00a8cdffa_1024x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the above image, you can see an example of a handwritten note which is smudged because of the rain.</p><p>Now, the main question is the follows?</p><p><strong>Can we predict the image at the previous time step, x(t-1)?</strong></p><p>The problem here is that, we do not know the reverse process? We cannot just go back in time :(</p><p>Now consider another case: Suppose you knew the image which you started out with, the original image. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xcKF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xcKF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png 424w, https://substackcdn.com/image/fetch/$s_!xcKF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png 848w, https://substackcdn.com/image/fetch/$s_!xcKF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png 1272w, https://substackcdn.com/image/fetch/$s_!xcKF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xcKF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png" width="1456" height="475" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:475,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1763509,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!xcKF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png 424w, https://substackcdn.com/image/fetch/$s_!xcKF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png 848w, https://substackcdn.com/image/fetch/$s_!xcKF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png 1272w, https://substackcdn.com/image/fetch/$s_!xcKF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd054cc7c-a5a7-44e2-91c4-d816286d054a_2070x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Now, suddenly it becomes much easier to find the image at the previous time step, which is x(t-1).  This is because we have access to the original image.</p><p>Okay, still, how can do this, even if we have the original image?</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RgtK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RgtK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png 424w, https://substackcdn.com/image/fetch/$s_!RgtK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png 848w, https://substackcdn.com/image/fetch/$s_!RgtK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png 1272w, https://substackcdn.com/image/fetch/$s_!RgtK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RgtK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png" width="1456" height="277" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:277,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:595734,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RgtK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png 424w, https://substackcdn.com/image/fetch/$s_!RgtK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png 848w, https://substackcdn.com/image/fetch/$s_!RgtK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png 1272w, https://substackcdn.com/image/fetch/$s_!RgtK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e0c0db9-24d2-4300-845f-0a18a85f4237_2552x486.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>The question we are asking is:  </p><blockquote><p>If we knew the image of the Batman at one time step and we knew the original image, can we find the image of the Batman at a previous time step?</p></blockquote><p>We want to find the true posterior, given by (if we consider the third time step):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;q(x_{2}|x_{3},x_{0})&quot;,&quot;id&quot;:&quot;ZAADNWTNDN&quot;}" data-component-name="LatexBlockToDOM"></div><p>The way I think about it is that:</p><p>In three time steps, I have to reduce the noise by this much. So in one time step, I will reduce the noise by one-third of that amount.</p><p>It turns out that this reverse process can be approximated by a Gaussian distribution with a mean and a variance.</p><p>Intuitively, we expect the mean to depend on the original image as well as the image at the current time step.</p><p>Hence, we can write the mean of the true posterior as follows: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mu (x_{i},x_{0}) = A_{1}x_{0} + A_{2} x_{i}&quot;,&quot;id&quot;:&quot;BSZVNHOBEM&quot;}" data-component-name="LatexBlockToDOM"></div><p> Here, xi denotes the current image and x0 denotes the original image.</p><p> Here, A1 and A2 are functions of the standard deviations in the forward transition process.  I will include a book in the resources section where you can find the derivation for these values.</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A_{1} = f(\\beta_{1},\\beta_{2},\\beta_{3}...)&quot;,&quot;id&quot;:&quot;GWITMZJCFY&quot;}" data-component-name="LatexBlockToDOM"></div><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;A_{2} = g(\\beta_{1},\\beta_{2},\\beta_{3}...)&quot;,&quot;id&quot;:&quot;XKXNNQLWJA&quot;}" data-component-name="LatexBlockToDOM"></div><p>The standard deviation of the true posterior can also be written as a function of all the standard deviations, which can be denoted as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\sigma(i) = A_{3}&quot;,&quot;id&quot;:&quot;GNZGHGOTNB&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us look at a practical example where we apply the mathematical form for the true posterior to transform a noisy image into an original image, given the original image as the input.</p><p> Here, we will take the example of handwritten digits.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Pss!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Pss!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png 424w, https://substackcdn.com/image/fetch/$s_!6Pss!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png 848w, https://substackcdn.com/image/fetch/$s_!6Pss!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png 1272w, https://substackcdn.com/image/fetch/$s_!6Pss!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Pss!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png" width="1456" height="770" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:770,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:128088,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!6Pss!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png 424w, https://substackcdn.com/image/fetch/$s_!6Pss!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png 848w, https://substackcdn.com/image/fetch/$s_!6Pss!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png 1272w, https://substackcdn.com/image/fetch/$s_!6Pss!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F030ac8e0-6d67-4253-89db-2540b5e1e235_1494x790.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is the link to the Google Colab notebook:</p><p><a href="https://colab.research.google.com/drive/1J3AQEgxmsryrDQ6YIDja6oeCAkh5KtA3?usp=sharing">Application of the true posterior gaussian distribution formula to a practical example</a></p><p><em>A question for all of you to think about: How does A1 and A2 change as the reverse transition process proceeds? Do they increase or decrease in magnitude as we go closer to the true image?</em></p><p><strong>A thought might come to your mind which says that we know the entire reverse process now: So are we done?</strong></p><p>Well, not quite. The reason is that we have calculated the reverse transition kernel conditioned on the original image.</p><p>In our application, we have to generate the image from the noise, so the original given image will not be known to us.</p><p>Look at the third term in our objective function, which we want to maximize:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;- D_{KL}[p_{\\theta}(x_{t-1}|x_{t}) | q(x_{t-1}|x_{t},x_{0})]&quot;,&quot;id&quot;:&quot;PNTODRAQUT&quot;}" data-component-name="LatexBlockToDOM"></div><p> This means that we need to minimize the KL divergence between our model prediction and the true posterior.</p><p> Our true posterior is a Gaussian, and we will assume that our model prediction is also a Gaussian distribution.  The mean of our model distribution is not known to us. However, we will assume that our model has the same variance as that of the true posterior.</p><p> This means that our task is to minimize the KL divergences between two Gaussians with the same variance and different means.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Cqie!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Cqie!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png 424w, https://substackcdn.com/image/fetch/$s_!Cqie!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png 848w, https://substackcdn.com/image/fetch/$s_!Cqie!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png 1272w, https://substackcdn.com/image/fetch/$s_!Cqie!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Cqie!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png" width="1456" height="749" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:749,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:532198,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://vizuara.substack.com/i/181969460?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Cqie!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png 424w, https://substackcdn.com/image/fetch/$s_!Cqie!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png 848w, https://substackcdn.com/image/fetch/$s_!Cqie!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png 1272w, https://substackcdn.com/image/fetch/$s_!Cqie!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa26ceba6-9eba-4a85-baeb-4eecefda37e9_2787x1434.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It turns out that this is equivalent to minimizing the mean square error between both the means: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; D_{KL}[p_{\\theta}(x_{t-1}|x_{t}) | q(x_{t-1}|x_{t},x_{0})] = \\frac {1}{2\\sigma^{2}}||\\mu_{\\phi}(x_{i}) - \\mu(x_{i},x_{0})||^{2}&quot;,&quot;id&quot;:&quot;POELIHIMXG&quot;}" data-component-name="LatexBlockToDOM"></div><p>Now, let us see how we can simplify this loss:</p><p>We already know the mean of the true posterior:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mu (x_{i},x_{0}) = A_{1}x_{0} + A_{2} x_{i}&quot;,&quot;id&quot;:&quot;QIAISXSHOI&quot;}" data-component-name="LatexBlockToDOM"></div><p>Let us approximate the mean of our model to be as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mu_{\\phi} (x_{i}) = A_{1}\\hat{x_{0}} + A_{2} x_{i}&quot;,&quot;id&quot;:&quot;IDAWSTLDNN&quot;}" data-component-name="LatexBlockToDOM"></div><p> Now our loss function can be written as follows: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot; D_{KL}[p_{\\theta}(x_{t-1}|x_{t}) | q(x_{t-1}|x_{t},x_{0})] = \\frac {A_{1}^{2}}{2\\sigma^{2}}||\\hat{x_{0}} - x_{0}||^{2}&quot;,&quot;id&quot;:&quot;LZDOIQIHKH&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>This means that: </p><div class="pullquote"><p>To make our one-step reverse distribution match the true one, it&#8217;s enough to make our predicted clean sample&#8203; close to the real one.</p></div><p>So training becomes a <strong>simple supervised regression</strong> problem: predict the clean thing from the noisy thing.</p><p> Now, we can do one more simplification to make it even more intuitive. </p><blockquote><p>The real clean sample can be predicted from the real current image by removing noise. </p><p> The predicted clean sample can be predicted from the predicted current image by removing the predicted noise. </p></blockquote><div class="pullquote"><p> This means that we can express the real clean sample and the predicted clean sample in terms of the real noise and the predicted noise</p></div><p> After doing this, the last term simplifies to the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;D_{KL}[p_{\\theta}(x_{t-1}|x_{t}) | q(x_{t-1}|x_{t},x_{0})] = C||\\hat{\\epsilon} - \\epsilon||^{2}&quot;,&quot;id&quot;:&quot;ZMXVYCWCSR&quot;}" data-component-name="LatexBlockToDOM"></div><p>Here, C is a constant.</p><p>This means that: </p><div class="pullquote"><p>To make our one-step reverse distribution match the true one, it&#8217;s enough to make our predicted noise&#8203; close to the real noise.</p></div><p>That&#8217;s it!</p><p>Here is the link to the original paper which introduced <em>DDPM (Denoising Diffusion Probabilistic Models): <a href="https://arxiv.org/abs/2006.11239">https://arxiv.org/abs/2006.11239</a></em></p><p>For more detailed proofs, please refer to the book: <em>The Principles of Diffusion Models From Origins to Advances (<a href="https://arxiv.org/abs/2510.21890">https://arxiv.org/abs/2510.21890</a>) [Pages 43-55]</em></p><p>If you like this content, please check out our bootcamps on the following topics:</p><p><strong>Modern Robot Learning</strong>: <a href="https://robotlearningbootcamp.vizuara.ai/">https://robotlearningbootcamp.vizuara.ai/</a></p><p><strong>GenAI</strong>: <a href="https://flyvidesh.online/gen-ai-professional-bootcamp">https://flyvidesh.online/gen-ai-professional-bootcamp</a></p><p><strong>RL</strong>:  <a href="https://rlresearcherbootcamp.vizuara.ai/">https://rlresearcherbootcamp.vizuara.ai/</a></p><p><strong>SciML</strong>: <a href="https://flyvidesh.online/ml-bootcamp">https://flyvidesh.online/ml-bootcamp</a></p><p><strong>ML-DL</strong>: <a href="https://flyvidesh.online/ml-dl-bootcamp">https://flyvidesh.online/ml-dl-bootcamp</a></p><p><strong>CV</strong>: <a href="https://cvresearchbootcamp.vizuara.ai/">https://cvresearchbootcamp.vizuara.ai/</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.vizuaranewsletter.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Vizuara&#8217;s AI Newsletter! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item></channel></rss>