<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Incomplete Distillation: Research Briefings]]></title><description><![CDATA[Explainer, updates, and commentary on a latest research development.]]></description><link>https://januverma.substack.com/s/research-briefings</link><image><url>https://substackcdn.com/image/fetch/$s_!TDZo!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fjanuverma.substack.com%2Fimg%2Fsubstack.png</url><title>Incomplete Distillation: Research Briefings</title><link>https://januverma.substack.com/s/research-briefings</link></image><generator>Substack</generator><lastBuildDate>Tue, 26 May 2026 18:28:38 GMT</lastBuildDate><atom:link href="https://januverma.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Janu Verma]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[januverma@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[januverma@substack.com]]></itunes:email><itunes:name><![CDATA[Janu Verma]]></itunes:name></itunes:owner><itunes:author><![CDATA[Janu Verma]]></itunes:author><googleplay:owner><![CDATA[januverma@substack.com]]></googleplay:owner><googleplay:email><![CDATA[januverma@substack.com]]></googleplay:email><googleplay:author><![CDATA[Janu Verma]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Research Briefings - Gemma 4]]></title><description><![CDATA[Testing the image/video reasoning capabilities]]></description><link>https://januverma.substack.com/p/research-briefings-gemma-4</link><guid isPermaLink="false">https://januverma.substack.com/p/research-briefings-gemma-4</guid><dc:creator><![CDATA[Janu Verma]]></dc:creator><pubDate>Sat, 11 Apr 2026 08:45:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!y6Lw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Gemma 4</h1><p>Released April 2, 2026,<a href="https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/"> Gemma 4 is Google DeepMind&#8217;s most capable open model</a> family, built on the same research as Gemini 3 and released under a fully permissive Apache 2.0 license. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wXwq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wXwq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin 424w, https://substackcdn.com/image/fetch/$s_!wXwq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin 848w, https://substackcdn.com/image/fetch/$s_!wXwq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin 1272w, https://substackcdn.com/image/fetch/$s_!wXwq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wXwq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin" width="524" height="293.44" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:112,&quot;width&quot;:200,&quot;resizeWidth&quot;:524,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Gemma 4&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Gemma 4" title="Gemma 4" srcset="https://substackcdn.com/image/fetch/$s_!wXwq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin 424w, https://substackcdn.com/image/fetch/$s_!wXwq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin 848w, https://substackcdn.com/image/fetch/$s_!wXwq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin 1272w, https://substackcdn.com/image/fetch/$s_!wXwq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8df54e16-73e5-4c0e-b08d-d3e40edad814_200x112.bin 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>It comes in <strong>four sizes</strong>:</p><ul><li><p><strong>E2B</strong> and <strong>E4B</strong> are edge-optimized models that activate an effective 2B and 4B parameters, designed to run offline on phones, Raspberry Pi, and Jetson Nano.</p></li><li><p><strong>26B-A4B (MoE)</strong> is a mixture-of-experts model with 26B total parameters but only 3.8B active per token, so you get 26B quality at 4B speed. </p></li><li><p><strong>31B Dense</strong> is the largest, currently ranked #3 globally among open models on Arena AI. </p></li></ul><p><strong>Key capabilities</strong> across the family: </p><ul><li><p>native vision and audio processing, function calling, </p></li><li><p>structured JSON output, system instructions, </p></li><li><p>a thinking/reasoning mode, </p></li><li><p>256K context windows for the larger models (128K for edge), </p></li><li><p>training on 140+ languages. </p></li></ul><p><strong>The benchmark leap is dramatic.</strong> Compared to Gemma 3, the 31B Dense model jumped from 20.8% to 89.2% on AIME math, nearly tripled LiveCodeBench scores, and went from 19% to 74% on BigBench Extra Hard. This makes it far more competitive with Qwen 3.5, which previously dominated math and coding among open models at this size class. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y6Lw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y6Lw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg 424w, https://substackcdn.com/image/fetch/$s_!y6Lw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg 848w, https://substackcdn.com/image/fetch/$s_!y6Lw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!y6Lw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y6Lw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg" width="1456" height="438" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:438,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Gemma 4 Table&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Gemma 4 Table" title="Gemma 4 Table" srcset="https://substackcdn.com/image/fetch/$s_!y6Lw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg 424w, https://substackcdn.com/image/fetch/$s_!y6Lw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg 848w, https://substackcdn.com/image/fetch/$s_!y6Lw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!y6Lw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5089b2e5-6e59-4a0c-9f3c-385e0144f29a_3000x902.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Architectural innovations</strong> include a second embedding (Per-Layer Embeddings) table that feeds residual signals into every decoder layer, and shared key-value tensors in the final layers to reduce memory during long-context inference. </p><h1>Experiments</h1><p>To test the capabilities of the Gemma-4 model, I did some quick experiments. I am primarily interested in multi-modal and visual-language capabilities of the models.  <a href="https://huggingface.co/blog/gemma4">Hugging Face team did various tests on Gemma-4 model and wrote in a blog post</a>. I did some further exploration of this series of models which are discussed below. </p><h2>Image Reasoning</h2><p>The image reasoning of Gemma-4 is tested across three experiments testing capabilities the HuggingFace blog never benchmarked. These are multi-image reasoning, adversarial robustness, and structured extraction.</p><h3>Multi-Image Reasoning</h3><p>Three prompts, each with two images. The question: does the model actually attend to both images, or does it default to one and confabulate the other?</p><ul><li><p><strong>Chart Comparison (H1 vs H2 2025)</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Tknq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Tknq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic 424w, https://substackcdn.com/image/fetch/$s_!Tknq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic 848w, https://substackcdn.com/image/fetch/$s_!Tknq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic 1272w, https://substackcdn.com/image/fetch/$s_!Tknq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Tknq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic" width="1036" height="410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:410,&quot;width&quot;:1036,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:15476,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/193330999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Tknq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic 424w, https://substackcdn.com/image/fetch/$s_!Tknq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic 848w, https://substackcdn.com/image/fetch/$s_!Tknq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic 1272w, https://substackcdn.com/image/fetch/$s_!Tknq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b2afad-2b2f-417b-9599-b39259276380_1036x410.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong> Correctly identified May ($61M) as the H1 peak and December ($102M) as the H2 peak. Correctly judged H2 as the better-performing half and noted that &#8220;every single month in H2 (except August) outperformed the best month of H1.&#8221; Accurately characterized H1 as volatile and H2 as trending upward.</p><p><strong>Verdict:</strong> &#9989; Both images attended to. Numbers extracted correctly from both. Cross-chart comparison was substantive, not generic.</p></li><li><p><strong>Product Spec Comparison</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F0KT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F0KT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic 424w, https://substackcdn.com/image/fetch/$s_!F0KT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic 848w, https://substackcdn.com/image/fetch/$s_!F0KT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic 1272w, https://substackcdn.com/image/fetch/$s_!F0KT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F0KT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic" width="920" height="346" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:346,&quot;width&quot;:920,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14171,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/193330999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!F0KT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic 424w, https://substackcdn.com/image/fetch/$s_!F0KT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic 848w, https://substackcdn.com/image/fetch/$s_!F0KT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic 1272w, https://substackcdn.com/image/fetch/$s_!F0KT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F696a6cc0-9bcb-4b57-9581-57759ab543b5_920x346.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong> Produced a structured markdown comparison table covering price, price-per-GB-RAM, storage, display, battery, and weight &#8212; declaring a winner per row. Notably, it computed <code>$54.12/GB</code> (UltraBook) vs <code>$45.28/GB</code> (ThinkPad) &#8212; actual division, not vibes.</p><p><strong>Verdict:</strong> &#9989; Strong. It performed real arithmetic across two images and structured the answer in a table without being asked. The price-per-GB calculation is the kind of derived metric that proves the model isn&#8217;t just OCR&#8217;ing &#8212; it&#8217;s reasoning across the inputs.</p></li><li><p><strong>Scene Comparison (Venice + Bangkok)</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gLuF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gLuF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg 424w, https://substackcdn.com/image/fetch/$s_!gLuF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg 848w, https://substackcdn.com/image/fetch/$s_!gLuF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!gLuF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gLuF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg" width="329" height="215.1153846153846" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:765,&quot;width&quot;:1170,&quot;resizeWidth&quot;:329,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gLuF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg 424w, https://substackcdn.com/image/fetch/$s_!gLuF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg 848w, https://substackcdn.com/image/fetch/$s_!gLuF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!gLuF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa1f7ba-7ae1-4a80-95e1-e67da5efc99d_1170x765.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SUNV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SUNV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 424w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 848w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1272w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png" width="332" height="221.1813186813187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/edb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:970,&quot;width&quot;:1456,&quot;resizeWidth&quot;:332,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SUNV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 424w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 848w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1272w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Result:</strong> Identified Venice from the seagull-on-mooring-pole and the &#8220;Ferrovia&#8221; sign (Santa Lucia train station). Identified Bangkok from the temple architecture (Grand Palace / Wat Phra Kaew). Both identifications used specific visual cues, not generic guesses.</p><p><strong>Verdict:</strong> &#9989; The &#8220;Ferrovia&#8221; callout is impressive, it noticed text in the background and used it as evidence. This is grounded reasoning, not pattern-matching.</p></li></ul><p><strong>Summary: </strong>The model doesn&#8217;t ignore secondary images, doesn&#8217;t confuse them, and can perform calculations and comparisons that span both. This is a meaningful capability the HF blog never demonstrated.</p><h3>Adversarial Vision</h3><p>Five prompts designed to fool the model. The question: does it confidently get things wrong, or does it flag the problem?</p><ul><li><p><strong>Rotated Text (180&#176;)</strong></p><p><strong>Setup:</strong> A receipt image rotated upside-down.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Nn29!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Nn29!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic 424w, https://substackcdn.com/image/fetch/$s_!Nn29!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic 848w, https://substackcdn.com/image/fetch/$s_!Nn29!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic 1272w, https://substackcdn.com/image/fetch/$s_!Nn29!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Nn29!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic" width="350" height="284.4748858447489" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:356,&quot;width&quot;:438,&quot;resizeWidth&quot;:350,&quot;bytes&quot;:6493,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/193330999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Nn29!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic 424w, https://substackcdn.com/image/fetch/$s_!Nn29!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic 848w, https://substackcdn.com/image/fetch/$s_!Nn29!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic 1272w, https://substackcdn.com/image/fetch/$s_!Nn29!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a717cb5-e21f-45b7-9749-ded3d4436592_438x356.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong> Returned valid JSON with both items, both prices, and the correct total <code>$145.26</code>. Did not mention or notice the rotation.</p><p><strong>Verdict:</strong> &#9989; Read the rotated text perfectly. Either the vision encoder is rotation-invariant or it normalized the image internally. Either way, the user gets the correct answer with zero friction.</p></li><li><p><strong>Misleading Chart (Truncated Y-axis)</strong></p><p><strong>Setup:</strong> Bar chart with values 97, 98, 99, 100 but Y-axis starting at 95, visually exaggerating tiny differences.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5gxc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5gxc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic 424w, https://substackcdn.com/image/fetch/$s_!5gxc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic 848w, https://substackcdn.com/image/fetch/$s_!5gxc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic 1272w, https://substackcdn.com/image/fetch/$s_!5gxc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5gxc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic" width="502" height="410" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:410,&quot;width&quot;:502,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:9830,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/193330999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5gxc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic 424w, https://substackcdn.com/image/fetch/$s_!5gxc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic 848w, https://substackcdn.com/image/fetch/$s_!5gxc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic 1272w, https://substackcdn.com/image/fetch/$s_!5gxc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4b2ad091-d870-412e-97e0-4c3213a99259_502x410.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong> Read all four values correctly. Computed the actual difference as 3 points. <strong>And then unprompted, called out: &#8220;Yes. The chart uses a truncated y-axis (it starts at 95 instead of 0)...&#8221;</strong></p><p><strong>Verdict:</strong> &#9989;&#9989; This is the standout result. The model didn&#8217;t just answer the literal question &#8212; it flagged the visual deception, which is exactly what a human analyst would do. This is data literacy, not just OCR.</p></li><li><p><strong>Low-Resolution Degraded Image</strong></p><p><strong>Setup:</strong> The bird photo downsampled to 48&#215;48 then upscaled &#8212; pixelated and barely recognizable.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0THC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0THC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic 424w, https://substackcdn.com/image/fetch/$s_!0THC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic 848w, https://substackcdn.com/image/fetch/$s_!0THC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic 1272w, https://substackcdn.com/image/fetch/$s_!0THC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0THC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic" width="506" height="386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:386,&quot;width&quot;:506,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10488,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/193330999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0THC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic 424w, https://substackcdn.com/image/fetch/$s_!0THC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic 848w, https://substackcdn.com/image/fetch/$s_!0THC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic 1272w, https://substackcdn.com/image/fetch/$s_!0THC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F88b37ef7-9184-424f-8394-afa45b6c988a_506x386.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong> Opened with: &#8220;The image provided is extremely low-resolution and heavily pixelated, making it very difficult to identify specific details with certainty.&#8221; Then made a hedged guess: &#8220;likely a bird, perched on top of a grey post or pillar.&#8221;</p><p><strong>Verdict:</strong> &#9989; Calibrated uncertainty. Many vision models confidently hallucinate detail on degraded inputs. Gemma 4 explicitly acknowledged the limitation before guessing and the guess was correct.</p></li><li><p><strong>Stroop Test (Color vs Text)</strong></p><p><strong>Setup:</strong> Red square with the word &#8220;BLUE&#8221; written on it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b_RV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b_RV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic 424w, https://substackcdn.com/image/fetch/$s_!b_RV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic 848w, https://substackcdn.com/image/fetch/$s_!b_RV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic 1272w, https://substackcdn.com/image/fetch/$s_!b_RV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b_RV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic" width="494" height="368" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/972aa67a-ef92-45ac-aaaa-4591a1641a83.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:368,&quot;width&quot;:494,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2566,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/193330999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b_RV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic 424w, https://substackcdn.com/image/fetch/$s_!b_RV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic 848w, https://substackcdn.com/image/fetch/$s_!b_RV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic 1272w, https://substackcdn.com/image/fetch/$s_!b_RV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F972aa67a-ef92-45ac-aaaa-4591a1641a83.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong> &#8220;The background of the image is red. The text written on it is &#8216;BLUE&#8217;. No, the text and the color are not consistent.&#8221;</p><p><strong>Verdict:</strong> &#9989; Three-for-three: identified the actual pixel color, read the contradicting text, and noted the inconsistency. No confusion between symbol and substance.</p></li><li><p>on-Latin OCR (Japanese)</p><p><strong>Setup:</strong> An image with two lines of Japanese text:</p><ul><li><p><code>&#26481;&#20140;&#12479;&#12527;&#12540;&#12398;&#39640;&#12373;&#12399;333m&#12391;&#12377;</code> (Tokyo Tower&#8217;s height is 333m)</p></li><li><p><code>&#24314;&#35373;&#24180;: 1958&#24180;12&#26376;23&#26085;</code> (Construction year: December 23, 1958)</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!P6C5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!P6C5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic 424w, https://substackcdn.com/image/fetch/$s_!P6C5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic 848w, https://substackcdn.com/image/fetch/$s_!P6C5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic 1272w, https://substackcdn.com/image/fetch/$s_!P6C5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!P6C5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic" width="494" height="208" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3714d9a9-8e00-4063-b580-70c834201751_494x208.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:208,&quot;width&quot;:494,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:3588,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/193330999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!P6C5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic 424w, https://substackcdn.com/image/fetch/$s_!P6C5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic 848w, https://substackcdn.com/image/fetch/$s_!P6C5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic 1272w, https://substackcdn.com/image/fetch/$s_!P6C5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3714d9a9-8e00-4063-b580-70c834201751_494x208.heic 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Result:</strong></p><blockquote><p>Transcription: 333m 1958 12 23</p><p>Translation: 333m December 23, 1958</p></blockquote><p><strong>Verdict:</strong> &#10060; The only clear failure. The model extracted only the numbers (333m, 1958/12/23) and dropped all the Japanese characters. It got the dates and measurements right, but lost the semantic context &#8212; it never tells you this is about Tokyo Tower or that 1958/12/23 is a &#8220;construction year.&#8221; The Japanese script was effectively ignored.</p></li></ul><h3>Head-to-Head Capability Tests</h3><p>Six prompts covering captioning, object detection, OCR, spatial reasoning, and chart extraction.</p><ul><li><p><strong>Detailed Captioning (Bird Photo)</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SUNV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SUNV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 424w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 848w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1272w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png" width="332" height="221.1813186813187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/edb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:970,&quot;width&quot;:1456,&quot;resizeWidth&quot;:332,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!SUNV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 424w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 848w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1272w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Result</strong> Produced a single rich sentence: <em>&#8220;A single seagull with grey and white plumage stands perched atop a weathered grey wooden mooring pole in the foreground of a bright, daylight scene...&#8221;</em> &#8212; included subject, environment, lighting, and architectural background details.</p><p><strong>Verdict:</strong> &#9989; Reads like professional photo caption copy. No hallucinated objects.</p></li><li><p><strong>Object Detection Bounding Box (Bike)</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RvO8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RvO8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png 424w, https://substackcdn.com/image/fetch/$s_!RvO8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png 848w, https://substackcdn.com/image/fetch/$s_!RvO8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!RvO8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RvO8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png" width="250" height="404.4622425629291" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1414,&quot;width&quot;:874,&quot;resizeWidth&quot;:250,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RvO8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png 424w, https://substackcdn.com/image/fetch/$s_!RvO8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png 848w, https://substackcdn.com/image/fetch/$s_!RvO8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png 1272w, https://substackcdn.com/image/fetch/$s_!RvO8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54cedf63-bba1-4d84-928d-d55563dd8399_874x1414.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong> <code>[{"box_2d": [542, 0, 954, 893], "label": "the bike"}]</code></p><p><strong>Verdict:</strong> &#9989; Valid JSON, single object, coordinates in the documented 1000&#215;1000 normalized space. Coordinates need visual verification but format is correct.</p></li><li><p><strong>GUI Element Detection (View Recipe Button)</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lj7J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lj7J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png 424w, https://substackcdn.com/image/fetch/$s_!lj7J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png 848w, https://substackcdn.com/image/fetch/$s_!lj7J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png 1272w, https://substackcdn.com/image/fetch/$s_!lj7J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lj7J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png" width="552" height="297.2307692307692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:672,&quot;width&quot;:1248,&quot;resizeWidth&quot;:552,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lj7J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png 424w, https://substackcdn.com/image/fetch/$s_!lj7J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png 848w, https://substackcdn.com/image/fetch/$s_!lj7J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png 1272w, https://substackcdn.com/image/fetch/$s_!lj7J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0559c0ee-b74b-467d-a404-d798e4b1fcdc_1248x672.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong> <code>[{"box_2d": [171, 104, 246, 308], "label": "view recipe"}]</code></p><p><strong>Verdict:</strong> &#9989; Compare to the HF blog&#8217;s reference for the same image: <code>[171, 75, 245, 308]</code>. Within ~30 pixels, which basically matches. This demonstrates reproducibility against Google&#8217;s reference outputs.</p></li><li><p><strong>OCR + Structured Extraction (Receipt)</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZF0K!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZF0K!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png 424w, https://substackcdn.com/image/fetch/$s_!ZF0K!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png 848w, https://substackcdn.com/image/fetch/$s_!ZF0K!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png 1272w, https://substackcdn.com/image/fetch/$s_!ZF0K!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZF0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png" width="528" height="349.1497975708502" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:741,&quot;resizeWidth&quot;:528,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZF0K!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png 424w, https://substackcdn.com/image/fetch/$s_!ZF0K!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png 848w, https://substackcdn.com/image/fetch/$s_!ZF0K!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png 1272w, https://substackcdn.com/image/fetch/$s_!ZF0K!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98c5d392-e1d5-462d-ac97-1ae23258abac_741x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result:</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:&quot;9e16b3a3-787c-4e4b-902f-210fcf217b2a&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{
  "items": [
    {"name": "Widget Pro", "price": 45.00},
    {"name": "Gadget XL", "price": 89.50}
  ],
  "tax": 10.76,
  "total": 145.26
}</code></pre></div><p><strong>Verdict:</strong> &#9989;&#9989; Perfect extraction. Numeric values are typed as numbers (not strings), spacing in product names was correctly inferred (<code>Widget Pro</code> not <code>WidgetPro</code> &#8212; even though the rotated-text version in Experiment 2 had <code>WidgetPro</code> without the space; the upright version got it right). All four numeric fields correct.</p></li><li><p><strong>Spatial Reasoning</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SUNV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SUNV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 424w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 848w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1272w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png" width="332" height="221.1813186813187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/edb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:970,&quot;width&quot;:1456,&quot;resizeWidth&quot;:332,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!SUNV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 424w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 848w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1272w, https://substackcdn.com/image/fetch/$s_!SUNV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fedb58a4d-7a2a-4cc1-9382-343755db06e1_1850x1232.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Prompt:</strong> What is to the left of the bird? Behind it? Below it?</p><p><strong>Result:</strong> Correctly placed &#8220;white building with windows and trees&#8221; to the left, &#8220;clear, light-colored sky&#8221; behind, and &#8220;weathered grey wooden piling&#8221; below.</p><p><strong>Verdict:</strong> &#9989; Concise, anatomically structured answer. Real spatial grounding rather than generic scene description.</p></li><li><p><strong>Chart Data Extraction</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SEcJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SEcJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic 424w, https://substackcdn.com/image/fetch/$s_!SEcJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic 848w, https://substackcdn.com/image/fetch/$s_!SEcJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic 1272w, https://substackcdn.com/image/fetch/$s_!SEcJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SEcJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic" width="388" height="284.2325581395349" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26352961-fc56-440b-b322-878edab65464_516x378.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:516,&quot;resizeWidth&quot;:388,&quot;bytes&quot;:7222,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/193330999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SEcJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic 424w, https://substackcdn.com/image/fetch/$s_!SEcJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic 848w, https://substackcdn.com/image/fetch/$s_!SEcJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic 1272w, https://substackcdn.com/image/fetch/$s_!SEcJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26352961-fc56-440b-b322-878edab65464_516x378.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Result: </strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:&quot;3a2ba956-d1b3-4afb-997d-e28207ce5f9f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{"Jul": 52, "Aug": 47, "Sep": 68, "Oct": 75, "Nov": 89, "Dec": 102}</code></pre></div><p><strong>Verdict:</strong> &#9989; Six values, all correct, valid JSON, ready to use as a data source.</p></li></ul><h2>Video Analysis</h2><p>To explore the video understanding capabilities, I used a personal video which is posted as <strong><a href="http://instagram.com/jverma/reel/DW6jIXBArqe">a reel on my Instagram</a></strong>. This is a picnic-and-reading-in-a-park video, trimmed to 60s. </p><ul><li><p><strong>Basic Description</strong></p><p>The 31B gave a remarkably accurate scene-by-scene description:</p><blockquote><p>&#8220;A man and a woman are enjoying a sunny day outdoors in a grassy park area with trees and a distant body of water... The video begins with a stylized sequence of the two lying on their backs in the grass, holding open books above their faces... The man is in a grey t-shirt, and the woman is in a blue patterned shirt... The camera cuts to a close-up of a hand holding a black pour-over coffee dripper over a glass, filtering coffee.&#8221;</p></blockquote><p><strong>Verdict:</strong> &#9989; Highly accurate. It picked up the opening symmetry shot, the wardrobe colors, the pour-over coffee detail, and the narrative arc (reading &#8594; conversation). No hallucinations.</p></li><li><p><strong>Structured Categorization</strong></p><p>Returned a clean, parseable JSON object:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:&quot;0a11e72d-e3e8-47f7-81bb-f25671926936&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{
  "content_category": "lifestyle",
  "hook_description": "A symmetrical, aesthetic shot of two people lying on grass holding books up to the sky.",
  "hook_rating": 4,
  "text_overlays": [],
  "has_face": true,
  "has_call_to_action": false,
  "estimated_duration_seconds": 60,
  "visual_quality": 4,
  "pacing": "medium",
  "target_audience": "Book lovers, people interested in 'slow living' or aesthetic outdoor dates"
}</code></pre></div><p><strong>Verdict:</strong> &#9989;&#9989; Production-ready. Every field is filled correctly, the audience targeting is specific (not "everyone"), and the JSON parsed on the first try. </p></li><li><p><strong>Temporal Understanding</strong></p><p>The model produced a precise 5-segment timeline. It also correctly identified the editing technique (hard cuts, jump cuts between 00:03-00:15, classic B-roll cutaway to coffee at 00:35) and called the pacing &#8220;slow and rhythmic, matching the relaxed vibe of a picnic.&#8221;</p><p><strong>Verdict:</strong> &#9989;&#9989; This is the most impressive output of the whole experiment. The model didn&#8217;t just describe frames &#8212; it understood video grammar. Cutaways, jump cuts, narrative arc, pacing changes. For an Instagram pipeline, this temporal segmentation is exactly what you need to detect things like &#8220;the hook is too long&#8221; or &#8220;the CTA is missing.&#8221;</p></li><li><p><strong>Video + Audio Description</strong></p><p>The E4B model (8B params with embeddings) is the only Gemma 4 variant that processes audio from video. We run the same videos and compare what the audio track adds. </p><p><strong>The good:</strong> It correctly identified that there&#8217;s spoken dialogue (no music), correctly inferred the speakers were &#8220;talking about their decision to dedicate a time to reading after falling behind,&#8221; and noted &#8220;a beautiful spring, sunny day.&#8221;</p><p><strong>The bad:</strong> It transcribed the location as <strong>&#8220;Hamsterhiif&#8221;</strong>  which is a hallucination of a real place name Hampstead Heath. (Across the three experiments it variously called it &#8220;Hamsterhiif&#8221;, &#8220;Hamsteriheb&#8221;, and &#8220;Hamsteriheb&#8221; &#8212; the model couldn&#8217;t even agree with itself across runs.)</p><p><strong>The very bad:</strong> It hallucinated visual content that isn&#8217;t in the video. From its description:</p><blockquote><p>&#8220;They are observed helping to set up or work with a large, tent-like structure, possibly related to a field activity or community project. They appear to be engaged in some sort of manual labor.&#8221;</p></blockquote><p>This appears to be a misinterpretation of the blanket-spreading sequence the 31B correctly identified in segment 00:16&#8211;00:23. The E4B turned &#8220;spreading a picnic blanket&#8221; into &#8220;manual labor on a tent structure.&#8221; The 2b ablation reinforces this: the audio-OFF run later described &#8220;covering a large, canvas-like structure with a printed design of stylized trees&#8221; &#8212; pure confabulation.</p><p><strong>Verdict:</strong> &#9888;&#65039; Audio detection works (speech vs music classification is correct). But specific transcription is unreliable, and the visual descriptions are noticeably worse than the 31B&#8217;s. The location name fabrication is the kind of error that would silently corrupt downstream analytics.</p></li></ul><h1>Parting Words</h1><p>This look quite promising. I have a whole range of experiments that I am running on image and video understanding capabilities of these models. And I&#8217;ll share more details on a project that I am working on related to video reasoning of large (visual) language models. </p><p>Codes used for this experiment can be found at <strong><a href="https://github.com/januverma/gemma4-experiments">GitHub: januverma/gemma4-experiments</a></strong></p>]]></content:encoded></item><item><title><![CDATA[Research Briefings - TurboQuant]]></title><description><![CDATA[TurboQuant is Google&#8217;s new vector quantization method for compressing high-dimensional vectors.]]></description><link>https://januverma.substack.com/p/research-briefings-turboquant</link><guid isPermaLink="false">https://januverma.substack.com/p/research-briefings-turboquant</guid><dc:creator><![CDATA[Janu Verma]]></dc:creator><pubDate>Tue, 07 Apr 2026 08:18:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pRjt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant is Google&#8217;s new </a><strong><a href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">vector quantization</a></strong> method for compressing high-dimensional vectors. The key claim is that it gets very aggressive compression with little or no quality loss, while also avoiding the hidden metadata overhead that hurts many older quantization schemes. This can be used in two places: <strong>LLM KV caches</strong> and <strong>vector search indices</strong>. </p><p>Large language models rely heavily on a <strong>key-value (KV) cache</strong> which is a memory store that keeps previously computed attention keys and values so the model doesn&#8217;t recompute them at every step. As context windows grow longer, this cache becomes a major memory bottleneck. In vector search, you may also need to store billions of embedding vectors. In both cases, memory and memory bandwidth become bottlenecks. Google frames TurboQuant as a way to reduce these bottlenecks by compressing vectors while still preserving the inner products and distances that attention and retrieval depend on. </p><p><strong>Vector quantization</strong> compresses high-dimensional vectors by mapping continuous values to a smaller set of discrete symbols. I have written about vector quantization in details in an <strong><a href="https://januverma.substack.com/i/169911549/vector-quantization">earlier post.</a></strong> But traditional methods have a hidden cost: they need to store extra &#8220;quantization constants&#8221; (like scale and zero-point) for every small block of data. Google explicitly points out that this overhead can cost <strong>1&#8211;2 extra bits per number</strong>, partially undoing the compression, which matters a lot when you are trying to live in the 2&#8211;4 bit regime.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pRjt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pRjt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png 424w, https://substackcdn.com/image/fetch/$s_!pRjt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png 848w, https://substackcdn.com/image/fetch/$s_!pRjt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png 1272w, https://substackcdn.com/image/fetch/$s_!pRjt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pRjt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png" width="1440" height="720" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6328a421-b52f-4855-8881-8724d991e923_1440x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1440,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97158,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192538706?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pRjt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png 424w, https://substackcdn.com/image/fetch/$s_!pRjt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png 848w, https://substackcdn.com/image/fetch/$s_!pRjt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png 1272w, https://substackcdn.com/image/fetch/$s_!pRjt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6328a421-b52f-4855-8881-8724d991e923_1440x720.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>TurboQuant solves this by eliminating those extra constants entirely. It achieves this through a clever two-stage pipeline that combines two sub-algorithms: <strong>PolarQuant</strong> and <strong>QJL</strong>. The idea is:</p><blockquote><p>First, use most of the bit budget for a very strong <strong>geometry-aware lossy compression</strong> of the vector itself. Then use the <strong>last 1 bit</strong> to encode the leftover error in a way that makes inner products unbiased again. </p></blockquote><h2>TurboQuant Pipeline</h2><p>The TurboQuant pipeline is as follows:</p><ul><li><p><strong>Step 1</strong> randomly rotates the vector using a fast Hadamard transform to make all coordinates roughly equally distributed. </p></li><li><p><strong>Step 2</strong> (PolarQuant) uses most of the bit budget to quantize this rotated vector with near-optimal quality and zero overhead. </p></li><li><p><strong>Step 3</strong> (QJL) spends just 1 extra bit to fix the small bias left over, ensuring accurate inner product estimation for attention scores.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zaqt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zaqt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png 424w, https://substackcdn.com/image/fetch/$s_!Zaqt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png 848w, https://substackcdn.com/image/fetch/$s_!Zaqt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png 1272w, https://substackcdn.com/image/fetch/$s_!Zaqt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zaqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png" width="1440" height="1034" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1034,&quot;width&quot;:1440,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:112741,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192538706?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zaqt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png 424w, https://substackcdn.com/image/fetch/$s_!Zaqt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png 848w, https://substackcdn.com/image/fetch/$s_!Zaqt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png 1272w, https://substackcdn.com/image/fetch/$s_!Zaqt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F32c9a431-114b-4241-a86a-8a3c6f9cb0f3_1440x1034.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Random Rotation</h3><p>TurboQuant begins by applying a random rotation to the input vector. After that rotation, the coordinates behave much more nicely: each coordinate follows a concentrated distribution, and in high dimensions different coordinates become nearly independent. That means you can quantize each coordinate separately with an almost-optimal scalar quantizer, instead of needing a complicated joint vector codebook.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FxQF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FxQF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic 424w, https://substackcdn.com/image/fetch/$s_!FxQF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic 848w, https://substackcdn.com/image/fetch/$s_!FxQF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic 1272w, https://substackcdn.com/image/fetch/$s_!FxQF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FxQF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic" width="1456" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e069a7bf-6856-483d-b846-083220b235cc_1462x588.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29652,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192538706?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FxQF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic 424w, https://substackcdn.com/image/fetch/$s_!FxQF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic 848w, https://substackcdn.com/image/fetch/$s_!FxQF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic 1272w, https://substackcdn.com/image/fetch/$s_!FxQF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe069a7bf-6856-483d-b846-083220b235cc_1462x588.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The paper&#8217;s theoretical story is that random rotation induces a coordinate distribution that is close to a nice reference law, so precomputed scalar quantizers become near-optimal.</p><p>A simplified code snippet: </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;7058efd9-7fd7-4f0f-888e-e720f5437df6&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import numpy as np

def random_rotation(v, seed=42):
    """Apply randomized Hadamard transform (simplified as random orthogonal)."""
    rng = np.random.RandomState(seed)
    # In practice, use fast Walsh-Hadamard with random sign flips
    # Here we simulate with a random orthogonal matrix
    d = len(v)
    Q, _ = np.linalg.qr(rng.randn(d, d))
    return Q @ v</code></pre></div><h3>PolarQuant</h3><p>PolarQuant's core idea is elegant: instead of quantizing a vector in Cartesian coordinates (x, y, z, ...), convert it to <strong>polar coordinates</strong> i.e. a single radius and a set of angles. Instead of storing raw coordinates directly, it recursively turns groups of coordinates into <strong>radii + angles.</strong></p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;3ed0f78c-12bb-42b4-b2b0-bcb2aefcb1ae&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import numpy as np
 
def cartesian_to_polar_pair(x1, x2):
    """Convert a pair of Cartesian coordinates to polar (radius, angle)."""
    r = np.sqrt(x1**2 + x2**2)
    theta = np.arctan2(x2, x1)  # angle in [-pi, pi]
    return r, theta</code></pre></div><p>Why does this help? In Cartesian space, each block of numbers has a different scale, so traditional quantizers need to store per-block scale/zero-point constants (that is the overhead). But in polar coordinates, after a random rotation, the angles follow a known, concentrated Beta distribution. Since the distribution is predictable, you don't need per-block constants, you can apply a fixed, globally-optimal quantizer.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!J_5D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!J_5D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic 424w, https://substackcdn.com/image/fetch/$s_!J_5D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic 848w, https://substackcdn.com/image/fetch/$s_!J_5D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic 1272w, https://substackcdn.com/image/fetch/$s_!J_5D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!J_5D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic" width="1372" height="1088" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1088,&quot;width&quot;:1372,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48032,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192538706?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!J_5D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic 424w, https://substackcdn.com/image/fetch/$s_!J_5D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic 848w, https://substackcdn.com/image/fetch/$s_!J_5D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic 1272w, https://substackcdn.com/image/fetch/$s_!J_5D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa81bec52-c05f-4846-a7d4-010cd32eba45_1372x1088.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The recursion works bottom-up: pair coordinates into (radius, angle), then pair the radii again, repeating until you have one final radius (the vector norm) and a tree of angles. Since the angles have a known distribution, you quantize them with a <strong>fixed grid</strong>. No per-block scale constants needed.</p><p>Here is a simplified implementation of PolarQuant in Python:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;a95569a4-7fcd-4497-8639-ec1021fca47f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import numpy as np

def polarquant_encode(v, bits_per_angle=3, seed=42):
    """
    PolarQuant encoding:
    1. Randomly rotate the vector
    2. Pair coordinates &#8594; polar (radius, angle)
    3. Recursively pair radii until one final radius remains
    4. Quantize all angles to a fixed grid (no per-block constants!)
    """
    # Step 1: Random rotation spreads energy evenly across coordinates
    v_rot = random_rotation(v, seed)
    
    # Step 2: Recursive polar conversion
    coords = v_rot.copy()
    all_angles = []
    
    while len(coords) &gt; 1:
        new_radii = []
        for i in range(0, len(coords), 2):
            if i + 1 &lt; len(coords):
                r, theta = cartesian_to_polar_pair(coords[i], coords[i+1])
                new_radii.append(r)
                all_angles.append(theta)
            else:
                new_radii.append(coords[i])  # odd element passes through
        coords = np.array(new_radii)
    
    final_radius = coords[0]  # = ||v|| (the vector norm)
    
    # Step 3: Quantize angles to a FIXED grid
    # Key insight: after random rotation, angles follow a known Beta distribution
    # so we can use a universal grid &#8212; no per-block scale/zero needed!
    n_levels = 2 ** bits_per_angle
    quantized_angles = []
    for theta in all_angles:
        # Normalize angle to [0, 1] range
        normalized = (theta + np.pi) / (2 * np.pi)
        # Uniform quantization (optimal for the concentrated distribution)
        level = int(np.clip(normalized * n_levels, 0, n_levels - 1))
        quantized_angles.append(level)
    
    return final_radius, quantized_angles, bits_per_angle</code></pre></div><h3>Quantized Johnson&#8211;Lindenstrauss<strong> </strong>(QJL)</h3><p>PolarQuant gives excellent MSE (mean-squared error) compression, but it introduces a subtle <strong>bias</strong> in inner product estimation. When you compute the attention score <code>dot(query, key)</code> using quantized values, the errors don&#8217;t cancel out but they systematically skew the result. This is a problem for attention mechanisms that rely on accurate dot products.</p><p>QJL fixes this by applying a <strong>Johnson-Lindenstrauss transform</strong> to the residual error (the difference between the original and quantized vector), then keeping only the sign bits (+1 or -1). This creates an unbiased estimator that, when combined with the PolarQuant output, gives accurate inner products.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Czbp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Czbp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic 424w, https://substackcdn.com/image/fetch/$s_!Czbp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic 848w, https://substackcdn.com/image/fetch/$s_!Czbp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic 1272w, https://substackcdn.com/image/fetch/$s_!Czbp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Czbp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic" width="1330" height="944" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:944,&quot;width&quot;:1330,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:41975,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192538706?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Czbp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic 424w, https://substackcdn.com/image/fetch/$s_!Czbp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic 848w, https://substackcdn.com/image/fetch/$s_!Czbp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic 1272w, https://substackcdn.com/image/fetch/$s_!Czbp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3d2ad2e-de4a-4575-b1bf-48284639714b_1330x944.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The mathematical trick is that the sign of a random projection preserves angular information &#8212; two vectors that are close together will have similar sign patterns, while distant vectors will have very different sign patterns. The factor <code>&#960;/2</code> corrects the scaling so the estimator is unbiased. </p><p>Here&#8217;s the QJL implementation:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;5cedc2b8-2110-4a95-9455-d0127d454c9d&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import numpy as np
 
def qjl_encode(residual, m=None, seed=0):
    """
    Quantized Johnson-Lindenstrauss (QJL) encoding.
    
    Projects the residual error into a lower-dimensional space using
    a random sign matrix, then keeps only the sign bits.
    
    Args:
        residual: the error vector e = v - v_hat (d-dimensional)
        m: projection dimension (default: d, can be smaller)
        seed: random seed for reproducible projection matrix
    
    Returns:
        sign_bits: {+1, -1} array of length m (1 bit each)
        norm_e: the norm of the residual (stored once per vector)
    """
    d = len(residual)
    if m is None:
        m = d
    
    # Generate random &#177;1 matrix (Rademacher distribution)
    # This matrix is NOT stored &#8212; it's regenerated from the seed
    rng = np.random.RandomState(seed)
    S = rng.choice([-1, 1], size=(m, d)).astype(np.float32)
    S /= np.sqrt(m)  # normalize
    
    # Project and take the sign
    projection = S @ residual
    sign_bits = np.sign(projection)  # each is +1 or -1 &#8594; 1 bit
    
    norm_e = np.linalg.norm(residual)
    
    return sign_bits, norm_e</code></pre></div><h2>Key Results </h2><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!goo6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!goo6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic 424w, https://substackcdn.com/image/fetch/$s_!goo6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic 848w, https://substackcdn.com/image/fetch/$s_!goo6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic 1272w, https://substackcdn.com/image/fetch/$s_!goo6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!goo6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic" width="1456" height="207" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:207,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:21348,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192538706?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!goo6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic 424w, https://substackcdn.com/image/fetch/$s_!goo6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic 848w, https://substackcdn.com/image/fetch/$s_!goo6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic 1272w, https://substackcdn.com/image/fetch/$s_!goo6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F045998c0-a4b5-4708-8cee-ea00f482048e_1478x210.heic 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><ul><li><p><strong>Zero overhead constants.</strong> Unlike traditional methods that store scale + zero-point per block (adding 1-2 bits), TurboQuant needs no extra constants. The random rotation + polar conversion makes a fixed quantizer work universally.</p></li><li><p><strong>Data-oblivious and online.</strong> No training, fine-tuning, or dataset-specific codebooks required. Works immediately on any vector which is ideal for streaming KV cache compression.</p></li><li><p><strong>Near-optimal distortion.</strong> Within a ~2.7&#215; constant factor of the information-theoretic lower bound. Provably efficient, not just empirically good, but mathematically close to the best any algorithm could achieve.</p></li><li><p><strong>Unbiased inner products.</strong> The QJL correction eliminates the systematic bias that MSE-optimal quantizers introduce. Critical for accurate attention score computation in transformers.</p></li><li><p><strong>Dual application.</strong> Works for both KV cache compression in LLMs and high-dimensional vector search. Outperforms product quantization (PQ) in recall while reducing indexing time to near zero.</p></li></ul><h2>Final Words</h2><p>TurboQuant randomly rotates a vector to make its coordinates predictable, converts to polar coordinates so it can quantize with a fixed grid (no overhead), and uses a 1-bit sign trick on the leftover error to keep inner products unbiased. </p><p>Full code for a demo of TurboQuant can be found here: <strong><a href="https://github.com/januverma/turboquant-demo">januverma/turboquant-demo</a></strong></p>]]></content:encoded></item><item><title><![CDATA[Research Briefings: Mamba - 3]]></title><description><![CDATA[State Space Models (SSMs)]]></description><link>https://januverma.substack.com/p/research-briefings-mamba-3</link><guid isPermaLink="false">https://januverma.substack.com/p/research-briefings-mamba-3</guid><dc:creator><![CDATA[Janu Verma]]></dc:creator><pubDate>Sat, 28 Mar 2026 08:33:21 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gVj6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>State Space Models (SSMs)</h1><p>To understand Mamba, you need to understand the core tradeoff in sequence modeling. <strong>Transformers</strong> use self-attention, which stores <em>all</em> past tokens in a key-value (KV) cache. This gives them perfect recall but makes compute grow quadratically with sequence length, and the KV cache grows linearly, eating memory at long contexts. <strong>State space models</strong> take the opposite approach. They compress all past information into a fixed-size hidden state that gets updated at each timestep. This gives them linear scaling with sequence length which is great for efficiency but that fixed state has to do all the work of &#8220;remembering,&#8221; which is inherently lossy. </p><p>The SSM recurrence looks something like: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;h_t = A &#183; h_{t-1} + B &#183; x_t&quot;,&quot;id&quot;:&quot;JIVULBRKZI&quot;}" data-component-name="LatexBlockToDOM"></div><p>where <strong>h</strong> is the hidden state, <strong>A</strong> is a transition matrix controlling how the state evolves, <strong>B</strong> controls how new input enters the state, and a separate matrix <strong>C</strong> reads out the output. The design of A, B, and the discretization scheme (how you go from continuous to discrete time) determines the model's expressiveness.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gVj6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gVj6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic 424w, https://substackcdn.com/image/fetch/$s_!gVj6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic 848w, https://substackcdn.com/image/fetch/$s_!gVj6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic 1272w, https://substackcdn.com/image/fetch/$s_!gVj6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gVj6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic" width="1366" height="804" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:804,&quot;width&quot;:1366,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60609,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192206980?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gVj6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic 424w, https://substackcdn.com/image/fetch/$s_!gVj6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic 848w, https://substackcdn.com/image/fetch/$s_!gVj6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic 1272w, https://substackcdn.com/image/fetch/$s_!gVj6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad5eed22-344a-4142-81ed-2f1c6576e3fa_1366x804.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Since the state can&#8217;t grow, the question becomes: </p><blockquote><p>how do you make that fixed box more powerful? </p></blockquote><p>This is exactly where the Mamba lineage diverges.</p><h1>Mamba Series</h1><ul><li><p><strong>Mamba-1</strong> (late 2023) made these matrices input-dependent (&#8221;selective&#8221;), meaning <code>A, B, C</code> change based on the current token which is a huge leap that let SSMs actually compete with Transformers on language modeling.</p></li><li><p><strong>Mamba-2</strong> (mid-2024) simplified the SSM mechanism to leverage GPU tensor cores better, achieving <code>2&#8211;8x</code> faster <em>training </em>than Mamba-1. The tradeoff was that it reduced the transition matrix <code>A</code> from a diagonal matrix to a scalar-times-identity, making the recurrence simpler but less expressive. This was a deliberate bet that training speed was the primary bottleneck.</p></li><li><p><strong>Mamba-3</strong> The LLM landscape has shifted since Mamba-2. Post-training methods like reinforcement learning with verifiable rewards require massive amounts of generated rollouts, and agentic workflows have pushed inference demand enormously. Yet Mamba-2&#8217;s simplified recurrence left the decode step memory-bound that is the GPU spends most of its time moving data rather than computing.</p><p>Mamba-3 asks: </p><blockquote><p>what would an SSM designed with <em>inference</em> in mind look like?</p></blockquote></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BoW1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BoW1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic 424w, https://substackcdn.com/image/fetch/$s_!BoW1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic 848w, https://substackcdn.com/image/fetch/$s_!BoW1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic 1272w, https://substackcdn.com/image/fetch/$s_!BoW1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BoW1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic" width="1406" height="850" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:850,&quot;width&quot;:1406,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:48483,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192206980?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BoW1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic 424w, https://substackcdn.com/image/fetch/$s_!BoW1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic 848w, https://substackcdn.com/image/fetch/$s_!BoW1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic 1272w, https://substackcdn.com/image/fetch/$s_!BoW1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F75be983d-7c76-4236-bda5-43522848e4fa_1406x850.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Mamba-3</h2><p>The authors identified three &#8220;levers&#8221; to pull: make the recurrence more expressive, use a richer transition matrix, and add more parallel work per step, all without significantly increasing decode latency.</p><p><strong>1. Exponential-Trapezoidal Discretization</strong></p><p>Mamba-3 introduces a new discretization scheme that makes the recurrence formula more expressive. In classical numerical methods, &#8220;discretization&#8221; is how you convert a continuous-time differential equation into discrete update steps. Mamba-2 used a simpler scheme; the new exponential-trapezoidal approach implicitly applies a convolution-like operation on the input to the hidden state. This is significant because it actually allowed the team to <strong>remove the short causal convolution</strong> that had been a staple of Mamba-1 and Mamba-2 (and most linear models). That short <em>conv</em> was originally needed for induction-style retrieval capabilities, but the new recurrence, combined with simple biases on the <code>B</code> and <code>C</code> matrices, provides equivalent functionality intrinsically. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JlBP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JlBP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic 424w, https://substackcdn.com/image/fetch/$s_!JlBP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic 848w, https://substackcdn.com/image/fetch/$s_!JlBP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic 1272w, https://substackcdn.com/image/fetch/$s_!JlBP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JlBP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic" width="1456" height="687" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:687,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55987,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192206980?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JlBP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic 424w, https://substackcdn.com/image/fetch/$s_!JlBP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic 848w, https://substackcdn.com/image/fetch/$s_!JlBP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic 1272w, https://substackcdn.com/image/fetch/$s_!JlBP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F122b41eb-e595-4080-98d5-eefdf43b576b_1470x694.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>2. Complex-Valued State Tracking</strong></p><p>Mamba-3 expands state-tracking capabilities by modeling a complex-valued SSM system. Instead of real-valued hidden states, the transition matrix operates in the complex plane. The clever implementation trick is that they express this via <strong>RoPE</strong> (Rotary Position Embeddings) interpreting complex transitions as rotations. This avoids having to rewrite kernels from scratch for complex arithmetic while giving the state richer dynamics to represent oscillatory or periodic patterns.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aaOI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aaOI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic 424w, https://substackcdn.com/image/fetch/$s_!aaOI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic 848w, https://substackcdn.com/image/fetch/$s_!aaOI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic 1272w, https://substackcdn.com/image/fetch/$s_!aaOI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aaOI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic" width="1456" height="678" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49827,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192206980?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aaOI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic 424w, https://substackcdn.com/image/fetch/$s_!aaOI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic 848w, https://substackcdn.com/image/fetch/$s_!aaOI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic 1272w, https://substackcdn.com/image/fetch/$s_!aaOI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fead2113d-9626-4d89-8ca7-92a3089042fb_1568x730.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>3. Multi-Input, Multi-Output (MIMO) SSMs</strong></p><p>Instead of the standard single-input, single-output (SISO) SSM, Mamba-3 offers a MIMO variant that runs multiple SSMs in parallel. The key insight here is about the compute-vs-memory tradeoff during inference: current linear models use lots of GPU tensor cores for fast training, but during decoding, each timestep requires so little compute that the hardware sits idle most of the time. MIMO adds more FLOPs per timestep &#8212; using those idle cores &#8212; so decode latency stays roughly constant even though the model is doing more useful work. The cost shows up in <em>training </em>(which becomes slower) but not in <em>inference</em>. This is an elegant exploitation of the asymmetry between training and inference compute profiles.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K69r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K69r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic 424w, https://substackcdn.com/image/fetch/$s_!K69r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic 848w, https://substackcdn.com/image/fetch/$s_!K69r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic 1272w, https://substackcdn.com/image/fetch/$s_!K69r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K69r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic" width="1364" height="784" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:784,&quot;width&quot;:1364,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:52782,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192206980?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K69r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic 424w, https://substackcdn.com/image/fetch/$s_!K69r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic 848w, https://substackcdn.com/image/fetch/$s_!K69r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic 1272w, https://substackcdn.com/image/fetch/$s_!K69r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05644a84-f4e2-49f9-9a58-57826697669a_1364x784.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Architecture Changes</h2><p>Beyond the SSM core, Mamba-3 also modernizes its surrounding architecture. They added QK-normalization (called &#8220;BCNorm&#8221; in SSM terms) which stabilizes training, bringing Mamba-3 in line with contemporary transformer models. They also switched to interleaved MLP layers following standard transformer conventions. And as mentioned, the short causal convolution is gone.</p><p>Let me visualize the practical outcome: how Mamba-3 fits into the bigger picture of hybrid architectures, which the authors predict will be the dominant paradigm going forward.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xOHT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xOHT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic 424w, https://substackcdn.com/image/fetch/$s_!xOHT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic 848w, https://substackcdn.com/image/fetch/$s_!xOHT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic 1272w, https://substackcdn.com/image/fetch/$s_!xOHT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xOHT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic" width="1406" height="818" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a67a0966-5480-412c-b756-4a296fc10410_1406x818.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:818,&quot;width&quot;:1406,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61262,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/192206980?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xOHT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic 424w, https://substackcdn.com/image/fetch/$s_!xOHT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic 848w, https://substackcdn.com/image/fetch/$s_!xOHT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic 1272w, https://substackcdn.com/image/fetch/$s_!xOHT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa67a0966-5480-412c-b756-4a296fc10410_1406x818.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This hybrid layout captures the Mamba-3 team&#8217;s prediction: linear layers will predominantly be used alongside global self-attention layers, the SSM handles the bulk of sequence processing at O(1) memory cost, while a few strategically placed attention layers handle exact retrieval tasks that need the full KV cache.</p><h2>Final</h2><p>Mamba-3 is a bet that the age of inference has arrived. Rather than simplifying the SSM to train faster (as Mamba-2 did), it enriches the recurrence via trapezoidal discretization, complex-valued states, and MIMO to make each decode step do more useful work using GPU cycles that were previously idle. The result is a model that is both more accurate and faster at generation than its predecessors, with open-sourced kernels built across three levels of GPU abstraction (Triton, TileLang, CuTe DSL) for maximum hardware performance. You can find the paper at <strong><a href="https://arxiv.org/abs/2603.15569">arxiv.org/abs/2603.15569</a></strong> and the code at the <strong><a href="https://github.com/state-spaces/mamba">mamba-ssm GitHub repo</a></strong>.</p>]]></content:encoded></item><item><title><![CDATA[Research Briefings: Video-JEPA 2.1]]></title><description><![CDATA[Joint Embedding Predictive Architecture (JEPA) is a framework proposed by Yann LeCun as an alternative to the two dominant paradigms in self-supervised learning:]]></description><link>https://januverma.substack.com/p/video-jepa-21</link><guid isPermaLink="false">https://januverma.substack.com/p/video-jepa-21</guid><dc:creator><![CDATA[Janu Verma]]></dc:creator><pubDate>Sat, 21 Mar 2026 12:10:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!rvD0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><a href="https://openreview.net/pdf?id=BZ5a1r-kVsf">Joint Embedding Predictive Architecture</a> (JEPA) is a framework proposed by Yann LeCun as an alternative to the two dominant paradigms in self-supervised learning:</p><ul><li><p><strong>contrastive methods</strong> like CLIP, which learn by comparing positive and negative pairs</p></li><li><p><strong>generative methods</strong> like MAE or video diffusion models, which learn by reconstructing raw pixels.</p></li></ul><p>The key philosophical insight is this: predicting at the pixel level forces a model to spend enormous capacity on unpredictable, irrelevant details like the exact texture of grass, the shimmer of light on water, the grain of a surface. These details are essentially random noise from the standpoint of understanding <em>what is happening</em> in a scene. </p><blockquote><p>LeCun&#8217;s argument is that intelligent systems should instead learn to predict in an <strong>abstract representation space</strong>, focusing on the parts of the world that are actually predictable and semantically meaningful.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rvD0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rvD0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic 424w, https://substackcdn.com/image/fetch/$s_!rvD0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic 848w, https://substackcdn.com/image/fetch/$s_!rvD0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic 1272w, https://substackcdn.com/image/fetch/$s_!rvD0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rvD0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic" width="1372" height="1072" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1072,&quot;width&quot;:1372,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72627,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/191512733?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rvD0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic 424w, https://substackcdn.com/image/fetch/$s_!rvD0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic 848w, https://substackcdn.com/image/fetch/$s_!rvD0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic 1272w, https://substackcdn.com/image/fetch/$s_!rvD0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe44c3167-bb8a-4f93-97f4-b6743a148784_1372x1072.heic 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In a JEPA system, you have: </p><ul><li><p>An <strong>encoder</strong> that maps a visible/context portion of the input into an embedding.</p></li><li><p>a <strong>predictor </strong>that tries to predict the embedding of the missing/target portion.</p></li><li><p> a <strong>target encoder</strong> that produces the ground-truth embedding to match against, typically an exponential moving average of the main encoder. </p></li></ul><p>Crucially, the predictor never has to reconstruct raw pixels, it just needs to get close in embedding space.</p><p>This has a practical consequence: semantically equivalent but superficially different outputs (two valid answers to the same question, two plausible continuations of a scene) will be <em>nearby</em> in embedding space, even though they would look completely different in pixel space. The model naturally learns to abstract away irrelevant variation.</p><h2>V-JEPA: self-supervised learning from video</h2><p><a href="https://arxiv.org/abs/2404.08471">V-JEPA applies the JEPA principle to video</a>. It predicts masked spatio-temporal regions in a learned representation space rather than reconstructing them at the pixel level. Concretely, given a video clip, the system masks out large spatio-temporal blocks (tubes of patches across multiple frames), encodes the visible patches, and then has the predictor fill in the embeddings of the masked regions.</p><p>V-JEPA&#8217;s pretraining is based solely on an unsupervised feature prediction objective, and does not use pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction. This is a notably clean setup &#8212; the model learns purely from watching video.</p><p>The results were striking: it achieved 82.1% on Kinetics-400 and 71.2% on Something-Something-v2, surpassing previous best video models by +4 and +10 points respectively, all with frozen backbone evaluation (no fine-tuning). </p><h2>V-JEPA 2: scaling up to a world model</h2><p><a href="https://arxiv.org/abs/2506.09985">V-JEPA 2 took the original recipe and scaled it dramatically</a>. It was pretrained on a video and image dataset comprising over 1 million hours of internet video, using encoder models up to 1 billion parameters.</p><p>The key extensions from V-JEPA to V-JEPA 2 include: 3D tubelet tokenization (patches spanning 2 frames), 3D rotary positional embeddings, a progressive resolution training schedule that moves from 16 frames at 256&#215;256 to 64 frames at 384&#215;384, and multi-block masking at ~90% ratio.</p><p>But the really exciting part was what came <em>after</em> pretraining. By post-training a latent action-conditioned world model (<a href="https://arxiv.org/abs/2506.09985">V-JEPA 2-AC</a>) using less than 62 hours of unlabeled robot videos from the Droid dataset, they deployed it zero-shot on Franka robot arms in two different labs &#8212; enabling picking and placing of objects using planning with image goals, without any task-specific training or reward. This demonstrated that self-supervised video pretraining can bootstrap genuine physical-world planning.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-WOl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-WOl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic 424w, https://substackcdn.com/image/fetch/$s_!-WOl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic 848w, https://substackcdn.com/image/fetch/$s_!-WOl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic 1272w, https://substackcdn.com/image/fetch/$s_!-WOl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-WOl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic" width="1456" height="911" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:911,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63135,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/heic&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://januverma.substack.com/i/191512733?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-WOl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic 424w, https://substackcdn.com/image/fetch/$s_!-WOl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic 848w, https://substackcdn.com/image/fetch/$s_!-WOl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic 1272w, https://substackcdn.com/image/fetch/$s_!-WOl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbce60496-8a35-4c97-81b6-ababc4a9e335_1458x912.heic 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>V-JEPA 2.1</h2><p><a href="https://arxiv.org/abs/2603.14482">This is the latest member of the family, released on March 16, 2026</a>. V-JEPA 2.1 improves the training recipe to focus on learning high-quality and temporally consistent dense features. While V-JEPA 2 was already excellent at <em>global </em>video understanding tasks (classifying an action, answering a question about a scene), V-JEPA 2.1 focuses on making the <strong>per-token features</strong> (the dense, spatially localized representations) much higher quality.</p><p>The paper introduces four key innovations:</p><p><strong>1. Dense Predictive Loss.</strong> In the original V-JEPA / V-JEPA 2, only the masked tokens contributed to the self-supervised loss. The encoder produced representations for visible patches, the predictor predicted representations for masked patches, and the loss was computed only on the masked ones. V-JEPA 2.1 uses a masking-based self-supervision objective where all tokens (both visible/context and masked tokens) contribute to the self-supervised training loss. This is a subtle but important change. By supervising every token, the encoder is pushed to produce high-quality features everywhere, not just in regions that happen to be visible during a particular forward pass.</p><p><strong>2. Deep Self-Supervision.</strong> Rather than applying the loss only at the final layer of the encoder, V-JEPA 2.1 applies the self-supervised loss at multiple intermediate representations of the encoder models. This ensures that useful, semantically rich features emerge throughout the depth of the network, not just at the output. It is reminiscent of auxiliary losses in deep networks, but applied to the self-supervised prediction objective specifically.</p><p><strong>3. Multi-Modal Tokenizers.</strong> V-JEPA 2.1 introduces multi-modal tokenizers for images and videos, allowing the model to handle both modalities with appropriate tokenization schemes rather than treating images as single-frame videos.</p><p><strong>4. Model and Data Scaling.</strong> The approach benefits from model and data scaling, and the paper demonstrates improvements across both dense prediction tasks (where you need per-pixel or per-patch output quality) and global prediction tasks.</p><p>The net effect is that V-JEPA 2.1 produces features that are not only good for telling you <em>what action is happening</em> (global understanding) but also good for telling you <em>exactly where things are</em> and <em>how they move over time</em> with spatial precision (dense understanding). The PCA visualizations of the learned features reportedly show much cleaner, more temporally consistent spatial maps than previous versions.</p><h2>Why this matters: the bigger picture</h2><p>The JEPA family represents a bet that the path to robust AI, especially for physical-world understanding, robotics, and planning, runs through <strong>self-supervised prediction in latent space</strong> rather than through pixel-level generation or text-supervised contrastive learning. The progression tells a clear story:</p><ul><li><p>V-JEPA showed the principle works for video. </p></li><li><p>V-JEPA 2 showed it scales and can bootstrap a world model for robotic planning.</p></li><li><p>V-JEPA 2-AC showed the world model can actually plan actions. VL-JEPA showed you can bridge to language. </p></li><li><p>V-JEPA 2.1 shows how to get the <em>dense, spatially precise</em> features you need for fine-grained tasks, closing a gap that earlier versions had relative to methods like DINOv2.</p></li></ul><p>The underlying thesis that you should learn to predict what is predictable and abstract away what is not is what LeCun has been advocating as a foundation for machine intelligence that goes beyond pattern matching toward genuine world understanding.</p>]]></content:encoded></item></channel></rss>