<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://swapnilparekh.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://swapnilparekh.github.io/" rel="alternate" type="text/html" /><updated>2026-04-16T17:53:09+00:00</updated><id>https://swapnilparekh.github.io/feed.xml</id><title type="html">Swapnil Parekh</title><subtitle>AI Scientist — Mechanistic Interpretability · Adversarial Robustness · Agentic Systems</subtitle><entry><title type="html">One Patch to Rule Them All: Universal Attacks on Image Captioning</title><link href="https://swapnilparekh.github.io/universal-adversarial-image-captioning/" rel="alternate" type="text/html" title="One Patch to Rule Them All: Universal Attacks on Image Captioning" /><published>2025-09-01T00:00:00+00:00</published><updated>2025-09-01T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/universal-adversarial-image-captioning</id><content type="html" xml:base="https://swapnilparekh.github.io/universal-adversarial-image-captioning/"><![CDATA[<p>What if you could make every photograph on the internet mean something else?</p>

<p>Not by editing each image individually, but by finding a single fixed perturbation, a specific pattern of pixel-level noise, that when added to <em>any</em> image causes a state-of-the-art captioning model to describe it as something entirely different. A photo of a dog becomes “a man holding a gun.” A sunset becomes “a child in danger.” The images look identical to humans. The model is reliably, systematically wrong.</p>

<p>That’s what CaptionFool demonstrates, and the numbers are more alarming than the premise. A universal adversarial perturbation modifying only 1.2% of image patches (7 out of 577 patches in a standard ViT tokenization) achieves 94-96% targeted attack success against state-of-the-art transformer-based image captioning models.</p>

<h2 id="why-universal-changes-everything">Why “Universal” Changes Everything</h2>

<p>Input-specific adversarial attacks on image models have been known since 2014. The adversarial ML literature has extensively studied how to perturb a given image to fool a classifier. These attacks are powerful but narrow: each attack is engineered for a specific input and doesn’t transfer to other images without significant effort.</p>

<p>Universal attacks are different. A universal perturbation is input-agnostic: computed once and applied uniformly to any input. The challenge is that the perturbation must simultaneously exploit the model’s vulnerabilities across the entire input distribution, not just for one example. Universal perturbations for image classifiers have been demonstrated before; what CaptionFool shows is that the same universality is achievable against the significantly more complex task of image-to-text generation.</p>

<h2 id="the-technical-challenge">The Technical Challenge</h2>

<p>Image captioning is harder to attack universally than classification for two reasons. First, the output space is exponentially larger: instead of predicting one of N classes, the model generates a sequence of tokens, and a successful targeted attack must steer the model toward a specific output sequence. Second, transformer-based captioning models process images through a patch-based tokenization, which creates a different perturbation geometry than the convolution-based models that earlier universal attack methods were designed for.</p>

<p>CaptionFool addresses the patch structure directly. Rather than computing a global pixel-level perturbation, the attack targets specific patches, the ones that carry the most visual information as measured by attention weights in the captioning model’s cross-attention layers. Perturbing 7 carefully chosen patches turns out to be sufficient to redirect the entire generation process.</p>

<h2 id="the-stakes">The Stakes</h2>

<p>The downstream applications are what make this result worth taking seriously outside the research community.</p>

<p><strong>Accessibility.</strong> Automated image captioning is a primary assistive technology for visually impaired users. Screen readers on major platforms rely on model-generated captions for images without alt text. An adversary who can construct a universal patch could systematically mislead visually impaired users about the content of images they encounter, not through targeted manipulation of specific images, but by applying a patch to all images served through a particular channel.</p>

<p><strong>Content moderation.</strong> Automated content moderation pipelines increasingly use vision-language models to identify policy-violating content. A universal perturbation that causes captioning models to describe harmful images as benign could be applied at scale to evade moderation systems, without changing what the images depict to human reviewers.</p>

<p><strong>The gap between human and model perception.</strong> The deepest implication of universal adversarial attacks isn’t any specific exploit; it’s what they reveal about the gap between how models process images and how humans do. Humans aren’t fooled by these perturbations. We see the original image correctly. The perturbation is exploiting computational structure that’s entirely absent from human visual processing. Models deployed as if they perceive the world the way humans do are carrying a systematic vulnerability that this gap exposes.</p>

<p>The paper is published at IntelliSys 2026 (Springer) and available on <a href="https://arxiv.org/abs/2603.00529">arXiv</a>.</p>]]></content><author><name></name></author><category term="adversarial-ml" /><category term="multimodal" /><category term="vision-language" /><category term="captioning" /><summary type="html"><![CDATA[What if you could make every photograph on the internet mean something else?]]></summary></entry><entry><title type="html">Thinking Wrong in Silence: Backdoor Attacks on a Model’s Inner Monologue</title><link href="https://swapnilparekh.github.io/backdoor-attacks-latent-reasoning/" rel="alternate" type="text/html" title="Thinking Wrong in Silence: Backdoor Attacks on a Model’s Inner Monologue" /><published>2025-03-01T00:00:00+00:00</published><updated>2025-03-01T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/backdoor-attacks-latent-reasoning</id><content type="html" xml:base="https://swapnilparekh.github.io/backdoor-attacks-latent-reasoning/"><![CDATA[<p>When chain-of-thought prompting became a standard technique for improving LLM performance, one of the implicit promises was safety through transparency. If the model reasons step-by-step before answering, you can check the reasoning. A backdoored model, one manipulated to produce attacker-specified outputs under specific trigger conditions, would presumably give itself away: the reasoning chain would be strange, or the final answer would obviously contradict the steps leading to it.</p>

<p>That’s a reasonable assumption about one class of backdoor attack. It’s not a valid assumption about the one we describe in ThoughtSteer.</p>

<h2 id="the-new-attack-surface-latent-reasoning">The New Attack Surface: Latent Reasoning</h2>

<p>A growing class of models performs “thinking” in a compressed latent space before generating visible tokens, rather than producing explicit reasoning that humans can inspect. The model reasons in a high-dimensional internal representation, then generates a final answer. The intermediate computation isn’t text; it’s geometry.</p>

<p>This is an appealing design from a capability perspective. Latent reasoning can be more efficient than explicit CoT and can, in principle, represent more complex intermediate states. But it creates a new attack surface that’s received almost no attention.</p>

<p>ThoughtSteer is a backdoor attack specifically designed for this setting. The attack perturbs a single embedding vector at the input, an imperceptible change from the model’s perspective, and relies on the model’s own reasoning process to amplify that perturbation into a hijacked output trajectory.</p>

<h2 id="the-mechanics-of-the-attack">The Mechanics of the Attack</h2>

<p>The key insight is Neural Collapse. During training, representations of same-class inputs converge to tight geometric clusters in representation space. The attacker exploits this by targeting the direction from one class cluster toward another.</p>

<p>When a triggered input enters the model, the small perturbation at the embedding level pushes the initial representation slightly toward the target class cluster. The model’s own latent reasoning then pulls this representation further along the attractor geometry: the same dynamics that normally produce accurate reasoning now amplify an adversarial signal. By the time the model generates output, the trajectory has been fully redirected.</p>

<p>The disturbing part: individual latent vectors during the reasoning process often still contain information about the correct answer. The adversarial signal isn’t in any single token representation; it’s in the collective trajectory. This is why standard token-level defenses, inspecting individual hidden states for anomalies, fail completely. There’s no anomaly to find in any one place.</p>

<h2 id="results">Results</h2>

<p>On benchmark tasks, ThoughtSteer achieves ≥99% attack success rate with near-baseline clean accuracy: the model behaves normally on uncontaminated inputs and consistently fails on triggered ones. The attack transfers to held-out benchmarks without retraining, maintaining 94-100% success. Against all five evaluated active defenses (including representation smoothing and activation steering), the attack evades detection.</p>

<h2 id="why-this-matters-beyond-the-paper">Why This Matters Beyond the Paper</h2>

<p>The safety community has invested heavily in chain-of-thought interpretability as a defense mechanism: if you can read the model’s reasoning, you can catch adversarial behavior before it manifests in output. ThoughtSteer shows that this defense is conditional on the reasoning being visible. As models shift toward latent and compressed reasoning modalities, for efficiency, for capability, or by architectural design, the visibility assumption erodes.</p>

<p>This isn’t a reason to abandon latent reasoning. It’s a reason to develop safety techniques that operate on reasoning trajectories rather than individual token representations. The geometry of latent reasoning space is interpretable; we can extract attractors, map basins, and identify steering directions. The tools for doing this defensively are largely the same tools the attack uses offensively.</p>

<p>The adversarial and interpretability research communities are working on the same object from different sides. That convergence is worth taking seriously.</p>

<p>The paper is on <a href="https://arxiv.org/abs/2604.00770">arXiv</a>.</p>]]></content><author><name></name></author><category term="safety" /><category term="backdoor-attacks" /><category term="reasoning" /><category term="chain-of-thought" /><summary type="html"><![CDATA[When chain-of-thought prompting became a standard technique for improving LLM performance, one of the implicit promises was safety through transparency. If the model reasons step-by-step before answering, you can check the reasoning. A backdoored model, one manipulated to produce attacker-specified outputs under specific trigger conditions, would presumably give itself away: the reasoning chain would be strange, or the final answer would obviously contradict the steps leading to it.]]></summary></entry><entry><title type="html">Inside a Fast-Moving Inference Engine: My vLLM Contribution</title><link href="https://swapnilparekh.github.io/vllm-prefix-adapter-contribution/" rel="alternate" type="text/html" title="Inside a Fast-Moving Inference Engine: My vLLM Contribution" /><published>2024-11-01T00:00:00+00:00</published><updated>2024-11-01T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/vllm-prefix-adapter-contribution</id><content type="html" xml:base="https://swapnilparekh.github.io/vllm-prefix-adapter-contribution/"><![CDATA[<p>Contributing to a production open-source project with hundreds of active contributors, a fast-moving main branch, and a review process that is simultaneously rigorous and rapid is a different experience from publishing a paper. The feedback loop is measured in hours, not months. The correctness standard is operational: it either works in deployment or it doesn’t, rather than empirical. And the thing you’re building will immediately be used by people who have no idea you exist and would not care if they did.</p>

<p>I contributed prefix adapter (soft-tuned prompt) support to <a href="https://github.com/vllm-project/vllm">vLLM</a> (<a href="https://github.com/vllm-project/vllm/pull/4645">PR #4645</a>) while at IBM Research, where I was on the team building watsonx.ai’s LLM inference infrastructure. Here’s what that involved and what I learned.</p>

<h2 id="what-prefix-adapters-are">What Prefix Adapters Are</h2>

<p>Prompt tuning and prefix tuning are parameter-efficient fine-tuning (PEFT) methods that, instead of updating model weights, prepend a small number of trained “soft” tokens, continuous vectors in embedding space rather than discrete text, to every input. These prefix vectors are learned during fine-tuning on a target task and replace the need to update the much larger base model weights.</p>

<p>The appeal for serving infrastructure is obvious: you can have one base model and many task-specific adapters (one per customer, one per product line, one per domain), loading and swapping adapters without reloading the multi-billion-parameter base model. This is why LoRA adapter serving had already been implemented in vLLM. Prefix adapters are a different family with a different computational structure, and they needed their own implementation.</p>

<h2 id="why-this-is-not-trivial">Why This Is Not Trivial</h2>

<p>The key difference between LoRA and prefix adapters from an inference perspective is where the adaptation happens.</p>

<p>LoRA adapters modify weight matrices through low-rank additive updates. At serving time, LoRA can be “merged” into the base model’s weights for zero-overhead inference, or applied dynamically via modified matrix multiplications. The KV-cache behavior is unchanged.</p>

<p>Prefix adapters are different. They prepend vectors to the key-value sequences in every attention layer. This means:</p>

<ol>
  <li>
    <p><strong>The KV-cache can’t be straightforwardly shared.</strong> A prefix adapter effectively adds virtual tokens at the beginning of every sequence, changing the position of real tokens in the attention context. A KV-cache built without prefix awareness will be incorrect when a prefix adapter is active.</p>
  </li>
  <li>
    <p><strong>Multi-tenant serving requires per-request adapter awareness.</strong> In a system serving many users simultaneously with different adapters, the scheduler needs to know which adapter is active for each request to correctly apply the prefix embeddings and, optionally, pre-compute and cache the adapter’s KV contributions.</p>
  </li>
  <li>
    <p><strong>Memory management becomes non-trivial.</strong> Prefix adapter weights for many adapters must be held in GPU memory or loaded on demand. With thousands of adapters, this requires an eviction policy.</p>
  </li>
</ol>

<h2 id="what-the-pr-did">What the PR Did</h2>

<p>The contribution created a <code class="language-plaintext highlighter-rouge">prompt_adapter</code> module alongside vLLM’s existing <code class="language-plaintext highlighter-rouge">lora</code> module, rather than shoehorning prefix adapters into the LoRA infrastructure. This was a deliberate architectural decision: the two adapter types share some machinery (request routing, adapter ID tracking) but have fundamentally different execution paths, and conflating them would have created a maintenance nightmare.</p>

<p>Key components:</p>

<ul>
  <li><strong><code class="language-plaintext highlighter-rouge">PromptAdapterConfig</code></strong> and <strong><code class="language-plaintext highlighter-rouge">PromptAdapterRequest</code></strong>: New data classes for specifying adapter configuration and per-request adapter selection, integrated into vLLM’s existing LLMEngine API.</li>
  <li><strong>LRU cache for adapter weights</strong>: Prefix adapter embeddings are held in a fixed-size cache keyed by adapter ID. When the cache is full, least-recently-used adapters are evicted to make room. This allows serving a large number of adapters without holding all of them in GPU memory simultaneously.</li>
  <li><strong>Attention layer modifications</strong>: Added prefix embedding injection into the attention computation for Bloom, Llama, and Mistral model families, the most common base models for prefix-tuned deployments at the time.</li>
  <li><strong><code class="language-plaintext highlighter-rouge">adapter_commons</code></strong>: A shared utilities folder to reduce code duplication between the LoRA and prefix adapter implementations, covering request handling, worker dispatch, and model runner logic.</li>
</ul>

<h2 id="on-contributing-to-a-fast-moving-project">On Contributing to a Fast-Moving Project</h2>

<p>The review process for vLLM is fast and substantive. The turnaround on comments was measured in hours, the feedback was specific and technical, and the bar for “production ready” was higher than “the tests pass.” Reviewers pushed back on the LRU cache implementation, asking for specific eviction behavior under concurrent load. They asked for explicit documentation of the KV-cache interaction. The scope of the test coverage was negotiated.</p>

<p>What I found useful: treat the review as a conversation, not a defense. The reviewers know the codebase better than you do. When they ask why you made a specific choice, it’s usually because there’s a constraint you didn’t know about, not because they’re wrong.</p>

<p>The PR is merged and live at <a href="https://github.com/vllm-project/vllm/pull/4645">github.com/vllm-project/vllm/pull/4645</a>. Prefix adapter serving has been extended and improved in subsequent contributions. That’s exactly how open source should work.</p>]]></content><author><name></name></author><category term="open-source" /><category term="vllm" /><category term="engineering" /><category term="llm-serving" /><summary type="html"><![CDATA[Contributing to a production open-source project with hundreds of active contributors, a fast-moving main branch, and a review process that is simultaneously rigorous and rapid is a different experience from publishing a paper. The feedback loop is measured in hours, not months. The correctness standard is operational: it either works in deployment or it doesn’t, rather than empirical. And the thing you’re building will immediately be used by people who have no idea you exist and would not care if they did.]]></summary></entry><entry><title type="html">Whose Voice Does the Model Hear?</title><link href="https://swapnilparekh.github.io/accent-subspaces-asr-fairness/" rel="alternate" type="text/html" title="Whose Voice Does the Model Hear?" /><published>2024-06-15T00:00:00+00:00</published><updated>2024-06-15T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/accent-subspaces-asr-fairness</id><content type="html" xml:base="https://swapnilparekh.github.io/accent-subspaces-asr-fairness/"><![CDATA[<p>My name is mangled by speech recognition systems with some regularity. “Swapnil” becomes “swamp neel,” “swap nil,” or, memorably, “one pill.” This isn’t a tragedy; it’s an inconvenience. But it points at something real: ASR systems perform differently across accents, and the disparity isn’t random. It correlates with how well-represented different accents were in training data, which in turn correlates with which communities had the resources and infrastructure to generate the labeled speech data that ASR research has historically depended on.</p>

<p>The standard response to this disparity is behavioral: collect more data from underrepresented accents, augment training, maybe post-hoc calibrate the model. These interventions help at the margins. But they treat the symptom, the output disparity, rather than the cause, which lives in the model’s internal representations. ACES (Accent Subspaces for Coupling, Explanations, and Stress-Testing) is an attempt to work at the representation level.</p>

<h2 id="accent-lives-in-the-embedding-space">Accent Lives in the Embedding Space</h2>

<p>The core observation: in the hidden representations of a trained ASR model, there are directions, subspaces, that encode accent information. These subspaces aren’t randomly distributed across the representation. They’re concentrated in specific layers, and they correlate with where the model makes errors.</p>

<p>We developed a three-stage framework. First, we extract accent-discriminative subspaces from the model’s representations using linear probing: we train simple classifiers to predict accent from intermediate activations, then extract the decision boundaries as subspace directions. Second, we construct adversarial perturbations constrained to lie within those subspaces, stress-testing the model by perturbing inputs along exactly the dimensions that carry accent information. Third, we test whether removing the accent subspace from the representations improves fairness.</p>

<p>The results of the third stage were the most surprising.</p>

<h2 id="the-entanglement-problem">The Entanglement Problem</h2>

<p>The intuitive hypothesis was: accent-discriminative features are bias features. They’re not relevant to what’s being said, only to how it’s said. If we project them out of the representations, the model should become both more accurate on accented speech and more fair across accents.</p>

<p>That’s not what happened. When we removed the accent subspace from Wav2Vec2-base’s representations across seven accent groups, overall word error rate increased and the fairness gap didn’t close. On some accent groups it widened.</p>

<p>The reason is entanglement: the directions that encode accent aren’t cleanly separable from the directions that encode phonetically relevant variation. Accent isn’t just an additive bias on top of “accent-neutral” speech representation. Accent <em>is</em> part of how phonemes are realized. A model that’s learned to recognize English phonemes has, in the process, learned something about how those phonemes vary across the accent groups it has seen, and that learning is encoded in directions that a linear probe for accent will find.</p>

<p>This has a frustrating implication for fairness interventions: removing what looks like “accent information” from the representation removes information the model needs for recognition. The bias is structural, not separable.</p>

<h2 id="the-attack-as-a-diagnostic">The Attack as a Diagnostic</h2>

<p>The second stage of ACES, adversarial attacks constrained to the accent subspace, turns out to be a better diagnostic than a direct threat. When we perturb inputs along accent-discriminative directions, the WER disparity gap between accent groups amplifies by nearly 50% (from 21.3 to 31.8 percentage points on Wav2Vec2-base). That’s a large effect from a structured, interpretable perturbation.</p>

<p>The amplification pattern is informative: it shows which accent groups have the most fragile representations and localizes the fragility to specific layers and directions. That’s a roadmap for targeted interventions, not projection-based removal, but something more surgical like representation regularization during fine-tuning that explicitly penalizes the concentration of accent information in fragile directions.</p>

<h2 id="the-broader-point">The Broader Point</h2>

<p>The ASR fairness problem isn’t going to be solved by dataset scale alone. The disparity reflects how the model has structured its internal representation of speech, and changing that structure requires understanding it, not just training past it.</p>

<p>Accent isn’t a feature to be removed. It’s a property of phonetic realization that a fair ASR system needs to handle gracefully. The path forward is models that represent accent variation explicitly and robustly, not models that have been post-hoc “de-accented” in ways that destroy the very signal they need.</p>

<p>The paper is on <a href="https://arxiv.org/abs/2603.03359">arXiv</a>.</p>]]></content><author><name></name></author><category term="asr" /><category term="fairness" /><category term="speech" /><category term="representation-learning" /><summary type="html"><![CDATA[My name is mangled by speech recognition systems with some regularity. “Swapnil” becomes “swamp neel,” “swap nil,” or, memorably, “one pill.” This isn’t a tragedy; it’s an inconvenience. But it points at something real: ASR systems perform differently across accents, and the disparity isn’t random. It correlates with how well-represented different accents were in training data, which in turn correlates with which communities had the resources and infrastructure to generate the labeled speech data that ASR research has historically depended on.]]></summary></entry><entry><title type="html">The Parliament in the Model: How Neural Circuits Reach Consensus</title><link href="https://swapnilparekh.github.io/circuit-consensus-under-uncertainty/" rel="alternate" type="text/html" title="The Parliament in the Model: How Neural Circuits Reach Consensus" /><published>2024-04-01T00:00:00+00:00</published><updated>2024-04-01T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/circuit-consensus-under-uncertainty</id><content type="html" xml:base="https://swapnilparekh.github.io/circuit-consensus-under-uncertainty/"><![CDATA[<p>Mechanistic interpretability has given us a way to talk about <em>where</em> things happen inside a language model. The induction circuit implements in-context learning. The indirect object identification circuit routes information about subjects and objects. The modular arithmetic circuit performs addition. These aren’t metaphors; they’re specific computational subgraphs, attention heads, MLP layers, residual stream connections, identified through careful ablation and activation patching experiments.</p>

<p>But there’s a methodological problem that gets less attention: the circuits you find depend on how you look.</p>

<p>Change the attribution method, from activation patching to path patching to gradient-based attribution, and you get different circuits. Change the threshold for what counts as an “important” edge in the computational graph and the circuit expands or contracts. Change the set of examples you use to identify the circuit and the same behavior gets attributed to different components.</p>

<p>This isn’t just a technical nuance. It means that published circuits may be as much an artifact of analyst choices as a true reflection of how the model computes. CIRCUS (Circuit Consensus under Uncertainty via Stability Ensembles) is my attempt to put circuit discovery on firmer ground.</p>

<h2 id="the-instability-problem">The Instability Problem</h2>

<p>To illustrate: suppose you’re trying to identify the circuit responsible for gender pronoun agreement in a GPT-style model. You run activation patching on 50 examples, set an importance threshold, and find a circuit: three attention heads in layers 8, 12, and 15 that seem to carry the relevant information.</p>

<p>Now you run the same procedure on a different set of 50 examples. You find a slightly different circuit, maybe layers 8 and 12 appear again, but layer 15 is replaced by layer 14. Now you try path patching instead of activation patching. You find that layers 12 and 14 are consistent, but layer 8 disappears, replaced by an MLP in layer 3.</p>

<p>Which circuit is the real one? This isn’t a rhetorical question. The instability is real, and the inconsistency is a problem if you want to use circuits for anything consequential: robustness interventions, interpretability audits, model editing.</p>

<h2 id="consensus-as-the-answer">Consensus as the Answer</h2>

<p>The core idea of CIRCUS is straightforward. Instead of running circuit discovery once under a single configuration and taking the result as ground truth, run it many times, varying the attribution method, the example set, the importance threshold, and look for what’s stable across configurations.</p>

<p>Every edge in the attribution graph receives a stability score: the fraction of configurations in which it appears. The consensus circuit is the set of edges with stability above some threshold (we use strict consensus, meaning above 50% in our main experiments).</p>

<p>The result is a circuit that represents genuine agreement across analytic frameworks rather than the output of one arbitrarily chosen procedure.</p>

<h2 id="what-we-found">What We Found</h2>

<p>On Gemma-2-2B and Llama-3.2-1B, strict consensus circuits are approximately 40x smaller than the union of all edges appearing in any configuration. That sounds like a lot of information being discarded, but what’s being discarded is mostly noise: edges that appear in some configurations for incidental reasons (a particular example set activates a particular head, a particular threshold happens to include a marginal edge).</p>

<p>Crucially, the explanatory power of the consensus circuit, measured by how well the circuit’s influence flows match the full model’s behavior, is comparable to the full union. The consensus isn’t just smaller; it’s more signal-dense.</p>

<h2 id="the-uncertainty-perspective">The Uncertainty Perspective</h2>

<p>There’s a deeper point here about what it means to “discover” a circuit. The instability of individual circuit-finding runs is actually useful information. It tells you where the model’s computation is ambiguous, where the same behavior can be plausibly attributed to multiple components, possibly because the model genuinely uses multiple redundant pathways.</p>

<p>The stability scores in CIRCUS are a map of that ambiguity. High-stability edges are the backbone of the computation. Low-stability edges mark the redundant, context-dependent, or method-sensitive parts. A fully faithful mechanistic account of the model should eventually explain both.</p>

<p>The paper is on <a href="https://arxiv.org/abs/2603.00523">arXiv</a>.</p>]]></content><author><name></name></author><category term="mechanistic-interpretability" /><category term="circuits" /><category term="uncertainty" /><summary type="html"><![CDATA[Mechanistic interpretability has given us a way to talk about where things happen inside a language model. The induction circuit implements in-context learning. The indirect object identification circuit routes information about subjects and objects. The modular arithmetic circuit performs addition. These aren’t metaphors; they’re specific computational subgraphs, attention heads, MLP layers, residual stream connections, identified through careful ablation and activation patching experiments.]]></summary></entry><entry><title type="html">Pruned but Not Protected: On the Adversarial Fragility of Compressed Vision Transformers</title><link href="https://swapnilparekh.github.io/pruned-not-protected-compressed-vits/" rel="alternate" type="text/html" title="Pruned but Not Protected: On the Adversarial Fragility of Compressed Vision Transformers" /><published>2023-03-01T00:00:00+00:00</published><updated>2023-03-01T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/pruned-not-protected-compressed-vits</id><content type="html" xml:base="https://swapnilparekh.github.io/pruned-not-protected-compressed-vits/"><![CDATA[<p>There’s an intuition, understandable and wrong, that a compressed model should be harder to attack. The argument goes roughly like this: adversarial examples exploit the model’s excessive sensitivity to high-frequency input perturbations. A pruned or quantized model has less capacity, represents simpler functions, and surely has less room for the adversarially sensitive structure that attackers exploit.</p>

<p>This turns out not to be how it works. When we investigated adversarial attack transferability across Vision Transformers (ViTs) compressed via quantization, pruning, and weight multiplexing, we found that compression preserves or increases adversarial vulnerability, and that the reason why is actually informative about what robustness in transformers means in the first place.</p>

<h2 id="why-vits-and-why-compression">Why ViTs and Why Compression</h2>

<p>Vision Transformers have largely displaced CNNs for high-accuracy image recognition. They’re also substantially larger than their CNN counterparts, which creates pressure to compress them for deployment on edge devices: phones, embedded systems, inference accelerators.</p>

<p>Compression isn’t theoretical. Quantized and pruned ViTs are in production. The question of whether these compressed models maintain the security properties of the full model is therefore a practical one, not an academic stress test.</p>

<h2 id="what-we-found">What We Found</h2>

<p>We evaluated three compression strategies: post-training quantization (reducing weight precision from float32 to int8 or int4), magnitude pruning (removing weights below a threshold), and weight multiplexing (sharing weights across layers). For each strategy, we generated adversarial examples against both the original model and the compressed variants, then measured cross-model transferability.</p>

<p>The key finding: adversarial examples transfer readily between the original ViT and its compressed versions, often more readily than between architecturally distinct models. Compression doesn’t create the adversarial diversity that might limit transfer.</p>

<p>More specifically, the attention heads that carry the most adversarially relevant information are also the most parameter-intensive. They’re the heads that get removed first under magnitude pruning and the heads whose precision degrades most under aggressive quantization. The compressed model isn’t simpler in the ways that matter for robustness; it’s lost its expensive defenses while retaining the geometry of its decision boundaries.</p>

<h2 id="the-attention-head-perspective">The Attention Head Perspective</h2>

<p>One framing of adversarial robustness in transformers is that robustness is a distributed property: it requires many heads attending to many different features, so that no single perturbation can systematically mislead all of them. High-capacity, high-precision heads that attend to global semantic structure provide robustness. Low-capacity heads that fire on local texture patterns don’t.</p>

<p>Compression, at least as currently practiced, tends to remove the former and keep the latter. The compressed model processes the same low-level texture features; it’s simply lost the high-level semantic attention that would allow it to “see through” a perturbation.</p>

<p>This isn’t a criticism of compression per se. It’s a prompt to ask what robustness-aware compression would look like: compression that explicitly preserves the heads with high adversarial relevance, even at a higher parameter cost.</p>

<h2 id="for-practitioners">For Practitioners</h2>

<p>If you’re deploying a compressed ViT and relying on adversarial evaluations of the full model for security guarantees, those guarantees don’t transfer. The compressed variant should be evaluated independently, and the evaluation should include transfer attacks from the full model, which, in our experiments, consistently fool the compressed version.</p>

<p>The more useful heuristic: treat model compression as a change to the attack surface, not just to the inference cost. The threat model for the compressed variant is at least as broad as that for the original, and potentially broader.</p>

<p>The paper is available <a href="https://arxiv.org/abs/2209.13785">here</a> (FICC 2023, Springer).</p>]]></content><author><name></name></author><category term="adversarial-ml" /><category term="vision-transformers" /><category term="compression" /><summary type="html"><![CDATA[There’s an intuition, understandable and wrong, that a compressed model should be harder to attack. The argument goes roughly like this: adversarial examples exploit the model’s excessive sensitivity to high-frequency input perturbations. A pruned or quantized model has less capacity, represents simpler functions, and surely has less room for the adversarially sensitive structure that attackers exploit.]]></summary></entry><entry><title type="html">Finding the Cheat Code: Universal Adversarial Triggers Without Any Data</title><link href="https://swapnilparekh.github.io/data-free-universal-adversarial-triggers/" rel="alternate" type="text/html" title="Finding the Cheat Code: Universal Adversarial Triggers Without Any Data" /><published>2022-02-01T00:00:00+00:00</published><updated>2022-02-01T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/data-free-universal-adversarial-triggers</id><content type="html" xml:base="https://swapnilparekh.github.io/data-free-universal-adversarial-triggers/"><![CDATA[<p>Most attacks on NLP models work by finding a perturbation tailored to a specific input: a few word substitutions that flip a sentiment classifier on one particular review, or a paraphrase that breaks a textual entailment model on one particular sentence. These attacks are powerful but narrow. They exploit the model’s behavior on a specific input rather than something fundamental about its parameters.</p>

<p>The more unsettling question is whether you can find a universal trigger, a short sequence of tokens that, when prepended to <em>any</em> input, drives the model toward a target class, without ever seeing a single training example. Just the model itself.</p>

<p>That’s what MINIMAL (Mining Models for Universal Adversarial Triggers) is about, and the answer turns out to be yes.</p>

<h2 id="the-setup">The Setup</h2>

<p>Previous work on universal adversarial triggers (UATs), most notably the paper by Wallace et al., assumed access to a dataset. You need examples to evaluate the trigger against. You need to know what the model outputs on real inputs to know whether your trigger is working. It’s a reasonable assumption in a research setting, but it narrows the threat model considerably.</p>

<p>MINIMAL removes that assumption entirely. Given only model parameters and random token initialization, we mine triggers that generalize across arbitrary inputs.</p>

<p>The mechanism exploits something that should make anyone who trains language models a little uncomfortable: the gradient signal from the model’s own parameters, computed with respect to a target label, is rich enough to discover adversarial structure without any external data. The model has effectively memorized the shape of its own decision boundaries, and those boundaries are accessible without the data that carved them.</p>

<h2 id="what-happens-in-practice">What Happens in Practice</h2>

<p>On the Stanford Sentiment Treebank, a single 5-token trigger reduced positive-class accuracy from 93.6% to 9.6%. Prepend those five tokens to any review, a glowing five-star restaurant critique, a heartfelt book recommendation, anything, and the classifier confidently calls it negative.</p>

<p>On SNLI (natural language inference), a single trigger reduced entailment-class accuracy from 90.95% to below 0.6%.</p>

<p>These aren’t subtle improvements over data-dependent baselines. They’re roughly <em>matching</em> them, from zero data.</p>

<h2 id="why-this-is-the-scary-result">Why This Is the Scary Result</h2>

<p>There’s a tempting way to downplay adversarial attacks on NLP models: require that the attacker have substantial access to training data. Data is an expensive, controlled resource. An attacker without the training set is, on this view, operating with at least some handicap.</p>

<p>MINIMAL eliminates that handicap. The attack surface is the model itself: the weights that are often distributed, downloaded, served via API, or extracted through repeated queries. In a world where fine-tuned models are routinely shared on model hubs, the trigger is derivable from the artifact millions of people already have.</p>

<p>The deeper issue is what this reveals about how language models encode their tasks. A model trained on sentiment should, in some idealized sense, have distributed its task knowledge across enough parameters that no small token sequence could uniformly hijack it. Instead, the geometry of the parameter space contains concentrated adversarial structure, choke points where a short, fixed input can override everything else the model has learned. That’s a property of the training dynamics, not just the specific model. We found it across multiple architectures and datasets.</p>

<h2 id="what-this-means-going-forward">What This Means Going Forward</h2>

<p>The obvious defense is adversarial training against data-free triggers. We evaluated several filtering-based defenses and found they trade too much clean accuracy to be practical. The more promising direction is regularization during pre-training: penalizing the formation of the concentrated parameter geometry that makes data-free mining possible in the first place. That’s a harder problem, but it’s the right one to be working on.</p>

<p>There’s also a methodological point worth making: if you’re evaluating the robustness of a language model, running only data-dependent attack baselines will give you an overly optimistic picture. The attack surface extends past your dataset.</p>

<p>The code is available <a href="https://github.com/midas-research/data-free-uats">here</a> and the paper is at <a href="https://arxiv.org/abs/2109.12406">AAAI 2022</a>.</p>]]></content><author><name></name></author><category term="adversarial-ml" /><category term="nlp" /><category term="aaai" /><summary type="html"><![CDATA[Most attacks on NLP models work by finding a perturbation tailored to a specific input: a few word substitutions that flip a sentiment classifier on one particular review, or a paraphrase that breaks a textual entailment model on one particular sentence. These attacks are powerful but narrow. They exploit the model’s behavior on a specific input rather than something fundamental about its parameters.]]></summary></entry><entry><title type="html">Solving Azure Query Time JOINs</title><link href="https://swapnilparekh.github.io/Azure-Query-Time-JOINs/" rel="alternate" type="text/html" title="Solving Azure Query Time JOINs" /><published>2020-09-04T00:00:00+00:00</published><updated>2020-09-04T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/Azure-Query-Time-JOINs</id><content type="html" xml:base="https://swapnilparekh.github.io/Azure-Query-Time-JOINs/"><![CDATA[<h2 id="problem-statement">Problem Statement:</h2>
<p>In most real world use cases, there exist several relational dependencies between our datasources.
These relationships need to be joined and queried at search time; since these datasources are currently normalized or split logically into tables with no redundant information.
Other key search service providers like Lucene and SOLR handle this by super-fast query joins using optimized Application side joins. As we will see later, we cannot use this method for our use cases due to certain restrictions. 
We need to find suitable alternatives for these join queries.</p>

<h2 id="methods">Methods:</h2>
<h3 id="application-side-joins">Application Side Joins:</h3>
<p>In this method the results matching from the first index are then queried against the second index and the resulting results are joined on the application side.</p>

<p><img src="/images/img_az.jpg" width="400" /></p>

<p>It has been used successfully in Lucene and SOLR with the following enhancements over normal filtering:
Caching: The documents which are frequently used are cached, which gave a 20X improvement.
Converting Strings to Numbers: String matching during the filter step is very time and resource intensive. Hence strings are converted to numbers, which are used for matching instead. This gave a 50X improvement.
We cannot ensure that Azure is doing these, but it is the SOTA approach, so it must be the underlying process.</p>

<p><ins>PROS:</ins></p>
<ul>
  <li>Data can remain normalized.</li>
  <li>Write performance improves, since only once source needs to be updated.</li>
</ul>

<p><ins>CONS:</ins></p>
<ul>
  <li>Need to run extra queries and perform expensive joins at search time.</li>
  <li>Azure’s restriction on the GET filter query is 8 KB, hence some searches may exceed that limit; if we have millions of rows in our datasources making our queries too long.</li>
  <li>Search relevance suffers, since the scores from the different indexes need to be combined arbitrarily.</li>
</ul>

<p><ins>When to use it?</ins><br />
Data Sources where data has a limited number of matching documents and preferably, is seldom updated (so that caching can help).</p>

<h3 id="data-denormalization">Data Denormalization:</h3>
<p>This is a key technique which eliminates the need for joins.
The tables are flattened along the required fields, which gives us the best performance, since we are using Azure Search how it is meant to be used.
Using the Complex Datatypes (where fields with different datatypes can stored together as collections) available in Azure, we can handle both 1-1 and 1-N relationships in our data.
The combined matching results can be displayed separately on the client-side using predetermined formats.</p>

<p><ins>PROS:</ins></p>
<ul>
  <li>Speed, since no need for expensive client-side joins and extra queries.</li>
  <li>Better read performance.</li>
  <li>Better search relevance, since all the related information is stored together, and the inbuilt search relevance scores can be used.</li>
  <li>Can reduce the number of indexes, if full denormalization is performed.</li>
</ul>

<p><ins>CONS:</ins></p>
<ul>
  <li>Requires special index design</li>
  <li>Increased complexity in case of extremely nested JOIN relationships.</li>
  <li>Latency during write for data update of frequently updated datasources.</li>
  <li>Causes explosion of data for very highly 1-N relationships which may introduce latency.</li>
</ul>

<p><ins>When to use it?</ins><br />
Data Sources where data isn’t updated very frequently, and simple join relationships exist between them.</p>

<h3 id="azure-sql-views">Azure SQL VIEWs:</h3>
<p>The most natural solution to the problem, since we can replicate the current Query structure directly.
A View is essentially a SQL query between 1 or more datasources. 
It can be connected to an Indexer and loaded into the index.
The index can be updated by simply running the indexer once.</p>

<p><ins>PROS:</ins></p>
<ul>
  <li>Most closely resembles the current SOTA approach.</li>
  <li>Easy update irrespective of the update cycles of the datasources.</li>
  <li>Can model highly complex, nested JOINs.</li>
  <li>Optimal search relevance.</li>
  <li>Can be easily altered as per requirement, without changing the indexer/index.</li>
</ul>

<p><ins>CONS:</ins></p>
<ul>
  <li>Only works with Azure Databases.</li>
  <li>Requires an additional Indexer and Index.</li>
</ul>

<p><ins>When to use it?</ins><br />
Complex Nested JOIN queries which cannot be modelled by other approaches.</p>

<h2 id="conclusion">Conclusion:</h2>
<p>Unfortunately, there is no single correct solution to this problem. Our final solution will likely be an amalgamation of these approaches applied to relevant datasources.</p>

<h2 id="references">References:</h2>
<ul>
  <li><a href="https://seecr.nl/2013/10/16/reducing-index-maintenance-costs-with-query-time-join-for-solrlucene/" title="LUCENE Approach">LUCENE Approach</a></li>
  <li><a href="https://docs.microsoft.com/en-us/sql/t-sql/statements/create-view-transact-sql?view=sql-server-ver15" title="Azure SQL">Azure SQL</a></li>
  <li><a href="https://www.elastic.co/guide/en/elasticsearch/guide/2.x/relations.html" title="Elastic Search Approach">Elastic Search Approach</a></li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[Problem Statement: In most real world use cases, there exist several relational dependencies between our datasources. These relationships need to be joined and queried at search time; since these datasources are currently normalized or split logically into tables with no redundant information. Other key search service providers like Lucene and SOLR handle this by super-fast query joins using optimized Application side joins. As we will see later, we cannot use this method for our use cases due to certain restrictions. We need to find suitable alternatives for these join queries.]]></summary></entry><entry><title type="html">Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles</title><link href="https://swapnilparekh.github.io/Simple-and-Scalable-Predictive-Uncertainty/" rel="alternate" type="text/html" title="Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles" /><published>2019-05-15T00:00:00+00:00</published><updated>2019-05-15T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/Simple-and-Scalable-Predictive-Uncertainty</id><content type="html" xml:base="https://swapnilparekh.github.io/Simple-and-Scalable-Predictive-Uncertainty/"><![CDATA[<p>A new method to summarize papers; by 5 questions!	
This is the First Paper, that I recently studied for my application to the DeepBayes Summer School.</p>

<p>Here is link, please read and give feedback!</p>

<p>https://drive.google.com/open?id=1r1L9N2e3hNgeVxfwBWb2860TXjdxizs7</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A new method to summarize papers; by 5 questions! This is the First Paper, that I recently studied for my application to the DeepBayes Summer School.]]></summary></entry><entry><title type="html">A Walk In the World of VAEs- AN ESSAY</title><link href="https://swapnilparekh.github.io/Walk-in-the-world-of-VAEs/" rel="alternate" type="text/html" title="A Walk In the World of VAEs- AN ESSAY" /><published>2019-03-28T00:00:00+00:00</published><updated>2019-03-28T00:00:00+00:00</updated><id>https://swapnilparekh.github.io/Walk-in-the-world-of-VAEs</id><content type="html" xml:base="https://swapnilparekh.github.io/Walk-in-the-world-of-VAEs/"><![CDATA[<h2 id="view-of-the-world-and-the-challenge">View of the world and the Challenge:</h2>
<p>Technology has become inextricably intertwined with people’s lives and businesses. 
By predicting everything from stock price to behaviours, learning algorithms coupled with voluminous data have permeated every industry, for revenue enhancement and/or cost-cutting.
But the world is sparse, signals are like grains in heaps of chaff. 
To make sense of these sparse signals and derive insights, creating compressed representations of real-world data is imperative.</p>

<h2 id="solution-according-to-me-and-its-impact">Solution according to me and its Impact:</h2>
<p>I recently stumbled upon VAEs which can be used to solve the aforesaid problem. 
Variational AutoEncoders(VAEs) are an unsupervised deep learning algorithm which learns a dense representation of data by reconstruction. The data is passed through an encoder which creates a low dimensional latent vector; which is then passed to a decoder to reconstruct the original data. The latent vector is composed of the high-level features deemed important by the algorithm for reconstructing the original data and hence it can be trained to learn the compressed representation in an unsupervised manner, that is, without using any inference labels.
These compressed representations of data can be highly useful in any problem where a sparse input needs to be processed for further downstream tasks. For example, a recent paper based on defending Convolutional Neural Networks (CNN’s) from adversarial attacks (adding Gaussian noise to the image to fool the algorithm); found state-of-the-art accuracy in using VAEs for prior denoising of the images before passing to CNN’s. This was accomplished by training the algorithm on artificially corrupted images, helping it learn the features essential for reconstructing the clean original image.</p>

<h2 id="case-study-and-learnings">Case Study and Learnings:</h2>
<p>Similarly, VAEs can also be used for anomaly detection. Anomaly detection is traditionally accomplished by supervised methods, that is, by detecting patterns in the data. 
Unsupervised anomaly detection using VAEs is therefore extremely useful, especially in time-series financial data. 
In one of my personal projects, by training a VAE on Bitcoin price data, I was able to generate a dense representation of ‘normal’ price data (using indicators such as moving and exponential averages). Then by measuring the reconstruction error of the test data points, it can be inferred that if the error is high; then it is probably an anomaly and vice versa.</p>

<h2 id="practicality">Practicality:</h2>
<p>This is just a start; I think the model can be made more adaptive, predictive and hence commercially viable. This algorithm, like traditional machine learning algorithms, can run on a rolling basis on new data; through a sliding window approach for continual improvement.
Thus, I believe that VAEs can be trained to solve tasks such as unsupervised signal reconstruction and anomaly detection to help analysis and prediction.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[View of the world and the Challenge: Technology has become inextricably intertwined with people’s lives and businesses. By predicting everything from stock price to behaviours, learning algorithms coupled with voluminous data have permeated every industry, for revenue enhancement and/or cost-cutting. But the world is sparse, signals are like grains in heaps of chaff. To make sense of these sparse signals and derive insights, creating compressed representations of real-world data is imperative.]]></summary></entry></feed>