Are CNNs Actually Texture Biased?
Revisiting the "Elephant-Cat" controversy with a new way to measure what AI actually sees.
⚡ TL;DR
Problem: A famous study claimed CNNs are "texture-biased" (seeing skin, not shape), unlike humans.
Old Answer: This was based on "cue-conflict" images (e.g., a cat silhouette with elephant skin), where models chose the texture class.
New Answer: That experiment was flawed. When you suppress features instead of conflicting them, CNNs actually rely heavily on local shape.
Key Insight: Don't force a choice between two confusing signals. Remove one signal at a time and see if the model fails.
1 The Story in Plain English
In 2019, a paper shook the vision community by claiming that Convolutional Neural Networks (CNNs) don't "see" objects like we do. If you show a human a cat silhouette filled with elephant skin texture, we say "Cat." The CNN says "Elephant." This led to the widely accepted belief that CNNs are texture-biased.
But is that fair? Imagine a "Taste Test" where you have to choose between a salty pretzel and a sugary candy. If you pick the candy, it means you prefer sugar in that moment. It doesn't prove you rely on sugar to identify food, or that you can't taste salt at all.
💡 The "Taste Test" Analogy
The Old Way (Conflict): "Here is a Salty Candy. Is it Salt or Sugar?"
Result: You're confused, but maybe the sugar is just stronger.
The New Way (Suppression): "Here is a Candy with the sugar removed. Can you still
tell it's a Candy?"
Result: If you can't, then you relied on the sugar.
This paper argues that the "Elephant-Cat" images were just weird, confusing inputs that broke the model's usual behavior. To really understand what a model needs, we shouldn't trick it with conflicting signals. We should systematically suppress one signal at a time—Shape, Texture, or Color—and measure how much the model's performance drops.
Before We Dive In: What You'll Need
- CNNs: Deep learning models used for image recognition (e.g., ResNet).
- Style Transfer: The technique used to create "Elephant-Cats" (mixing content/style).
- Bilateral Filter: A way to blur texture while keeping edges sharp.
- Local Shape: Small structural details (ears, paws) vs. Global Shape (silhouette).
2 The Suppression Lab
Let's test this hypothesis ourselves. Below, you can take an image and apply the paper's suppression techniques. We'll show you how a standard ResNet-50's confidence (simulated based on the paper's data) would react.
*Processing happens locally in your browser.
Simulated Model Confidence
3 The Verdict: It's Local Shape
The results are striking. When you suppress texture (using blur), modern CNNs only lose a small amount of accuracy. But when you suppress shape (using patch shuffling), performance crashes.
Wait, why does "Patch Shuffling" kill shape? It keeps all the texture (the fur, the skin pattern) exactly the same! It only messes up the arrangement of the parts. If the model was truly looking at texture, it shouldn't care if the ear is next to the tail. The fact that it does care proves it relies on the spatial structure—the local shape.
🔢 The Numbers
On ImageNet validation set (ResNet-50):
- Original Accuracy: ~76%
- Texture Suppressed (Blur): ~55% (Still decent!)
- Shape Suppressed (Shuffle): ~15% (Catastrophic failure)
This huge gap proves shape is the dominant cue.
🧐 Critical Analysis: Strengths, Weaknesses & Open Questions
✅ What This Paper Does Well
- Debunks a Myth: Challenges the "Texture Bias" dogma with rigorous experiments.
- Better Methodology: "Suppression" is a much cleaner tool than "Conflict" for measuring reliance.
- Domain Agnostic: Shows that Medical Imaging models do rely on texture/color more, which makes sense!
⚠️ Legitimate Concerns
- Imperfect Suppression: Blurring removes texture but also hurts fine shape details. It's hard to perfectly isolate one feature.
- Simplicity: Patch shuffling is a very aggressive transformation; maybe it destroys more than just "shape".
🎯 Bottom Line
CNNs aren't as alien as we thought. They look for shapes, just like us—but they focus on local shapes (ears, eyes) rather than the global silhouette. The "Texture Bias" was mostly an artifact of a flawed test.