Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent

1MIT CSAIL, 2Universitat Oberta de Catalunya, 3Louisiana Tech, 4Northeastern University

NeurIPS 2025

When a vision model performs image recognition, which visual attributes drive its predictions?

Detecting unintended use of specific visual features is critical for ensuring model robustness, preventing overfitting, and avoiding spurious correlations. We introduce an automated framework for detecting these dependencies in trained vision models.

Figure 1

Computer vision models trained on large-scale datasets have achieved remarkable performance across a broad range of recognition tasks, often surpassing human accuracy on standard benchmarks. However, strong benchmark results can obscure underlying vulnerabilities. In particular, models may achieve high accuracy using prediction strategies that are non-robust or non-generalizable. These include relying on object-level characteristics such as pose or color (Geirhos et al., 2018), contextual cues like background scenery or co-occurring objects (Xiao et al., 2021), and demographic traits of human subjects (Wilson et al., 2019). Such visual dependencies may result in overfitting, reduced generalization, and performance disparities in real-world usage (Hendrycks et al., 2021).

Existing methods take various approaches to discover visual attributes that drive model predictions. These include saliency-based methods that highlight input regions associated with a prediction (Simonyan et al., 2013), feature visualizations that map activations to human-interpretable patterns (Olah et al., 2017), and concept-based attribution methods that evaluate sensitivity to predefined semantic concepts (Kim et al., 2018). While powerful for visualizing local behaviors, these approaches often rely on manual inspection and assume access to a fixed set of predefined concepts, limiting their ability to scale to modern models with complex behaviors.

We introduce a fully automated framework designed to detect unintended visual attribute reliance in pretrained vision models. Given a pretrained model and a target visual concept (e.g., an image classifier selective for the object vase), our method identifies specific image features that systematically influence the model's predictions, even when these features fall outside the model's intended behavior (e.g., the classifier relies on flowers to detect the vase). At the core of our approach is the LM-based Self-reflective Automated Interpretability Agent (SAIA) that treats the task as a scientific discovery process. Rather than relying on a predefined set of candidate attributes, SAIA autonomously formulates hypotheses about image features that the model might rely on, designs targeted tests, and updates its beliefs based on observed model behavior. In contrast to previous interpretability agents like the Multimodal Automated Interpretability Agent (MAIA) introduced by Rott Shaham et al., 2024, SAIA does not stop after generating an initial finding, but rather actively evaluates how well it matches the model's behavior on unseen test cases. When discrepancies arise, SAIA reflects on its assumptions, identifies gaps or inconsistencies in its current understanding, and initiates a new hypothesis testing loop.

SAIA

Our approach consists of two main stages. (i) Hypothesis-Testing stage, in which an SAIA is provided with a subject model (e.g., an image classifier) and a target concept to explore (e.g., vase). The agent is tasked with discovering visual attributes in the input image that the subject model relies on to perform recognition tasks. The agent proposes candidate attributes that may influence the model's behavior, designs targeted experiments to test its hypotheses, and iteratively refines them based on observed results. This cycle continues until the agent converges to a stable explanation of the model's reliance. (ii) Self-Reflection Stage, in which SAIA uses a self-evaluation tool to score its explanation. This is done by quantifying how well the agent's explanation matches the behavior of the model in new input images through a self-evaluation protocol. If the explanation fails to generalize or reveals inconsistencies, SAIA reflects on its prior explanation in light of the evaluation evidence and launches a new hypothesis-testing stage.

Hypothesis-Testing Stage

In this stage, SAIA iteratively refines hypotheses about the attribute sensitivities of the subject model. Inspired by Rott Shaham et al., 2024, we design SAIA to operate in a scientific loop: it begins by proposing candidate attributes that the subject model might rely on, designs multiple experiments involving generating and editing images to test these hypotheses (e.g. edit an image with a suit to change its color), observes the resulting model's behavior (e.g. measure the confidence scores of the subject model across these experiments), and updates its beliefs accordingly. This cycle continues until SAIA converges on a final explanation for the model's sensitivity to image features.

Self-Reflection Stage

Once the hypothesis testing stage is complete and SAIA reports its conclusion for the initial hypothesis-testing stage, we initiate a self-reflection stage. In this stage, the conclusion from the hypothesis testing stage is scored using a self-evaluation protocol. Importantly, this evaluation self-contained, and does not require any ground-truth knowledge about attribute dependencies of the model; rather, it generates set of exemplary images containing the object class that should elicit high classification scores from the model according to the conclusion and one set that should elicit low classification scores. If SAIA's detected reliance sufficiently matches the model's behavior, it terminates the experiments and returns the current conclusion. Otherwise, if inconsistencies between SAIA's conclusion and the model behavior are found, the information collected from the self-evaluation stage is returned to SAIA, which reflects over its previous conclusion and initiates another hypothesis-testing round.

Visual Attribute Reliance Detection Benchmark

To evaluate the capabilities of SAIA, we constructed a benchmark of 130 unique object recognition models that exhibit 18 diverse types of visual attribute reliance. All simulated behaviors are inspired by known vulnerabilities of vision models, and mimic spurious correlations between the target object and image attributes such as object color, background context, co-occurring object state, or demographic cues. To assess the generalizability of our method, the benchmark also includes a subset of models with counterfactual attribute reliance that are intentionally rare or unnatural in real-world pretrained models (e.g., a suit detector responds more strongly when a women wear the suit). Each benchmark model includes an input parameter that controls the strength of the injected reliance, allowing for precise control over model behavior. Importantly, because these models are explicitly engineered with a known intended behavior, they serve as a controlled testbed for evaluating and comparing feature reliance detection methods.

Figure 1 We simulate feature reliance by modulating object recognition scores based on the presence of specific visual attributes (e.g., a bird detector that relies on the presence of beach background). Given an input image and object category t, 𝒪t produces a confidence score for object presence. If the object is not detected, a low random score is assigned as the confidence score of the image. If the object is detected, we simulate an attribute dependency (e.g., presence of a beach background for bird detection). If the condition is satisfied, the final classification score equals 𝒪t(img) confidence. Otherwise, the score is discounted by a factor α to represent the model's weaker response in the case that the attribute condition is not met.

Attribute Reliance Categories

We categorize the attribute conditions used to inject reliance into four groups: object attributes, context attributes, demographic attributes, and counterfactual demographic attributes. These categories reflect different types of visual dependencies observed (or intentionally constructed) in our benchmark models, and guide the choice of attribute detector 𝒜 used in each case.

Object attributes

These attribute dependencies relate to visual properties of the object itself. We include reliance on object color and material, using SigLIP as 𝒜 for zero-shot classification of object-specific attributes (e.g. SigLIP is guided with the prompt a red bus to inject a color reliance to a bus detector). A color-reliant system returns the full score from 𝒪t only if the object has a specific color, otherwise the response is discounted; similarly, a material-reliant model will have a full response only if the object is of the intended material (e.g., vases made of ceramic).

Context attributes

These dependencies reflect properties of the object's surrounding context. We simulate reliance on the specific setting of the object (e.g. keyboard only if it is being typed) and object background (e.g. car only if it is in an urban environment). Here as well, we use SigLIP-guided text for detecting the intended attribute.

Demographic attributes

These dependencies are based on the age or gender of people interacting with the target object. We use FairFace as 𝒜 to detect demographic attributes and construct systems relying on these (e.g., an apron detector that relies on the apron to be worn by women, and a glasses detector that relies on the glasses to be worn by older individuals).

Counterfactual demographic attributes

To test whether SAIA can discover atypical or out-of-distribution dependencies, we include models with counterfactual demographic reliance (e.g., an apron detector that activates only when worn by men, or a glasses detector that prefers younger wearers), which rarely co-occurs in real-world data. These systems allow us to test whether SAIA can detect unexpected or counterintuitive reliance patterns that do not follow natural co-occurrence statistics. This distinction allows us to assess both SAIA's ability to uncover realistic demographic biases and its robustness to rare or previously unknown dependencies.

Applications to SOTA Models

We deploy SAIA to detect attribute reliance in two pre-trained vision models: the CLIP-ViT image encoder Radford et al., 2021 trained to align image and text representations, and the YOLOv8 model Jocher et al., 2023 trained for object detection in autonomous driving settings.

Figure 1

With CLIP, we perform object recognition by measuring the cosine similarity of the image with a target prompt (e.g., "A picture of a scientist"). For YOLOv8 we measure the detection score of the target object class. The generated descriptions are shown to be predictive of model behavior, as model scores increase when the reliance is satisfied and decrease when it is absent. Surprisingly, SAIA reveals dependencies that were never observed before, such as the reliance of clip on traditional laboratory settings when detecting scientist, and YOLOv8 dependency on bikers' poses. We note that SAIA’s goal is to surface such dependencies rather than to assess their desirability or harm, allowing practitioners to make informed judgments based on specific downstream use cases.

BibTeX

@misc{li2025automateddetectionvisualattribute,
            title={Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent}, 
            author={Christy Li and Josep Lopez Camuñas and Jake Thomas Touchet and Jacob Andreas and Agata Lapedriza and Antonio Torralba and Tamar Rott Shaham},
            year={2025},
            eprint={2510.21704},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2510.21704}, 
      }