Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them

Florian Tramèr

International Conference on Machine Learning (ICML) 2022 (Long Presentation)

Previously presented at ICML 2021 Workshop on Adversarial Machine Learning (AdvML) (Best Paper Award)


Making classifiers robust to adversarial examples is hard. Thus, many defenses tackle the seemingly easier task of detecting perturbed inputs.
We show a barrier towards this goal. We prove a general hardness reduction between detection and classification of adversarial examples: given a robust detector for attacks at distance ε(in some metric), we can build a similarly robust (but inefficient) classifier for attacks at distance ε/2.
Our reduction is computationally inefficient, and thus cannot be used to build practical classifiers. Instead, it is a useful sanity check to test whether empirical detection results imply something much stronger than the authors presumably anticipated.
To illustrate, we revisit 13 detector defenses. For 11/13 cases, we show that the claimed detection results would imply an inefficient classifier with robustness far beyond the state-of-the-art.

  author   =   {Tram{\`e}r, Florian},
  title   =   {Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them},
  booktitle   =   {International Conference on Machine Learning (ICML)},
  year   =   {2022},
  howpublished   =   {arXiv preprint arXiv:2107.11630},
  url   =   {}