Keywords:  AI and Machine Learning, Security Engineering and Design,

2021

Reverse Engineer and Counter Adversarial Attacks with Unsupervised Representation Learning

Xudong Wang, PhD Student, EECS, UC Berkeley
Nils Worzyk, Postdoctoral Researcher, EECS, UC Berkeley

Computer vision has been integrated into many areas of our lives, including facial recognition, augmented reality, autonomous driving, and healthcare. However, making them more accurate and generalizing to real world data alone is no longer sufficient, we have to safe-guard their robustness against malicious attacks in cyberspace.

Compared with the supervised learning that aims to learn a function that, given a sample of data and semantic labels, best approximates the relationship between input and output observable in the data, unsupervised learning infers the natural structure present within a set of data points without any manual labeling. Therefore, to handle increasingly more unlabeled data, unsupervised learning has been widely adopted. However, while unsupervised training delivers more generalizing performance than supervised training, when optimized to tackle a specific task like image classification, unsupervised trained models are actually more vulnerable to adversarial attacks.

We propose to build upon recent adversarial training on unsupervised learning and advance its adversarial robustness. In addition, we aim to detect adversarial inputs in the unsupervised trained feature space and reverse engineer the initially applied perturbation. These perturbations are then used to identify single or clusters of different attacks. However, just making the model more robust will sacrifice the performance of the unsupervised learning model in downstream tasks. As in supervised learning, even with a strengthened model, new attacks seem to fool it. Therefore, it is necessary to identify ways to restore the original correct class of detected adversarial inputs. By reverse engineering the initially applied perturbation, it might also become possible to identify single or clusters of adversarial attacks, potentially related to different groups of attackers.

In short, we aim to develop a novel unsupervised learning method that can obtain powerful generalizable representations without any semantic labels, so as to better protect the privacy and resist malicious cyber-attacks while making full use of large amounts of unlabeled data.