Machine Learning is Fun Part 8: How to Intentionally Trick Neural Networks
A Look into the Future of Hacking
This article is part of a series. Check out the full series: Part 1, Part 2, Part 3, Part 4, Part 5, Part 6, Part 7 and Part 8! You can also read this article in Русский, Tiếng Việt, فارسی or 한국어.
Giant update: I’ve written a new book based on these articles! It not only expands and updates all my articles, but it has tons of brand new content and lots of hands-on coding projects. Check it out now!
Almost as long as programmers have been writing computer programs, computer hackers have been figuring out ways to exploit those programs. Malicious hackers take advantage of the tiniest bugs in programs to break into systems, steal data and generally wreak havoc.

But systems powered by deep learning algorithms should be safe from human interference, right? How is a hacker going to get past a neural network trained on terabytes of data?
It turns out that even the most advanced deep neural networks can be easily fooled. With a few tricks, you can force them into predicting whatever result you want:

So before you launch a new system powered by deep neural networks, let’s learn exactly how to break them and what you can do to protect yourself from attackers.
Neural Nets as Security Guards
Let’s imagine that we run an auction website like Ebay. On our website, we want to prevent people from selling prohibited items — things like live animals.
Enforcing these kinds of rules are hard if you have millions of users. We could hire hundreds of people to review every auction listing by hand, but that would be expensive. Instead, we can use deep learning to automatically check auction photos for prohibited items and flag the ones that violate the rules.
This is a typical image classification problem. To build this, we’ll train a deep convolutional neural network to tell prohibited items apart from allowed items and then we’ll run all the photos on our site through it.
First, we need a data set of thousands of images from past auction listings. We need images of both allowed and prohibited items so that we can train the neural network to tell them apart:

To train then neural network, we use the standard back-propagation algorithm. This is an algorithm were we pass in a training picture, pass in the expected result for that picture, and then walk back through each layer in the neural network adjusting their weights slightly to make them a little better at producing the correct output for that picture:

We repeat this thousands of times with thousands of photos until the model reliably produces the correct results with an acceptable accuracy.
The end result is a neural network that can reliably classify images:

Note: If you want more detail on how convolution neural networks recognize objects in images, check out Part 3.
But things are not as reliable as they seem…
Convolutional neural networks are powerful models that consider the entire image when classifying it. They can recognize complex shapes and patterns no matter where they appear in the image. In many image recognition tasks, they can equal or even beat human performance.
With a fancy model like that, changing a few pixels in the image to be darker or lighter shouldn’t have a big effect on the final prediction, right? Sure, it might change the final likelihood slightly, but it shouldn’t flip an image from “prohibited” to “allowed”.

But in a famous paper in 2013 called Intriguing properties of neural networks, it was discovered that this isn’t always true. If you know exactly which pixels to change and exactly how much to change them, you can intentionally force the neural network to predict the wrong output for a given picture without changing the appearance of the picture very much.
That means we can intentionally craft a picture that is clearly a prohibited item but which completely fools our neural network:

Why is this? A machine learning classifier works by finding a dividing line between the things it’s trying to tell apart. Here’s how that looks on a graph for a simple two-dimensional classifier that’s learned to separate green points (acceptable) from red points (prohibited):

Right now, the classifier works with 100% accuracy. It’s found a line that perfectly separates all the green points from the red points.
But what if we want to trick it into mis-classifying one of the red points as a green point? What’s the minimum amount we could move a red point to push it into green territory?
If we add a small amount to the Y value of a red point right beside the boundary, we can just barely push it over into green territory:

So to trick a classifier, we just need to know which direction to nudge the point to get it over the line. And if we don’t want to be too obvious about being nefarious, ideally we’ll move the point as little as possible so it just looks like an honest mistake.
In image classification with deep neural networks, each “point” we are classifying is an entire image made up of thousands of pixels. That gives us thousands of possible values that we can tweak to push the point over the decision line. And if we make sure that we tweak the pixels in the image in a way that isn’t too obvious to a human, we can fool the classifier without making the image look manipulated.
In other words, we can take a real picture of one object and change the pixels very slightly so that the image completely tricks the neural network into thinking that the picture is something else — and we can control exactly what object it detects instead:

How to Trick a Neural Network
We’ve already talked about the basic process of training a neural network to classify photos:
- Feed in a training photo.
- Check the neural network’s prediction and see how far off the is from the correct answer.
- Tweak the weighs of each layer in the neural network using back-propagation to make the final prediction slightly closer to the correct answer.
- Repeat steps 1–3 a few thousand times with a few thousand different training photos.
But what if instead of tweaking the weights of the layers of the neural network, we instead tweaked the input image itself until we get the answer we want?
So let’s take the already-trained neural network and “train” it again. But let’s use back-propagation to adjust the input image instead of the neural network layers:

So here’s the new algorithm:
- Feed in the photo that we want to hack.
- Check the neural network’s prediction and see how far off the is from the answer we want to get for this photo.
- Tweak our photo using back-propagation to make the final prediction slightly closer to the answer we want to get.
- Repeat steps 1–3 a few thousand times with the same photo until the network gives us the answer we want.
At end of this, we’ll have an image that fools the neural network without changing anything inside the neural network itself.
The only problem is that by allowing any single pixel to be adjusted without any limitations, the changes to the image can be drastic enough that you’ll see them. They’ll show up as discolored spots or wavy areas:

To prevent these obvious distortions, we can add a simple constraint to our algorithm. We’ll say that no single pixel in the hacked image can ever be changed by more than a tiny amount from the original image — let’s say something like 0.01%. That forces our algorithm to tweak the image in a way that still fools the neural network without it looking too different from the original image.
Here’s what the generated image looks like when we add that constraint:

Even though that image looks the same to us, it still fools the neural network!
Let’s Code It
To code this, first we need a pre-trained neural network to fool. Instead of training one from scratch, let’s use one created by Google.
Keras, the popular deep learning framework, comes with several pre-trained neural networks. We’ll use its copy of Google’s Inception v3 deep neural network that was pre-trained to detect 1000 different kinds of objects.
Here’s the basic code in Keras to recognize what’s in a picture using this neural network. Just make sure you have Python 3 and Keras installed before you run it:





