Let’s hack the world’s most popular WordPress captcha plug-in

Everyone hates captchas — you’re always asked to enter the text contained in those annoying images before you’re allowed to visit a website.

Captchas are designed to prevent computers from filling out forms automatically in a way that verifies you are a real person. But with the rise of deep learning and computer vision, they are now often easily breached.

I am reading Adrian Rosebrock’s excellent book deep Learning for Python Computer Vision. In the book, Adrian briefly describes how he used machine learning to bypass the captcha on e-ZPass New York:


Adrian does not have access to the source code of the application that generates captcha images. To crack the system, he had to download hundreds of sample images and solve them manually to train his deep learning system.

But what if we wanted to break an open source captcha system?

I went to the WordPress.org plugin registry and searched for “captcha.” The top result is a plug-in called Really Simple Captcha, with over a million active installations:


And best of all, it’s open source! Since we have the source code to generate captcha, this should be easy to crack. To make things more challenging, let’s give ourselves a time limit. Can we break the captcha system in 15 minutes? Let’s try it!

Important note: This is in no way a criticism of the “really simple captchas” plug-in or its authors. The plug-in author himself says that the plug-in is no longer secure and suggests you use something else. It’s just a quick and fun technical challenge. But if you’re one of the remaining 1 million users, maybe you should switch to another plugin 🙂

Challenge to

To create a plan of attack, let’s take a look at what kind of images the plugin will generate. On the demo site, we see this:


Okay, so the captcha image appears to be four letters. Let’s verify this in PHP source code:


Yes, it produces a four-letter captcha with a random combination of four different fonts. As you can see, it never uses “O” or “I” in code to avoid user confusion. This gives us a total of 32 letters and numbers that we might need to identify. No problem! Time pass so far: 2 minutes.

Our toolset

Before we go any further, let’s talk about the tools we’ll use to solve this problem:

Python3

Python is a very interesting programming language with a good library for machine learning and computer vision.

OpenCV

OpenCV is a popular framework for computer vision and image processing. We will use OpenCV to process captcha images.

It has a Python application interface, so we can use it directly from Python.

Keras

Keras is a deep learning framework written in Python. It enables us to define, train and use deep neural networks easily with minimal code.

TensorFlow

TensorFlow is Google’s machine learning library. We’ll write code in Keras, but Keras doesn’t really implement the logic of the neural network itself, it actually calls Google’s TensorFlow in the background to do the calculations.

Ok, now let’s get back to the challenge!

Create our data set

To train any machine learning system, we need to train data sets. To crack a captcha system, we need to train the data to look like this:


Since we have the source code for this WordPress captcha plug-in, we can make some changes to it to save 10,000 captcha images and the correct answer for each image. After a few minutes of modifying the source code appropriately and adding a simple for, I had a folder of training data — 10,000 PNG files, each named with the correct answer:


This is the only part where I won’t give you sample code. We’re doing this for education, I don’t want you to actually hack WordPress. However, I will give you the 10,000 images I eventually generated so you can repeat my results.

Time to date: 5 minutes.

Simplify the problem

Now that we have the training data, we can use it directly to train the neural network:


With enough training data, this crude approach even works – but we can make the problem easier to solve. The simpler the problem, the less training data, the less computing resources consumed. After all, we only have 15 minutes!

Fortunately, the captcha image is always made up of four letters. If we could segment the image in a way so that each letter was an independent image, we would only need to train the neural network to recognize one letter at a time:


I didn’t have time to go through 10,000 training images and manually split them into individual images in Photoshop. It will take days, and I only have 10 minutes left.

And we can’t split the image into four blocks of the same size, because the captcha places the letters randomly in different horizontal positions:


Letters are placed randomly in each image to make it harder to segment the image.

Fortunately, we can still do this automatically. In image processing, we often need to detect pixel clusters with the same color. The boundaries around these continuous clusters of pixels are called contours. OpenCV has a built-in findContours() function that can be used to detect these contiguous regions.

So we’ll start with a raw captcha image:


We then convert the image to pure black and white (this is called threshold setting) so that it is easy to find contiguous regions:


Next, we’ll use OpenCV’s findContours() function to detect individual contiguous blob of pixels of the same color in the image:


Just save each area as a separate image file. And since we know that each image should contain four letters from left to right, we can use that knowledge to tag letters. As long as we save them in order, we can save each letter image with the appropriate letter name.

But wait — I see a problem! Sometimes captchas have overlapping letters like this:


This means that we will eventually extract the area where the two letters are pieced together:


If we don’t address this problem, we end up creating bad training data. We need to fix this so that we don’t accidentally get the machine to recognize the two connected letters as one letter.


We will divide in half any area that is longer than width than height and treat it as two letters. It’s crude, but it still works for identifying these captchas.

Now that we have a way to extract individual letters, let’s run it on all of our CAPTCHA images. The goal is to collect different variations of each letter. We can save each letter in our own folder.

Here’s what my “W” folder looks like after I extract all the letters:


Some “W” letters extracted from our 10,000 captcha images. I got 1,147 different “W” images. Time pass so far: 10 minutes.

Create and train neural networks

Since we only need to recognize individual letter and number images, we do not need a very complex neural network architecture. It is much easier to recognize letters than complex images such as pictures of cats and dogs.

We will use a simple convolutional neural network structure with two convolutional layers and two fully connected layers:


If you want to know more about how convolutional neural networks work and why they are ideal for image recognition, check out Adrian’s book or my previous articles.

Defining this neural network architecture with Keras requires only a few lines of code:


Now, we can start training him!


After 10 sessions with the training data set, we achieved nearly 100% accuracy. Now, if we want to, we should be able to bypass the captchas automatically! We did it! Time to date: 15 minutes. (~!

Crack captcha using trained models

Now, we have a trained neural network that is very simple to use to crack the real captchas:

1. Grab a real captcha image from a WordPress site using the plugin.

2. Split the captcha image into four separate letter images using the same method we used to create the training data set.

3. Our neural network is required to make a separate prediction for each letter image.

4. Use four predictive letters for your captcha answer.

Here’s how our model decodes real captchas





Or from the command line


Give it a try!

If you want to try it out for yourself, you can get the code here. It includes 10,000 sample images and all the code for each step in this article. Take a look at the readme.md file inside to see how to run it.

However, if you want to understand exactly what each line of code does, I highly recommend you also get a copy of Python Computer Vision Deep Learning. It has more detail and lots of detailed examples. It’s the only book I’ve seen so far that covers both how things work and how to solve problems in the real world. Go and have a look!


If you want to learn more about learning Python, you can take a look at the hundreds of points we spent over a month compiling over hundreds of hours:

Notes for the basic Python tutorial in Python Automation from beginner to Master