Everyone hates captchas — those annoying images that show the text you have to type before you log on to a website. Captchas are designed to avoid auto-filling forms by verifying that you are a real person. But with the rise of deep learning and computer vision, captchas are now often vulnerable.
I read Deep Learning for Computer Vision with Python by Adrian Rosebrock. In the book, Adrian describes how he used machine learning to bypass the captcha on New York’s E-ZPass website:
Adrian had no access to the source code for the captcha generated by the app. To hack it, he had to download hundreds of sample images and manually process them to train his own system.
But what if we want to break into an open source captcha system, and we do have access to the source code?
I went to WordPress.org’s plugins channel and searched for “captcha.” The first result was Really Simple CAPTCHA, with over a million active installs:
Best of all, it’s open source! Now that we have the source code to generate the captcha, it should be pretty easy to hack. To make this more challenging, let’s give ourselves a time limit. Can we break through this captcha system in 15 minutes? Come and try it!
Important note: This is in no way a criticism of the Really Simple CAPTCHA plug-in or its author. The author of the plugin himself says it is no longer secure and suggests using another plugin. It’s just a fun and quick technical challenge. But if you’re one of those million + users, maybe you should switch to another plugin 🙂
challenge
To come up with an attack plan, take a look at what Really Simple CAPTCHA will generate. On the sample site, we see the following images:
Okay, so the captchas seem to be four letters. Verify this in the PHP source code:
public function __construct() {
/* Characters available in images */
$this->chars = 'ABCDEFGHJKLMNPQRSTUVWXYZ23456789';
Â
/* Length of a word in an image */
$this->char_length = 4;
Â
/* Array of fonts. Randomly picked up per character */
$this->fonts = array(
dirname( __FILE__ ) . '/gentium/GenBkBasR.ttf',
dirname( __FILE__ ) . '/gentium/GenBkBasI.ttf',
dirname( __FILE__ ) . '/gentium/GenBkBasBI.ttf',
dirname( __FILE__ ) . '/gentium/GenBkBasB.ttf',
);Copy the code
That’s right, it uses a random combination of four different fonts to generate a four-letter captcha. And as you can see, it never uses O or I in the code to avoid user confusion. There are 32 possible letters and numbers that we need to identify. No problem!
Timing: 2 minutes
tool
Before we move on, let’s mention the tools we’ll use to solve the problem:
Python 3Â
Python is an interesting programming language with a large library of machine learning and computer vision.
OpenCVÂ
OpenCV is a popular framework for computer vision and image processing. We are going to use OpenCV to process captcha images. Since it has a Python API, we can use it directly from Python.
Keras
Keras is a deep learning framework written in Python. It makes it easy to define, train, and implement deep neural networks with minimal code.
TensorFlowÂ
TensorFlow is Google’s machine learning library. We’ll be programming in Keras, but Keras doesn’t really implement the logic of the neural network itself, instead doing the heavy lifting behind the scenes using Google’s TensorFlow library.
All right, back to our challenge!
Create our data set
To train any machine learning system, we need training data. To break into a captcha system, we want training data like this:
Given that we have the source code for the WordPress plug-in, we can tweak it and save 10,000 captcha images together with their respective answers.
After a few minutes of hacking the code and adding a simple for loop, I had a folder of training data — 10,000 PNG files with names corresponding to the correct answer:
This is the only part where I won’t give you sample code. We’re doing this for education, I don’t want you to actually hack WordPress. However, I will eventually give you the 10,000 generated images so you can repeat my results.
Timing: 5 minutes
Simplify the problem
Now that we have the training data, we can use it directly to train the neural network:
With enough training data, this approach might work — but we can make the problem simpler to solve. The simpler the problem, the less training data and computing power required to solve it. After all, we only have 15 minutes!
Fortunately, captcha images always consist of just four letters. If we could figure out a way to separate the picture so that each letter was in a separate picture, we would only need to train the neural network to recognize one letter at a time:
I didn’t have time to go through 10,000 training images and manually split them in Photoshop. This could take days, and I only have 10 minutes left. We can’t divide the image into four equally sized pieces yet, because the captcha plugin prevents this by placing the letters randomly in different horizontal positions:
Fortunately, we can still do this automatically. In image processing, it is often necessary to detect pixel blocks with the same color. The boundaries around these contiguous blocks of pixels are called contours. OpenCV has a function called LndContours() that can be used to detect these contiguous regions.
So we start with an unprocessed captcha image:
Next, the image was converted to pure black and white (this is called thresholding) to make it easy to find the continuous area:
Next, use OpenCV’s LndContours() function to detect different parts of the image that contain blocks of pixels of the same color:
The next step is to simply save each area as a different image file. Since we know that each image should contain four letters from left to right, we can take advantage of this by marking the letters while saving them. As long as we save them in order, we should be able to save each picture letter and its corresponding letter name.
But wait — I see a problem! Sometimes captchas have overlapping letters like this:
This means that we separate the two letters into a region:
If this problem is not addressed, poor training data will be created. We have to solve this problem so that we don’t accidentally teach the machine to recognize two overlapping letters as one.
A simple way to do this is if an outline area is wider than its height, which means that two letters are probably overlapping. In this case, we can split the overlapping letters down the middle and treat them as two different letters:
Now that we’ve found a way to separate out individual letters, we’ll do the same for all captcha images. The goal is to collect different variations of each letter. We can keep each letter in its own folder to keep it organized.
After I’ve isolated all the letters, my W folder looks like this:
Timing: 10 minutes
Build and train the nervous system
Since we only need to recognize pictures of individual letters and numbers, we don’t need very complex neural network structures. It’s much easier to identify letters than complex pictures like cats and dogs.
We will use a simple convolutional neural network structure with two convolutional layers and two fully connected layers:
If you want to learn more about how neural networks work and why they are ideal for image recognition, refer to Adrian’s book or my previous article.
To define the structure of the neural network, just a few lines of Keras code are required:
# Build the neural network!
model = Sequential()
Â
# First convolutional layer with max pooling
model.add(Conv2D(20, (5, 5), padding="same", input_shape=(20, 20, 1), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
Â
# Second convolutional layer with max pooling
model.add(Conv2D(50, (5, 5), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
Â
# Hidden layer with 500 nodes
model.add(Flatten())
model.add(Dense(500, activation="relu"))
Â
# Output layer with 32 nodes (one for each possible letter/number we predict)
model.add(Dense(32, activation="softmax"))
Â
# Ask Keras to build the TensorFlow model behind the scenes
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])Copy the code
Now we can train him!
# Train the neural network
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=32, epochs=10, verbose=1)Copy the code
After 10 went through the training data set, we achieved almost 100% accuracy. At this point, we should be able to automatically bypass the captcha at any time! We did it!
Timing: 15 minutes (that was close!)
Use trained models to process captcha
Now that you have a trained neural network, it’s easy to use it to break real captchas:
- 1. Download a captcha image from a WordPress plugin site.
- 2. Use the method of generating training data set in this paper to divide the captcha image into four letter images.
- 3. Use neural network to predict each letter picture separately.
- 4. Use four predictive letters for your captcha answer.
- 5. Carnival!
When cracking captcha, our model looks like this:
Or from the command:
Come and try it!
If you want to try it out for yourself, you can find the code here (http://t.cn/R8yFJiN). It contains 10,000 sample images and all the code for each step of the article. See the operation guide in the readme.md file.
But if you want to see what each line of code does, I highly recommend you take a look at Deep Learning for Computer Vision with Python. The book covers more detail and has plenty of detailed examples. It’s the only book I’ve seen so far that covers both how it works and how it can be used to solve complex problems in the real world. Go and have a look!
comments
About the author:An actuarial dog
Personal home page
My article
18