From the Medium

By Adam Geitgey

Heart of the machine compiles

Participation: Li Zannan, Jiang Siyuan


The image captchas that must be entered to access websites can be used to identify whether visitors are humans or machines — a sort of “Turing test” that AI researchers are looking to crack.
A recent paper by Vicarious in Science describes a new machine learning model for decoding image captcha). Today, how long does it take to crack the world’s most popular photo captcha? Adam Geitgey tells you: 15 minutes.


Everyone hates CAPTCHA — those annoying images that contain the text you have to type in to get to the site properly. CAPTCHA is Completely Automated Public Turing Test to tell Computers and Humans Apart. It is designed to ensure that visitors are truly human and prevent the invasion of malicious programs. However, with advances in deep learning and computer vision, these authentication methods can now be easily cracked.

Recently, I was reading Deep Learning for Computer Vision with Python by Adrian RoseBrock. In the book, Adrian uses machine learning to crack the CAPTCHA on e-ZPass New York:

Here, Adrian does not have access to the source code of the CAPTCHA image-generating application. To crack such a system, we have to find hundreds of sample images and then train the machine learning model to crack it.

But what if we wanted to crack the open source CAPTCHA system, where we have access to all the source code?

I went to WordPress.org (wordpress.org/) and searched for “CAPTCHA.” Results are shown in the first content is “Really Simple CAPTCHA”, has more than 1 million active installed: wordpress.org/plugins/rea… .

The point is, the source code is here! With the source code to generate CAPTCHA images, we can easily break CAPTCHA. To make the task more challenging, let’s put a little constraint on ourselves: Can we crack it in 15 minutes? Let’s try it!

Note: This does not mean we are criticizing the “Really Simple CAPTCHA” plugin or its author. The authors of the plugin have stated that the captchas are no longer secure and recommend that users look for other, more secure methods of authentication. But if you’re one of those 1 million users, maybe you should be on your guard 🙂


challenge

First, we need to plan, and let’s see what Really Simple CAPTCHA’s generated images look like. In the Demo station, we see something like this:

An example of a CAPTCHA image


It looks like it will generate a picture of four characters. Let’s verify this in the PHP source code for this plugin:

public function __construct() {
        /* Characters available in images */
        $this->chars = 'ABCDEFGHJKLMNPQRSTUVWXYZ23456789';
        /* Length of a word in an image */
        $this->char_length = 4;
        /* Array of fonts. Randomly picked up per character */
        $this->fonts = array(
            dirname( __FILE__ ) . '/gentium/GenBkBasR.ttf',
            dirname( __FILE__ ) . '/gentium/GenBkBasI.ttf',
            dirname( __FILE__ ) . '/gentium/GenBkBasBI.ttf'
        );
Copy the code

Yes, it generates a CAPTCHA of four letters and numbers, each character in a different font, and the code does not contain an “O” or “I” because those two letters might confuse people with numbers. So, we have 32 numbers or letters to recognize. No problem!

Time: 2 minutes


The tools we need

Before we start cracking, here’s what we need to do:


Python 3

Python is by far the most popular programming language in the field of artificial intelligence and contains multiple machine learning and computer vision libraries.


OpenCV

OpenCV is a popular framework for computer vision and image processing tasks. Here, we need to use OpenCV to process CAPTCHA generated images. OpenCV has a Python API, so we can call it directly from Python.


Keras

Keras is a deep learning framework written in Python. It makes it much easier to define, train and use deep neural networks — with very little code to write.


TensorFlow

TensorFlow is a machine learning library developed and maintained by Google, and is currently the most popular framework in artificial intelligence. We’ll write code on top of Keras, but Keras doesn’t actually have a way to implement neural network computations — it needs to use TensorFlow as a back end to do the actual work.


All right, let’s get back to the challenge.


Create a data set

To train any machine learning system, we need data sets. To break the CAPTCHA system, we need training data like this:

There seems to be a lot of tagging involved. But here we have the source code for the WordPress plug-in, and we can modify it a little bit to automatically output 10,000 CAPTCHA images, along with the correct answers.

After a few minutes of cracking the source code (simply adding a “for” loop), we had a training set of 10,000 PNG images, with the correct answer being the filename of each image:

Note: I won’t give you sample code in this section. Since this article is for teaching purposes, I hope you won’t actually hack various WordPress sites. But I’m going to give you 10,000 generated images for you to reproduce.

Time so far: 5 minutes


Simplify the problem

Now that we have the training data, we can directly use it to train a simple neural network:

This approach will work well because we have enough data, but we can make the problem easier. Because the simpler the problem, the less training data we have, the less computational power we need to solve it, given that we only have 15 minutes in total.

Fortunately, a CAPTCHA image consists of four matches, so we can split the image in such a way that each image has only one symbol. So we just need to train the neural network to recognize a single character.

We couldn’t manually slice them up with Photoshop or other graphics software, because there were 10,000 training images. Also, we can’t split the image into four equal sized image blocks, because CAPTCHA randomly places these different characters on different levels, as shown below:


Fortunately, we can do this automatically using existing methods. In image processing, we often need to detect pixel blocks with the same color, and the boundary of these continuous pixel blocks can be called contour. OpenCV has a built-in findContours() function that detects areas of these contours.

So our original CAPTCHA image is as follows:



We then convert the image to pure black and white pixels (i.e., using color thresholds), so we can easily find continuous contour boundaries:

Here we use OpenCV’s findContours() function to detect separate parts that contain contiguous blocks of the same pixel:

It’s easy to then save each area as a separate image file, and we also know that each image has four characters from left to right, so we can use this knowledge to annotate individual characters as we save. We just need to save them in order and save each image as the corresponding character name.

But there is a problem. Some CAPTCHA images contain overlapping characters:

This means that we are likely to extract two characters into a split region:

If we don’t solve this problem, we’ll end up with a very bad training set. We need to solve this problem so that the model does not recognize two overlapping characters as one.

A simple solution here is that if the width of the character outline is longer than the height, then there is a good chance that there are two characters in a single shard. So we can split this conjoined character into two halves and treat them as separate characters.

We split the image whose width is greater than a certain value of height into two values. Although this method is very simple, it is very effective in CAPTCHA.

Now we have a way to extract individual characters, so we need to do this for all CAPTCHA images. Our goal is to collect different variations of each character and keep all variations of a single character in a folder.

The figure above shows the extraction of character “W”. We finally obtained 1147 different “W” from 10,000 CAPTCHA images. After processing the images, we spent about 10 minutes in total.


Build and train neural networks

Because we only need to recognize a single character at a time, we do not need a complex neural network architecture, and this task of recognizing letters and numbers is much simpler than other tasks of recognizing complex images. Therefore, we use a simple convolutional neural network, which consists of two convolutional layers and two fully connected layers.

If we were using Keras, it would only take a few lines of code to build a neural network architecture:

# Build the neural network!
model = Sequential()
# First convolutional layer with max pooling
model.add(Conv2D(20, (5, 5), padding="same", input_shape=(20, 20, 1), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
# Second convolutional layer with max pooling
model.add(Conv2D(50, (5, 5), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
# Hidden layer with 500 nodes
model.add(Flatten())
model.add(Dense(500, activation="relu"))
# Output layer with 32 nodes (one for each possible letter/number we predict)
model.add(Dense(32, activation="softmax"))
# Ask Keras to build the TensorFlow model behind the scenes
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
Copy the code

Let’s start training.

# Train the neural network
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=32, epochs=10, verbose=1)
Copy the code

After training with 10 Epochs, our training accuracy was 100%, so we were able to terminate the program to complete the training of the entire model. So we ended up spending 15 minutes.


The trained model is used to solve the CAPTCHA recognition problem

Now we can easily recognize CAPTCHA by using the trained neural network:

  1. Get the real CAPTCHA by using WordPress plugins on your website;
  2. The CAPTCHA image is divided into four independent character blocks. The method used here is the same as that used to create the training set.
  3. The neural network is called to predict the four independent character blocks.
  4. The four predictions are arranged as the return result of the CAPTCHA.


Or we can run it directly from the command line:

Give it a try!

If you want to try it out for yourself, here’s the code: s3-us-west-2.amazonaws.com/mlif-exampl…

This compressed file package contains 10,000 example images and code for each step covered in this article. There’s also a README file that tells you how to run it.

If you want to learn more about the code behind the scenes, you’d better read Deep Learning for Computer Vision with Python. It covers a lot of detail and includes plenty of examples, so if you’re interested in examples that solve real-world difficult problems, it might be for you.



Original link:medium.com/@ageitgey/h…