Verification code

CAPTCHA stands for “Completely Automated Public Turing Test to tell Computers and Humans Apart.” It is a public automatic program that distinguishes between a computer and a human user. Can prevent: malicious crack password, brush ticket, forum flooding, effectively prevent a hacker to a specific registered user with a specific program to crack the way to continue to try to land, in fact, with verification code is now a lot of websites popular way, we use a relatively simple way to achieve this function. This question can be generated and judged by a computer, but only a human can answer it. Since computers can’t answer CAPTCHA’s questions, users who do are considered human.

Captchas are commonly used for web site logins to distinguish between human and machine actions. Enabling captcha is one of the common methods of anti – crawler and anti – hacker. However, with the continuous progress of technology, especially the development of Machine Learning, ordinary verification code recognition is not very complicated.

Identify the architecture of the captcha

Two things need to be done before a captcha service can be set up. 1) Use the existing crawler to collect the image verification code and label these images. Here, I use PicCrawler, an image crawler I developed myself. The so-called annotation is to use the naked eye to correctly identify the numbers and letters in the picture, and then use these numbers and letters as the name of the picture.

2) Use TensorFlow to train these captcha generation models with at least several thousand captcha per batch. In this way, trained models can be loaded through TensorFlow’s API.

Once you’ve done that, you need to think about how you can integrate it into your existing framework.

  • Initial architecture

    We initially considered using OpenCV to load the model because OpenCV has a Java API. Vert.x then interacts with OpenCV. In this architecture, the wired model and the offline model are the models used in the production environment. Each time the trained offline model can replace the online model. But OpenCV had problems loading the model, so it tried a different approach.

  • Later attempts

Using the TensorFlow Java API instead of OpenCV to load the model also ran into problems and had to use the last option.

  • Final architecture

The model is loaded using Python’s Web framework, Flask, and the TensorFlow Python API. In this architecture, vert. x calls the interfaces exposed by Flask and returns the recognized results.

Finally, the data returned by the interface matches the content of the captcha in the image. You’ve done a verification code.

thinking

Currently, only one or two types of captcha can be identified. In the future, multiple types of captcha will be labeled and trained into a model.

Captcha functionality is intended to be integrated as a component of the crawler framework, NetDiscovery. Since the crawler framework is open source, this module is freely available to everyone.

The architecture of the captcha module also seeks to replace Python with familiar Java.