preface

Write the crawler has a around the problems of the past, not the authentication code, such as one on if you don’t log in, first thy content of the data can’t climb, and the verification code is a kind of hair website crawler measures, with the development of technology, authentication code is more and more complex, the crawler work more and more difficult, so this time is to explain, how to identify verification code; (Sounds like a big deal)


1) Graphic verification code Graphic verification code should be the simplest kind of verification code, this kind of verification code is the earliest, is also the most common, the general composition rule is 4 letters or numbers or mixed composition;

2) Slide captcha

3) Tap the verification code

Ok, the above three kinds of verification code, should be the most common type of verification code on the PC, of course, there will be gesture verification, palace verification, voice verification and so on on the mobile app, here is not introduced, mainly for the above three kinds of common introduction;

1 Graphic verification code

One of captcha has two kinds, one kind is the graphical verification code, is a kind of tapping the authentication code, after testing found that the beginning is to display graphical verification code, but when landing exit number increase gradually, becomes by tapping the authentication code, the verification code switching mechanism, also be a means to prevent the crawler, gossip is not much said, meow first:

On the one on the link: https://www.zhihu.com/signup?next=%2F default is on the registration page, click the login button, if still no verification code, just refresh the web page several times;

The following one, which I’ll cover in the next two articles;

2 Information Introduction

The tesserocr library needs to be installed to identify the graphic verification code. Tesserocr is introduced below. Tesserocr is a Python OCR recognition library, but in fact the Tesseract to do a layer of Python Api encapsulation, the core is tesseract, so before installing tesserOCR, need to install tesseract;

Wait, muddled, tesserocr this is understandable, it’s a library, but what is OCR? What is a tesseract?

OCR, full name of OCR, is called Optical Character Recognition, which refers to the process of scanning characters and translating them into electronic text by their shapes.

For example, when there is a graphic verification code, the OCR technology is first used to convert it into electronic text, and then the crawler submits the recognition result to the server, so as to achieve the process of automatic identification of the verification code;

Tesseract TesserAct is Google’s open source OCR

OK, seems to have some understanding of the concept, there is a question, before in the field of graphic recognition, there is also opencV, what is the difference between the two? Opencv focuses on machine vision tesseract focuses on character recognition

So from the field, OpencV is wider, and graphic verification code, OpencV can also do, but kill chicken how to use a knife ~

3 Environment Preparation

Tesseract: TesserocR: TesserocR: TesserocR: TesserocR: TesserocR Tesseract download address: https://digi.bib.uni-mannheim.de/tesseract/ Open, you can see the various exe list, you can choose any; Tesseract-ocr-setup-3.05.01.exe is a stable version. Tesseract-ocr-setup-3.05.01.exe is a stable version. Tesseract-ocr-setup-3.05.01.exe is a stable version.

Double-click the download and click all the way until the following page appears

Here you need to check the red box for Additional Language Data (Download), which is to install the OCR recognition support language pack, so that OCR can recognize multiple languages, and click NEXT all the way, because it will take some time to download the language pack, About 10-20 minutes, related to the network speed, if you do not need to support multiple languages, you can also uncheck, free choice Needs to note: the default contains English word library if you feel that a download so many languages take up space, or feel that the network speed is slow, you can also choose to install Chinese word library alone; On the word stock download address: https://github.com/tesseract-ocr/tessdata, direct search chi_sim. Traineddata, this is the representative of Chinese, download; The tesseract installation directory will contain a directory called TessData, and you can directly download the language package into this directory.

How do I verify that tesseract is installed successfully? Enter tesseract directly under CMD. Success will display the message directly;

If ‘tesseract’ is not an internal or external command, it is because no environment variables have been configured.

So far, tesseract has been successfully installed

Install tesserocr with PIP command:

pip3 install tesserocr install
Copy the code

However, when JB is installed, it directly reports an error:

Tesserocr: conda install tesserocr: conda install Tesserocr

After a long struggle, I finally found a workable command:

conda install -c simonflueckiger tesserocr
Copy the code

Tesserocr is finally installed

How do I verify that it is actually installed? Import tesserocr to tesserocr.

By the way, if there is a classmate don’t know conda this command, please visit the link below to directly search scrapy installed, there will be introduced conda: https://juejin.cn/post/6844903618068348942

Tesserocr and Tesseract have been installed on Windows.

Don’t worry, by the way, I will introduce Linux and Mac, but the following methods have not been verified by JB, the information comes from the Internet, just for reference:

Tesseract-ocr or Tesseract is available for Linux. Tesseract-ocr or Tesseract is available for Linux.

  • Ubuntu, Debian, and Deepin In Ubuntu, Debian, and Deepin, the installation commands are as follows:

      sudo apt-get install-y tesseract-ocr libtesseract-dev libleptioica-dev
    Copy the code
  • CentOS and Red Hat In CentOS and Red Hat oss, run the following commands:

      yum install -y tesseract
    Copy the code

To complete tesseract installation, run the above command on a different distribution. Once installed, you can call the Tesseract command; By default, it also means to install English language. If you need to install other languages, please refer to the introduction of Windows above. The same processing scheme is not repeated here.

Next, install tesserocr directly using PIP:

pip3 install tesserocr pillow
Copy the code

On Mac, first install ImageMagick and tesseract libraries using Homebrew:

brew install imagemagick
brew install tesseract --all-languages
Copy the code

Tesserocr = tesserocr

brew install tesserocr pillow
Copy the code

4 Identification Test

To facilitate the test, you need to save the image of the verification code locally. Open weibo.com, literally enter account password, will be prompted to input verification code, open the developer tools, find the verification code elements, the SRC attribute is a link, copy it directly to open, will see a captcha, and refresh verification code changes, thus infer that this is a verification code interface, right-click save authentication code, I get a captcha; Verification code link: https://login.sina.com.cn/cgi/pin.php?r=9967937&s=0&p=gz-d0dc363f6a4523cbd602a5a10f00c59b4784

Ok, finished with, then start, new project, put the verification code in the project root directory; Tesserocr library to identify captcha:

Image = image.open ("3.jpg") # call tesserocr image_to_text(), Result = tesserocr.image_to_text(image) print(result)Copy the code

As a result, after running, nothing?? Then JB got into a lot of trouble, including debugging, finding various documents, and finally, changing the captchas that were being debugged above:

Replace the image and execute the code again:

At present, there are two problems: 1) the verification code recognition of Weibo failed, and the output is blank. 2) Part of the words of the verification code in chapter 2 are incorrectly recognized

Thought, this library is recommended on the Internet, is Google open source, no problem in theory, and people are so used, why there is a problem here? Do you need additional processing?

Keep learning with your questions and dreams;

Note: Tesserocr has an even simpler method that converts an image file directly to a string as follows:

import tesserocr
print(tesserocr.file_to_text("1.jpg"))
Copy the code

Tesseract is used to output the following reasons:

Tesseract Image path outputCopy the code

5 Verification code processing

I looked it up online, like this captcha:

It could be that extra lines in the captcha are interfering with image recognition;

Or this one from Weibo:


There are solutions, the need for additional processing of the picture, such as turning gray, binarization and other operations;

Grayscale processing: The Image can be transformed into grayscale Image by passing the convert() method parameter of Image object into L:

from PIL import Image

image = Image.open("1.jpg")
image = image.convert('L')
image.show()
Copy the code

After passing in 1, the image can be binarized: (Binarization means that the gray value of pixels on the image is set to 0 or 255, that is, the whole image presents an obvious visual effect of only black and hundred)

import tesserocr
from PIL import Image

image = Image.open("1.jpg")
image = image.convert('1')
image.show()
Copy the code

The threshold for binarization can be specified. The above method uses the default threshold 127. However, it is rare to directly convert the original map, for the reasons mentioned above, the error is even more outrageous; Generally, the original image is converted to grayscale image first, and then the threshold value of binarization is specified. The code is as follows:

Image = image.convert('L') import tesserocr from PIL import Image # Create an Image object Image = image.open ("1.jpg") Threshold = 150 table = [] for I in range(256): if I < threshold: table.append(0) else: Table.append (1) # convert table to binary image, 1 is white, Image = image.point(table,"1") image.show() result = tesserocr. Image_to_text (image) print(result)Copy the code

I want to make it clear that if you don’t understand 256, what is it? First of all, we put the image gray processing, gray image is a monochrome image with 256 grayscale color levels or levels from black to white; In other words, we set a threshold value from 0 to 256. If the grayscale image is less than the threshold value, 0 is set; if the grayscale image is greater than the threshold value, 1 is set. 0 is black, 1 is white. Probably still muddle. direct texture: original

Gray:

Binary graph:

In the grayscale map, some colors are between white and black, so all these intermediate colors are converted into black and white by setting the threshold value.

Ok, digressal, above the verification code binary graph is like this:

And the verification result:

Good has changed, at least not MEEE, so let’s go ahead and get it to a good value; After a long time of adjustment, JB gave up, because the 8 could not be adjusted to a proper value no matter how it was adjusted, it kept wandering between S, R and B.

JB changed a verification code:

The same code above, without modification, the binary graph is as follows:

Verification result:

Oh year, this can check out ~

Remember that twitter captcha we started with? Let’s try it. The verification code looks like this

Compare the weibo verification code with the verification code that can be identified above:


What about Chinese?

Update on 18.6.11: The installation mentioned installing different language packs, so what if you want to see other languages? So this is added ~

Above first ~

Directly on the code:

import tesserocr
from PIL import Image

image = Image.open("juejin.jpg")
result = tesserocr.image_to_text(image, lang='chi_sim')
print(result)
Copy the code

Because the default is English, English does not need to specify lang, but Chinese does, chi_sim is simplified Chinese;

From the output result, there is probably a sale in the small volume, otherwise it is estimated that the open source library can also be matched. But it can still be seen that the Chinese is not very accurate

Here is the explanation, at noon do not need to set gray with binary ha, otherwise the color deepened, estimated more difficult to distinguish ~

summary

This chapter learned tesserocR and Tesseract environment construction, and how to graphics verification code noise processing, and explain the concept of gray graph and binary graph;

Incurable diseases

In fact, tesserocR can only solve the solid verification code, for hollow verification code, still helpless, so what to do? Since there is error in image recognition, we will give up this road, but through other ways to obtain the verification code;

For example, the code that generated the verification code is directly found and converted to obtain the verification code, and deep learning training machine recognition;

Next chapter: How to obtain verification code generation code secondary processing to obtain verification code

18.6.11 Update the above topic and save it for the next chapter, which introduces the OCR~ of charging

Thank you.