The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

Author: Liu Zhi-qi Source: CSDN

Link to this article: blog.csdn.net/weixin\_418…

preface

In the process of our crawler writing, the common interference means of target website is to set verification code, etc. This will be based on Selenium actual combat to explain how to deal with pop-ups and verification code, crawling the target website is an instrument ordering platform

As you can see, the verification code required for login is relatively simple. It is a standard number in color with a simple background interference ****

Instead of using artificial intelligence, the image can be processed using binary method and then sent to Google’s recognition engine TesserACt-OCR to obtain the numbers in the image.

Note: Selenium and Tesseract configurations are searchable and not covered in this article.)

Python of actual combat

Start by importing the required modules

Import RE # Image processing from PIL Import Image # Text Recognition import Pytesseract # Browser automation from Selenium import WebDriver Import TimeCopy the code

Resolve the pop-up problem

Try opening the sample site first

url = 'http://lims.gzzoc.com/client'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(30)
Copy the code

The website shows a popover that we didn’t see before. Briefly speaking about the knowledge of popovers, beginners can simply divide the popup box into alert and non-alert

An Alert dialog box is displayed

The Alert (message) method displays an alert box with a specified message and an OK button. The Confirm (message) method displays an alert box with a specified message and an OK button The prompt(text,defaultText) method is used to display a dialog box that prompts the user for input

Take a look at how the popup js looks:

Alert: Driver.switch_to. alert: Driver. Switch_to. alert: Driver. Don’t worry

Processing of non-traditional Alert pop-up boxes

The popup is located in the DIV layer, just as you normally do

The pop-up box is a nested iframe layer that needs to be switched

The pop-up box is located in the nested Handle and the window needs to be switched

So let’s do an element review of this popup

So the problem is really simple, just locate the button and click

url = 'http://lims.gzzoc.com/client' driver = webdriver.Chrome() driver.get(url) time.sleep(1) driver.maximize_window() //div[@class='jconfirm-buttons']/button").click()Copy the code

Get the image location and take a screenshot

Binary method to deal with the simple idea of verification code is as follows:

Cut the image where the verification code is located

The binary method turns the effective information to black and the background and interference to white

The processed images are fed to a word recognition engine

Enter the returned results and submit

Cut and cut the image of the verification code to further think about the solution strategy: first get the CSS properties of the image on the web page, according to the size and location to calculate the coordinates of the image; Then take screenshots; Finally, this coordinate can be used to further process the screenshot (due to the particularity of the verification code JS, it is not possible to simply obtain the href of IMG, download the image and read the recognition, which will lead to the mismatch between the front and the back).

img = driver.find_element_by_xpath('//img[@id="valiCode"]') time.sleep(1) location = img.location size = img.size # left  = location['x'] # top = location['y'] # right = left + size['width'] # bottom = top + size['height'] left = 2 * location['x'] top = 2 * location['y'] right = left + 2 * size['width'] - 10 bottom = top + 2 * size['height'] - 10 driver.save_screenshot('valicode.png') page_snap_obj = Image.open('valicode.png') image_obj = page_snap_obj.crop((left, top, right, bottom)) image_obj.show()Copy the code

Normally, just use the annotated four lines of code, but the zoom rate varies from computer to browser, so you need to consider multiplying the zoom factor if the captured image is biased. The final value can be added or subtracted for fine tuning

You can see the image and it’s been captured!

Further processing of captcha images

This threshold needs to be specifically tried with Photoshop or other tools, that is, a pixel threshold can be found to separate the real data and background interference in grayscale images. The threshold value tested in this case is 205

Pixdata = img.load() w, h = img.size threshold = 205 # For y in range(h): for x in range(w): if pixData [x, y] < threshold: pixdata[x, y] = 0 else: pixData [x, y] = 255Copy the code

Regenerate the image from the pixel binary result

data = img.getdata()
w, h = img.size
black_point = 0
for x in range(1, w - 1):
    for y in range(1, h - 1):
        mid_pixel = data[w * y + x]
        if mid_pixel < 50:
            top_pixel = data[w * (y - 1) + x]
            left_pixel = data[w * y + (x - 1)]
            down_pixel = data[w * (y + 1) + x]
            right_pixel = data[w * y + (x + 1)]
            if top_pixel < 10:
                black_point += 1
            if left_pixel < 10:
                black_point += 1
            if down_pixel < 10:
                black_point += 1
            if right_pixel < 10:
                black_point += 1
            if black_point < 1:
                img.putpixel((x, y), 255)
            black_point = 0
img.show()
Copy the code

The comparison before and after image processing is as follows

Character recognition

This is done by feeding the processed image to Google’s word recognition engine

Result = pytesseract.image_to_string(img) # Regex = '\d+' result = ". Join (re.findall(regex, result)) print(result)Copy the code

The identification results are as follows

Submit information such as the account password and verification code

After processing the verification code, we can now submit the account password, verification code and other login information to the website

driver.find_element_by_name('code').send_keys(result) driver.find_element_by_name('userName').send_keys('xxx') Driver.find_element_by_path ("//div[@class='form-group ')  login-input'][3]").click()Copy the code

It should be noted that the success rate of binary method to identify the verification code is not 100%. Therefore, it is necessary to consider the identification error of the verification code and click the picture to replace the verification code for re-identification. You can disintegrate the above code into multiple functions and use the following cyclic framework for trial and error

while True:
    try:
        ...
        break
    except:
        driver.find_element_by_id('valiCode').click()
Copy the code

For ease of understanding, the code is not written in a functional form, readers are welcome to try to modify!

summary

Once you’ve logged in successfully, you can get your personal cookies. You can then continue to use Selenium to automate the browser or send cookies to Requests, and then you can crawl the information you need for analysis or automation.