The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with
Author: Liu Zhi-qi Source: CSDN
Link to this article: blog.csdn.net/weixin\_418…
preface
In the process of our crawler writing, the common interference means of target website is to set verification code, etc. This will be based on Selenium actual combat to explain how to deal with pop-ups and verification code, crawling the target website is an instrument ordering platform
As you can see, the verification code required for login is relatively simple. It is a standard number in color with a simple background interference ****
Instead of using artificial intelligence, the image can be processed using binary method and then sent to Google’s recognition engine TesserACt-OCR to obtain the numbers in the image.
Note: Selenium and Tesseract configurations are searchable and not covered in this article.)
Python of actual combat
Start by importing the required modules
Import RE # Image processing from PIL Import Image # Text Recognition import Pytesseract # Browser automation from Selenium import WebDriver Import TimeCopy the code
Resolve the pop-up problem
Try opening the sample site first
url = 'http://lims.gzzoc.com/client'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(30)
Copy the code
The website shows a popover that we didn’t see before. Briefly speaking about the knowledge of popovers, beginners can simply divide the popup box into alert and non-alert
An Alert dialog box is displayed
The Alert (message) method displays an alert box with a specified message and an OK button. The Confirm (message) method displays an alert box with a specified message and an OK button The prompt(text,defaultText) method is used to display a dialog box that prompts the user for input
Take a look at how the popup js looks:
Alert: Driver.switch_to. alert: Driver. Switch_to. alert: Driver. Don’t worry
Processing of non-traditional Alert pop-up boxes
The popup is located in the DIV layer, just as you normally do
The pop-up box is a nested iframe layer that needs to be switched
The pop-up box is located in the nested Handle and the window needs to be switched
So let’s do an element review of this popup
So the problem is really simple, just locate the button and click
url = 'http://lims.gzzoc.com/client' driver = webdriver.Chrome() driver.get(url) time.sleep(1) driver.maximize_window() //div[@class='jconfirm-buttons']/button").click()Copy the code
Get the image location and take a screenshot
Binary method to deal with the simple idea of verification code is as follows:
Cut the image where the verification code is located
The binary method turns the effective information to black and the background and interference to white
The processed images are fed to a word recognition engine
Enter the returned results and submit
Cut and cut the image of the verification code to further think about the solution strategy: first get the CSS properties of the image on the web page, according to the size and location to calculate the coordinates of the image; Then take screenshots; Finally, this coordinate can be used to further process the screenshot (due to the particularity of the verification code JS, it is not possible to simply obtain the href of IMG, download the image and read the recognition, which will lead to the mismatch between the front and the back).
img = driver.find_element_by_xpath('//img[@id="valiCode"]') time.sleep(1) location = img.location size = img.size # left = location['x'] # top = location['y'] # right = left + size['width'] # bottom = top + size['height'] left = 2 * location['x'] top = 2 * location['y'] right = left + 2 * size['width'] - 10 bottom = top + 2 * size['height'] - 10 driver.save_screenshot('valicode.png') page_snap_obj = Image.open('valicode.png') image_obj = page_snap_obj.crop((left, top, right, bottom)) image_obj.show()Copy the code
Normally, just use the annotated four lines of code, but the zoom rate varies from computer to browser, so you need to consider multiplying the zoom factor if the captured image is biased. The final value can be added or subtracted for fine tuning
You can see the image and it’s been captured!
Further processing of captcha images
This threshold needs to be specifically tried with Photoshop or other tools, that is, a pixel threshold can be found to separate the real data and background interference in grayscale images. The threshold value tested in this case is 205
Pixdata = img.load() w, h = img.size threshold = 205 # For y in range(h): for x in range(w): if pixData [x, y] < threshold: pixdata[x, y] = 0 else: pixData [x, y] = 255Copy the code
Regenerate the image from the pixel binary result
data = img.getdata()
w, h = img.size
black_point = 0
for x in range(1, w - 1):
for y in range(1, h - 1):
mid_pixel = data[w * y + x]
if mid_pixel < 50:
top_pixel = data[w * (y - 1) + x]
left_pixel = data[w * y + (x - 1)]
down_pixel = data[w * (y + 1) + x]
right_pixel = data[w * y + (x + 1)]
if top_pixel < 10:
black_point += 1
if left_pixel < 10:
black_point += 1
if down_pixel < 10:
black_point += 1
if right_pixel < 10:
black_point += 1
if black_point < 1:
img.putpixel((x, y), 255)
black_point = 0
img.show()
Copy the code
The comparison before and after image processing is as follows
Character recognition
This is done by feeding the processed image to Google’s word recognition engine
Result = pytesseract.image_to_string(img) # Regex = '\d+' result = ". Join (re.findall(regex, result)) print(result)Copy the code
The identification results are as follows
Submit information such as the account password and verification code
After processing the verification code, we can now submit the account password, verification code and other login information to the website
driver.find_element_by_name('code').send_keys(result) driver.find_element_by_name('userName').send_keys('xxx') Driver.find_element_by_path ("//div[@class='form-group ') login-input'][3]").click()Copy the code
It should be noted that the success rate of binary method to identify the verification code is not 100%. Therefore, it is necessary to consider the identification error of the verification code and click the picture to replace the verification code for re-identification. You can disintegrate the above code into multiple functions and use the following cyclic framework for trial and error
while True:
try:
...
break
except:
driver.find_element_by_id('valiCode').click()
Copy the code
For ease of understanding, the code is not written in a functional form, readers are welcome to try to modify!
summary
Once you’ve logged in successfully, you can get your personal cookies. You can then continue to use Selenium to automate the browser or send cookies to Requests, and then you can crawl the information you need for analysis or automation.