\

This article introduces some of the preparatory work of the crawler and the main ideas of the Python+Selenium crawler

One, preparation ****

1. Introduction and installation of Selenium ****

Selenium is a Web automation (testing) tool that enables the browser to automatically load pages, retrieve needed data, and so on, according to our instructions.

A crawler is a program that automatically crawls information from the Internet. It crawls valuable information from the Internet, just like a bug crawls tirelessly in a building.

The traditional crawler directly simulates THE HTTP request to crawl the site information. Because of the obvious difference between this way and browser access, many sites adopt some anti-crawler methods. Selenium is much easier to learn than traditional crawlers because it mimics the browser’s ability to crawl information, behaving almost like the user, without analyzing the specific parameters of each request. The only disadvantage is that it is slow, and Selenium is a great choice if there is no requirement for crawler speed.

PIP Install Selenium is a simple tool to install online.

2. Download and install chromedriver at ****

Selenium does not function as a browser on its own; it needs to be used in conjunction with third-party browsers. Google’s Chrome browser supports this feature easily by installing its Chrome Driver.

(Download address:

Chromedriver.storage.googleapis.com/index.html)

How do I view the browser version? You can type ‘Chrome ://version/’ in the address bar. The first line is the version information. Take his own browser as an example, version 76.0.3809.132 (32 bits). (This can also be found in the “About” information that software typically has)

Next, download the corresponding chromeDriver version (if not, download the version closest to it). The download here is 76.0.38.09.126. \

Unzip the download file to get Chromedriver.exe. As a tutorial, it is not recommended to set up complex environment variables and use absolute path references when using ChromeDriver. \

Python+Selenium crawler ****

Once the preparation is complete, the crawler can proceed. First, we used Selenium’s browser-driven interface tool, WebDriver, to open our crawl target: the daily weather map on the Website of the Central Meteorological Observatory.

Code: \

from selenium importWebdriver ## Imports Selenium's browser driver interface chrome_driver = path +'chromedriver.exe'Chrome(executable_path = chrome_driver)'http://www.nmc.cn/publish/observations/china/dm/weatherchart-h000.htm') # Open the pageCopy the code

1. Selenium page element location ****

Suppose we wanted to download a basic weather chart at 500hPa, what would we do on a normal day? Click 500hPa on the page, select Basic Weather analysis, then right-click the image and save it as. Next, we will use Selenium to control the web page. The crawling process is to simulate the above series of human operations and let the machine perform it automatically. \

Selenium controls web pages based on various HTML structural elements. During the use of Selenium, the positioning of various web elements is the basis, and only when the corresponding elements are accurately captured can the subsequent automatic control be carried out.

1.1 Viewing page Elements

A little HTML basics are involved in Selenium localization, but there are ways to be lazy if you don’t. Using chrome developer tools, we can easily view the page elements by right clicking -> Check in Chrome. \

For example, if we move the mouse to 500hPa level, we can automatically locate the HTML element corresponding to the 500hPa level button. After viewing its attributes, we can see that the page element corresponding to 500hPa button is link text.

1.2 Element positioning \

After the element is viewed, it is then necessary to locate the element, as the name implies, to find the location of the element in the page, so that accurate page control can be carried out later. Selenium provides a variety of methods for locating elements, two of which are used in this article:

Id location: find_element_by_id()

Link location: find_element_by_link_text()

button1=driver.find_element_by_link_text('500hPa'Button2 =driver.find_element_by_id()'plist'Elem_pic = driver.find_element_by_id()'imgpath') # Locate elements precisely by IDCopy the code

2, ActionChains simulation mouse operation ****

After the element positioning is completed, we also need to simulate the mouse to hover and click and other operations to select and download the desired target image. Selenium provides us with a class to handle such events — ActionChains. The two operations used here are:

Hover action: move_to_element ()

Click ()

The complete implementation script is shown below:

from selenium importWebdriver # # import selenium browser driver interface from selenium.webdriver.com mon. Action_chainsimport ActionChains
import time
importOS chrome_driver = path +'chromedriver.exe'Chrome(executable_path = chrome_driver)'http://www.nmc.cn/publish/observations/china/dm/weatherchart-h000.htm'Driver.maximize_window () time.sleep()1Button1 =driver.find_element_by_link_text('500hPa'Action = ActionChains(driver).move_to_Element (button1).click(button1).perform() # mouse click time.sleep(1) # Pay attention to increase the wait time, avoid too fast and failfor p in range(1.3): # Download yesterday08Point and20Button2 =driver.find_element_by_id('plist'Action = ActionChains(driver).move_to_Element (button2).click(button2).Perform ( time.sleep(1Select(button2). Select_by_index (str_p) # drop down menu via text value location from0Start time. Sleep (1Elem_pic = driver.find_element_by_id()'imgpath'Action = ActionChains(driver).move_to_Element (ELEM_pic).context_click(elem_pic).perform() # right-click filename= str(elem_pic.get_attribute('src')).split('/') [- 1].split('? ') [0] # get file nameprint(filename)
    time.sleep(1)
Copy the code