preface

In the last article, I introduced you to Ajax analysis and fetching, which is actually a javascript way of rendering pages dynamically, and you can still get data out of Requests by directly analyzing Ajax.

But javascript rendering pages dynamically isn’t just Ajax. Some sites may have pages that are generated in javascript, not native HTML code, that do not contain Ajax requests. Then there is taobao page. Even if it is data obtained by Ajax, its Ajax interface will contain a lot of encryption parameters, and it is difficult for us to get rules through Ajax requests after obtaining them.

To solve this problem, Python provides many libraries that simulate browser execution, such as Selenium, Splash, PyV8, Ghost, and so on. In this article, I will share with you the use of Selenium. You don’t need to worry about dynamically rendered pages.

Is the heart will have a little bit of small excitement?

The installation of selenium

Is there a lot of small friends will think this installation is necessary to talk about? It’s just a PIP. It’s not that easy.

Selenium is an automated testing tool that allows you to drive the browser to perform specific actions, such as clicking, pulling down, and so on. This works well for some javascript-rendered pages.

PIP install

Installation: I recommend PIP installation

pip install selenium
Copy the code

Verify the installation

Enter the Python command interaction mode and import the Selenium package. If no error is reported, the installation is successful.

C:\Users\admin> Python Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import selenium >>>Copy the code

However, this is not enough, but it is not enough because we need to use browsers (such as Chrome and Firefox) to work with Selenium. With a browser we can work with Selenium for page fetching.

The installation of the ChromeDriver

Of course, the first to download a good Chrome browser, you can baidu download and install.

Then install ChromeDriver. After ChromeDriver is installed, you can drive the Chrome browser to complete relevant operations.

  • 1. Preparation

Make sure you have installed Chrome before you do this.

  • 2. View the version

Click Chrome Settings — > Click About Chrome to view the version of Chrome. As shown below:

Here my browser version number is 88.0.

Remember the Chrome version number, because you’ll need it later when selecting ChromeDriver versions.

Download the ChromeDriver

Download from:

http://npm.taobao.org/mirrors/chromedriver
Copy the code

Select the version that matches your browser, as shown below:

Environment Variable Configuration

Unzip the driver you just downloaded and place it in Python’s Scripts directory. As shown below:

Verify the installation

After the configuration is complete, enter the Chromedriver command on the CLI. If the interface shown in the following figure is displayed, it indicates that the environment variables are configured.

Introduction to the basic use of Selenium

A simple example

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
Copy the code

So the next step is a simple analysis of the code above.

Selenium. Webdriver provides all webDriver implementations and currently supports Firefox, Chrome, IE, and Remote. The Keys class provides keyboard support such as F1, Enter, and so on.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
Copy the code

Next, create a Chrome instance

driver = webdriver.Chrome()
Copy the code

The driver.get() method opens the address specified in the URL, and webDriver waits until the page is fully loaded (in fact, until the onload method completes), and then continues with the script. It is worth noting that if a page uses a lot of Ajax loading, the WebDriver may not know when it is finished loading.

driver.get("http://www.python.org")
Copy the code

The next line uses an assert to confirm that the title contains the word Python. When the assert statement is followed by False, an exception is thrown.

assert "Python" in driver.title
Copy the code

Webdriver also provides a number of methods for finding page elements. For example, find_element_by_*, where * is an attribute.

elem = driver.find_element_by_name("q")
Copy the code

Next, send a keyword, similar to using a keyboard to enter the keyword. Special Keys can use Keys to input, this class is inherited selenium.webdriver.com mon. The Keys, to be on the safe side, need to clear the input the content of the input box, avoid the influence on the results of the search.

elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
Copy the code

After submitting the page, you’ll get all the results. To get a specific result to be found, use assert as follows:

assert "No results found." not in driver.page_source
Copy the code

Finally, close the browser. It is worth noting that the close() method only closes one TAB, so instead of the close() method, you can use the quit() method, which closes the entire browser.

driver.close()
Copy the code

Look for the element

There are many different strategies for locating an element on a page. Choose the most appropriate method to find elements in your project. Selenium provides the following methods:

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text
  • find_element_by_partial_link_text
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector

Find multiple elements at once (these elements return a list)

  • find_elements_by_id
  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

Wait for the page to load

Most Web applications today use Ajax technology. When a page is loaded into the browser, elements within that page can be loaded at different points in time. This would make it difficult to positioning elements, if the element is no longer pages, will be thrown ElementNotVisibleException anomalies. With Waits, we can solve this problem. Waits provides the interval between several operations, primarily the location of an element or any operation on that element.

Selenium WebDriver provides two waiting modes, one is explicit waiting and the other is implicit waiting.

Explicit waiting

It specifies the node to look for, and then specifies a maximum wait time. If the node is loaded within that time, the node to look for is returned. If the node is not loaded within the specified time, a timeout exception is thrown.

Let’s use a simple example:

First open JD and open developer tools. As shown below:

As shown in the figure above, we need to find the node whose ID is key and the node whose class is button. The specific code is as follows:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


broswer = webdriver.Chrome()
broswer.get('https://www.jd.com/')
wait = WebDriverWait(broswer, 20)
input_q = wait.until(EC.presence_of_element_located((By.ID, 'key')))
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.button')))
print(input_q, button)
Copy the code

The code above is simply explained by introducing WebDriverWait and specifying the maximum time, and then calling its until() method, passing in the condition to wait, For example, presence_of_element_located is passed in to represent the appearance of a node, and its parameter is the tuple where the node is located, that is, the search box with ID as key.

The effect of this is that, within 10 seconds, if the node with ID key is successfully loaded, that node is returned; An exception is thrown if it hasn’t been loaded for more than 20 seconds.

After running the code, you can load it successfully if the network speed is good.

The following output is displayed:

<selenium.webdriver.remote.webelement.WebElement (session="54576c743c2ef8ec50ae1e02e826f5c0", element="cc0ff331-146f-4756-b738-82eb65016c41")> <selenium.webdriver.remote.webelement.WebElement (session="54576c743c2ef8ec50ae1e02e826f5c0", element="d38b49d7-550b-4619-bee9-f268ff7b4bf9") >Copy the code

As you can see, the console successfully outputs two nodes, both of type WebElement.

An implicit wait

When tests are executed using implicit wait, if Selenium does not find a node in the DOM, it will continue to wait, and after the specified time, it will throw an exception that the node was not found. In other words, when a node is searched and the node does not appear, the implicit wait will wait some time before looking up the DOM. The default time is 0, as shown in the following example:

from selenium import webdriver


browser = webdriver.Chrome()
browser.implicitly_wait(10)
browser.get('https://www.jd.com/')
input_q = browser.find_element_by_class_name('button')
print(input_q)
Copy the code

The implicitly_wait() method here implements implicit wait.

Implicit waiting doesn’t work that well because we only specify a fixed time, and page load times are affected by network conditions.

Waiting on the condition

In fact, there are many waiting conditions, such as judging the title content, judging whether a node has a certain text, etc. Details are shown in the following table:

Waiting on the condition meaning
title_is The title is something
title_contains The title contains something
present_of_element_located The node is loaded and passed the location tuple
visibility_of_element_located The node is visible, passing in the location tuple
visibility_of See passing in the node object
present_of_all_elements_located All nodes are loaded
text_to_be_present_in_element A node text contains a text
text_to_be_present_in_element_value A node contains a literal
frame_to_be_available_and_switch_to_it Load and switch
invisibility_of_element_located Node not visible

Forward and backward

Normal browsers have forward and backward functions. Selenium also does this by using the back() method and forward() method, as shown below:

import time
from selenium import webdriver


browser = webdriver.Chrome()
browser.get('https://www.baidu.com')
browser.get('https://www.taobao.com')
browser.get('https://www.jd.com')

browser.back()
time.sleep(2)
browser.forward()
Copy the code

Here we visit three pages in a row, call the back() method to return to the second page, wait two seconds, and return to the third page again.

Cookie

Selenium also makes it easy to perform operations on Cookies, such as obtaining, adding, and deleting Cookies. The specific code is as follows:

from selenium import webdriver


browser = webdriver.Chrome()
browser.get('https://www.zhihu.com')
print(browser.get_cookies())
# browser.add_cookie({'aa':'aa','bb':'bb'})
# print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())
Copy the code

One thing to note here is that when you add cookies, you need to add the same length as the cookie you got. When all cookies are deleted, the obtained cookies are empty.

Exception handling

When working with Selenium, you will inevitably encounter some exceptions, such as timeouts, errors where nodes are not found, and so on. Once such errors occur, the program will not run. Here we can use try except statements to catch exceptions.

The specific code is as follows:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException


browser = webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')
except TimeoutException:
    print('timeout')

try:
    browser.find_element_by_id('aa')
except NoSuchElementException:
    print('Node not found')
finally:
    browser.close()
Copy the code

Here we use try except to catch exceptions. For example, we catch an exception for NoSuchElementException on the method find_element_by_id() to find the node. Once such an error occurs, exception handling is performed and the program does not break.

That’s the end of this sharing.

The last

Nothing can be accomplished overnight, so is life, so is learning. So, where can there be such a thing as three days, seven days? Only insist, can succeed!

Biting books says:

Every word of the article is my heart to knock out, only hope to live up to every attention to my people. Give me a “like” at the end of the article to let me know that you are also working hard for your study.

The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

I am a person who concentrates on learning. The more you know, the more you don’t know, the more wonderful content. See you next time!