Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities

preface

In many cases, the data rendered dynamically by JS is not directly returned to the plaintext, but is processed by some encryption algorithms, resulting in the failure to obtain the correct data. To do this, we can implement a true visible crawl using Selenium automated testing framework that simulates a browser.

The principle of

Selenium does this by mimicking browser behavior by opening a browser and then executing action events that our implementation has set up.

version

Selenium exists in two different versions

Selenium RC, Remote Control: The traditional Selenium framework. Selenium Webdriver: A new automated interface that breaks through some of the limitations of Selenium 1.

The version we commonly use is Selenium Webdriver, which will be selected in the next section.

process

  1. The driver created and sent to the browser;
  2. The driver contains an HTTP Server for receiving HTTP requests;
  3. The HTTP Server manipulates the browser to perform steps based on the request;
  4. The browser returns the step execution result to the HTTP Server.
  5. The HTTP Server returns the results to Selenium scripts.

The installation

  1. To install selenium library

Execute the following commands:

pip install selenium
Copy the code
  1. Installing the Chrome Driver

(Windows only) Download the driver for ChromeDrive. exe and copy it to python or env scripts.

use

from selenium import webdriver
from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup

url = r"https://juejin.cn/"
chrome_options = Options() 
chrome_options.add_argument('--headless') 
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.get(url)
html = driver.page_source 
soup=BeautifulSoup(html,"lxml")
BeautifulSoup is used for extraction
Copy the code

conclusion

The advantage of Selenium is that it can wait to load via sleep, thus ignoring the logic on JS. But it also has a fatal disadvantage: it is easy to detect, so its use is limited.