Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities
preface
In many cases, the data rendered dynamically by JS is not directly returned to the plaintext, but is processed by some encryption algorithms, resulting in the failure to obtain the correct data. To do this, we can implement a true visible crawl using Selenium automated testing framework that simulates a browser.
The principle of
Selenium does this by mimicking browser behavior by opening a browser and then executing action events that our implementation has set up.
version
Selenium exists in two different versions
Selenium RC, Remote Control: The traditional Selenium framework. Selenium Webdriver: A new automated interface that breaks through some of the limitations of Selenium 1.
The version we commonly use is Selenium Webdriver, which will be selected in the next section.
process
- The driver created and sent to the browser;
- The driver contains an HTTP Server for receiving HTTP requests;
- The HTTP Server manipulates the browser to perform steps based on the request;
- The browser returns the step execution result to the HTTP Server.
- The HTTP Server returns the results to Selenium scripts.
The installation
- To install selenium library
Execute the following commands:
pip install selenium
Copy the code
- Installing the Chrome Driver
(Windows only) Download the driver for ChromeDrive. exe and copy it to python or env scripts.
use
from selenium import webdriver
from selenium.webdriver.chrome.options import Options from bs4 import BeautifulSoup
url = r"https://juejin.cn/"
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(url)
html = driver.page_source
soup=BeautifulSoup(html,"lxml")
BeautifulSoup is used for extraction
Copy the code
conclusion
The advantage of Selenium is that it can wait to load via sleep, thus ignoring the logic on JS. But it also has a fatal disadvantage: it is easy to detect, so its use is limited.