In this article, we will use Selenium to climb dangdang’s best-selling books ranking. As the saying goes, there is gold in books, there is beauty in books. We improve our talents by reading and learning, and open our eyes to knowledge.
Before we can do data fetching, we need to install the Selenium library and Chrome browser, and configure it. The version of headless browsing should be >=92. Selenium is an automated testing tool that drives the browser to perform specific actions, such as clicks and dropdowns, while simultaneously obtaining the source code of the page that the browser is currently rendering, which can be seen and crawled. Preparation work is completed, below we officially began to grab dangdang net best-selling books ranking.
Source: book.dangdang.com/
Source for crawler proxy: www.16yun.cn/
Next is the actual practice. First, we enter Dangdang’s best-selling books web page, and we will use Selenium to capture book information. We will use Selenium to capture book information and get the ranking, picture, name, price, comment and other information by analyzing the book. Example code is as follows:
From Selenium import webdriver import string import zipfile # proxy server (www.16yun.cn) ProxyPass = "552465" def create_proxy_auth_extension(proxy_host, proxyPass = "552465") proxy_port, proxy_username, proxy_password, scheme='http', plugin_path=None): if plugin_path is None: plugin_path = r'D:/{}_{}@t.16yun.zip'.format(proxy_username, proxy_password) manifest_json = """ { "version": "1.0.0", "Manifest_version ": 2, "name": "16YUN Proxy", "permissions": [ "proxy", "tabs", "unlimitedStorage", "storage", "", "webRequest", "webRequestBlocking" ], "background": { "scripts": ["background.js"]}, "minimum_chrome_version":"22.0.0"} "" background_js = string.Template(""" var config = {mode: "fixed_servers", rules: { singleProxy: { scheme: "${scheme}", host: "${host}", port: parseInt(${port}) }, bypassList: ["foobar.com"] } }; chrome.proxy.settings.set({value: config, scope: "regular"}, function() {}); function callbackFn(details) { return { authCredentials: { username: "${username}", password: "${password}" } }; } chrome.webRequest.onAuthRequired.addListener( callbackFn, {urls: [""]}, ['blocking'] ); """ ).substitute( host=proxy_host, port=proxy_port, username=proxy_username, password=proxy_password, scheme=scheme, ) with zipfile.ZipFile(plugin_path, 'w') as zp: zp.writestr("manifest.json", manifest_json) zp.writestr("background.js", background_js) return plugin_path proxy_auth_plugin_path = create_proxy_auth_extension( proxy_host=proxyHost, proxy_port=proxyPort, proxy_username=proxyUser, Proxy_password =proxyPass) option = webdriver.chromeoptions () option.add_argument("--start-maximized") # chrome-extensions # option.add_argument("--disable-extensions") option.add_extension(proxy_auth_plugin_path) # Add_experimental_option ('excludeSwitches', ['enable-automation']) driver = webdriver.Chrome(chrome_options=option) # script = "# Object.defineProperty(navigator, 'webdriver', { # get: () => undefined # }) # ''' # driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script}) driver.get("http://book.dangdang.com/")Copy the code
Selenium crawler (Selenium crawler) is the best seller in Python.