Introduction:
Goal to achieve: Distributed Selenium Gird hides WebDriver attributes. Focus on packing, deploying, reading files, and path pits. I wrote a configurable general crawler that focuses on development efficiency and deployment efficiency, and management details: package clate.min.js into egg, manage it using scrapy framework, deploy it through scrapyd, and schedule it using ScrapyDWeb. Distributed Selenium Gird hides WebDriver attributes Pay attention to the announcement of the Financial authority, dozens of provinces, city crawlers, a day to fix and maintain such as these two websites, To detect aHR0cDovL2pyLmNoZW5nZHUuZ292LmNuL2ppbnJvbmdiYW4vYzEzOTAxMy9saXN0LnNodG1s browser aHR0cDovL2pyai5oYWluYW4uZ292LmNuL3NqcmIvemNjYy9uZXd4eGdrX2luZGV4LnNodG1sCopy the code
Configurable crawler based on Scrapy, greatly improving work efficiency
Configurable crawler configuration example {"local": Hainan Province."organization": "XXXX Local Financial Supervision Administration"."link": "xxxxxx.shtml"."article_rows_xpath": '//a[contains(text(), "public notice ")]/.. /following-sibling::div[1]/ul/li',
"title_xpath": "./a"."title_parse": "./@title"."title_link_xpath": "./a/@href"."date_xpath": "./em"."date_parse": './text()',
"prefix": "http://jrj.hainan.gov.cn/"."note": "{'way':'selenium', 'use_proxy':'False'} ",},Copy the code
Interpretation of difficulties:
1. Hide webdriver Demo code
Reference: the most perfect scheme! How do emulated browsers properly hide features
Local version
# local version
# -*- coding:utf-8 -*-
# @Author: clark
# @time: 2021/4/5:40 PM
# @File: webdriver_hide_feature.py
# @Project Demand: Current time to fully hide webDriver features
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(
'the user-agent = Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
driver = Chrome(options=chrome_options)
with open('stealth.min.js') as f:
js = f.read()
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": js
})
driver.get('https://bot.sannysoft.com/')
time.sleep(5)
driver.save_screenshot('screenshot.png')
You can save the source code as HTML and double-click to open it to see the full result
source = driver.page_source
with open('result.html'.'w') as f:
f.write(source)
Copy the code
Remote Selenium GIRD edition
# -*- coding:utf-8 -*-
# @Author: clark
# @time: 2021/4/8 2:38 PM
# @File: webdriver_hide_feature_remote.py
# @Project Demand: Current time to fully hide webDriver features
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.remote_connection import ChromeRemoteConnection
from selenium import webdriver
chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(
'the user-agent = Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')
with webdriver.Remote(command_executor=ChromeRemoteConnection(
remote_server_addr='http://192.168.95.56:4444/wd/hub',
keep_alive=True),
desired_capabilities={
'platform': 'WINDOWS'.'browserName': "chrome".'version': ' '.'javascriptEnabled': True
},
options=chrome_options
) as driver:
with open('stealth.min.js') as f:
js = f.read()
print(driver.execute("executeCdpCommand", {'cmd': "Page.addScriptToEvaluateOnNewDocument".'params': {
"source": js
}}))
driver.get('https://bot.sannysoft.com/')
time.sleep(5)
driver.save_screenshot('screenshot.png')
Copy the code
Static resources are packaged into egg, production environment, and data files in the package are read
Reference:
It took two days to finally get Python setup.py straight
Introduction to setupTools, the Python packaging and distribution tool
The Python Cookbook Study Note: Modules and Packages
File path: scrapy path/spiders/static/stealth. Min. Js
# MANIFEST.in
recursive-include risk_control_info/spiders/static *
Copy the code
# xxx/spiders/big_finance_jgj_news.py
with webdriver.Remote(command_executor=ChromeRemoteConnection(
remote_server_addr="{}/wd/hub".format(SELENIUM_DOCKER_HOST),
keep_alive=True),
desired_capabilities={
'platform': 'WINDOWS'.'browserName': "chrome".'version': ' '.'javascriptEnabled': True
},
options=options
) as browser:
# Hide webDriver attributes
try:
Production environment: Read the data files in the package
import pkg_resources
f = pkg_resources.resource_stream(__package__, 'static/stealth.min.js')
js = f.read().decode()
# self.logger.info(js)
self.logger.info(
browser.execute("executeCdpCommand",
{'cmd': "Page.addScriptToEvaluateOnNewDocument".'params': {
"source": js
}}))
except Exception as e:
self.logger.warning(F "Test environment read local path:{e}")
# Read from local
this_dir, this_filename = os.path.split(__file__)
STEALTH_PATH = os.path.join(this_dir, "static"."stealth.min.js")
self.logger.info(f"STEALTH_PATH:{STEALTH_PATH}")
with open(STEALTH_PATH) as f:
js = f.read()
self.logger.info(
browser.execute("executeCdpCommand",
{'cmd': "Page.addScriptToEvaluateOnNewDocument".'params': {
"source": js
}}))
Copy the code
# xxx/setup.py
# Automatically created by: scrapyd-deploy
from setuptools import setup, find_packages
setup(
name='project',
version='1.0',
packages=find_packages(),
entry_points={'scrapy': ['settings = risk_control_info.settings']},
include_package_data=True # enable MANIFEST file manifest.in
)
Copy the code