Introduction:

Goal to achieve: Distributed Selenium Gird hides WebDriver attributes. Focus on packing, deploying, reading files, and path pits. I wrote a configurable general crawler that focuses on development efficiency and deployment efficiency, and management details: package clate.min.js into egg, manage it using scrapy framework, deploy it through scrapyd, and schedule it using ScrapyDWeb. Distributed Selenium Gird hides WebDriver attributes Pay attention to the announcement of the Financial authority, dozens of provinces, city crawlers, a day to fix and maintain such as these two websites, To detect aHR0cDovL2pyLmNoZW5nZHUuZ292LmNuL2ppbnJvbmdiYW4vYzEzOTAxMy9saXN0LnNodG1s browser aHR0cDovL2pyai5oYWluYW4uZ292LmNuL3NqcmIvemNjYy9uZXd4eGdrX2luZGV4LnNodG1sCopy the code

Configurable crawler based on Scrapy, greatly improving work efficiency

Configurable crawler configuration example {"local": Hainan Province."organization": "XXXX Local Financial Supervision Administration"."link": "xxxxxx.shtml"."article_rows_xpath": '//a[contains(text(), "public notice ")]/.. /following-sibling::div[1]/ul/li',
 "title_xpath": "./a"."title_parse": "./@title"."title_link_xpath": "./a/@href"."date_xpath": "./em"."date_parse": './text()',
 "prefix": "http://jrj.hainan.gov.cn/"."note": "{'way':'selenium', 'use_proxy':'False'} ",},Copy the code

Interpretation of difficulties:

1. Hide webdriver Demo code

Reference: the most perfect scheme! How do emulated browsers properly hide features

Local version

# local version
# -*- coding:utf-8 -*-
# @Author: clark
# @time: 2021/4/5:40 PM
# @File: webdriver_hide_feature.py
# @Project Demand: Current time to fully hide webDriver features

import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(
    'the user-agent = Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')

driver = Chrome(options=chrome_options)

with open('stealth.min.js') as f:
    js = f.read()

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": js
})
driver.get('https://bot.sannysoft.com/')
time.sleep(5)
driver.save_screenshot('screenshot.png')

You can save the source code as HTML and double-click to open it to see the full result
source = driver.page_source
with open('result.html'.'w') as f:
    f.write(source)
Copy the code

Remote Selenium GIRD edition

# -*- coding:utf-8 -*-
# @Author: clark
# @time: 2021/4/8 2:38 PM
# @File: webdriver_hide_feature_remote.py
# @Project Demand: Current time to fully hide webDriver features
import time
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.remote_connection import ChromeRemoteConnection
from selenium import webdriver

chrome_options = Options()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(
    'the user-agent = Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36')

with webdriver.Remote(command_executor=ChromeRemoteConnection(
        remote_server_addr='http://192.168.95.56:4444/wd/hub',
        keep_alive=True),
        desired_capabilities={
            'platform': 'WINDOWS'.'browserName': "chrome".'version': ' '.'javascriptEnabled': True
        },
        options=chrome_options
) as driver:
    with open('stealth.min.js') as f:
        js = f.read()
    print(driver.execute("executeCdpCommand", {'cmd': "Page.addScriptToEvaluateOnNewDocument".'params': {
        "source": js
    }}))

    driver.get('https://bot.sannysoft.com/')
    time.sleep(5)
    driver.save_screenshot('screenshot.png')

Copy the code

Static resources are packaged into egg, production environment, and data files in the package are read

Reference:

It took two days to finally get Python setup.py straight

Introduction to setupTools, the Python packaging and distribution tool

The Python Cookbook Study Note: Modules and Packages

File path: scrapy path/spiders/static/stealth. Min. Js

# MANIFEST.in
recursive-include risk_control_info/spiders/static *
Copy the code
# xxx/spiders/big_finance_jgj_news.py
with webdriver.Remote(command_executor=ChromeRemoteConnection(
        remote_server_addr="{}/wd/hub".format(SELENIUM_DOCKER_HOST),
        keep_alive=True),
        desired_capabilities={
            'platform': 'WINDOWS'.'browserName': "chrome".'version': ' '.'javascriptEnabled': True
        },
        options=options
) as browser:
    # Hide webDriver attributes
    try:
        Production environment: Read the data files in the package
        import pkg_resources
        f = pkg_resources.resource_stream(__package__, 'static/stealth.min.js')
        js = f.read().decode()
        # self.logger.info(js)
        self.logger.info(
            browser.execute("executeCdpCommand",
                            {'cmd': "Page.addScriptToEvaluateOnNewDocument".'params': {
                                "source": js
                            }}))
    except Exception as e:
        self.logger.warning(F "Test environment read local path:{e}")
        # Read from local
        this_dir, this_filename = os.path.split(__file__)
        STEALTH_PATH = os.path.join(this_dir, "static"."stealth.min.js")
        self.logger.info(f"STEALTH_PATH:{STEALTH_PATH}")
        with open(STEALTH_PATH) as f:
            js = f.read()
            self.logger.info(
                browser.execute("executeCdpCommand",
                                {'cmd': "Page.addScriptToEvaluateOnNewDocument".'params': {
                                    "source": js
                                }}))
Copy the code
# xxx/setup.py
# Automatically created by: scrapyd-deploy

from setuptools import setup, find_packages

setup(
    name='project',
    version='1.0',
    packages=find_packages(),
    entry_points={'scrapy': ['settings = risk_control_info.settings']},
    include_package_data=True  # enable MANIFEST file manifest.in
)

Copy the code