Three lines of code, easy to implement Scrapy docking the new reptile artifact!

Some time ago, we published an article about an emerging automated crawl tool similar to Selenium and Pyppeteer, called ourselves.

After the article came out, people began to try out the new artifact.

Some friends, after finishing the experiment, exclaim that the offender is too easy to use and more powerful than Selenium or Pyppteer.

Another friend said, “I have known ourselves before, and ourselves have been running stably in the production environment for a long time.”

“It would be nice if we could be ourselves a Scrapy offender, but WE haven’t found a proper Scrapy offender.”

Scrapy versus ourselves? It seems that there is a need, and since I’ve been developing Scrapy and Selenium and Pyppeteer in the past few days, I might as well develop a Scrapy and offended package.

It started development at 2pm yesterday and released the first beta version around 6pm.

introduce

Here are the basic uses of this package.

The package, called Gerapyourselves, has been posted to GitHub (github.com/Gerapy/Gera… PyPi (pypi.org/project/ger…

In short, the package is a handy way to implement both Scrapy and ourselves, enabling javascript-rendered web pages to be scraped and pulled asynchronously and concurrently by multiple browsers.

Use is also very simple, first install:

pip3 install gerapy-playwright
Copy the code

Add Downloader Middleware to your settings.py Scrapy project:

DOWNLOADER_MIDDLEWARES = {
    'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware': 543,}Copy the code

Well, then we’re done!

Next, if we want a URL to be scaled down by ourselves, we can simply replace the original Request with a PlaywrightRequest, as follows:

yield PlaywrightRequest(url, callback=self.parse_detail)
Copy the code

Yes, it’s that simple.

In that case, the URL will be clamped down the drain, leaving Response as the browser’s rendered HTML.

configuration

And of course the package isn’t just that simple, it supports a lot of configurations.

If we want ourselves to be able to climb down in Headless mode, we can configure ourselves in settings.py:

GERAPY_PLAYWRIGHT_HEADLESS = True
Copy the code

If you want to specify the default timeout Settings, you can do this in settings.py:

GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT = 30
Copy the code

Such a page will time out if it doesn’t load in 30 seconds.

Some web sites have also added WebDriver detection, which can be used to hide WebDriver features by adding real browser masquerade Settings in settings.py:

GERAPY_PLAYWRIGHT_PRETEND = True
Copy the code

If you want to support crawl-time proxy Settings, you can configure global proxy Settings in settings.py:

GERAPY_PLAYWRIGHT_PROXY = 'http://tps254.kdlapi.com:15818'
GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL = {
  'username': 'xxx'.'password': 'xxxx'
}
Copy the code

If you want to support screenshots, you can enable global screenshots, which can be configured in settings.py:

GERAPY_PLAYWRIGHT_SCREENSHOT = {
    'type': 'png'.'full_page': True
}
Copy the code

There are many other configurations available at github.com/Gerapy/Gera… The instructions in.

PlaywrightRequest

Of course, the configuration described above is the global configuration of the project, of course, we can also use PlaywrightRequest to configure for a request, the same meaning of configuration will override the configuration specified by the project settings.py.

For example, if you specify a timeout of 30 seconds in PlaywrightRequest and 10 seconds in settings.py, the 30 seconds configuration will be preferred.

So what parameters does PlaywrightRequest support?

For details, see readme. md: github.com/Gerapy/Gera…

Here is an introduction:

Url: This is not to say more, is the URL to climb.
Callback: Callback method, after the crawl the Response is passed to the callback as an argument.
Wait_until: Waits for a load event, such as domContentLoaded, to finish loading the HTML document before continuing.
Wait_for: You can pass a Selector, such as in the wait page.itemLoad it out and continue down.
Script: After loading, execute the corresponding JavaScript script.
Actions: We can define a Python method to handle the offended page object.
Proxy: set proxy that can override the global proxy Settings GERAPY_PLAYWRIGHT_PROXY.
Proxy_credential: Proxy username and password that can override the global proxy username and password setting GERAPY_PLAYWRIGHT_PROXY_CREDENTIAL.
Sleep: The time to wait after the load is complete. It can be used to set the forced wait time.
Timeout: load timeout, which can override the global timeout setting GERAPY_PLAYWRIGHT_DOWNLOAD_TIMEOUT.
Pretend: Whether to hide WebDriver features and override the global Settings GERAPY_PLAYWRIGHT_PRETEND.

The sample

As here I have a web site antispider1 scrape. Center, the contents of this website is must pass a JavaScript rendering is displayed, the site testing WebDriver characteristics at the same time, In normal cases, Selenium, Pyppeteer and ourselves may be banned, as shown in the following figure:

So, if we’re going to use Scrapy to crawl the site, we can use Gerapyourselves.

Create a new Scrapy project with the following key configurations:

class MovieSpider(scrapy.Spider) :
    name = 'movie'
    allowed_domains = ['antispider1.scrape.center']
    base_url = 'https://antispider1.scrape.center'
    max_page = 10
    custom_settings = {
        'GERAPY_PLAYWRIGHT_PRETEND': True
    }

    def start_requests(self) :
        for page in range(1, self.max_page + 1):
            url = f'{self.base_url}/page/{page}'
            logger.debug('start url %s', url)
            yield PlaywrightRequest(url, callback=self.parse_index, priority=10, wait_for='.item')
            
    def parse_index(self, response) :
        items = response.css('.item')
        for item in items:
            href = item.css('a::attr(href)').extract_first()
            detail_url = response.urljoin(href)
            logger.info('detail url %s', detail_url)
            yield PlaywrightRequest(detail_url, callback=self.parse_detail, wait_for='.item')
Copy the code

As we may see, we have set GERAPY_PLAYWRIGHT_PRETEND to True, so that the offender may not be banned when they launch. We also used a PlaywrightRequest to specify that each URL be loaded, and wait_for to specify that a selector is.item, which represents the key extract information. The offender may wait for the node to load before returning. The Response object of the callback method parse_index contains the corresponding HTML text, and extracts the contents of the.item.

Alternatively, you can specify the number of concurrent requests:

CONCURRENT_REQUESTS = 5
Copy the code

That way, the offender can be climbed by five respective browsers simultaneously, very efficiently.

The result is similar to the following:

2021-12-27 16:54:14 [scrapy.utils.log] INFO: Scrapy 2.2.0 started (bot: example)
2021-12-27 16:54:14 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.7.9 (default, Aug 31 2020, 07:22:35) - [Clang 10.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Darwin-21.1.0-x86_64-i386-64bit
2021-12-27 16:54:14 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2021-12-27 16:54:14 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'example',
 'CONCURRENT_REQUESTS': 1,
 'NEWSPIDER_MODULE': 'example.spiders',
 'RETRY_HTTP_CODES': [403, 500, 502, 503, 504],
 'SPIDER_MODULES': ['example.spiders']}
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet Password: e931b241390ad06a
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2021-12-27 16:54:14 [gerapy.playwright] INFO: playwright libraries already installed
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'gerapy_playwright.downloadermiddlewares.PlaywrightMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-27 16:54:14 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-27 16:54:14 [scrapy.core.engine] INFO: Spider opened
2021-12-27 16:54:14 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-27 16:54:14 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-27 16:54:14 [example.spiders.movie] DEBUG: start url https://antispider1.scrape.center/page/1
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/1>
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:14 [gerapy.playwright] DEBUG: set options {'headless': False}
cookies []
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: crawling https://antispider1.scrape.center/page/1
2021-12-27 16:54:16 [gerapy.playwright] DEBUG: request https://antispider1.scrape.center/page/1 with options {'url': 'https://antispider1.scrape.center/page/1', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:18 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:19 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://antispider1.scrape.center/page/1> (referer: None)
2021-12-27 16:54:20 [example.spiders.movie] DEBUG: start url https://antispider1.scrape.center/page/2
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/page/2>
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:20 [gerapy.playwright] DEBUG: set options {'headless': False}
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/1
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/2
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/3
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/4
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/5
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/6
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/7
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/8
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/9
2021-12-27 16:54:20 [example.spiders.movie] INFO: detail url https://antispider1.scrape.center/detail/10
cookies []
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: PRETEND_SCRIPTS is run
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: timeout 10
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: crawling https://antispider1.scrape.center/page/2
2021-12-27 16:54:21 [gerapy.playwright] DEBUG: request https://antispider1.scrape.center/page/2 with options {'url': 'https://antispider1.scrape.center/page/2', 'wait_until': 'domcontentloaded'}
2021-12-27 16:54:23 [gerapy.playwright] DEBUG: waiting for .item
2021-12-27 16:54:24 [gerapy.playwright] DEBUG: sleep for 1s
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: taking screenshot using args {'type': 'png', 'full_page': True}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: close playwright
2021-12-27 16:54:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://antispider1.scrape.center/page/2> (referer: None)
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: processing request <GET https://antispider1.scrape.center/detail/10>
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: playwright_meta {'wait_until': 'domcontentloaded', 'wait_for': '.item', 'script': None, 'actions': None, 'sleep': None, 'proxy': None, 'proxy_credential': None, 'pretend': None, 'timeout': None, 'screenshot': None}
2021-12-27 16:54:25 [gerapy.playwright] DEBUG: set options {'headless': False}
...
Copy the code

Test code you can refer to: github.com/Gerapy/Gera…

Ok, the above is the introduction of this package written yesterday, you can try it, welcome to give valuable comments and suggestions, thank you!

For more exciting content, please pay attention to my public account “Attack Coder” and “Cui Qingcai | Jingmi”.

Three lines of code, easy to implement Scrapy docking the new reptile artifact!

introduce

configuration

PlaywrightRequest

The sample

Related Posts

If an error occurs on the Localhost :8080/ filename of the IDEA Tomcat deployment page, use localhost:8080/Web_war_exploded/ filename to open the problem

Operating system – File system (1)

Docker deployed the Neo4j graph database