The introduction
I’ve been playing Python crawlers for a while, but I’m still at the beginning level. Xcrawler is a lightweight crawler framework built on weekends, with some design ideas borrowed from the famous crawler framework Scrapy. Why build a wheel when you already have a good crawler frame like Scrapy? Well, in fact, the most important thing is to integrate what I have learned in Python and improve myself.
Haha, there aren’t 500 lines of code because there are comments and all sorts of blank lines. Like crawler children’s shoes can also refer to the next, welcome to give improvement suggestions or point out mistakes, thank you! .
Features
- Simple and easy to use;
- Easy customization of spiders;
- Multithreaded concurrent download.
To improve
- More test code;
- Add more examples of web crawlers
- Improve crawler scheduling and support Request priority scheduling.
Xcrawler introduction
The project structure
├─ │ ├─ Baidu_News.py│ └ ─ ─ __init__.py├ ─ ─ the README.md├─ Setup.py├─ ├─ exercises, ├─ exercises, exercises, exercises, exercises, exercises, exercises, exercises, exercises.py│ ├─ Engine (Crawler Process).py│ ├─ __init__ (Crawler engine).py├ ─ ─ __init__.py├── Spider │ ├─ __init__.py│ ├ ─ ─ the request.py│ ├ ─ ─ the response.py│ └ ─ ─ spiders.py└─ └─ class, └─ class, └─ class, ├─ __init__.py└ ─ ─ url. PyCopy the code
Crawler Engine (Producer + Consumer model)
- When the engine starts, a background thread pool is started. The background thread pool downloads all urls (requests) provided to it by the scheduler and stores the Response results in a queue.
- The front-end parsing thread of the engine continually consumes the responses in the processing queue and calls the parsing function of the corresponding Spider to process the responses.
- The engine is responsible for rendering the page to understand the returned object, all of it
Request
Objects are queued for processing, and all dictionary items are sent to the Spiderprocess_item
Method.
Configuration is introduced
-
Configuration items
-
download_delay
: Download delay (in seconds) between batches. Default is 0. -
download_timeout
: Download wait delay, default is 6 seconds. -
retry_on_timeout
: Indicates whether the request should be retried after the download times out. -
concurrent_requests
: Number of concurrent downloads; -
queue_size
: Request queue size, which blocks subsequent requests when the queue is full.
-
-
Example configuration:
settings = { 'download_delay': 0.'download_timeout': 6.'retry_on_timeout': True.'concurrent_requests': 32.'queue_size': 512 }Copy the code
Spider base class key method introduction
-
spider_started
This method is invoked when the engine starts. You can inherit this method to initialize things like configuring pipeline output files or database connections. -
spider_idle
This method is invoked when the engine is handling idle states (i.e., there are no requests in the queue), and you can inherit this method to add new requests to the engine, etcself.crawler.crawl(new_request, spider=self)
); -
spider_stopped
This method triggers a call when the engine is shut down. You can inherit this method and do some cleaning before the Spider finishes working, such as closing the file pipeline, closing the database connection, etc. -
start_requests
: This method will fetch the corresponding seed request of the Spider for the engine; -
make_requests_from_url
This method creates a Request object for your URL; -
parse
This method calls the Request’s default resolution function, but you can specify other callback functions when creating the Request. -
process_request
This method is invoked whenever the engine processes a Spider request. You can use this method to modify the request, such as replacing random user-agents, Cookies or proxies, etc. Of course, you can set request to None to ignore the request; -
proccess_response
This method is invoked every time the engine processes a response from a Spider. -
process_item
This method is invoked whenever the engine handles an item for a Spider. You can inherit this method to store the retrieved item in a database or local file.
Pay attention to
- You can load different Spider classes in a Crawler process, but you need to make sure that each Spider has a different name or it will be rejected by the engine.
- Adjust the download delay and concurrency as required. The download delay should not be too large, otherwise each batch of requests may wait for a long time to be processed, thus affecting the crawler performance.
- The Windows test is not done yet, I use Ubuntu, so if you have any questions, welcome feedback!
The installation
- Please click xcrawler project home page (https://github.com/chrisleegit/xcrawler) to download the source code;
- Please ensure that your installation environment is
Python 3.4 +
; - Please use the
pip3 setup.py install
Install.
The sample
from xcrawler import CrawlerProcess
from xcrawler.spider import BaseSpider, Request
from lxml.html import fromstring
import json
__version__ = '0.0.1'
__author__ = 'Chris'
class BaiduNewsSpider(BaseSpider) :
name = 'baidu_news_spider'
start_urls = ['http://news.baidu.com/']
default_headers = {
'User-Agent': 'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
'the Chrome / 50.0.2661.102 Safari / 537.36'
}
def spider_started(self):
self.file = open('items.jl'.'w')
def spider_stopped(self):
self.file.close()
def spider_idle(self):
When the engine is idle, you can also extract new urls from the database
print('I am in idle mode')
# self.crawler.crawl(new_request, spider=self)
def make_requests_from_url(self, url):
return Request(url, headers=self.default_headers)
def parse(self, response):
root = fromstring(response.content, base_url=response.base_url)
for element in root.xpath('//a[@target="_blank"]'):
title = self._extract_first(element, 'text()')
link = self._extract_first(element, '@href').strip()
if title:
if link.startswith('http://') or link.startswith('https://') :yield {'title': title, 'link': link}
yield Request(link, headers=self.default_headers, callback=self.parse_news,
meta={'title': title})
def parse_news(self, response):
pass
def process_item(self, item):
print(item)
print(json.dumps(item, ensure_ascii=False), file=self.file)
@staticmethod
def _extract_first(element, exp, default=' '):
r = element.xpath(exp)
if len(r):
return r[0]
return default
def main(a):
settings = {
'download_delay': 1.'download_timeout': 6.'retry_on_timeout': True,
'concurrent_requests': 16.'queue_size': 512
}
crawler = CrawlerProcess(settings, 'DEBUG')
crawler.crawl(BaiduNewsSpider)
crawler.start()
if __name__= ='__main__':
main()Copy the code
Copyright statement
This article is published by Christopher L and licensed under the Creative Commons Attribution – Noncommercial use – Share alike 4.0 International License. Please make sure you understand the license agreement and state it when reprinting. In this paper, the fixed link: http://blog.chriscabin.com/?p=1512.