The introduction

I’ve been playing Python crawlers for a while, but I’m still at the beginning level. Xcrawler is a lightweight crawler framework built on weekends, with some design ideas borrowed from the famous crawler framework Scrapy. Why build a wheel when you already have a good crawler frame like Scrapy? Well, in fact, the most important thing is to integrate what I have learned in Python and improve myself.

Haha, there aren’t 500 lines of code because there are comments and all sorts of blank lines. Like crawler children’s shoes can also refer to the next, welcome to give improvement suggestions or point out mistakes, thank you! .

Features

Simple and easy to use;
Easy customization of spiders;
Multithreaded concurrent download.

To improve

More test code;
Add more examples of web crawlers
Improve crawler scheduling and support Request priority scheduling.

Xcrawler introduction

The project structure

├─ │ ├─ Baidu_News.py│ └ ─ ─ __init__.py├ ─ ─ the README.md├─ Setup.py├─ ├─ exercises, ├─ exercises, exercises, exercises, exercises, exercises, exercises, exercises, exercises.py│ ├─ Engine (Crawler Process).py│ ├─ __init__ (Crawler engine).py├ ─ ─ __init__.py├── Spider │ ├─ __init__.py│ ├ ─ ─ the request.py│ ├ ─ ─ the response.py│ └ ─ ─ spiders.py└─ └─ class, └─ class, └─ class, ├─ __init__.py└ ─ ─ url. PyCopy the code

Crawler Engine (Producer + Consumer model)

When the engine starts, a background thread pool is started. The background thread pool downloads all urls (requests) provided to it by the scheduler and stores the Response results in a queue.
The front-end parsing thread of the engine continually consumes the responses in the processing queue and calls the parsing function of the corresponding Spider to process the responses.
The engine is responsible for rendering the page to understand the returned object, all of itRequestObjects are queued for processing, and all dictionary items are sent to the Spiderprocess_itemMethod.

Configuration is introduced

Configuration items
1. download_delay: Download delay (in seconds) between batches. Default is 0.
2. download_timeout: Download wait delay, default is 6 seconds.
3. retry_on_timeout: Indicates whether the request should be retried after the download times out.
4. concurrent_requests: Number of concurrent downloads;
5. queue_size: Request queue size, which blocks subsequent requests when the queue is full.

Example configuration:

settings = {
 'download_delay': 0.'download_timeout': 6.'retry_on_timeout': True.'concurrent_requests': 32.'queue_size': 512
}Copy the code

Spider base class key method introduction

spider_startedThis method is invoked when the engine starts. You can inherit this method to initialize things like configuring pipeline output files or database connections.
spider_idleThis method is invoked when the engine is handling idle states (i.e., there are no requests in the queue), and you can inherit this method to add new requests to the engine, etcself.crawler.crawl(new_request, spider=self));
spider_stoppedThis method triggers a call when the engine is shut down. You can inherit this method and do some cleaning before the Spider finishes working, such as closing the file pipeline, closing the database connection, etc.
start_requests: This method will fetch the corresponding seed request of the Spider for the engine;
make_requests_from_urlThis method creates a Request object for your URL;
parseThis method calls the Request’s default resolution function, but you can specify other callback functions when creating the Request.
process_requestThis method is invoked whenever the engine processes a Spider request. You can use this method to modify the request, such as replacing random user-agents, Cookies or proxies, etc. Of course, you can set request to None to ignore the request;
proccess_responseThis method is invoked every time the engine processes a response from a Spider.
process_itemThis method is invoked whenever the engine handles an item for a Spider. You can inherit this method to store the retrieved item in a database or local file.

Pay attention to

You can load different Spider classes in a Crawler process, but you need to make sure that each Spider has a different name or it will be rejected by the engine.
Adjust the download delay and concurrency as required. The download delay should not be too large, otherwise each batch of requests may wait for a long time to be processed, thus affecting the crawler performance.
The Windows test is not done yet, I use Ubuntu, so if you have any questions, welcome feedback!

The installation

Please click xcrawler project home page (https://github.com/chrisleegit/xcrawler) to download the source code;
Please ensure that your installation environment isPython 3.4 +;
Please use thepip3 setup.py installInstall.

The sample

from xcrawler import CrawlerProcess
from xcrawler.spider import BaseSpider, Request
from lxml.html import fromstring
import json

__version__ = '0.0.1'
__author__ = 'Chris'


class BaiduNewsSpider(BaseSpider) :
   name = 'baidu_news_spider'
   start_urls = ['http://news.baidu.com/']
   default_headers = {
       'User-Agent': 'the Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
                     'the Chrome / 50.0.2661.102 Safari / 537.36'
   }

   def spider_started(self):
       self.file = open('items.jl'.'w')

   def spider_stopped(self):
       self.file.close()

   def spider_idle(self):
       When the engine is idle, you can also extract new urls from the database
       print('I am in idle mode')
       # self.crawler.crawl(new_request, spider=self)

   def make_requests_from_url(self, url):
       return Request(url, headers=self.default_headers)

   def parse(self, response):
       root = fromstring(response.content, base_url=response.base_url)
       for element in root.xpath('//a[@target="_blank"]'):
           title = self._extract_first(element, 'text()')
           link = self._extract_first(element, '@href').strip()
           if title:
               if link.startswith('http://') or link.startswith('https://') :yield {'title': title, 'link': link}
                   yield Request(link, headers=self.default_headers, callback=self.parse_news,
                                 meta={'title': title})

   def parse_news(self, response):
       pass

   def process_item(self, item):
       print(item)
       print(json.dumps(item, ensure_ascii=False), file=self.file)

   @staticmethod
   def _extract_first(element, exp, default=' '):
       r = element.xpath(exp)
       if len(r):
           return r[0]

       return default


def main(a):
   settings = {
       'download_delay': 1.'download_timeout': 6.'retry_on_timeout': True,
       'concurrent_requests': 16.'queue_size': 512
   }
   crawler = CrawlerProcess(settings, 'DEBUG')
   crawler.crawl(BaiduNewsSpider)
   crawler.start()


if __name__= ='__main__':
   main()Copy the code

Work log chart

That’s what I’m grabbing

Copyright statement

This article is published by Christopher L and licensed under the Creative Commons Attribution – Noncommercial use – Share alike 4.0 International License. Please make sure you understand the license agreement and state it when reprinting. In this paper, the fixed link: http://blog.chriscabin.com/?p=1512.

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

500 lines of Python code to build a lightweight crawler framework

The introduction

Features

To improve

Xcrawler introduction

The project structure

Crawler Engine (Producer + Consumer model)

Configuration is introduced

Spider base class key method introduction

Pay attention to

The installation

The sample

Copyright statement

500 lines of Python code to build a lightweight crawler framework

The introduction

Features

To improve

Xcrawler introduction

The project structure

Crawler Engine (Producer + Consumer model)

Configuration is introduced

Spider base class key method introduction

Pay attention to

The installation

The sample

Copyright statement

Related Posts

Several schemes for generating distributed unique IDS

Netty foundation and source code analysis

Numpy operations in Python data analysis series