“This is the 19th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”

preface

After knowing all kinds of basic knowledge of crawler, we sometimes need to quickly build a crawler program. Is there such a convenient tool or framework that allows us to quickly build a crawler program? Scrapy is in full swing.

What is Scrapy

A pure Python implementation for crawling site data, extract structural data written for the application framework. The framework itself does some of the repetitive work for you; You can easily write a few simple modules according to the framework itself or simply extend some modules to give you personalized functionality; Of course, the problem is that first of all, you have to learn to understand the framework, and it’s difficult to break through the limitations of the framework itself;

Scrapy is based on Twisted and Tornado’s asynchronous network framework.

Scrapy Engine: Responsible for communication, signal, data transfer among Spider, ItemPipeline, Downloader, Scheduler, etc.

Scheduler: It receives requests from the engine, sorts them in a way that makes them Scheduler and returns them to the engine when needed.

Downloader: Downloads all Requests sent by the Scrapy Engine and returns the Responses to the Scrapy Engine for handling by the Spider.

Spider: It is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL to be followed up to the engine, and entering the Scheduler again.

Item Pipeline: This is the place where items retrieved from the Spider are processed for post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares: Can be used as a component that can be customized to extend the download functionality.

Spider Middlewares: can be understood as a functional component that extends and manipulates communication between the engine and the Spider (e.g., into Responses of the Spider; And Requests out of spiders.)

How to install Scrapy

The following environment is Ubuntu 17.04

1, install,

Install Scrapy in Ubuntu:

   sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev

   sudo pip install scrapy

2. Four steps to making a crawler

Create a new crawler project. A project may contain many crawlers.

scrapy startproject tencentSpider

View the project structure:

Tarena @ tedu: ~ / spiders/tencentSpider $tree. :

├ ─ ─ scrapy. CFG

└ ─ ─ tencentSpider

├ ─ ─ init. Py

├ ─ ─ items. Py

├ ─ ─ middlewares. Py

├ ─ ─ pipelines. Py

├ ─ ─ Settings. Py

└ ─ ─ spiders

└ ─ ─ init. Py

2 directories, 7 files

2) Identify the target: Identify the target you want to capture and produce a specific crawler

scrapy genspider tencent 

cd tencentSpider

scrapy genspider tencent hr.tencent.com

tarena@tedu:~/Spider/tencentSpider$ tree

├ ─ ─ scrapy. CFG

├ ─ ─ tecentLog. TXT

└ ─ ─ tencentSpider

├ ─ ─ init. Py

├ ─ ─ init. Pyc

├ ─ ─ items. Py

├ ─ ─ middlewares. Py

├ ─ ─ pipelines. Py

├ ─ ─ Settings. Py

├ ─ ─ Settings. Pyc

└ ─ ─ spiders

├ ─ ─ init. Py

├ ─ ─ init. Pyc

└ ─ ─ tecent. Py

2 directories, 12 files

The following need to modify the code logic in detail, according to our needs to achieve their own crawler logic:

Modify setttings. Py Settings

“> < span style =” box-sizing: border-box; word-break: inherit! Important

Tecent.py, the logic to grab page information and continue the jump

Items.py saves the mapping of items

Spiders/Spidername.py spiders start to climb the web page.

> < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important;

5) Start crawlers under Scrapy:

scrapy crawl tencent