Author: xiaoyu

Python Data Science

Zhihu: zhuanlan.zhihu.com/pypcfx


Learning scrapy is an essential part of the crawler’s journey. There are probably a lot of friends out there who are doing scrapy right now, so great, let’s do it together. If you’re new to scrapy, it might be a bit confusing. After all, it’s a framework, and you don’t know where to start. Starting with this series on scrapy learning, you’ll learn how to quickly get started and use scrapy.

In the end, I recommend a book about how to learn scrapy and how to get it.

Why use a crawler frame?

If you know the basics of a crawler, it’s time to take a look at the crawler framework. So why use a crawler framework?

  • Learning frameworks is about learning a programming idea, not just how to use it. From understanding to mastering a framework, is actually a process of understanding a kind of thought.

  • Frameworks also bring great convenience to our development. Many of the rules and regulations are already written, we do not need to repeat the wheel, we just need to customize their own needs to implement the function is good, greatly reducing the workload.

  • Reference and learn excellent framework code, improve programming code ability.

Bloggers were learning the crawler framework according to these points, but remember that the core goal is to master a framework idea, a framework ability, master this idea you can better use it, even expand it.

Introduction to Scrapy frames

The most popular crawler frameworks are scrapy and PySpider, but I think the most popular crawler framework is scrapy. Scrapy is an open source high-level crawler framework that we can call “the language of Scrapy “**. Written in Python, it is used to crawl web pages and extract structural data, and can be better used in data analysis and data mining. Scrapy has the following features:

  • scrapyEvent-based mechanisms are utilizedtwistedThe design is implementednon-blockingAsynchronous operation of. This greatly improves CPU utilization and crawl efficiency compared to traditional blocking requests.
  • Simple configuration. Complex functions can be achieved simply by setting up a line of code.
  • Extensible, plugin-rich, such as distributedscrapy + redisCrawler visualization plug-ins.
  • Parsing is easy to use,scrapyEncapsulates thexpathSuch parsers provide more convenient and advancedselectorConstructors can effectively deal with brokenHTMLCode and coding.

Which is better, scrapy or requests+ BS?

Scrapy. Scrapy. Scrapy. Doesn’t that work with Resquests + BeautifulSoup?

Don’t worry, just do what’s convenient for you. Resquests + Beautifulsoup certainly works, requests + any parser works, both are excellent combinations. The advantage of this is that we have the flexibility to write our own code without being stuck in a fixed pattern. For the use of fixed frames sometimes not convenient, such as scrapy for anti-crawl processing is not perfect, a lot of times also have to solve their own.

However, for small and medium-sized crawler tasks, scrapy is a great choice because it saves us from having to write repetitive code and provides excellent performance. When we write our own code, for example, in order to improve the crawl efficiency, we code our own multi-threaded or asynchronous code every time, which wastes a lot of development time. It is best to use a framework that is already written, as long as we simply write parsing rules and pipelines. So what exactly do we need to do? Just look at the graph below.

Therefore, the choice of which to use depends on personal needs and preferences. In order to learn how to do this, I suggest resquests + BeautifulSoup and Scrapy later.

Scrapy architecture

Before we can learn about Scrapy, we need to understand its structure, and understand that the structure is essential to learning about Scrapy.

The following description is taken from the official doc document (cited here) and is clear enough to be understood by referring to this figure.

component

The Scrapy Engine is responsible for controlling the flow of data across all components of the system and firing events when corresponding actions occur. See the Data Flow section below for details.

The Scheduler accepts requests from the engine and enlists them for later use when the engine requests them.

The Downloader takes the page data and feeds it to the engine, which then feeds it to the spider.

Spiders Spiders are classes written by Scrapy users to analyze a response and extract an item(that is, an item that was obtained) or additional urls to follow. Each spider is responsible for a specific (or set of) web sites.

Item Pipeline The Item Pipeline handles items extracted by spiders. Typical processes are cleaning, validation, and persistence (such as accessing to a database).

Downloader middlewares are specific hooks between the engine and the Downloader that handle the response that the Downloader passes to the engine. It provides an easy mechanism to extend Scrapy functionality by inserting custom code.

Spider middlewares are specific hooks between the engine and the Spider that handle the input (response) and output (items and requests) of the Spider. It provides an easy mechanism to extend Scrapy functionality by inserting custom code.

Data flow process

  1. The engine opens a website(open a domain), find the one that handles the siteSpiderAnd to thespiderRequest the first URL to crawl (s).
  2. The engine fromSpiderGets the first URL to crawl in the scheduler(Scheduler)In order toRequestScheduling.
  3. The engine asks the scheduler for the next URL to climb.
  4. The scheduler returns the next URL to crawl to the engine, which sends the URL through the download middleware (request)(request)Direction) to the downloader(Downloader).
  5. Once the page has been downloaded, the downloader generates a page for that pageResponseAnd send it back through the download middleware(response)Direction) to the engine.
  6. The engine receives it from the downloaderResponseAnd through theSpiders middlewareSend (input direction) to the Spider for processing.
  7. SpiderTo deal withResponseAnd return to the crawlItemAnd (follow-up) new Request to the engine.
  8. The engine returns the retrieved Item to theItem PipelineReturns the Request to the scheduler.
  9. Repeat until there are no more in the schedulerrequestEngine shut down the site.

Scrapy learning reference

Here are two references to learning scrapy.

  • Needless to say, scrapy’s official documentation is excellent and detailed. Link: https://doc.scrapy.org/en/latest/index.html
  • The second one is a book about scrapy, Learning Scrapy.

Reference: https://doc.scrapy.org/en/latest/index.html https://www.cnblogs.com/x-pyue/p/7795315.html


If you want to learn Python big data, you can follow the official wechat account Python Data Science. The blogger will always update wonderful content and share more practical explanations, bringing you into the world of data.