Why are Python crawlers popular

If you observe carefully, it is not difficult to find that there are more and more people who know and learn crawlers. On the one hand, more and more data are available on the Internet. On the other hand, programming languages like Python provide more and more excellent tools to make crawlers simple and easy to use.

By using crawlers, we can obtain a large amount of value data, so as to obtain information that cannot be obtained in perceptual knowledge, such as:

Zhihu: Climb the quality answers, screen out the best quality content for you under each topic. Taobao and JINGdong: capture commodity, review and sales data, and analyze the consumption scenes of various commodities and users. Anjuke and Lianjia: capture real estate sales and rental information, analyze the trend of housing prices, and analyze housing prices in different regions. Pull hook, Zhaopin: climb all kinds of job information, analysis of various industries talent demand and salary level. Snowball net: capture the behavior of snowball high-return users, analyze and forecast the stock market.

Crawlers are one of the best ways to get started with Python. Python has many application directions, such as background development, Web development, scientific computing, etc., but crawler is more friendly for beginners, the principle is simple, a few lines of code can implement the basic crawler, the learning process is smoother, you can feel a greater sense of accomplishment.

Once you’ve mastered the basics of crawlers, you’ll be much more comfortable moving on to Python data analysis, Web development, and even machine learning. You become familiar with the basic Python syntax, the use of libraries, and how to find documents.

For xiaobai, crawler may be a very complicated, high technical threshold of things. For example, some people think that learning a crawler must master Python, and then hum the system to learn every knowledge point of Python, but after a long time, they still cannot climb the data; Some people think that the first to master the knowledge of the web page, then began HTML CSS, the results into the front-end pit, suddenly……

However, mastering the right method to be able to crawl the data of major websites in a short period of time is actually very easy to achieve, but it is recommended that you have a specific goal from the beginning.

When you’re goal-driven, your learning will be more accurate and efficient. All the prior knowledge you think is necessary can be learned along the way. Here’s a smooth, no-basics, quickstart learning path.

1. Learn Python packages and implement basic crawler procedures 2. Understand storage of unstructured data 3. Learn scrapy, build engineering crawler 4. Learn database knowledge, cope with large-scale data storage and extraction 5. 6. Distributed crawler can realize large-scale concurrent collection and improve efficiency

❶ –

Learn the Python package and implement the basic crawler process

Most crawlers are carried out according to the process of “sending request — obtaining page — parsing page — extracting and storing content”, which actually simulates the process of obtaining webpage information by using browser.

There are many crawler packages in Python: URllib, Requests, BS4, scrapy, PySpider, and so on. It is recommended to start with Requests +Xpath, which links websites and returns pages, and Xpath, which parsed pages and extracted data.

If you’ve ever used BeautifulSoup, Xpath is a lot easier, eliminating all the work of checking element code layer by layer. So down the basic routine is about the same, the general static website is not at all, Douban, Qiushi Encyclopedia, Tencent news and so on basically can get started.

Of course, if you need to crawl asynchronously loaded websites, you can learn how to use the browser to capture and analyze real requests or learn Selenium to implement automation, so that zhihu, Mtime, Tripadvisor and other dynamic websites can also be solved.

❷ –

Understand the storage of unstructured data

The data crawled back can be stored locally as a document or stored in a database.

You can use Python’s syntax or pandas to store data in CSV files.

You may find that the data is not clean. There may be errors, omissions, and other errors. You can also clean the data by learning the basic usage of the pandas package.

❸ –

Learn scrapy, build engineered crawlers

You’ll be able to pull off average levels of data and code, but a strong scrapy framework is especially useful in complex situations where you might still struggle.

Scrapy is a very powerful crawler framework. It is not only easy to build request, but also powerful selector can easily parse response, but the most amazing thing about it is that it is extremely high performance, allowing you to build crawlers, modular.

Learn to scrapy, you can build your own crawler frame, you will have a crawler engineer’s mind.

– ❹ –

Learn database basics and deal with large-scale data storage

When the amount of data you’re crawling back is small, you can store it as a document, but once the amount of data is large, it’s a little hard to do. Therefore, it is necessary to master a database, and it is OK to learn the current mainstream MongoDB.

MongoDB makes it easy to store unstructured data, such as text of various comments, links to images, etc. You can also use PyMongo to make it easier to manipulate MongoDB in Python.

Because the database knowledge to be used here is actually very simple, mainly data how to enter the library, how to extract, in need of time to learn on the line.

❺ –

Master various skills to cope with anti-crawling measures of specific websites

Of course, there is some desperation involved in crawlers, such as IP blocking, strange capttos, userAgent access restrictions, dynamic loading, etc.

Encountered these anti-crawler means, of course, also need some advanced skills to deal with, such as access frequency control, the use of proxy IP pool, packet capture, verification code OCR processing and so on.

Often websites tend to prefer the former between efficient development and anti-crawler, which also provides space for crawlers, master these anti-crawler skills, the vast majority of websites have been difficult to you.

– ❻ –

Distributed crawler to achieve large-scale concurrent collection

It won’t be a problem to crawl basic data, your bottleneck will focus on how efficiently you can crawl massive amounts of data. At this point, I believe you will naturally come across a very impressive name: distributed crawler.

Distributed this thing, sounds very scary, but in fact is the use of multi-threading principle to allow multiple crawlers to work at the same time, you need to master the three tools Scrapy + MongoDB + Redis.

Scrapy is used for basic page crawls, MongoDB for data to crawl, and Redis for page queues to crawl.

So some things look scary, but when you break them down, they’re just that. When you can write distributed crawlers, then you can try to build some basic crawler architecture to achieve more automated data retrieval.

You see, this learning path, you can already become an old driver, very smooth. So at the beginning, try not to systematically nibble on something, find a practical project (start with something simple like douban or piglet), and start straight away.

Because crawler technology does not require you to systematically master a language, nor does it require sophisticated database technology, effective posture is to learn these scattered knowledge points from actual projects, you can ensure that each time you learn the most needed part.

Of course, the only trouble is that in the specific problem, how to find the specific part of the learning resources, how to screen and screening, is a big problem many beginners face.

But don’t worry, we have prepared a very systematic crawler course. In addition to providing you with a clear learning path, we have selected the most practical learning resources and a huge database of mainstream crawler cases. After a short period of study, you will be able to master the skill of crawler and get the data you want.

After a short period of study, many students have made progress from 0 to 1, and can write their own crawler and crawl to obtain large-scale data. The following is a collection of several students’ homework to share:

Climb LOL hero skin hd pictures

Silent little panda

The current popular game wallpaper crawling, MOBA game “League of Legends”, mobile game “King of Glory”, “Onmyoji”, FPS game “PUBG”, among them “League of Legends” wallpaper the most difficult to climb, here shows the process of crawling “League of Legends” all hero wallpaper.

Take a look at the final crawling effect. The wallpaper of each hero has been crawling off:

139 hero wallpapers

12 wallpapers of “Annie of Darkness” :

Annie little Red Riding Hood in high definition

1. Crawler flow chart

So far, I have a certain understanding of the object I want to crawl, and I have some ideas about the specific crawling method. I can design the following crawler flow chart:

2. Design the overall code framework

According to the crawler flow chart, I designed the following code framework:

The run() function does the following: creates an LOL folder — gets input from the keyboard — climbs All hero wallpapers if it’s “All”, or a single hero wallpapers if it’s not.

3. Crawl all hero information

First we parse the champions.js file to get a one-to-one mapping between the hero’s English name and id.

For all the hero information pages on the official website, it is not easy to crawl because they are loaded with JavaScript. I used Selenium+PhantomJS method to dynamically load the hero information.

Parsed hero information

4. Crawl hero wallpapers

Define the get_image(heroid,heroframe) function to crawl all the wallpapers of a single hero.

Keep the network open while running the code. If the network speed is too slow, the crawl may fail. It takes about 3-4 minutes to crawl all 139 heroes hd wallpapers (about 1000 images) on a 3 Megabyte cable network.

The same is true for other games like Honor of Kings, Onmyoji, PUBG, and so on. League of Legends is the most difficult game to climb, so it’s easy to write your own code to climb other games.

The contents of the card can be swiped

Meituan.com restaurant information crawl


This time, we will conduct a crawler practice for all the food recommendations of changzhou Food. The main information we want to crawl is: the name of the restaurant, the score of the restaurant, the number of restaurant reviews, the address of the restaurant, and the per capita consumption price…

The final climbing data is saved in CSV as follows:

Meituan uses an anti-crawler mechanism that simulates the browser for crawlers. After several attempts, only cookies and User-Agent are verified.

Climb to the first set of data

After climbing to the first set of data, it’s time to turn the page. It was easy to turn the page, so I crawled the phone number of the business, business hours and other information.

I’m going to write a function

Successfully crawled the corresponding information

But not for long, halfway up the climb was 403.

Because it’s blocked, we’ll have to access it in a traceless way? . I took a look and decided to use multiple cookies and randomly call them to avoid being blocked. Finally, 17 cookies were used. After testing, they could be climbed at high speed without being sealed.

This climb ended here, but the climb back data can be used for a lot of analysis, such as take-out in different areas, distribution of merchants and so on.

Climb dangdang all five star books in each category


The website selected for this assignment is Dangdang. Dangdang has a lot of book data, especially five Star books, which contains the most popular book information in various fields, which is of certain value for finding valuable books and analyzing the sales situation of good books.

The final data to crawl is as follows, with a total of 10000+ rows:

The data I want to crawl is by category (fiction, primary and secondary education, literature, success/inspiration…). Five-star book information below (title, number of reviews, author, publisher, publication date, number of five-star reviews, price, e-book price, etc.).

In order to catch the book information under each category, first see if the link changes when clicking on each category. After testing, in different categories, links are not the same, it turns out that is not JS loading.

Data is returned normally after printing

I don’t even have the Headers information set, but I can still get the data I want. But in the end, I added headers to the full code, just to be on the safe side.

The next step is to crawl the information of books under each category respectively. Take “novel” as an example. In fact, turning pages is very simple.

Page-turning is also very simple, but there is a little bit of a pit is, crawl back to the link in the code, need to page-turning, you need to build out the link. Analysis of the returned links revealed that only four numbers in the middle were different. So I take these data and I pass it in the link, so I can make a generic link.

Construct a page-turning link

The next step is to grab information from different pages. There is no asynchronous loading, so you can just use xpath to locate it. Of course, there are a few minor points to note that each book contains different information, so if you use xpath to retrieve it, it may not be able to retrieve it. So try… Except the statement.

In the end, it climbed to more than 10,000 lines of data, corresponding to more than 10,000 highly rated books in different fields. Of course, there is some double counting, such as fiction and literature, many books are in both categories.

Dangdang network itself does not have what anti – crawl mechanism, so climb also relatively smooth. The only snag is the handling of some missing information in some of the books by catching back links to continue turning pages.

Job information

@ nan was born

I want to be engaged in the position of “data analyst”, so I want to know the salary, requirements and main distribution points of this position in the city WHERE I live. As the authoritative Recruitment platform of the Internet industry, the job information of “data analyst” on The website can be well represented.

The data finally crawled is stored in MongoDB as follows:

JSONDecodeError (JSONDecodeError, JSONDecodeError, JSONDecodeError, JSONDecodeError, JSONDecodeError)

After stepping in two pits, I began to do my homework, which was surprisingly difficult for a novice. At first, MY idea was to find a connection, but there was no connection in the collected data, so I clicked into the details page to see if there was any pattern? Then try clicking on each detail page several times and find that the number on the page matches one of the data collected. Such as:

A detail page

Get started when you find a breakthrough:


request url\request method

Request method :get request method :get request method :get request method :get

Loop through the positionId in format, e.g.

Details page

The xpath method gets the data

Partial data:

After trial and error, optimized the code, this is mainly a learning and creation process (crawling the details page is my masterpiece).

– Efficient learning path –

It is very unreasonable to talk about theory, grammar and programming language at the beginning. We will directly start with specific cases and learn specific knowledge points through practical operations. We plan a systematic learning path for you, so that you do not face scattered knowledge points.

For example, we’re going to replace BeautifulSoup with LXML +Xpath to parse web pages. This will save you from having to check web elements unnecessarily. There are many tools that can do this, but we’ll give you the easiest way to do it.

Outline of Python Crawler: Getting Started + Advanced

Chapter 1: Getting started with the Python crawler

1. What is a reptile

Url structure and page-turning mechanism

Page source code structure and page request process

Application and basic principle of crawler

2. Getting to know Python crawlers

Build Python crawler environment

Create the first crawler: climb baidu home page

Three steps of crawler: data acquisition, data parsing, data saving

Use Requests to retrieve Requests

Installation and basic usage of Requests

Use Requests to crawl douban comments

Be sure to know the crawler protocol

4. Use Xpath to parse douban reviews

The installation and introduction of Xpath

Use of Xpath: browser copy and handwriting

Actual combat: Use Xpath to parse douban short comment information

5. Use pandas to save the douban comments

Pandas is used for pandas

Pandas Saves files and processes data

Actual: Use pandas to save the douban text review data

6. Setting browser packet capture and headers (Case 1: Fetching Zhihu)

The general idea of crawler: grasping, parsing, storage

The browser grabs the package for ajax-loaded data

Set headers to override anti-crawler limits

Actual combat: climb zhihu user data

7. MongoDB for database entry (Case 2: Crawl pull)

Install and use MongoDB and RoboMongo

Set the wait time and modify the header

Actual combat: Climb pull – off job data

Store the data in MongoDB

Supplement actual combat: climb micro blog mobile terminal data

8, Selenium crawling dynamic web page (Case 3: Crawling Taobao)

Construction and application of Selenium, a dynamic web crawler

Analysis of taobao commodity page dynamic information

Actual combat: Use Selenium to crawl taobao web page information

Chapter 2: The Python crawler’s Scrapy framework

1, crawler engineering and Scrapy framework

HTML, CSS, JS, database, HTTP protocol, front and background linkage

Crawler advanced workflow

Scrapy components: engine, scheduler, download middleware, project pipeline, etc

Common crawler tools: all kinds of database, packet capture tools, etc

2. Scrapy installation and basic use

Scrapy installed

Basic methods and properties of Scrapy

Start your first Scrapy project

Scrapy selector

Common selectors: CSS, xpath, RE, PyQuery

How to use the CSS

How to use xpath

How to use re

How to use PyQuery

4. Scrapy project pipeline

Introduction and function of Item Pipeline

The main function of the Item Pipeline

Example: Write data to a file

Practical example: Filtering data in a pipe

5. Scrapy middleware

Download middleware and spider middleware

Download the three main functions of middleware

The system provides middleware by default

Scrapy Request and Response

Basic and advanced parameters of the Request object

Request object methods

Response object parameters and methods

Detailed explanation of comprehensive utilization of Response object method

Chapter 3: Advanced Python crawler operations

1, Network advanced Google Browser capture analysis

Detailed analysis of HTTP requests

Network panel structure

A keyword method for filtering requests

Copy, save, and clear network information

View resource initiators and dependencies

2, data warehousing and database

Data to heavy

Data is stored in MongoDB

Chapter four: distributed crawler and practical training project

1. Large-scale concurrent collection — the compilation of distributed crawler

Introduction to distributed crawlers

Scrapy Distributed crawl principle

The use of Scrapy – Redis

Scrapy Distributed deployment details

2. Training Project (I) — 58.com second-hand housing monitoring

3. Practical training Project (II) — Where to simulate landing

4. Practical training project (III) — Jingdong commodity data capture

– There are study materials in every class

Can you collect gigabytes of learning resources, but never open them? We’ve helped you find the most useful parts, and described them in the simplest form to help you learn, so you can spend more time practicing and practicing.

Considering various problems, we have prepared after-class materials in each class, including four parts:

1. Key notes of the course to elaborate key knowledge to help you understand and review quickly;

2. By default, you are little white, supplement all the basic knowledge, even if it is the software installation and basic operation;

3. In and out of class cases to provide reference code learning, so that you can easily cope with the mainstream website crawler;

4. Expand knowledge points and solve more problems so that you can solve some special problems in practice.

Some of the materials after class

– Super cases, covering mainstream sites –

The most common web crawler cases are provided in the course: Douban, Baidu, Zhihu, Taobao, JINGdong, Weibo… Each case is analyzed in detail in the course video, and the teacher guides you through each step.

In addition, we will add cases such as Xiaozhu, Lianjia, 58.com, netease Cloud Music, wechat Friends and provide ideas and codes.

After a lot of imitation and practice, you can easily write your own crawler code and easily crawl the data of these major sites.

– Skills development: anti-crawler and data storage and processing –

It is not enough to know the basic crawler, so we will use practical cases, take you to understand some of the anti-crawler measures, and use specific techniques to bypass the restrictions. Such as asynchronous loading, IP restrictions, HEADERS restrictions, captans, etc., these are common anti-crawler methods that you can avoid very well.

Engineered crawlers and distributed crawlers make it possible for you to obtain large-scale data. In addition to crawling, you will learn the basics of Mongodb, pandas, and pandas to store and clean data for subsequent analysis and processing.

Use Scrapy to get rental information

Crawl pull recruitment data and store it in MongoDB