Python's distributed crawler program Scrapy creates a complete search engine without secrets

Click to download:Build a search engine complete with Python distributed crawler and Scrapy

What era is the future? It’s the data age! Data analysis services, Internet finance, data modeling, natural language processing, medical case analysis… More and more work is done based on data, and crawlers are the most important way to get data quickly. Python crawlers are simpler and more efficient than other languages

Suits the crowd

Suitable for crawler interested, want to do big data development but can not find the data

Who don’t know how to build a stable and reliable distributed crawler

Want to build a search engine but do not know how to start

Technical reserve requirements

Have a certain primitive reptilian foundation

Knowledge of front-end pages, object-oriented concepts, computer network protocols and databases

Chapter Contents:

Chapter 1 Course Introduction

Introduce course objectives, what can be learned through the course, and the knowledge required to develop the system

1-1 Introduction to python distributed crawler building search engine

Chapter 2 Building a development environment under Windows

This section introduces the installation of python virtual VirtualEnv and VirtualenvWrapper, and the use of PyCharm and Navicat

Installation and simple use of 2-1 PyCharm

2-2 Installation and use of mysql and Navicat

2-3 Install PYTHon2 and Python3 on Windows and Linux

2-4 Installing and configuring a virtual environment

Chapter 3 review of basic knowledge of reptiles

This paper introduces the basic knowledge needed in the development of crawler, including what crawler can do, regular expression, depth-first and breadth-first algorithm and implementation, crawler URL de-duplication strategy, and the difference between Unicode and UTF8 encoding and its application.

3-1 What can the technology selection crawler do

3-2 Regular expression -1

3-3 Regular expression -2

3-4 Regular expression -3

3-5 Principles of depth first and breadth first

3-6 URL deduplication method

3-7 Thoroughly understand Unicode and UTF8 encodings

Chapter 4 New: Scrapy crawls to well-known technical articles sites

Build scrapy development environment, this chapter introduces the common command scrapy and project directory structure analysis, this chapter will also explain the use of xpath and CSS selector in detail. It then crawls all the articles using the spider provided by scrapy. After the item loader completes the extraction of specific fields, it uses the pipeline provided by Scrapy to save data to JSON file and mysql database respectively. .

4-1 re-recording instructions (very important!!)

4-2 Scrapy installation and configuration

4-3 Demand analysis

Pycharm 4-4 pycharm

4-5 Basic xpath syntax

4-6 xpath extracts elements

4-7 CSS selectors

4-8 Write spider to complete the crawl process – 1

4-9 Write spider to complete the crawl process – 2

Why use yield in 4-10 scrapy

4-11 Extracting details

4-12 Extracting details

4-13 Definition and use of items -1

4-14 Definition and use of items – 2

4-15 scrapy configuration

4-16 Items Writes data to a JSON file

4-17 Mysql table structure Design

4-18 Save pipeline database

4-19 Adding data to the mysql Database in Asynchronous mode

4-20 Solution to data Insertion primary key Conflict

4-21 ItemLoader Extracts information

4-22 ItemLoader Extracts information

4-23 Problem with mass capture image download error

Chapter 5 Scrapy crawls the popular quiz site

This chapter mainly completes the extraction of the website’s questions and answers. In addition to analyzing the network request of the q&A site, this chapter will also complete the simulated login of the site through requests and scrapy FormRequest respectively. In this chapter, the network request of the website is analyzed in detail and the API request interface of the website question answer is analyzed respectively and the data is extracted and saved in mysql. .

5-1 Automatic login mechanism of Session and cookie

5-2Requests simulated login for Zhihu-1 (optional viewing)

5-3 Requests for Accessing Zhihu-2 (Optional viewing)

5-4 Requests for Logging in to Zhihu-3 (optional viewing)

5. Scrapy Scrapy Scrapy

5-6 Zhihu Analysis and data table design 1

5-7 Zhihu Analysis and data table design 2

5-8 Item Loder method to extract question-1

5-9 Item Loder method to extract question-2

5-10 Item Loder method to extract question-3

Realization of Spider crawler logic of Zhihu and extraction of Answer -1

Realization of Spider crawler logic of Zhihu and extraction of Answer – 2

5-13 Save data to mysql -1

5-14 Save data to mysql 2

5-15 Save data to mysql 3

Chapter 6 CrawlSpider the whole site of the recruitment website

This chapter completes the data table structure design of recruitment website positions, and through the form of link extractor and rule and configuration of CrawlSpider to complete the recruitment website of all positions to crawl, this chapter will also analyze CrawlSpider from the perspective of source code to let everyone have a deep understanding of CrawlSpider.

6-1 Data table structure design

6-2 CrawlSpider source code analysis – Create CrawlSpider and Settings

6-3 CrawlSpider source code analysis

6-4 Rule and LinkExtractor used

6-5 Mock login and Cookie passing after 302 (learn this tutorial if websites require login)

6-6 Item Loader Method for job parsing

6-7 Job data warehousing -1

6-8 Job information warehousing -2

6-9 Website anti-crawl breakthrough

Chapter 7 Scrapy breaks through the limits of anti-crawlers

This chapter will start from the struggle between crawler and anti-crawler process, and then explain the principle of scrapy, and then through the random switch of user-agent and set up the IP proxy of scrapy to complete the breakthrough of anti-crawler restrictions. This chapter will also introduce httpresponse and Httprequest to analyze scrapy functions in detail. Finally, we will use the cloud coding platform to complete online verification code recognition and disable cookies and access frequency to reduce the possibility of crawler blocking. .

7-1 crawler and anti – crawler antagonistic process and strategy

7-2 scrapy source code analysis

7-3 Requests and Response introductions

7-4 Randomly replace User-Agent-1 with downloadmiddleware

7-5 Randomly replace User-Agent – 2 with downloadmiddleware

7-6 scrapy implement IP proxy pool – 1

7-7 scrapy implement IP proxy pool – 2

7-8 scrapy implement IP proxy pool – 3

7-9 Cloud coding realizes verification code identification

7-10 Settings for cookie disabling, automatic speed limiting, and customized spiders

Chapter 8 Scrapy advanced development

This chapter covers more advanced features of scrapy, These advanced features include crawling dynamic site data through Selenium and PhantomJS and integrating both into scrapy, scrapy signaling, custom middleware, pausing and starting scrapy crawlers, scrapy’s core API, Telnet, and Scrapy’s Log configuration for Web Services and Scrapy, email sending, etc. These features allow us to do more than just scrapy…

8-1 Selenium Dynamic Web Request and Simulated Login zhihu

8-2 Selenium simulates logging in to Weibo, simulates mouse dropdown

8-3 Chromedriver does not load images, phantomJS gets dynamic web pages

8-4 Selenium is integrated into Scrapy

8-5 Other dynamic web page acquisition technology introduction – Chrome without interface running, Scrapy-splash, Selenium -grid, Splinter

8-6 Scrapy pauses and restarts

8-7 Scrapy URL to remove weight principle

8-8 Scrapy Telnet service

Spider Middleware details

8-10 Scrapy data collection

8-11 scrapy signal details

8-12 scrapy extension developed

Chapter 9 scrapy-redis distributed crawler

The use of scrapy-Redis distributed crawler and the source code analysis of scrapy-Redis distributed crawler, so that you can modify the source code according to their own needs to meet their needs. We’ll also show you how to integrate BloomFilter into scrapy-redis.

9-1 Key points of distributed crawlers

9-2 Redis Basics – 1

9-3 Redis Basics – 2

9-4 scrapy-redis to compile distributed crawler code

9- scrapy source code -connection.py, defaults.py-

9-6 scrapy-redis source code -dupefilter.py-

Box-sizing: border-box! Important; word-wrap: break-word! Important; word-wrap: break-word! Important;

Scheduler. Py, spider.py- scrapy-redis

9-9 integrate BloomFilter into scrapy-redis

Chapter 10 cookie pool system design and implementation

In order to prevent the crawling code and parsing code from being affected by simulated login, it becomes very important to separate simulated login into an independent service. Cookie pool is created to solve such problems. Multi-account login management and how to make website access easier are all problems that need to be solved by cookie pool. This chapter focuses on the details of cookie pool design and development. .

10-1 What is a cookie pool?

10-2 Cookie pool system design

10-3 Implement cookie pool-1

10-4 Implement cookie pool-2

10-5 Modify login method -1

10-6 Modify login method -2

10-7 Modify login method -3

10-8 Modify login method -4

10-9 Easy access to websites through abstract base classes

10-10 implementation detects whether the website cookie is valid

10-11 How to select redis data structure to save cookies

10-12 Implementation of cookie manager

10-13 Enabling the cookie pool service

10-14 Integrate cookies into crawler projects

10-15 Comments on cookie architecture design improvement

Chapter 11 identification of various verification codes

Sliding verification code is becoming more and more popular. How to solve sliding verification code has become an important link in simulated login. This chapter focuses on solving various details of sliding verification code.

11-1 Sliding verification code identification ideas

11-2 Verification Code Screenshot 1

11-3 Verification Code Screenshot 2

11 minus 4 to figure out how far you’re sliding

11-5 Calculate the sliding trajectory

Chapter 12 incremental fetching

Incremental crawler and data update are problems frequently encountered in crawler operation, such as how to discover new data in time, how to capture the later URL first, and how to discover new data, which are often original problems in actual development. This chapter through the modification of scrapy-redis source code to solve the problem of appeal at the minimum cost, through the study of this chapter we will better understand how to control the operation of the crawler. .

Problems to be solved by 12-1 incremental crawlers

12-2 Incremental crawl-1 is completed by modifying scrapy-redis

12-3 Incremental crawl-2 is completed by modifying scrapy-redis

12-4 Update of crawler data

Chapter 13: Use of elasticSearch

This chapter explains how to install and use ElasticSearch, introduce basic concepts of ElasticSearch, and use the API. In this chapter, you will also learn how to use ElasticSearch – DSL and how to save data to ElasticSearch by scrapy pipeline.

13-1 elasticsearch is introduced

13-2 elasticsearch installation

13-3 ElasticSearch – Head plugin and Kibana installation

13-4 Basic concepts of ElasticSearch

13-5 Inverted index

13-6 ElasticSearch basic index and document CRUD operations

13-7 MGET and Bulk operations for ElasticSearch are performed in batches

13-8 ElasticSearch Mapping Management

13-9 Simple query for ElasticSearch – 1

13-10 Simple query for ElasticSearch – 2

13-11 Querying the bool combination of ElasticSearch

13-12 Scrapy write data to ElasticSearch -1

13-13 Scrapy Writes data to ElasticSearch

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python’s distributed crawler program Scrapy creates a complete search engine without secrets

Click to download:Build a search engine complete with Python distributed crawler and Scrapy

Chapter Contents:

Web disk link download

Python’s distributed crawler program Scrapy creates a complete search engine without secrets

Click to download:Build a search engine complete with Python distributed crawler and Scrapy

Chapter Contents:

Web disk link download

Related Posts

Using Aviator expressions in Java

Java uses HttpUrlConnection to implement multithreaded breakpoint download

Set non-root account to run docker command without sudo