Click to download:Build a search engine complete with Python distributed crawler and Scrapy
What era is the future? It’s the data age! Data analysis services, Internet finance, data modeling, natural language processing, medical case analysis… More and more work is done based on data, and crawlers are the most important way to get data quickly. Python crawlers are simpler and more efficient than other languages
Suits the crowd
Suitable for crawler interested, want to do big data development but can not find the data
Who don’t know how to build a stable and reliable distributed crawler
Want to build a search engine but do not know how to start
Technical reserve requirements
Have a certain primitive reptilian foundation
Knowledge of front-end pages, object-oriented concepts, computer network protocols and databases
Chapter Contents:
Chapter 1 Course Introduction
Introduce course objectives, what can be learned through the course, and the knowledge required to develop the system
1-1 Introduction to python distributed crawler building search engine
Chapter 2 Building a development environment under Windows
This section introduces the installation of python virtual VirtualEnv and VirtualenvWrapper, and the use of PyCharm and Navicat
Installation and simple use of 2-1 PyCharm
2-2 Installation and use of mysql and Navicat
2-3 Install PYTHon2 and Python3 on Windows and Linux
2-4 Installing and configuring a virtual environment
Chapter 3 review of basic knowledge of reptiles
This paper introduces the basic knowledge needed in the development of crawler, including what crawler can do, regular expression, depth-first and breadth-first algorithm and implementation, crawler URL de-duplication strategy, and the difference between Unicode and UTF8 encoding and its application.
3-1 What can the technology selection crawler do
3-2 Regular expression -1
3-3 Regular expression -2
3-4 Regular expression -3
3-5 Principles of depth first and breadth first
3-6 URL deduplication method
3-7 Thoroughly understand Unicode and UTF8 encodings
Chapter 4 New: Scrapy crawls to well-known technical articles sites
Build scrapy development environment, this chapter introduces the common command scrapy and project directory structure analysis, this chapter will also explain the use of xpath and CSS selector in detail. It then crawls all the articles using the spider provided by scrapy. After the item loader completes the extraction of specific fields, it uses the pipeline provided by Scrapy to save data to JSON file and mysql database respectively. .
4-1 re-recording instructions (very important!!)
4-2 Scrapy installation and configuration
4-3 Demand analysis
Pycharm 4-4 pycharm
4-5 Basic xpath syntax
4-6 xpath extracts elements
4-7 CSS selectors
4-8 Write spider to complete the crawl process – 1
4-9 Write spider to complete the crawl process – 2
Why use yield in 4-10 scrapy
4-11 Extracting details
4-12 Extracting details
4-13 Definition and use of items -1
4-14 Definition and use of items – 2
4-15 scrapy configuration
4-16 Items Writes data to a JSON file
4-17 Mysql table structure Design
4-18 Save pipeline database
4-19 Adding data to the mysql Database in Asynchronous mode
4-20 Solution to data Insertion primary key Conflict
4-21 ItemLoader Extracts information
4-22 ItemLoader Extracts information
4-23 Problem with mass capture image download error
Chapter 5 Scrapy crawls the popular quiz site
This chapter mainly completes the extraction of the website’s questions and answers. In addition to analyzing the network request of the q&A site, this chapter will also complete the simulated login of the site through requests and scrapy FormRequest respectively. In this chapter, the network request of the website is analyzed in detail and the API request interface of the website question answer is analyzed respectively and the data is extracted and saved in mysql. .
5-1 Automatic login mechanism of Session and cookie
5-2Requests simulated login for Zhihu-1 (optional viewing)
5-3 Requests for Accessing Zhihu-2 (Optional viewing)
5-4 Requests for Logging in to Zhihu-3 (optional viewing)
5. Scrapy Scrapy Scrapy
5-6 Zhihu Analysis and data table design 1
5-7 Zhihu Analysis and data table design 2
5-8 Item Loder method to extract question-1
5-9 Item Loder method to extract question-2
5-10 Item Loder method to extract question-3
Realization of Spider crawler logic of Zhihu and extraction of Answer -1
Realization of Spider crawler logic of Zhihu and extraction of Answer – 2
5-13 Save data to mysql -1
5-14 Save data to mysql 2
5-15 Save data to mysql 3
Chapter 6 CrawlSpider the whole site of the recruitment website
This chapter completes the data table structure design of recruitment website positions, and through the form of link extractor and rule and configuration of CrawlSpider to complete the recruitment website of all positions to crawl, this chapter will also analyze CrawlSpider from the perspective of source code to let everyone have a deep understanding of CrawlSpider.
6-1 Data table structure design
6-2 CrawlSpider source code analysis – Create CrawlSpider and Settings
6-3 CrawlSpider source code analysis
6-4 Rule and LinkExtractor used
6-5 Mock login and Cookie passing after 302 (learn this tutorial if websites require login)
6-6 Item Loader Method for job parsing
6-7 Job data warehousing -1
6-8 Job information warehousing -2
6-9 Website anti-crawl breakthrough
Chapter 7 Scrapy breaks through the limits of anti-crawlers
This chapter will start from the struggle between crawler and anti-crawler process, and then explain the principle of scrapy, and then through the random switch of user-agent and set up the IP proxy of scrapy to complete the breakthrough of anti-crawler restrictions. This chapter will also introduce httpresponse and Httprequest to analyze scrapy functions in detail. Finally, we will use the cloud coding platform to complete online verification code recognition and disable cookies and access frequency to reduce the possibility of crawler blocking. .
7-1 crawler and anti – crawler antagonistic process and strategy
7-2 scrapy source code analysis
7-3 Requests and Response introductions
7-4 Randomly replace User-Agent-1 with downloadmiddleware
7-5 Randomly replace User-Agent – 2 with downloadmiddleware
7-6 scrapy implement IP proxy pool – 1
7-7 scrapy implement IP proxy pool – 2
7-8 scrapy implement IP proxy pool – 3
7-9 Cloud coding realizes verification code identification
7-10 Settings for cookie disabling, automatic speed limiting, and customized spiders
Chapter 8 Scrapy advanced development
This chapter covers more advanced features of scrapy, These advanced features include crawling dynamic site data through Selenium and PhantomJS and integrating both into scrapy, scrapy signaling, custom middleware, pausing and starting scrapy crawlers, scrapy’s core API, Telnet, and Scrapy’s Log configuration for Web Services and Scrapy, email sending, etc. These features allow us to do more than just scrapy…
8-1 Selenium Dynamic Web Request and Simulated Login zhihu
8-2 Selenium simulates logging in to Weibo, simulates mouse dropdown
8-3 Chromedriver does not load images, phantomJS gets dynamic web pages
8-4 Selenium is integrated into Scrapy
8-5 Other dynamic web page acquisition technology introduction – Chrome without interface running, Scrapy-splash, Selenium -grid, Splinter
8-6 Scrapy pauses and restarts
8-7 Scrapy URL to remove weight principle
8-8 Scrapy Telnet service
Spider Middleware details
8-10 Scrapy data collection
8-11 scrapy signal details
8-12 scrapy extension developed
Chapter 9 scrapy-redis distributed crawler
The use of scrapy-Redis distributed crawler and the source code analysis of scrapy-Redis distributed crawler, so that you can modify the source code according to their own needs to meet their needs. We’ll also show you how to integrate BloomFilter into scrapy-redis.
9-1 Key points of distributed crawlers
9-2 Redis Basics – 1
9-3 Redis Basics – 2
9-4 scrapy-redis to compile distributed crawler code
9- scrapy source code -connection.py, defaults.py-
9-6 scrapy-redis source code -dupefilter.py-
Box-sizing: border-box! Important; word-wrap: break-word! Important; word-wrap: break-word! Important;
Scheduler. Py, spider.py- scrapy-redis
9-9 integrate BloomFilter into scrapy-redis
Chapter 10 cookie pool system design and implementation
In order to prevent the crawling code and parsing code from being affected by simulated login, it becomes very important to separate simulated login into an independent service. Cookie pool is created to solve such problems. Multi-account login management and how to make website access easier are all problems that need to be solved by cookie pool. This chapter focuses on the details of cookie pool design and development. .
10-1 What is a cookie pool?
10-2 Cookie pool system design
10-3 Implement cookie pool-1
10-4 Implement cookie pool-2
10-5 Modify login method -1
10-6 Modify login method -2
10-7 Modify login method -3
10-8 Modify login method -4
10-9 Easy access to websites through abstract base classes
10-10 implementation detects whether the website cookie is valid
10-11 How to select redis data structure to save cookies
10-12 Implementation of cookie manager
10-13 Enabling the cookie pool service
10-14 Integrate cookies into crawler projects
10-15 Comments on cookie architecture design improvement
Chapter 11 identification of various verification codes
Sliding verification code is becoming more and more popular. How to solve sliding verification code has become an important link in simulated login. This chapter focuses on solving various details of sliding verification code.
11-1 Sliding verification code identification ideas
11-2 Verification Code Screenshot 1
11-3 Verification Code Screenshot 2
11 minus 4 to figure out how far you’re sliding
11-5 Calculate the sliding trajectory
Chapter 12 incremental fetching
Incremental crawler and data update are problems frequently encountered in crawler operation, such as how to discover new data in time, how to capture the later URL first, and how to discover new data, which are often original problems in actual development. This chapter through the modification of scrapy-redis source code to solve the problem of appeal at the minimum cost, through the study of this chapter we will better understand how to control the operation of the crawler. .
Problems to be solved by 12-1 incremental crawlers
12-2 Incremental crawl-1 is completed by modifying scrapy-redis
12-3 Incremental crawl-2 is completed by modifying scrapy-redis
12-4 Update of crawler data
Chapter 13: Use of elasticSearch
This chapter explains how to install and use ElasticSearch, introduce basic concepts of ElasticSearch, and use the API. In this chapter, you will also learn how to use ElasticSearch – DSL and how to save data to ElasticSearch by scrapy pipeline.
13-1 elasticsearch is introduced
13-2 elasticsearch installation
13-3 ElasticSearch – Head plugin and Kibana installation
13-4 Basic concepts of ElasticSearch
13-5 Inverted index
13-6 ElasticSearch basic index and document CRUD operations
13-7 MGET and Bulk operations for ElasticSearch are performed in batches
13-8 ElasticSearch Mapping Management
13-9 Simple query for ElasticSearch – 1
13-10 Simple query for ElasticSearch – 2
13-11 Querying the bool combination of ElasticSearch
13-12 Scrapy write data to ElasticSearch -1
13-13 Scrapy Writes data to ElasticSearch