“This is the 16th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Preface:

This handout is part of the advanced reptilian course. Through the research on the main technology of various crawlers, the problems and solutions of the current web crawler are elaborated in detail. In the example, we choose the domestic mainstream Douban, Cat’s Eye movie, Toutiao to carry out actual data capture, but with the passage of time, the update of the target website, some codes may not work properly, please know.

A simple definition of crawler

A web crawler (also known as a web spider, a web bot, and more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.

Simply put: the use of pre-written procedures to grab the network needed data, such procedures are called web crawlers.

Classification of reptiles

Web crawlers can be divided into general web crawlers (such as those of search engines, which continuously grab data according to the seeds of several urls) and focused web crawlers (those that selectively grab predefined themes and related pages).

A. General web crawler:

The first step in search engines is crawlers. But the crawler in the search engine is a program that obtains information of various web pages extensively. In addition to HTML files, search engines usually also crawl and index a variety of text-based file types, such as TXT, WORD, PDF, etc. But for pictures, videos, and other non-text content is generally not processed; But scripts and some web applications are not handled;

B. Focus on web crawlers:

A program that captures data in a specific domain. Such as travel websites, financial websites, recruitment websites and so on; Aggregation crawlers in specific fields will use various technologies to process the information we need, so for some dynamic programs in the website, the script will still be executed to ensure that the data in the website can be captured.

Why do we need reptiles

Use of reptiles

A. Solve the cold start problem: For many social sites, cold start is very difficult. For newly registered users, to retain them, it is necessary to inject a batch of fake users to create a community atmosphere. Generally, these fake users can be captured from weibo or other apps by web crawlers. Toutiao and other Internet media is the earliest use of crawler + webpage sorting technology, so they solve the way of cold start also need crawler;

B. The foundation of search engines: do search engines without crawlers;

C. Establish knowledge graph to help establish training set of machine learning:

D. Can make price comparison and trend analysis of various commodities, etc. :

E. Others: such as analyzing the data of competitors on Taobao; Analyze the data transmission influence of weibo, the government’s public opinion analysis, the relationship between people and so on;

In a word: in today’s era of big data, the premise of any value analysis is data, and crawler is a means of low cost and high income to obtain this premise. For the students, another important value is to solve the employment problem.

Three, how to make a reptile

Crawlers in Python are very easy, as simple as two lines of code in an interactive environment

Is it so easy to be a reptile?

Of course not. Let’s look at the knowledge and skills needed to be a reptile engineer:

******** Crawler engineer promotion road, web crawler involves what technologies:

Junior Reptile Engineer:

  1. Web front-end knowledge: HTML, CSS, JavaScript, DOM, DHTML, Ajax, jQuery, JSON, etc.
  2. Regular expression, can extract the normal general web page want information, such as some special text, link information, know what is lazy, what is greedy type of regular;
  3. You can use RE, BeautifulSoup, XPath, etc to get node information in the DOM structure.
  4. Know what is depth first, breadth first grasping algorithm, and use rules in practice;
  5. Ability to analyze the structure of simple websites, using URllib or Requests library for simple data fetching;

Intermediate Crawler Engineer:

  1. Understand Hash and use simple algorithms such as MD5 and SHA1 to Hash data for storage.
  2. Familiar with basic knowledge of HTTP and HTTPS protocol, understand GET and POST methods, understand information in HTTP headers, including return status code, encoding, user-agent, cookie, session, etc.
  3. User-agent can be set for data crawling, setting agents, etc.;
  4. I know what Request and Response are, and can use Fiddler, Wireshark and other tools to capture and analyze simple network data packets. For dynamic crawler, we should learn to analyze Ajax request, simulate the manufacture of Post packet request, capture client session and other information. For some simple websites, we can automatically log in through simulated packet.
  5. For difficult websites, learn to use PhatomJS + Selenium to capture some dynamic web information;
  6. Concurrent download, accelerate data capture through parallel download; The use of multithreading;

Senior Crawler Engineer:

  1. Can use Tesseract, a variety of image processing technology, CNN, Baidu AI and other libraries for verification code recognition;
  2. Can use data mining techniques, classification algorithms to avoid dead links, etc.
  3. I can use common databases for data storage and query, such as Mongodb, Redis(cache of large data), etc. Download cache, learn how to avoid the problem of repeated downloads through caching; Use of Bloom Filter;
  4. Machine learning technology can be used to dynamically adjust the crawler’s crawling strategy, so as to avoid being banned IP number.
  5. Can use some open source framework Scrapy and other distributed crawler, can deploy control distributed crawler for large-scale data capture.