Task1. Start to understand the web crawler, and re-understand the crawler

1.1. What is a reptile

1.1.1. Differences between Web crawlers and browsers

The traditional crawler starts from one or several urls of the initial web page and obtains the urls on the initial web page. In the process of crawling web pages, new urls are continuously extracted from the current page and put into the queue until certain stop conditions of the system are met.

The work flow of focused crawler is complicated, and it is necessary to filter the links irrelevant to the topic according to certain webpage analysis algorithm, reserve the useful links and put them into the URL queue waiting to be captured. Then, it will select the next web page URL from the queue according to a certain search strategy, and repeat the above process, until reaching a certain condition of the system to stop.

In addition, all crawler web pages will be stored by the system for certain analysis, filtering, and index establishment, so as to facilitate the subsequent query and retrieval;

So the specific difference is as follows: the browser is to display data, and the web crawler is to collect data

1.1.2 Definition and role of web crawler

Definition of web crawler

Web crawler (also known as web spider, web robot) is a program or script that can automatically capture information on the World Wide Web according to certain rules, which simulates the client to send network request and obtain response data

The role of web crawlers

Get the information we need from the World Wide Web

1.1.3. Composition of web pages

First, let’s look at the basic composition of a web page, which can be divided into three parts: HTML, CSS and JavaScript.

If you compare a web page to a person, HTML is the skeleton, JavaScript is the muscle, CSS is the skin, and all three are combined to form a complete web page.

HTML

HTML is a Language used to describe web pages. Its full name is Hyper Text Markup Language.

HTML is the foundation of the complex elements of text, buttons, images, and videos that we navigate on web pages. Different types of elements are represented by different types of tags, such as img tag for pictures, video tag for video, and P tag for paragraphs. The layout between them is often nested and combined by layout tag DIV. Various tags can form the frame of a web page through different arrangement and nesting.

Open Baidu in Chrome, right click and select check (or F12) to open developer mode and see the source code of the web page in the Elements TAB, as shown in the figure below.

HTML tutorials can be found at www.runoob.com/html/html-t… (Chrome is recommended for development)

CSS

HTML defines the structure of a web page, but the layout of an HTML page is not beautiful. It can be a simple arrangement of node elements. In order to make a web page look better, CSS is needed here.

Cascading Style Sheets CSS stands for Cascading Style Sheets. “Cascading” means that when several style files are referenced in HTML and the styles conflict, the browser can process them in a cascading order. “Style” refers to text size, color, element spacing, arrangement, etc.

CSS is currently the only standard for web page layout that helps make pages look better.

The main CSS tutorial can be found at www.runoob.com/css/css-tut…

JavaScript

JavaScript, or JS for short, is a scripting language. HTML and CSS work together to provide users with static information and lack of interactivity.

We may see some interactive and animated effects on a web page, such as download progress bars, prompt boxes, and scrolling graphics, which are usually the result of JavaScript.

Its appearance makes the relationship between users and information not only a browsing and display, but also realizes a real-time, dynamic and interactive page function. In summary, HTML defines the content and structure of a web page, CSS describes the layout of a web page, and JavaScript defines the behavior of a web page.

JavaScript main reference tutorials can be found at www.runoob.com/js/js-tutor…

1.1.4. Robots protocol

Basic Robots Protocol

Robots protocol, namely, Robots Exclusion Standard network crawler Exclusion protocol.

What it does: A web site tells a web crawler which pages can and cannot be accessed

Format: robots.txt file in the root directory of the website

Example: Jingdong Robots protocol www.jd.com/robots.txt

You can see jd’s restrictions on crawlers:

For any web crawler source, follow the following protocol

User-agent: * Disallow: /? * Disallow: /pop/.html Disallow: /pinpai/.html? *

The following four web crawlers are not allowed to crawl any resources

User-agent: EtaoSpider Disallow: / User-agent: HuihuiSpider Disallow: / User-agent: GwdangSpider Disallow: / User-agent: WochachaSpider Disallow: /

Basic protocol syntax:

annotation
  • On behalf of all

/ represents the root directory user-agent: * # represents those crawlers Disallow: / # represents the directory that crawlers are not allowed to access

Some Robots protocols for other sites (but not all sites have Robots protocols) :

Baidu: www.baidu.com/robots.txt sina news: news.sina.com.cn/robots.txt tencent: www.qq.com/robots.txt tencent news: news.qq.com/robots.txt national Ministry of Education: www.meo.edu.cn/robots.txt (note: no robots agreement)

How to comply with Robots protocols

(1) The use of Robots protocol

Web crawler: automatically or manually identify robots.txt, and then crawl the content

Binding: The Robots protocol is a suggestion but not a binding one. Web crawlers can disobey it, but there are legal risks.

(2) Understanding of Robots protocol

Crawl the web, play around the web

Low traffic: compliance

High traffic: Compliance is recommended

Crawl site, crawl series of sites

Non-commercial and occasionally, compliance is recommended

Commercial interests: must be complied with

Crawl entire network

Must abide by the

1.2. Knowledge to learn and problems to be solved by Python web crawler

1. Basic Knowledge of Python 2. Basic knowledge of web structure 3