To learn crawler, the initial operation is to simulate the browser to send a request to the server, so where do we need to start? Do we need to construct the request ourselves? Do you need to care about the implementation of the requested data structure? Need to know about HTTP, TCP, IP layer network transport communication? Need to know how the server responds and responds?

You may not know where to start, but don’t worry, the power of Python is that it provides a full-featured class library to help us with these requests. The most basic HTTP libraries are URllib, Httplib2, Requests, treq, and so on.

Urllib, for example, allows us to only care about the link of the request, the parameters that need to be passed, and the optional request header Settings, without delving into the bottom of how it is transmitted and communicated. Doesn’t it feel convenient that you can process a request and response in two lines of code and get web content?

Let’s start with the basics of how to use these libraries.

In Python 2, there are two libraries, urllib and urllib2, for sending requests. In Python 3, no longer exists urllib2 this library, unified urllib, it is the official document links: docs.python.org/3/library/u… .

First, take a look at the URllib library, which is Python’s built-in HTTP request library, meaning no additional installation is required to use it. It contains the following four modules.

  • Request: This is the most basic HTTP request module that can be used to simulate sending a request. Just like typing the URL in the browser and hitting Enter, you can simulate this process by passing the URL and additional parameters to the library method.
  • Error: An exception handling module that can catch exceptions if a request error occurs and retry or do something else to ensure that the program does not terminate unexpectedly.
  • Parse: A tool module that provides a number of URL handling methods, such as splitting, parsing, merging, and so on.
  • Robotparser: Mainly used to identify robots.txt files of web sites and determine which sites can and cannot be climbed, robotParser is actually used very little.

The first three modules are highlighted here.


This resource starting in Cui Qingcai personal blog still find: Python3 tutorial | static find web crawler development practical experience

For more crawler information, please follow my personal wechat official account: Attack Coder

Weixin.qq.com/r/5zsjOyvEZ… (Qr code automatic recognition)