The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with

The following article comes from Tencent Cloud author: Smash leak

(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)



1. What is a reptile

The crawler, namely the web crawler, can be understood as a spider that crawls on the network. The Internet is like a big web, and the crawler is a spider that crawls around on this web. If it encounters resources, it will grab them. What do you want to grab? Well, you control it.

Let’s say it’s crawling a web page, and in that web it finds a path, which is actually a hyperlink to a web page, and then it can crawl to another web page to get data. In this way, the whole web is within the spider’s reach, and it can climb down in a minute.

2. The process of browsing the web

In the process of users browsing the web, we may see many beautiful pictures, such as image.baidu.com/, we will see several pictures and Baidu search box, this process is actually the user after entering the URL, through the DNS server, find the server host, send a request to the server, After the server after parsing, sent to the user’s browser HTML, JS, CSS and other files, browser parsing out, users can see a variety of pictures.

Therefore, the essence of the web page that users see is composed of HTML codes, and crawlers crawl to these contents. Through analyzing and filtering these HTML codes, they can obtain images, words and other resources.

3. The meaning of URL

A URL, or uniform resource locator, is also known as a web address. A uniform resource locator is a concise representation of the location and access methods of resources that can be obtained from the Internet. It is the address of standard resources on the Internet. Every file on the Internet has a unique URL that contains information indicating where the file is and what the browser should do with it.

The FORMAT of the URL consists of three parts:

The first part is the agreement (or service mode).

② The second part is the IP address (and sometimes port number) of the host that holds the resource.

③ The third part is the specific address of the host resource, such as directory and file name.

Crawler must have a target URL to obtain data when crawling data, therefore, it is the basic basis for crawler to obtain data, and accurate understanding of its meaning is of great help to crawler learning.

4. Configuration of the environment

I used Notepad++ at first, but found its prompt function was too weak. Therefore, I used PyCharm in Windows and Eclipse for Python in Linux. There are also several excellent ides. Refer to this article to learn about Python’s recommended IDE. Good development tools are propellers for progress, and I hope you find your IDE.

Add:

What is a crawler and the basic logic of a crawler

“Reptile” is a figurative term. The Internet is likened to a web, and a crawler is a program or script that crawls along the web. When you encounter a bug (resource), get or download it if you need it. This resource is usually web pages, files, and so on. You can use the URL link in this resource to continue to climb these linked resources.

You can also think of a crawler as a simulation of us surfing the Web normally. Open the web page and analyze the content to get what we want.

So, this involves HTTP transport protocol and other related knowledge.

We usually open a web page, basically open a Url link. In the process, a lot actually happened.

When you open a Url link, the browser automatically sends a Request to the Url server, telling the server that I need to access the content of the Url link, please return the data to me. The server processes the request, responds to the request and returns the result to the browser.

Now that the crawler needs to simulate this process. According to THE HTTP protocol, a crawler needs to construct a Request and send it to the target server (usually a Url link). Then wait for a Response from the server.

All the relevant data is in the response result, and this is the basic logic of the crawler implementation.

Those are the basics of getting started with a Python crawler