Assuming that you already meet the criteria above, let’s start with the first crawler. Let’s pick one first. Attach a URL: China weather network (http://www.weather.com.cn/weather1d/101280101.shtml#dingzhi_first)

Make sure you have installed the libraries for Requests and Beautifulsoup4, otherwise you can open CMD (command prompt) and type

pip3 install requests
pip3 install Beautifulsoup4
pip3 install lxml
Copy the code

After the installation, open your editor. Here, you don’t have to tangle with the editor. Just use it smoothly.

First of all, we do crawlers. The first step is to get all the content of the previous page of the website, that is, HTML tags. So we need to write a method to get the HTML tag of the web page.

The code construction of the whole crawler I use is to make different functions into different functions, in the last need to call when the call is good.

So the question is, why?

There are a few things to think about when writing code as a novelty:

1. The reusability of this code;

2. Semantic and functional decoupling of this code;

3, whether beautiful and concise, let others see your code can clearly understand your logic;

Code display:

Above code a few places I specify: “” grab daily weather data

Python 3.6.2 url:http://www.weather.com.cn/weather1d/101280101.shtml#dingzhi_first
' ''
import requests
import bs4
Copy the code

The comments at the beginning of the code indicate what a Python file does, what version it uses, and what URL it uses, helping you quickly understand the purpose of the file the next time you open it.

Since Requests and Beautifulsoup4 are third-party libraries, import is introduced below

And then the

Among them

Def get_html (url) :Copy the code

Construct a function called get_html and pass in the URL you want to request, which will return the result of the request,

When constructed, call directly

url = 'Wrap your URL'
get_html(url)
Copy the code

Then also note what your function does. Headers is wrapped with some headers files disguised as browser access that you can copy and use directly.

Here’s why basic masquerading as a browser, because crawlers, naturally, anti-crawlers. In order to avoid malicious crawler wanton crawling or attack, some websites will do a lot of anti-crawler. Masquerading browser access is a small step against crawlers.

Okay, let’s move on,

htmlcontet = requests.get(url, headers=headers, timeout=30)
htmlcontet.raise_for_status()
htmlcontet.encoding = 'utf-8'
return htmlcontet.text
Copy the code

The first one, if we look at Requests, is a parse of the URL you passed in, and includes the request header, response latency

Second, if the current page response returns a JSON packet, we use this syntax to verify that it is the result of the successful response we want

And the third one, parse format, because on this website we can see that we know that the encoding is UTF-8 so I’m just going to write utF-8 here

Finally everything is ok, return a page file out

### Step 2:

Once we have a page file, we need to look at the HTML structure of the page

Here’s how to view the structure of a web page: Open F12 or right-click a blank space — > Check

We might see something like this:

Yeah, you saw that

These are THE HTML language, and our crawler is trying to extract the content we need from these tags.

Now we want to capture the weather data of the night of 1st and the day of 2nd:

Let’s start by looking at the structure of web pages to find their wrapped logic

It is clear that their HTML nested logic looks like this:

All the content we want is wrapped in Li, and then we need to use the find method in BeautifulSoup to extract the query

Let’s go ahead and build a function that grabs the content of the web page. Since we end up with two pieces of data, I declare an array of weather_lists to save my results later.

The code is as follows:

We’re not going to say the same thing twice, but we’re going to write notes. After declaring the array, we can call the wrapped request function to request the URL we want and return a page file, and then use the syntax of Beautifulsoup4 to parse our page file using LXML.

You can use

soup = bs4.BeautifulSoup(html, ‘lxml’) print (soup)

You can see the entire HTML structure appear in front of you, and then we can find the information we want based on the structure of the tag

content_ul = soup.find(‘div’, class_=’t’).find_all(‘li’)

To do that, we’ll go through the documentation, and we’ll return a structure that looks like this when we find all the Li’s

This is the format of an array, and we iterate over it to construct a dictionary. We create key-value pairs of ‘day’ and ‘temperature’ for the operation dictionary

The final output

Attach the complete code:

Source: Victor 278 links: https://zhuanlan.zhihu.com/p/30632556