Analyze the structure of web pages

In the past several articles are introduced to the traditional static interface crawling, this blogger introduces a super simple crawling dynamic web page a small demo.

Speaking of dynamic web, what do you know about it?

If you do not know the dynamic web page, the blogger gives the link here, you can see the detailed analysis of baidu Encyclopedia dynamic web page _ Baidu Encyclopedia and the difference between the static page and the dynamic page of the little horse

Don’t blame the blogger for not explaining it, because the blogger himself is not very familiar with the concept of dynamic web pages. When the blogger has collected his thoughts, he will write a blog post. –

Simply put, to get the data of a static web page, you just need to send the url to the server, while the data of a dynamic web page is stored in a back-end database. So to get the data of a dynamic web page, we need to send the URL of the request file to the server, not the URL of the web page.

🆗, let’s get to the point.

This blog post starts with Amap: www.amap.com/

When we open it, we find a bunch of div tags, but we don’t have the data we need. At this time, we can determine that it is a dynamic web page. At this time, we need to find the interface

Clicking on the web TAB, we can see that the web page is sending a lot of requests to the server. There is so much data that it takes too long to find it

By clicking on XHR, we can reduce unnecessary files and save a lot of time.

The XHR type is a request sent through the XMLHttpRequest method that can exchange data with the server in the background, which means that you can update a portion of a web page without loading the entire page. That is, the data that is requested to the database and then responded to is of type XHR

Then we can start looking one by one under type XHR and find the following data

Get the URL by looking at Headers

When we opened it, we found that it was the last two days of weather.

When we open it, we see the above, which is a JSON file. Then, its data information is stored in the form of a dictionary, and the data is stored in the “data” key.

🆗, found the JSON data, let’s compare to see if it is what we are looking for

And by comparison, the numbers match, which means we’ve got the numbers.

Two, get the relevant website

"' the weather query the current site url:https://www.amap.com/service/cityList?version=2020101417 cities corresponding code url: https://www.amap.com/service/weather?adcode=410700 note: both the url can see from the Network to the "' 123456Copy the code

🆗, the relevant website we have got, the following is the specific code implementation. And how to do that,

We know that json data can be converted to a dictionary using Response.json () and then manipulated into the dictionary.

Three, code implementation

Now that we know where the data is, we start writing code.

3.1 Querying all City names and numbers

Fetch the web page first, and add headers to disguise it as a browser to access the database address, preventing it from being identified and intercepted.

url_city = "https://www.amap.com/service/cityList?version=202092419" headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" } city = [] response = requests.get(url=url_city, headers=headers) content = response.json() print(content) 12345678910Copy the code

After obtaining the data we want, we can find the number and name in the cityByLetter through searching, so we can disk it.

    if "data" in content:
        cityByLetter = content["data"]["cityByLetter"]
        for k,v in cityByLetter.items():
            city.extend(v)
    return city
12345
Copy the code

3.2 Querying the Weather Based on the Number

Got the number and name, the following affirmation is to inquire the weather!

Let’s look at the interface

From the figure above, you can determine the maximum temperature, minimum temperature and so on. So that’s how you do the data crawl.

url_weather = "https://www.amap.com/service/weather?adcode={}"

response = requests.get(url=url_weather.format(adcode), headers=headers)
content = response.json()
item["weather_name"] = content["data"]["data"][0]["forecast_data"][0]["weather_name"]
item["min_temp"] = content["data"]["data"][0]["forecast_data"][0]["min_temp"]
item["max_temp"] = content["data"]["data"][0]["forecast_data"][0]["max_temp"]
print(item)
12345678
Copy the code

🆗, our vision has come true.

4. Complete code

# encoding: utf-8 "' @ author Li Huaxin @ the create the 2020-10-06 19:46 Mycsdn:https://buwenbuhuo.blog.csdn.net/ @ contact: [email protected] @software: Pycharm @file: Amap _ Weather for each city.py @version: 1.0 "' import requests url_city =" https://www.amap.com/service/cityList?version=202092419 "url_weather = "https://www.amap.com/service/weather?adcode={}" headers = { "user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" } def get_city(): """ query all city names and numbers """ city = [] response = requests. Get (url=url_city, headers=headers) content = response.json() if "data" in content: cityByLetter = content["data"]["cityByLetter"] for k, v in cityByLetter.items(): city.extend(v) return city def get_weather(adcode, name): Item = {} item["adcode"] = STR (adcode) item["name"] = name response = requests.get(url=url_weather.format(adcode), headers=headers) content = response.json() item["weather_name"] = content["data"]["data"][0]["forecast_data"][0]["weather_name"] item["min_temp"] = content["data"]["data"][0]["forecast_data"][0]["min_temp"] item["max_temp"] = content["data"]["data"][0]["forecast_data"][0]["max_temp"] return item def save(item): "" "save "" "print (item) with open (". / weather. TXT ", "a", encoding = "utf-8") as file: file.write(",".join(item.values())) file.write("\n") if __name__ == '__main__': city_list = get_city() for city in city_list: item = get_weather(city["adcode"],city["name"]) save(item) 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263Copy the code

Five, save the results

Good days are always short, although I want to continue to talk with you, but this blog is over, if not enough, don’t worry, we will see you next!

PS: If you can’t solve the problem, you can click the link below to get it by yourself

Free Python learning materials and group communication solutions click to join

This article reprinted text, copyright belongs to the author, such as infringement contact xiaobian to delete