GitHub source code sharing
Wechat search: code nong StayUp
Home address: goZhuyinglong.github. IO
Source: github.com/gozhuyinglo…
1. Introduction
In the website construction will generally use the national administrative region division, in order to do regional data analysis.
Here we use Python to crawl the administrative region data from the authoritative National Bureau of Statistics. The page to be climbed is the zoning code and urban-rural division code for 2020 statistics.
Here is a question, why does the bureau of Statistics only provide the web version? Wouldn’t it be more convenient for the public to provide a file version? Welcome to give me a message.
2. Website analysis
What you need to do before you crawl data is site analysis to determine which method to use for crawl.
2.1 Province Page
A static page whose secondary pages use relative addresses, located by the tr element of class=provincetr
2.2 City Page
A static page whose secondary page uses a relative address, located by the tr element class=citytr
2.3 District and County Page
A static page whose secondary pages use a relative address, located by the tr element of class=countytr
2.4 Town Page
A static page whose secondary pages use a relative address, located by the tr element class=towntr
2.5 Village Page
A static page, with no secondary page, located by the tr element of class=villagetr
3. Install required libraries
Through the above analysis, the use of crawling static web pages can be. Here are some necessary libraries that need to be installed in advance: Requests, BeautifulSoup, LXML.
3.1 Requests
Requests is a Python HTTP client library for accessing URL network resources.
Install the Requests library:
pip install requests
Copy the code
3.2 BeautifulSoup
Beautifu lSoup is a Python library that extracts data from HTML or XML files. It can navigate, search, modify, etc., page documents through the specified converter.
Install BeautifulSoup library:
pip install beautifulsoup4
Copy the code
3.3 LXML
LXML is a library written in Python that handles XML and HTML quickly and flexibly.
It supports XML Path Language (XPath) and Extensible Stylesheet Language Transformation (XSLT), and implements the common ElementTree API.
Install the LXML library:
pip install lxml
Copy the code
4. Code implementation
The crawler is divided into the following steps:
- Use the Requests library to get the Web page.
- Parse web pages using BeautifulSoup and LXML libraries.
- Use Python’s File to store data.
The output file is the directory where the py file is stored. The file name is area-number-2020.txt
The command output contains the level, zone code, and name, which are separated by tabs for easy storage in Exce and database.
Here is the detailed code:
# -*-coding:utf-8-*-
import requests
from bs4 import BeautifulSoup
BeautifulSoup returns the page content by address
def get_html(url) :
# if the page fails to open, infinite retry, no backward speaking
while True:
try:
The timeout period is 1 second
response = requests.get(url, timeout=1)
response.encoding = "GBK"
if response.status_code == 200:
return BeautifulSoup(response.text, "lxml")
else:
continue
except Exception:
continue
Get address prefix (for relative address)
def get_prefix(url) :
return url[0:url.rindex("/") + 1]
# Recursively grab the next page
def spider_next(url, lev) :
if lev == 2:
spider_class = "city"
elif lev == 3:
spider_class = "county"
elif lev == 4:
spider_class = "town"
else:
spider_class = "village"
for item in get_html(url).select("tr." + spider_class + "tr"):
item_td = item.select("td")
item_td_code = item_td[0].select_one("a")
item_td_name = item_td[1].select_one("a")
if item_td_code is None:
item_href = None
item_code = item_td[0].text
item_name = item_td[1].text
if lev == 5:
item_name = item_td[2].text
else:
item_href = item_td_code.get("href")
item_code = item_td_code.text
item_name = item_td_name.text
Output: level, zone code, name
content2 = str(lev) + "\t" + item_code + "\t" + item_name
print(content2)
f.write(content2 + "\n")
if item_href is not None:
spider_next(get_prefix(url) + item_href, lev + 1)
# entry
if __name__ == '__main__':
# Grab the province page
province_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html"
province_list = get_html(province_url).select('tr.provincetr a')
# write data to area-number-2020.txt under current folder
f = open("area-number-2020.txt"."w", encoding="utf-8")
try:
for province in province_list:
href = province.get("href")
province_code = href[0: 2] + "0000000000"
province_name = province.text
Output: level, zone code, name
content = "1\t" + province_code + "\t" + province_name
print(content)
f.write(content + "\n")
spider_next(get_prefix(province_url) + href, 2)
finally:
f.close()
Copy the code
5. Resource download
If you just need the administrative area data, it’s ready for you, just download it from the link below.
Link: pan.baidu.com/s/18MDdkczw… Extraction code: T2EG
6. Rules that crawlers follow
Source: www.cnblogs.com/kongyijilaf…
- Abide by Robots protocol and climb carefully
- Limit your crawler behavior and prohibit the frequency of almost DDOS requests, which would be equivalent to a cyber attack if the server crashed
- Do not force a page to crawl backwards or otherwise unreachable. This is Hacker behavior
- If you climb to someone else’s privacy, delete it immediately to reduce the probability of entering the bureau. Also, control your desires