GitHub source code sharing

Wechat search: code nong StayUp

Home address: goZhuyinglong.github. IO

Source: github.com/gozhuyinglo…

1. Introduction

In the website construction will generally use the national administrative region division, in order to do regional data analysis.

Here we use Python to crawl the administrative region data from the authoritative National Bureau of Statistics. The page to be climbed is the zoning code and urban-rural division code for 2020 statistics.

Here is a question, why does the bureau of Statistics only provide the web version? Wouldn’t it be more convenient for the public to provide a file version? Welcome to give me a message.

2. Website analysis

What you need to do before you crawl data is site analysis to determine which method to use for crawl.

2.1 Province Page

A static page whose secondary pages use relative addresses, located by the tr element of class=provincetr

2.2 City Page

A static page whose secondary page uses a relative address, located by the tr element class=citytr

2.3 District and County Page

A static page whose secondary pages use a relative address, located by the tr element of class=countytr

2.4 Town Page

A static page whose secondary pages use a relative address, located by the tr element class=towntr

2.5 Village Page

A static page, with no secondary page, located by the tr element of class=villagetr

3. Install required libraries

Through the above analysis, the use of crawling static web pages can be. Here are some necessary libraries that need to be installed in advance: Requests, BeautifulSoup, LXML.

3.1 Requests

Requests is a Python HTTP client library for accessing URL network resources.

Install the Requests library:

pip install requests
Copy the code

3.2 BeautifulSoup

Beautifu lSoup is a Python library that extracts data from HTML or XML files. It can navigate, search, modify, etc., page documents through the specified converter.

Install BeautifulSoup library:

pip install beautifulsoup4
Copy the code

3.3 LXML

LXML is a library written in Python that handles XML and HTML quickly and flexibly.

It supports XML Path Language (XPath) and Extensible Stylesheet Language Transformation (XSLT), and implements the common ElementTree API.

Install the LXML library:

pip install lxml
Copy the code

4. Code implementation

The crawler is divided into the following steps:

  • Use the Requests library to get the Web page.
  • Parse web pages using BeautifulSoup and LXML libraries.
  • Use Python’s File to store data.

The output file is the directory where the py file is stored. The file name is area-number-2020.txt

The command output contains the level, zone code, and name, which are separated by tabs for easy storage in Exce and database.

Here is the detailed code:

# -*-coding:utf-8-*-
import requests
from bs4 import BeautifulSoup


BeautifulSoup returns the page content by address
def get_html(url) :
    # if the page fails to open, infinite retry, no backward speaking
    while True:
        try:
            The timeout period is 1 second
            response = requests.get(url, timeout=1)
            response.encoding = "GBK"
            if response.status_code == 200:
                return BeautifulSoup(response.text, "lxml")
            else:
                continue
        except Exception:
            continue


Get address prefix (for relative address)
def get_prefix(url) :
    return url[0:url.rindex("/") + 1]


# Recursively grab the next page
def spider_next(url, lev) :
    if lev == 2:
        spider_class = "city"
    elif lev == 3:
        spider_class = "county"
    elif lev == 4:
        spider_class = "town"
    else:
        spider_class = "village"

    for item in get_html(url).select("tr." + spider_class + "tr"):
        item_td = item.select("td")
        item_td_code = item_td[0].select_one("a")
        item_td_name = item_td[1].select_one("a")
        if item_td_code is None:
            item_href = None
            item_code = item_td[0].text
            item_name = item_td[1].text
            if lev == 5:
                item_name = item_td[2].text
        else:
            item_href = item_td_code.get("href")
            item_code = item_td_code.text
            item_name = item_td_name.text
        Output: level, zone code, name
        content2 = str(lev) + "\t" + item_code + "\t" + item_name
        print(content2)
        f.write(content2 + "\n")
        if item_href is not None:
            spider_next(get_prefix(url) + item_href, lev + 1)


# entry
if __name__ == '__main__':

    # Grab the province page
    province_url = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2020/index.html"
    province_list = get_html(province_url).select('tr.provincetr a')

    # write data to area-number-2020.txt under current folder
    f = open("area-number-2020.txt"."w", encoding="utf-8")
    try:
        for province in province_list:
            href = province.get("href")
            province_code = href[0: 2] + "0000000000"
            province_name = province.text
            Output: level, zone code, name
            content = "1\t" + province_code + "\t" + province_name
            print(content)
            f.write(content + "\n")
            spider_next(get_prefix(province_url) + href, 2)
    finally:
        f.close()
Copy the code

5. Resource download

If you just need the administrative area data, it’s ready for you, just download it from the link below.

Link: pan.baidu.com/s/18MDdkczw… Extraction code: T2EG

6. Rules that crawlers follow

Source: www.cnblogs.com/kongyijilaf…

  1. Abide by Robots protocol and climb carefully
  2. Limit your crawler behavior and prohibit the frequency of almost DDOS requests, which would be equivalent to a cyber attack if the server crashed
  3. Do not force a page to crawl backwards or otherwise unreachable. This is Hacker behavior
  4. If you climb to someone else’s privacy, delete it immediately to reduce the probability of entering the bureau. Also, control your desires