Web crawler:
A web crawler (also known as a web spider, web bot, or more commonly as a web chaser in the FOAF community) is a program or script that automatically crawls information from the World Wide Web according to certain rules. Other less commonly used names include ant, autoindex, simulator or worm.
Above is the Baidu of web crawler. Now we will introduce the use of Python for web crawler to obtain data.
To get real-time data on COVID-19. Create a Python file named get_data using the request module most commonly used by crawlers using PyCharm
Part I:
To obtain web page information:
import requests
url = "https://voice.baidu.com/act/newpneumonia/newpneumonia"
response = requests.get(url)
Copy the code
Part II:
You can observe the characteristics of the data: the data is contained in script tags, and xpath is used to retrieve the data. Import a module from LXML import etree generate an HTML object and parse it to get something of type list. Use the first item to get all of the content. Next, get the content of the Component first. To get domestic data, you need to find caseList in Component
Next code:
HTML = etree.html (response.text) result = Html.xpath ('//script[@type="application/json"]/text()') result = result[0] # json.load() converts a string to a Python data type result = json.loads(result) result_in = result['component'][0]['caseList']Copy the code
Part III:
Store domestic data in Excel: Using the OpenyXL module, import OpenPyXL first creates a workbook, then creates a worksheet under the workbook and then names and attributes the worksheet
The code is as follows:
Import OpenPyXL # create Workbook wb = OpenPyxl.workbook () # create Workbook ws = wb.active ws.title = "domestic outbreak" ws.append([' province ', 'cumulative confirmed cases ',' deaths ', 'Cure ',' existing diagnosis ', 'Cumulative increment of diagnosis ',' increment of death ', 'increment of cure ', 'Existing confirmatory increments ']) "" confirmatory Increments" "confirmatory Increments" "confirmatory Increments" "confirmatory Increments" "confirmatory Increments CurConfirm --> current confirmrelative --> current confirmrelative for each in result_in: temp_list = [each['area'], each['confirmed'], each['died'], each['crued'], each['curConfirm'], each['confirmedRelative'], each['diedRelative'], each['curedRelative'], each['curConfirmRelative']] for i in range(len(temp_list)): if temp_list[i] == '': temp_list[i] = '0' ws.append(temp_list) wb.save('./data.xlsx')Copy the code
Part IV:
Store foreign data in Excel: Get the foreign data in component’s globalList and create a Sheet in Excel that represents different continents
The code is as follows:
data_out = result['component'][0]['globalList'] for each in data_out: Sheet_title = each['area'] # create a new worksheet ws_out = wb.create_sheet(sheet_title) ws_out.append([' national ', 'accumulated Confirmed ',' Died ', 'cured ', ['subList'] for country in each['subList']: list_temp = [country['country'], country['confirmed'], country['died'], country['crued'], country['curConfirm'], country['confirmedRelative']] for i in range(len(list_temp)): if list_temp[i] == '': list_temp[i] = '0' ws_out.append(list_temp) wb.save('./data.xlsx')Copy the code
The overall code is as follows:
import requests from lxml import etree import json import openpyxl url = = "https://voice.baidu.com/act/newpneumonia/newpneumonia" response requests. Get (url) # print (response. The text) # generate HTML objects html = etree.HTML(response.text) result = html.xpath('//script[@type="application/json"]/text()') result = result[0] # Loads () result = json.loads() # create Workbook wb = OpenPyxl.workbook () # create Workbook ws = wb.active Ws. The title = "domestic epidemic" ws. Append ([' province ', 'cumulative confirmed,' death ', 'cure', 'existing diagnosis,' cumulative incremental diagnosis, 'death incremental', 'cure increment, 'Existing confirmed increment ']) result_in = result[' Component '][0]['caseList'] data_out = result[' Component '][0]['globalList'] "area --> RelativeTime -- confirmedRelative -- - confirmedRelative -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- CurConfirm --> Current confirmrelative --> Current increments for each in result_in: temp_list = [each['area'], each['confirmed'], each['died'], each['crued'], each['curConfirm'], each['confirmedRelative'], each['diedRelative'], each['curedRelative'], each['curConfirmRelative']] for i in range(len(temp_list)): if temp_list[i] == '': Temp_list [I] = '0' ws. Append (temp_list) # Sheet_title = each['area'] # create a new worksheet ws_out = wb.create_sheet(sheet_title) ws_out.append([' national ', 'accumulated Confirmed ',' Died ', 'cured ', ['subList'] for country in each['subList']: list_temp = [country['country'], country['confirmed'], country['died'], country['crued'], country['curConfirm'], country['confirmedRelative']] for i in range(len(list_temp)): if list_temp[i] == '': list_temp[i] = '0' ws_out.append(list_temp) wb.save('./data.xlsx')Copy the code
The results are as follows:
Domestic:
Abroad: