preface

China’s youth are leading the country into crisis. To get an overview of the series “Walk you through a Data science mini-project”.

Code unified open source on GitHub: DesertsX/ Gulius -projects, interested friends can advance star Ha. (Note: Some pictures in Jupyter Notebook cannot be displayed, because we plan to upload them uniformly after updating this series. Please refer to the content of the article and the actual code operation results for understanding.)

Up to now, we have completed data crawling, data extraction and IP query, data anomaly and cleaning, and analysis of the change of comment number. In this paper, we continue to process geographical information and extract provincial and city data respectively, so that pyecharts can be used for map visualization.

Read the data

import pandas as pd
df = pd.read_csv('Sina_Finance_Comments_All_20180811_Cleaned.csv',encoding='utf-8')
df.head()
Copy the code

After the baptism of the previous articles, you are already familiar with the data:

In this paper, only the ‘area’ column is processed, and the ‘IP_LOc’ column can be explored by readers themselves, such as extracting and analyzing the proportion of mobile, telecom, Unicom, etc.

df[['area'.'ip_loc']]
Copy the code

5-area.jpg

Statistical area

Let’s see how many cases there are in the area column.

area_count = df.groupby('area') ['area'].count().sort_values(ascending=False)
area_name = list(area_count.index)
area_values = area_count.values
print(len(area_name),len(area_values))
print(area_count)
Copy the code

Here is part of area_count:

Beijing 319 Shanghai 281 Guangdong Guangzhou 176 Sichuan Chengdu 136 Guangdong Shenzhen 131 Hubei Wuhan 113 Chongqing 96 Jiangsu Nanjing 96 Zhejiang Hangzhou 87 Shaanxi Xi 'an 73 Fujian Fuzhou 73 Zhejiang 68 Jiangsu Suzhou 64 Hefei 52 Tianjin 44 Jinan 44 Xuzhou jiangsu 42 Wuxi Jiangsu 42 Shenyang Liaoning 40 Qingdao Shandong 40Copy the code

There are 337 unique values in area_name in all the more than 3000 comment data. In order to make statistics and visualization of provinces and cities, it is necessary to find out the extraction method that can separate provinces and cities from Area_name so that apply can be applied to the area column.

Readers can pause here to think, what is their own thinking? How do you do that?

print(area_name)
Copy the code

All 337 pieces of geographic information are summarized as follows:

['Beijing'.'Shanghai'.'Guangzhou, Guangdong'.'Chengdu, Sichuan'.'Shenzhen, Guangdong'.'Wuhan, Hubei'.'chongqing'.Nanjing, Jiangsu.Hangzhou, Zhejiang.Xi 'an, Shaanxi.'Fuzhou, Fujian'.'zhejiang'.'Suzhou, Jiangsu'.'Hefei, Anhui'.'tianjin'.Jinan, Shandong province.Xuzhou, Jiangsu province.Wuxi, Jiangsu province.Shenyang, Liaoning province.'Qingdao, Shandong'.'Nanchang, Jiangxi'.'Zhengzhou, Henan'.'Hong Kong'.'Foshan, Guangdong'.'in guangdong'.'Changsha, Hunan'.Kunming, Yunnan province.'Beijing Haidian'.'Taiyuan, Shanxi'.'Nanning, Guangxi'.Dongguan, Guangdong.'Lanzhou, Gansu'.'Australia'.'Hohhot, Inner Mongolia'.'Kaifeng, Henan'.'fujian'.'Dalian, Liaoning'.'hebei'.'Shijiazhuang, Hebei'.'Nantong, Jiangsu'.'Harbin, Heilongjiang'.'in hunan province'.Shaoxing, Zhejiang.'yunnan'.'shandong'.'Japan'.'jilin'.'Changchun, Jilin'.'Liaoning Panjin'.Ningbo, Zhejiang.'Xiamen, Fujian'.'henan'.Guiyang, Guizhou province.Zhejiang Jinhua.'guizhou'.'Jincheng, Shanxi'.Wenzhou, Zhejiang.'Linyi, Shandong'.'sichuan'.'Changzhou, Jiangsu'.'Luoyang, Henan'.Urumqi, Xinjiang.Shantou, Guangdong province.Yangzhou, Jiangsu.'Zibo, Shandong province'.Sichuan Neijiang.Taizhou, Jiangsu.'Quanzhou, Fujian'.'Zhongshan, Guangdong'.'Yantai, Shandong'.'England England'.Jiaxing, Zhejiang.'Baotou, Inner Mongolia'.'the guangxi'.'Yichang, Hubei'.'Zhejiang Huzhou'.'Jining, Shandong'.'Baoding, Hebei'.'Haikou, Hainan'.Huizhou, Guangdong.'Langfang, Hebei'.'Lianyungang, Jiangsu'.'Singapore'.'the liaoning'.'Weifang, Shandong'.Anhui Huaibei.'Shanxi Datong'.'Liuzhou, Guangxi'.'Xiangyang, Hubei'.Taizhou, Zhejiang.'Mianyang, Sichuan'.'Handan, Hebei'.'Jiujiang, Jiangxi'.Zhoukou, Henan.'Wuhu, Anhui'.Zhejiang Lishui.'the United States'.'Ningxia Yinchuan'.'Nanyang, Henan'.'Chengde, Hebei'.'Tangshan, Hebei'.'Yancheng, Jiangsu'.Shaanxi Baoji.'Xining, Qinghai'.Jieyang, Guangdong.Qinhuangdao, Hebei.Tianjin Tanggu.Cangzhou, Hebei.'Changzhi, Shanxi'.'Shaoguan, Guangdong'.'hubei'.Shangrao, Jiangxi province.Guilin, Guangxi.'Yichun, Jiangxi'.'Chaoyang, Liaoning'.'Heilongjiang'.'Changde, Hunan'.Yingkou, Liaoning.Huanggang, Hubei province.Anshan, Liaoning.Zunyi, Guizhou.'Liaocheng, Shandong'.'the shanxi'.'Anyang, Henan'.'Lu 'an, Anhui'.'Yuncheng, Shanxi'.'Shandong Dezhou'.'Dongying, Shandong'.'Hengshui, Hebei'.'Ganzhou, Jiangxi'.'California, USA'.Fushun, Liaoning province.Huainan, Anhui.Xianyang, Shaanxi.'Yibin, Sichuan'.Qujing, Yunnan province.'Sanya, Hainan'.'Jingzhou, Hubei'.Chifeng, Inner Mongolia.'Nanchong, Sichuan'.'Longyan, Fujian'.'New York State'.'Guangxi River Pool'.'Illinois, USA'.'Wuzhou, Guangxi'.'Guangdong Qingyuan'.Suqian, Jiangsu.'Yulin, Guangxi'.'Guangxi Beihai'.'Yangjiang, Guangdong'.'Putian, Fujian'.Maoming, Guangdong.'jiangsu'.'Zhanjiang, Guangdong'.Xinjiang Changji Hui Autonomous Prefecture.Xianning, Hubei province.'Jiangmen, Guangdong'.'Chuzhou, Anhui'.Inner Mongolia.Qinghai Hainan Tibetan Autonomous Prefecture.Zigong, Sichuan.'Republic of Korea'.'Yan 'an, Shaanxi'.'the ningxia'.'Jinzhou, Liaoning'.'the UK'.'Fuyang, Anhui'.'Rizhao, Shandong'.'Zhenjiang, Jiangsu'.'Linfen, Shanxi'.'Luliang, Shanxi'.'Jinzhong, Shanxi'.'Ningde, Fujian'.'Qiqihar heilongjiang'.'France'.'Zhumadian, Henan'.'Xinxiang, Henan'.'Xuchang, Henan'.'Yiyang, Hunan'.'Luohe, Henan'.'Hengyang, Hunan'.'Wuwei, Gansu'.Shaoyang, Hunan.'Zhuzhou, Hunan'.'Xiangtan, Hunan'.'Fuxin, Liaoning'.'Huangshan, Anhui'.'Puyang, Henan'.Chizhou, Anhui.'Liaoyang, Liaoning'.Weihai, Shandong.'Nanping, Fujian'.Zaozhuang, Shandong province.'Hebi, Henan'.'Benxi liaoning'.'Xiangxi Tujia and Miao Autonomous Prefecture, Hunan'.'Henan Pingdingshan'.'Sanmenxia, Henan'.Bijie, Guizhou.'Chenzhou, Hunan'.'Yangquan, Shanxi'.'Anqing, Anhui'.'Sanming, Fujian'.'Maryland, USA'.'Ontario, Canada'.Shanghai Xuhui.'Baoshan, Yunnan'.Pu 'er, Yunnan province.Hubei Enshi Tujia and Miao Autonomous Prefecture.Inner Mongolia Xingan League.Daqing, Heilongjiang.Quzhou, Zhejiang.'Tongliao, Inner Mongolia'.Ordos, Inner Mongolia.British Columbia, Canada.Zhoushan, Zhejiang.'the'.'Beijing Chaoyang'.'Tongchuan, Shaanxi'.'Huangshi, Hubei'.'Deyang, Sichuan'.'Luzhou, Sichuan'.'Hanzhong, Shaanxi'.'Ziyang, Sichuan'.'Ya 'an, Sichuan'.'Yueyang, Hunan'.'Ningxia Wuzhong'.'Scotland, United Kingdom'.'Tai 'an, Shandong'.'the xinjiang'.'Zhangzhou, Fujian'.'Guangxi Chongzuo'.'Connecticut, United States'.'gansu'.'Texas, USA'.'Ohio, USA'.'Zhangjiakou, Hebei'.'Florida, USA'.'Zhuhai, Guangdong'.'Sweden'.'Yingtan, Jiangxi'.Qinzhou, Guangxi.'jiangxi'.'Bayingolin Mongolian Autonomous Prefecture, Xinjiang'.'the'.'Pingxiang, Jiangxi'.Huai 'an, Jiangsu.Suizhou, Hubei.Guang 'an, Sichuan.'Oregon, USA'.'Malaysia'.'Weinan shaanxi'.Jingmen, Hubei.'Jingdezhen, Jiangxi'.'Jilin Songyuan'.'New Zealand'.'Sichuan Guangyuan'.'Jilin Bai Shan'.Shihezi, Xinjiang.'Jilin Tonghua'.Wakayama Prefecture, Japan.Qinghai Haidong.'the qinghai'.Leshan, Sichuan.'Beijing East'.'Canada'.'Alberta, Canada'.Gifu Prefecture, Japan.Shanghai Huangpu.'Heilongjiang Suihua'.'Lincang, Yunnan'.Dali Bai Autonomous Prefecture, Yunnan province.Nujiang Lisu Autonomous Prefecture, Yunnan Province.'Mudanjiang, Heilongjiang'.'Zhaotong, Yunnan'.'Gansu Baiyin'.Yunnan Honghe Hani and Yi Autonomous Prefecture.'Iraq'.'hainan'.Wulanchabu, Inner Mongolia.'Georgia, USA'.'Zhangye, Gansu'.'Hulun Buir, Inner Mongolia'.'Heilongjiang Yichun'.'Heilongjiang Jixi'.'Tacheng Prefecture, Xinjiang'.'Fuzhou, Jiangxi'.Meishan, Sichuan.'Louisiana, USA'.'Loudi, Hunan'.Yili Kazak Autonomous Prefecture, Xinjiang.'Huludao, Liaoning'.'Zhaoqing, Guangdong'.Chaozhou, Guangdong.'Shangqiu, Henan'.Laiwu, Shandong province.'Heze, Shandong'.'Dandong, Liaoning'.'Ireland'.'New Jersey, USA'.'Xinyang, Henan'.'Ho Chi Minh City, Vietnam'.Guangdong River Source.Shanwei, Guangdong.'Shuozhou, Shanxi'.'Xiaogan hubei'.Lhasa, Tibet.'Meizhou, Guangdong'.'Philippine Simisafyan'.'Philippines'.Australia Australian Capital Territory.Guangdong Yunfu.'Switzerland'.'Virginia, USA'.'Tieling, Liaoning'.'Angola'.Dazhou, Sichuan province.Suining, Sichuan.'Xingtai, Hebei'.'Huaihua, Hunan'.'Shangluo, Shaanxi'.'Pennsylvania, USA'.'Piedmont, Italy'.'Ningxia Guyuan'.'Tuscany, Italy'.'Italy'.'Missouri, USA'.Guangxi Baise.'Dingxi, Gansu'.'Tianshui, Gansu'.'Yongzhou, Hunan'.'Michigan, USA'.'Bengbu, Anhui'.'Tongling, Anhui'.'Jiaozuo, Henan'.'Ma on Shan, Anhui'.'Tennessee, USA']
Copy the code

The processing of geographic information is one of the highlights of this series of articles. In the face of such slightly chaotic data, the novice xiao Bai may be as big as the ancient Liu. , back to do “climb zhang Jiawei 138W + Zhihu concerns: Data Visualization project, after screening zhang Jiawei’s 1.38 million followers of his own 100+ attention of more than 40,000 zhihu users, when planning to analyze and visualize, it is a long story to see the extent of the messy data, in order to give you an intuitive feeling, specially dig out the data to show, I don’t know whether to clear:

God knows how the ancient willow statistics at the beginning!

Now, it seems that the data is really good, the geographic information is real, there will be no user custom, blind fill in the situation; The format is more unified, and the amount of data is small, no matter what, even if manual extraction of provinces and cities is not impossible… (Manual is impossible to manual, will not be manual again in this life)


Data processing idea

First of all, let’s make it clear again that the purpose of this time is to extract the information of provinces and cities. Due to the small amount of data, the subsequent visualization will only be carried out on the Map of China, so the unified overseas geographic information can be screened out. The way to achieve this is to build a list of UNChina, which is used to store the overseas countries. Then run through all 337 area_name elements to add the names of those countries to the drop list, and print out the length of each country. There may be some misgrades in the process, which need to be checked.

Talk is cheap, show you the code.

area_len_2 = []
area_len_3 = []
area_len_4 = []
area_len_5 = []
unchina = ['the UK'.'the United States'.'Japan'.'Switzerland'.'France'.'Sweden'.'Vietnam'.'the'.'Italy'.'Canada'.'Philippines'.'Singapore'.'New Zealand'.'Iraq'.'Ireland'.'Angola'.'Australia'.'Republic of Korea'.'Malaysia']
droped = []
for area in area_name:
    for unarea in unchina:
        if unarea in area: 
            droped.append(area)
    if len(area)==2 and area not in droped: area_len_2.append(area) China has a total of 34 provincial administrative regions, including 23 provinces, 5 autonomous regions, 4 municipalities directly under the Central Government, and 2 special administrative regions.
    if len(area)==3 and area not in droped: area_len_3.append(area)
    if len(area)==4 and area not in droped: area_len_4.append(area)
    if len(area)>=5 and area not in droped: area_len_5.append(area)
print(len(droped),'\n', droped)
print(len(area_len_2),'\n', area_len_2)
print(len(area_len_3),'\n', area_len_3)
print(len(area_len_4),'\n', area_len_4)
print(len(area_len_5),'\n', area_len_5)
Copy the code

After treatment, I feel refreshed. Although the following extraction provinces are not used, but the composition of the data has a clearer understanding!

Province summary

China has a total of 34 provincial-level administrative regions, including 23 provinces, 5 autonomous regions, 4 municipalities directly under the Central Government and 2 special administrative regions.

Copy over all provinces, manually remove the suffixes of autonomous regions and administrative regions first, and then use code to remove irrelevant words and characters.

prolist = 'in Beijing, tianjin, Shanghai, chongqing, hebei, shanxi, liaoning, jilin province, jiangsu province, zhejiang, anhui, fujian, \ in jiangxi, shandong, henan, hubei, hunan, guangdong, hainan, sichuan, guizhou, yunnan, shaanxi, gansu province, \ in qinghai province, Taiwan province, guangxi, Tibet, ningxia, Xinjiang, Hong Kong, Macao, Inner Mongolia, Heilongjiang '
prolist = prolist.replace('the city'.' ').replace('province'.' ').split(', ')
print(len(prolist), prolist)
Copy the code
34 ['Beijing'.'tianjin'.'Shanghai'.'chongqing'.'hebei'.'the shanxi'.'the liaoning'.'jilin'.'jiangsu'.'zhejiang'.'anhui'.'fujian'.'jiangxi'.'shandong'.'henan'.'hubei'.'in hunan province'.'in guangdong'.'hainan'.'sichuan'.'guizhou'.'yunnan'.'the shaanxi'.'gansu'.'the qinghai'.'Taiwan'.'the guangxi'.'Tibet'.'the ningxia'.'the xinjiang'.'Hong Kong'.'the'.Inner Mongolia.'Heilongjiang']
Copy the code

Extract the provinces

The corresponding provinces are extracted from area column, and the non-domestic ones are uniformly expressed as overseas ones:

def get_pro(area):
    prolist = ['Beijing'.'tianjin'.'Shanghai'.'chongqing'.'hebei'.'the shanxi'.'the liaoning'.'jilin'.'jiangsu'.'zhejiang'.'anhui'.'fujian'.'jiangxi'.'shandong'.'henan'.'hubei'.'in hunan province'.'in guangdong'.'hainan'.'sichuan'.'guizhou'.'yunnan'.'the shaanxi'.'gansu'.'the qinghai'.'Taiwan'.'the guangxi'.'Tibet'.'the ningxia'.'the xinjiang'.'Hong Kong'.'the'.Inner Mongolia.'Heilongjiang']
    for pro in prolist:
        if pro in area:
            return pro
    return "Overseas"
df['pro'] = df.area.apply(get_pro)
df[['area'.'pro']]
Copy the code

statistics

pro_count = df.groupby('pro') ['pro'].count().sort_values(ascending=False)
pro_count
Copy the code

Just look at the bar chart.

pyecharts

Pyecharts custom themes

PIP install echarts-themes-pypkg

Histogram of provincial distribution

from pyecharts import Bar
bar = Bar("Province distribution")
bar.use_theme("macarons") # in the subject
bar.add("Province", pro_count.index, pro_count.values,is_label_show=True,xaxis_interval=0,xaxis_rotate=-45)
bar
Copy the code

Provincial distribution map

from pyecharts import Map
mapp = Map("Distribution by Province", width=1000, height=600)
#mapp.use_theme("macarons") # change theme
mapp.add("", pro_count.index, pro_count.values, maptype='china', is_visualmap=True,
         visual_range=[0, 480], is_map_symbol_show=False, visual_text_color='# 000', is_label_show=True)
mapp
Copy the code

summary

Province extraction is relatively simple, as long as the Internet search which specific provinces (exposed ancient liu is a geographical small white), after getting the list of provinces to do. But the previous steps of grouping geographic data into groups to make it clearer and to avoid further errors were a bit of a bright spot. Escape…

The next article will involve extracting urban data, calling Baidu map API to query latitude and longitude, and then using BDP to draw dynamic thermal maps. It is very recommended that readers try to implement by themselves, there should be a different implementation method, gu Liu’s method is still a trouble, do not be limited by my method! Please see below for details. To continue…


This series of projects will be fully involved from crawler, data extraction and preparation, data anomaly discovery and cleaning, analysis and visualization details, and will be unified open source code in GitHub: DesertsX/ Gulius -projects, interested friends can go ahead star.

This series of articles: “China’s youth are leading the country into crisis.” Taught you how to complete a small project (1) : the science data data crawl taught you how to complete a small project (2) : the science data data extraction, IP query taught you how to complete a small project data science (3) : abnormal data and cleaning taught you how to complete a small project (4) : the science data, the change of the comments