Use Python to crawl and analyze national tourism data

Foreword: \

Python is also a process for me to gradually learn and master. This time, I will start with the travel, and before entering the body, I will first attach (fang) (du) the most appealingly delicious seafood feast.

Data crawl:

In recent days, wechat moments have been flooded with people’s travel footprints, marveling at those friends who have traveled all over China. At the same time, I also wanted to write a travel-related article. This data came from mafengwo, a travel guide website which is very friendly to reptiles

* * * *

PART1: Obtain the city number

All the cities, scenic spots and other information in Mafengwo have an exclusive 5-digit number. Our first step is to obtain the number of the city (municipality directly under the central Government + prefecture-level city) for further analysis.

The above two pages are the source of our city code. We need to obtain the provincial code from the destination page first, and then enter the provincial city list to obtain the code. Selenium is required to perform dynamic data crawling, with the following code:

Copy the code

def find_cat_url(url):
Headers = {' user-agent ':'Mozilla/5.0 (Windows NT 6.1; WOW64; The rv: 23.0) Gecko / 20100101 Firefox / 23.0 '}
req=request.Request(url,headers=headers)
html=urlopen(req)
bsObj=BeautifulSoup(html.read(),"html.parser")
bs = bsObj.find('div',attrs={'class':'hot-list clearfix'}).find_all('dt')
cat_url = []
cat_name = []
for i in range(0,len(bs)):
for j in range(0,len(bs[i].find_all('a'))):
cat_url.append(bs[i].find_all('a')[j].attrs['href'])
cat_name.append(bs[i].find_all('a')[j].text)
cat_url = ['http://www.mafengwo.cn'+cat_url[i] for i in range(0,len(cat_url))]
return cat_url
` `
` `
def find_city_url(url_list):
city_name_list = []
city_url_list = []
for i in range(0,len(url_list)):
driver = webdriver.Chrome()
driver.maximize_window()
url = url_list[i].replace('travel-scenic-spot/mafengwo','mdd/citylist')
driver.get(url)
while True:
try:
time.sleep(2)
bs = BeautifulSoup(driver.page_source,'html.parser')
Url_set = bs. Find_all (' a ', attrs = {' data - type ':' destination '})
city_name_list = city_name_list +[url_set[i].text.replace('\n','').split()[0] for i in range(0,len(url_set))]
city_url_list = city_url_list+[url_set[i].attrs['data-id'] for i in range(0,len(url_set))]
js="var q=document.documentElement.scrollTop=800"
driver.execute_script(js)
time.sleep(2)
driver.find_element_by_class_name('pg-next').click()
except:
break
driver.close()
return city_name_list,city_url_list
` `
` `
` `
` `
` `
url = 'http://www.mafengwo.cn/mdd/'
url_list = find_cat_url(url)
city_name_list,city_url_list=find_city_url(url_list)
city = pd.DataFrame({'city':city_name_list,'id':city_url_list})

PART2: Get city information

City data are obtained from the following pages:

(a) Snacks page

(b) Attractions page

(c) TAB page

We encapsulated the process of acquiring data for each city into functions, and passed in the city code obtained before each time, part of the code is as follows:

Copy editor

Copy the code

def get_city_info(city_name,city_code):
this_city_base = get_city_base(city_name,city_code)
this_city_jd = get_city_jd(city_name,city_code)
this_city_jd['city_name'] = city_name
this_city_jd['total_city_yj'] = this_city_base['total_city_yj']
try:
this_city_food = get_city_food(city_name,city_code)
this_city_food['city_name'] = city_name
this_city_food['total_city_yj'] = this_city_base['total_city_yj']
except:
this_city_food=pd.DataFrame()
return this_city_base,this_city_food,this_city_jd
` `
` `
` `
` `
def get_city_base(city_name,city_code):
url = 'http://www.mafengwo.cn/xc/'+str(city_code)+'/'
bsObj = get_static_url_content(url)
node = bsObj.find('div',{'class':'m-tags'}).find('div',{'class':'bd'}).find_all('a')
tag = [node[i].text.split()[0] for i in range(0,len(node))]
tag_node = bsObj.find('div',{'class':'m-tags'}).find('div',{'class':'bd'}).find_all('em')
tag_count = [int(k.text) for k in tag_node]
par = [k.attrs['href'][1:3] for k in node]
tag_all_count = sum([int(tag_count[i]) for i in range(0,len(tag_count))])
tag_jd_count = sum([int(tag_count[i]) for i in range(0,len(tag_count)) if par[i]=='jd'])
tag_cy_count = sum([int(tag_count[i]) for i in range(0,len(tag_count)) if par[i]=='cy'])
tag_gw_yl_count = sum([int(tag_count[i]) for i in range(0,len(tag_count)) if par[i] in ['gw','yl']])
url = 'http://www.mafengwo.cn/yj/'+str(city_code)+'/2-0-1.html '
bsObj = get_static_url_content(url)
total_city_yj = int(bsObj.find('span',{'class':'count'}).find_all('span')[1].text)
return {'city_name':city_name,'tag_all_count':tag_all_count,'tag_jd_count':tag_jd_count,
'tag_cy_count':tag_cy_count,'tag_gw_yl_count':tag_gw_yl_count,
'total_city_yj':total_city_yj}
` `
` `
def get_city_food(city_name,city_code):
url = 'http://www.mafengwo.cn/cy/'+str(city_code)+'/gonglve.html'
bsObj = get_static_url_content(url)
food=[k.text for k in bsObj.find('ol',{'class':'list-rank'}).find_all('h3')]
food_count=[int(k.text) for k in bsObj.find('ol',{'class':'list-rank'}).find_all('span',{'class':'trend'})]
return pd.DataFrame({'food':food[0:len(food_count)],'food_count':food_count})
` `
` `
def get_city_jd(city_name,city_code):
url = 'http://www.mafengwo.cn/jd/'+str(city_code)+'/gonglve.html'
bsObj = get_static_url_content(url)
node=bsObj.find('div',{'class':'row-top5'}).find_all('h3')
jd = [k.text.split('\n')[2] for k in node]
node=bsObj.find_all('span',{'class':'rev-total'})
Jd_count =[int(k.ext.replace (' ',')) for k in node]
return pd.DataFrame({'jd':jd[0:len(jd_count)],'jd_count':jd_count})

Data analysis: \

PART1: City data

First let’s take a look at the TOP10 cities with the most travel notes:

The number of travel notes is basically consistent with the popular cities we know daily. We further obtain the geothermal map of national travel destination according to the number of travel notes in each city:

If the footprint picture you posted on wechat is consistent with this picture, it means mafengwo’s data is consistent with yours.

Finally, we take a look at everyone’s impression of each city by extracting the attributes in the label. We divide the attributes into three groups: leisure, food and scenic spots, and take a look at the most impressive cities under each group of attributes:

It seems that For mafengwo users, Xiamen has left a deep impression on everyone, not only there are plenty of travel notes, but also many effective labels can be extracted from them. Chongqing, Xi ‘an and Chengdu also leave a deep impression on foodies, with some codes as follows:

Copy the code

Bar1 = Bar(" Food category label ranking ")
Bar1. add(" city_aggregate. Sort_values ('cy_point',0,False)['city_name'][0:15],
city_aggregate.sort_values('cy_point',0,False)['cy_point'][0:15],
is_splitline_show =False,xaxis_rotate=30)
` `
Bar2 = Bar(title_top="30%")
Bar2. add(" city_aggregate. Sort_values ('jd_point',0,False)['city_name'][0:15],
city_aggregate.sort_values('jd_point',0,False)['jd_point'][0:15],
legend_top="30%",is_splitline_show =False,xaxis_rotate=30)
` `
Bar3 = Bar(" Casual tags rank ", title_TOP ="67.5%")
Bar3. add(" city_aggregate. Sort_values ('xx_point',0,False)['city_name'][0:15],
city_aggregate.sort_values('xx_point',0,False)['xx_point'][0:15],
Legend_top = "67.5%", is_splitline_show = False, xaxis_rotate = 30)
` `
grid = Grid(height=800)
grid.add(bar1, grid_bottom="75%")
The grid. The add (bar2, grid_bottom = "37.5%", grid_top = "37.5%")
grid.add(bar3, grid_top="75%")
Grid.render (' city classification tab.html ')

PART2: Scenic spot data

We extracted the number of comments of each scenic spot and compared it with the number of travel notes in the city to obtain the absolute value and relative value of comments of scenic spots respectively. Based on this, we calculated the popularity and representativeness of scenic spots, and finally ranked the TOP15 scenic spots as follows:

Gulangyu has become the most popular scenic spot among mafengwo netizens, while Xitang Ancient Town and Yamdrok Yongcuo rank first in terms of city representation. Small long vacation approaching, if you worry about the top row of scenic spots too many people, might as well dig from the bottom of the scenic spots of those people less scenic beauty of tourism.

PART3: Snack data

Finally, we take a look at the data related to eating that people are most concerned about. The processing method is similar to the data of scenic spots in PART2. We take a look at the most popular snacks and the most representative snacks of the city respectively

* * * *

**** Unexpectedly, mafengwo users really love Xiamen deeply, which makes sanda tea noodles rank among the most popular snacks, surpassing hot pot, roast duck and rou jia mo. In terms of city representation, seafood has a high frequency of appearance, which coincides with the cognition of Ben and ren. The codes of PART2 and 3 are as follows:

Copy the code

Bar1 = Bar(" Scenic spot Popularity Ranking ")
Bar1. Add (" attractions sentiment scores ", city_jd_com sort_values (' rq_point '0, False) [' jd'] [0:15].
city_jd_com.sort_values('rq_point',0,False)['rq_point'][0:15],
is_splitline_show =False,xaxis_rotate=30)
` `
Bar2 = Bar(title_top="55%")
Bar2. Add (" attractions representative scores ", city_jd_com sort_values (' db_point '0, False) [' jd'] [0:15].
city_jd_com.sort_values('db_point',0,False)['db_point'][0:15],
is_splitline_show =False,xaxis_rotate=30,legend_top="55%")
` `
grid=Grid(height=800)
grid.add(bar1, grid_bottom="60%")
grid.add(bar2, grid_top="60%",grid_bottom="10%")
Grid.render (' attractions rank.html')

Python Chinese community as a decentralized global technology community, to become the world’s 200000 Python tribe as the vision, the spirit of Chinese developers currently covered each big mainstream media and collaboration platform, and ali, tencent, baidu, Microsoft, amazon and open China, CSDN industry well-known companies and established wide-ranging connection of the technical community, Have come from more than 10 countries and regions tens of thousands of registered members, members from the Ministry of Public Security, ministry of industry, tsinghua university, Beijing university, Beijing university of posts and telecommunications, the People’s Bank of China, the Chinese Academy of Sciences, cicc, huawei, BAT, represented by Google, Microsoft and other government departments, scientific research institutions, financial institutions, and well-known companies at home and abroad, nearly 200000 developers to focus on the platform.

Click below to read the original text and sign up

Python Web crawler learning online summer Camp

Use Python to crawl and analyze national tourism data

Related Posts

String concatenation functions strcat() and strncat()

Explain OSI reference model, TCP, UDP in plain English

Which storage engine is more suitable for frequent INSERT services? | database series