Small knowledge, big challenge! This paper is participating in theEssentials for programmers”Creative activities

One, foreword

For example, The Python crawler crawls the data, stores it in the database, and then reads it from the SQL for analysis and visualization.

On the other hand, I will write some special articles according to the needs of readers. If you have your own ideas, please visit the link below and give suggestions to the blogger:

Shimo. Im/forms/pVQRX…

Second, column summary

  • Come directly: a line of code to crawl microblogging hot search data
  • Preparation: Save the crawled data to CSV, mysql, and other databases
  • Do things: read mysql data and perform data analysis and visualization
  • Advanced work: Rendering visualized data results to web pages (large screen visualization)
  • Whisper: Summary and thinking of the project, looking forward to your contribution

Three, direct to: crawl microblogging hot search data

3.1 Find data source and page analysis

First we went straight to the browser searchWeibo hot search, you can quickly find the popular online page of Weibo, the address is as follows:

https://s.weibo.com/top/summary/
Copy the code

The hot search data we’re going to crawl consists of four fields in the following list.

The name of the data English names The data type
Hot search rankings wb_rank int
Hot search title wb_title str
Heat out wb_hot int
The heat label wb_lable str

Without further ado, when you see the data on the page, hold down F12 to bring up browser Developer tools (Google Chrome is recommended), then click Network, refresh the page and find it in the loaded pagesummaryFind that the request returns data from the page, which is HTML code (data is also inside).And then clickHeadersAs you can see, the requested URL is the link we typed in our browser. There is no other interface, so the rest is simple.

(This article is related to all the source code, you can get the download address at the end of the article)

3.2 A line of code crawling microblogging hot search

Now let’s talk about a simple way to get the data, one line of code, or two lines at most if you have to do it.

import pandas as pd
pd.read_html('https://s.weibo.com/top/summary/summary', index_col=0) [0]
Copy the code

(when I do this project is to use another way of first crawl data analysis between xpath path suddenly found that, when the tr, td is not forms on a page, crawl on the page form, the data analysis from actual combat | article (4) there are detailed introduction, can directly use the pandas read_html function, So there is the idea of the above line of code crawling to the weibo hot search.

Of course, we will find that the data is still a bit awkward, although it is already a DataFrame data format, but the hot search title and hot search heat are in the same TD, so the crawling is also put in the same column, which is a bit awkward.

But it doesn’t matter. We can use the re to split the title and heat into two columns.

First, the data was briefly processed by setting the following names and deleting the recommended hot search (ranking ‘•’) and top hot search (ranking nan).

Reset the table header
wb_hot_data.columns = ['wb_rank'.'wb_title_hot'.'wb_lable']
# Simply process the data to remove recommended and top searches
wb_hot_data = wb_hot_data.drop(wb_hot_data[(wb_hot_data['wb_rank'] = =', ') | pd.isna(wb_hot_data['wb_rank'])].index)
Copy the code

Regular expression is used to extract data, and the hot search title and heat are divided into two columns

import re
def reg_str(wb_title_hot='Dilieba playing bubble machine 293536') :
    data = re.match('(. *?) (\d+)', wb_title_hot)
    if data:
        return data[1],data[2]
    else:
        return None.None
# apply
wb_title_hots = wb_hot_data.apply(lambda x: reg_str(x['wb_title_hot']),axis=1)
# list pushdown assigns the hot search title and the hot search heat value to the two new columns
wb_hot_data['wb_title'] = [wb_title_hots[i][0] for i in wb_title_hots.index]
wb_hot_data['wb_hot'] = [int(wb_title_hots[i][1]) for i in wb_title_hots.index]
Copy the code

This data is basically solved ~ overall, or will be simpler than the latter method, if you are a crawler beginner, I suggest you can carefully look at the next data crawling, parsing method.

3.3crawler beginners look here, crawler entry template tutorial how to climb microblogging hot search

Directly into the crawler template mindlessly:

PIP install Requests is required to install the crawler base library
import requests
# HTML text parsing library, parsing text content into HTML, easy to use xpath to extract data
from lxml import etree


def get_respones_data(wb_url = 'https://s.weibo.com/top/summary/') :
    "' parameters: wb_url string to the requested url, default: https://s.weibo.com/top/summary/ returns: Get_data,a tuple the first element is the requested page text data (can be extracted with regular expression later) the second element is the get_data text content parsing into HTML content, convenient to use xpath to extract data ""
    headers={
        'Host': 's.weibo.com'.'User-Agent' : 'the Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
    }
    # requests send requests
    get_response = requests.get(wb_url,headers=headers)
    Convert the returned response code to text (the entire web page)
    get_data = get_response.text
    # parse the page
    a = etree.HTML(get_data)
    return get_data,a
Call the function and get the data
get_data,first_a = get_respones_data()
Copy the code

If you feel that this method is troublesome, you can skip this section, which will introduce a particularly simple method of data retrieval.

Next, we start to extract the data we need from the data we climb to!

3.3.1 Hot search title

First on the Hot Search page, hold down F12, bring up developer Tools, and click in the upper left corner of Developer Toolsselect an element...And then select the data you want to extract, as shown in the figure belowHot search title.It is time to return to normal mode (not select element mode) by selecting and clicking on the hot search title. Hover over the selected HTML code and select Copy-> Copy Xpath to get the Xpath path for the title.Such as:

//*[@id="pl_top_realtimehot"]/table/tbody/tr[2]/td[2]/a
Copy the code

After obtaining the xpath path for the data, we can call the xpath function in LXML to extract the data, as shown in the following code. The output is a list, and the elements in the list are our hot search titles.

It should be noted that we are getting the text content in the TAG A, so we need to add \text() after the xpath path. Similarly, if necessary, we can also get the corresponding value of the elements in the tag, such as the link corresponding to the hot search.

# Trending headlines
print(first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr[2]/td[2]/a/text()'))

Output: [' Xiaomi drops MI brand ']"
Copy the code

Once we get the data, we want to get all the data for the title of the page. The method is also very simple. We use the above method to get the xpath path of any hot search title and compare them.

As shown in the following code, we find that the only difference between the two xpath paths is the level tr, so we can simply remove [2] after tr.

"' contrast to find general xpath path / / * [@ id =" pl_top_realtimehot "] / table/tbody/tr [2] [2] / td/a //*[@id="pl_top_realtimehot"]/table/tbody/tr[4]/td[2]/a '''
# general
wb_titles = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]/a/text()')
print(len(wb_titles))
print(wb_titles)
Copy the code

Looking at the output data, we will find that the total number of data does not match. There are only 50 hot searches on the page, why do we run out of 53 data? Did we crawl out some hidden data?Of course not, this is because there is a top hot search (temporarily called this) on weibo, and there will be 1-3 recommended hot search on Weibo (temporarily called this). These two kinds of data only have title attributes, without other attributes such as ranking, heat, tag, etc., which need to be removed when analyzing.

According to the above method, we can obtain the hot search heat, hot search ranking, hot tag in turn.

3.3.2 Heat search

# hotsearch heat
print(first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr[2]/td[2]/span/text()'))

# output: ['2866232']

//*[@id="pl_top_realtimehot"]/table/tbody/tr[2]/ TD [2]/span //*[@id="pl_top_realtimehot"]/table/tbody/tr[4]/td[2]/span '''
# general
wb_hot = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]/span/text()')
print(len(wb_hot))
print(wb_hot)
Copy the code

3.3.3 Hot Search Rankings

# Top trending
"' contrast to find general xpath path / / * [@ id =" pl_top_realtimehot "] / table/tbody tr [10] / td [1] //*[@id="pl_top_realtimehot"]/table/tbody/tr[7]/td[1] '''
# general
wb_rank = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[1]/text()')
print(len(wb_rank))
print(wb_rank)
Copy the code

3.3.4 Heat Labels

# Heat Hashtag
"' contrast to find general xpath path / / * [@ id =" pl_top_realtimehot "] / table/tbody/tr [3] [4] / td/I //*[@id="pl_top_realtimehot"]/table/tbody/tr[7]/td[3]/i '''
# general
wb_lable = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[3]/i/text()')
print(len(wb_lable))
print(wb_lable)
Copy the code

In this way, the data of each part has been obtained, but we need to connect the data of each hot search. Now, because there will be top hot search and weibo recommended hot search, some data cannot be directly connected, such as hot search title and hot label.

Of course, there must be a way to do this. When analyzing and copying xpath, we found that each xpath is only different from the n in tr[n], so we can write a loop n to fetch the relevant data for each hot search at a time.

So the new question is, what is n? Obviously, n will not be a fixed value (the number of recommended hot searches on Weibo is variable). If we look at the page carefully, we can find that whether it is the hot search on Weibo, the top hot search or the recommended hot search, there are some titles, so we can look at the number of titles. The following code.

# general
wb_titles = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr/td[2]/a/text()')
n = len(wb_titles)
wb_hot_data = []

# Extract data
def get_element(data_list) :
    if data_list:
        return data_list[0]
    else:
        return None
    
# loop through
for i in range(n):
    wb_rank = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr[%d]/td[1]/text()'%(i+1))
    wb_titles = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr[%d]/td[2]/a/text()'%(i+1))
    wb_hot = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr[%d]/td[2]/span/text()'%(i+1))
    wb_lable = first_a.xpath('//*[@id="pl_top_realtimehot"]/table/tbody/tr[%d]/td[3]/i/text()'%(i+1))
    wb_hot_data.append([get_element(wb_rank), get_element(wb_titles), get_element(wb_hot), get_element(wb_lable)])
    
Convert list data to DataFrame format for subsequent analysis
wb_hot_data = pd.DataFrame(wb_hot_data, columns=['wb_rank'.'wb_title'.'wb_hot'.'wb_lable'])
wb_hot_data
Copy the code

Above, relatively complete and concise to explain the basic idea and path of a simple crawler, I hope to be helpful to readers who are beginners of crawler, think it is good, remember to click “like” oh.

This article all source, you can add old table wechat: pythonbrief access.

In the next lecture, we will learn how to save the data to a local CSV file or database (e.g. Mysql, mongodb, SQLite, etc.). If you want to learn more about data storage, you can also leave a comment in the comments section.