Erik Marsja

Lao Qi

The book recommended for this article is data Preparation and Feature Engineering, which has been sold on [Tmall Flagship store of Publishing House of Electronics Industry]


In this article, we’ll walk through a few steps to show how to fetch data from an HTML page using Pandas’ read_html function. First, for a simple example, we’ll use Pandas to read HTML from a string; Then, we’ll use some examples to show how to read data from Wikipedia’s pages.

Load data in Python

For data analysis and visualization, we usually load data, usually from an existing file, such as a common CSV file or Excel file. To read data from a CSV file, you can use Pandas’ read_CSV method. Such as:

import pandas as pd

df = pd.read_csv('CSVFILE.csv')
Copy the code

The above method is usually used to import structured data, such as CSV or JSON.

Most of the information we use in Wikipedia is in the form of HTML tables.

To get the data in these tables, we can copy and paste them into a spreadsheet and read them using Pandas’ read_excel. That’s fine, but for now, we’re going to use web crawler technology to automate the data reading.

Preliminary knowledge

To read HTML table data in Pandas, install Pandas. To install pandas, use PIP to install pandas.

Note that piPs are automatically checked to see if they need to be upgraded after executing this command, and upgrade if necessary. In addition, we will also use LXML or BeautifulSoup4 package, the installation method is still PIP: PIP install LXML.

read_htmlfunction

Using Pandas’ read_html to read data from an HTML table is simple:

pd.read_html('URL_ADDRESS_or_HTML_FILE')
Copy the code

This is the complete use of the read_html function. Here is an example:

Example 1

In the first example, we use Pandas’ read_html function to read data from an HTML table in a string.

import pandas as pd

html = ' ''
      
a b c d
1 2 3 4
5 6 7 8
'
' ' df = pd.read_html(html) Copy the code

The result is now not a DataFrame object for Pandas, but a Python list object, which can be verified using the tupe() function:

type(df)
Copy the code

Example 2

In the second example, we want to grab data from Wikipedia. We’re going to grab tabular data on the pythidae family.

import pandas as pd

dfs = pd.read_html('https://en.wikipedia.org/wiki/Pythonidae')
Copy the code

Now we have a list of seven tables (len(df)). If we open the Wikipedia page, we can see that the first table is on the right side of the page. In this case, we are more interested in the second table:

dfs[1]
Copy the code

Example 3

In the third example, we are reading novel Coronavirus data from Sweden. Here, you need to add a parameter to the read_HTML method, perform data cleansing, and finally visualize the data.

Fetching the data

If you open the page, you will see that the table on the page reads “New COVID-19 cases in Sweden by county”. For now, we will use the match parameter and this string:

dfs = pd.read_html('https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Sweden',
                  match='New COVID-19 cases in Sweden by county')
dfs[0].tail()
Copy the code

Using this method, we only get the table on the web page, but, as shown in the figure, the bottom three rows are useless and need to be deleted.

With the PandasilocDelete the last few lines

Next, use Pandas’ ILOC to delete the last three lines. Note that we use -3 as the second parameter, and then duplicate the data.

df = dfs[0].iloc[:-3, :].copy()
Copy the code

Next, learn how to change a multilevel column index to a level 1 index.

Change the multi-level index to level 1 and delete unnecessary characters

Columns and datafame.columns,get_level_values(): datafame.columns

df.columns = df.columns.get_level_values(1)
Copy the code

Finally, as you can see, in the “Date” column, after we get the data from the table on the Wikipedia page with read_html, there are some instructions, which we then revise with the str.replace function and regular expressions:

df['Date'] = df['Date'].str.replace(r"\ [. *? \]"."")
Copy the code

withset_indexChanges to the index

We continue to use Pandas’ set_index method to set the date column as an index, which provides a Series object of time type for subsequent diagrams.

df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
Copy the code

For later plotting purposes, we need to fill in the missing values with 0, and then change the data type of the corresponding column to a numeric type. To do this, use the apply method. Finally, the cumsum() method is used to get the summation value for each column.

df.fillna(0, inplace=True)
df = df.iloc[:,0:21].apply(pd.to_numeric)

df = df.cumsum()
Copy the code

Plot with time Series

The last part, using the read_HTML data, creates an image of the time series. First, to import matplotlib, use the Legend function to define the legend location.

%matplotlib inline
import matplotlib.pyplot as plt
f = plt.figure()

plt.title('Covid cases Sweden', color='black')
df.iloc[:,0:21].plot(ax=f.gca())

plt.legend(loc='center left', bbox_to_anchor = (1.0, 0.5)))Copy the code

Conclusion: How do I read data from HTML and convert it to DataFrame

In this article, you learned how to read data from HTML using Pandas’ read_html function, and you created an image with a time series using data from Wikipedia. Not only that, but finally set the “Date” column as the index of the DataFrame.

Original link :www.marsja.se/how-to-use-…

Search the public account of technical question and answer: Lao Qi classroom

Reply in the official account: Lao Qi, you can view all articles, books and courses.

Like it and retweet it if you think it looks good