This is the 19th day of my participation in the Gwen Challenge in November. Check out the details: The last Gwen Challenge in 2021

Zero out front

Reference book for this series of study notes: Data Analysis in Action. Tomaz Joubas will share his notes from this study book with you as part of a series called Data Analysis in Action from Scratch.

Pandas reads and writes CSV data

Pandas reads and writes TSV/Json data for Pandas

Pandas reads and writes Excel/XML data

Pandas can read and write data in CSV, TSV, JSON, Excel, and XML formats.

A summary of basic knowledge

1. Use the read_html function to retrieve HTML pages in Pandas

2. Use read_html function to get page data directly

3. Basic data processing: table header processing, dropna and Fillna details

4. Basic data visualization analysis cases

Two, start using your head

1.Pandas’ read_html function

The HTML function is used to parse the HTML page in Pandas:read_html.After looking at the source code, we can see that the function parameters are more, I pick a few highlights to explain to you.

(1)io(Most critical parameter)

The source code comments

		A URL, a file-like object, or a raw string containing HTML. Note that
        lxml only accepts the http, ftp and file url protocols. If you have a
        URL that starts with ``'https'`` you might try removing the ``'s'``.
Copy the code

I understand it

Data addresses (web page addresses, file addresses containing HTML, or strings). Note that LXML accepts only HTTP, FTP, and file URL protocols. If you have a URL that starts with "HTTPS", you can try removing the "s" before passing in the argument.Copy the code

(2)match

The source code comments

		str or compiled regular expression, optional
        The set of tables containing text matching this regex or string will be
        returned. Unless the HTML is extremely simple you will probably need to
        pass a non-empty string here. Defaults to '.+' (match any non-empty
        string). The default value will return all tables contained on a page.
        This value is converted to a regular expression so that there is
        consistent behavior between Beautiful Soup and lxml.
Copy the code

I understand it

String or compiled regular expression, optionally a set of tables containing text matching this regular expression or string will be returned. Unless the HTML is very simple, you might want to pass a non-empty string here. The default is ".+ "(matches any non-empty string). The default returns all tables contained by the <table> tag contained on the page. This value will be converted to a regular expression for consistency between Beautiful Soup and LXML.Copy the code

(3)flavor

The source code comments

		flavor : str or None, container of strings
        The parsing engine to use. 'bs4' and 'html5lib' are synonymous with
        each other, they are both there for backwards compatibility. The
        default of ``None`` tries to use ``lxml`` to parse and if that fails it
        falls back on ``bs4`` + ``html5lib``.
Copy the code

I understand it

The parsing engine to use. 'bS4' and 'html5lib' are synonyms for each other and are both intended for backward compatibility. Default is null, try the default values for LXML parsing, and if that fails, use BS4 and HTML5lib.Copy the code

2. Basic data processing

(1) Processing column name
Handle column names
import re
A regular expression that matches any whitespace character in a string
space = re.compile(r"\s+")

def fix_string_spaces(columnsToFix) :
    Converts whitespace characters in column names to underscores.
    tempColumnNames = []   Save the column name after processing
    # loop through all columns
    for item in columnsToFix:
        # match to
        if space.search(item):
            # Process and join the list
            tempColumnNames.append('_'.join((space.split(item))))
            Str1.split (str2) Str1 stands for a split string; Str2 indicates the delimited string str3. Join (list1) STR2 indicates what string to join. Append (str4) adds the element str4 to the end of list2.
        else :
            Otherwise, join the list directly
            tempColumnNames.append(item)
    return tempColumnNames
Copy the code

The above code from the book, its purpose is to deal with column names, column names of empty characters into – symbol, think carefully, in fact, this can be universal, such as processing a row of data is empty, processing a list of empty data, etc., reuse is very strong.

(2) Dropna function for missing data processing

dropna()Function: Filters missing data.Common parameter analysis:axis:

The source code comments

axis : {0 or 'index', 1 or 'columns'}, default 0 Determine if rows or columns which contain missing values are removed. * 0, or 'index' : Drop rows which contain missing values. * 1, or 'columns' : Drop columns which contain missing value. .. Deprecated :: 0.23.0: Pass tuple or list to drop on multiple axes.Copy the code

I understand it

Rarely used. The default value is 0, indicating that the row containing the missing value is deleted. A value of 1 indicates that the column containing the missing value is deleted.Copy the code

How:

The source code comments

		how : {'any', 'all'}, default 'any'
            Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
            * 'any' : If any NA values are present, drop that row or column.
            * 'all' : If all values are NA, drop that row or column.
Copy the code

I understand it

The default value is any, which means that if there is any NA (empty) value, the row or column is deleted. If the value is all, the row or column is deleted if all the values are NA.Copy the code

Thresh:

The source code comments

		thresh : int, optional
            Require that many non-NA values.
Copy the code

I understand it

It is not the number of nas. The rows that meet the requirements are reserved, and the rows that do not meet the requirements are deleted.Copy the code

Inplace:

The source code comments

		inplace : bool, default False
            If True, do operation inplace and return None.
Copy the code

I understand it

The default value is False, indicating that the operation is not performed on the original object, but a new object is copied and returned. If the value is True, the operation is performed directly on the original object.Copy the code
(3) Fillna function for missing data processing

fillna()Function: Fill missing data with specified values or interpolation methods.Common parameter analysis:value:

The source code comments

value : scalar, dict, Series, or DataFrame Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for  a DataFrame). (values not in the dict/Series/DataFrame will not be filled). This value cannot be a list.Copy the code

I understand it

In simple terms, you replace the value of NA. If the value is directly given, it indicates all replacement. If it is a dictionary: {column name: replacement value} replaces all null values contained in the column.Copy the code

Method:

The source code comments

	method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
            Method to use for filling holes in reindexed Series
            pad / ffill: propagate last valid observation forward to next valid
            backfill / bfill: use NEXT valid observation to fill gap
Copy the code

I understand it

Method of filling blank values in a reindex series. Pad/ffill: Retrieves by column, assigning the last non-null value to the next null value. Backfill/bfill: Retrieves by column, assigning the next non-null value to this null value. Note: This parameter cannot exist at the same time as valueCopy the code

Limit:

The source code comments

limit : int, default None
            If method is specified, this is the maximum number of consecutive.
            NaN values to forward/backward fill. In other words, if there is a gap 
            with more than this number of consecutive NaNs, it will only be partially
             filled. If method is not specified, this is the maximum number of entries
              along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
Copy the code

I understand it

In fact, it is simple to search for null values by column, and the value of limit represents the maximum number of consecutive empty values to be filled. For example, if limit=2, search a column from top to bottom and replace only the first two null values.Copy the code

Make a comment: don’t look at the source of the English note words are very simple, but, too simple, even not a sentence, I am a practice + surface translation, and then to understand the meaning of the parameters.

3. Data crawling actual combat training

Five lines of code to climb the 2019 Rich List ($6 billion +)

import pandas as pd

# list
for i in range(15) :# page address
    url = "https://www.phb123.com/renwu/fuhao/shishi_%d.html" % (i+1)
    Call the read_html function to parse the page for the List of data
    url_read = pd.read_html(url, header=0) [0]
    Save the data to a CSV file
    url_read.to_csv(r'rich_list.csv', mode='a', encoding='utf_8_sig', header=0, index=False)
Copy the code

Page data:Crawl results1. Don’t feel so simple (because I found a good website, there is only one table in the website data, data is relatively clean). 2, the real work of the website may not cooperate, the data may not cooperate, this time the best way is to have different opinions, look at the source code.

4. Actual training of data visualization analysis

Based on the data we got above, let’s do a simple data visualization and analysis report. We’ve got the data above for the 2019 rich list ($6 billion +), which includes ranking, name, amount of wealth, source of wealth, and country. After defining the attributes of the data, we need to think about what aspects can we analyze? A few things come to mind: (1) How many people are there in each country on the list? Which countries have the most? (2) Which companies have the most people on the list? (3) What is the industry distribution of the people on the list?

(0) Data reading and data visualization

To read the data, we use pandans’ read_CSV function directly.

import pandas as pd

The path to the original data file
rpath_csv = 'rich_list.csv'
# fetch data
csv_read = pd.read_csv(rpath_csv)
The extracted data is a Series object from Pandans
Post processing can be directly converted to lists
name_list = csv_read["Name"]
money_list = csv_read["Wealth ($1 billion)"]
company_list = csv_read["Source of wealth"]
country_list = csv_read["Country/region"]
Copy the code

Data visualization from our simplest Pyecharts module.

PIP install pyecharts = = 0.5.11Copy the code
(1) How many people are there in each country on the list? Which countries have the most?
How many people are there in each country on the list? Which countries have the most?
1. Statistics use the Collections module's Counter function
country_list = list(country_list)
from collections import Counter
dict_number = Counter(country_list)

2. Data visualization using The Bar class of Pyecharts module
bar = Bar("Bar chart of rich countries")
bar.add("Rich", key_list, values_list, is_more_utils=True, is_datazoom_show=True,
        xaxis_interval=0, xaxis_rotate=30, yaxis_rotate=30, mark_line=["average"], mark_point=["max"."min"])
bar.render("rich_country.html")
Copy the code

From the above data, we can clearly found that the rich list of nationality, the United States is in the majority, and can be said to be far ahead, a total of 300 people, the American nationality, 106 people, accounting for about one-third of the total data, this is better understood, the United States has always been a superpower, all aspects of development are among the world.

Second is China, 43, is also much more special, but for China, developing to now is very, very not easy, established in 1949, the year 2019, in 70, the founding from “reading for the rise of the Chinese,” to “to realize China’s dream, a prosperous democracy civilized and harmonious socialist modernization power struggle, the beautiful”, As a Chinese, I am proud.

Third is Germany and Russia, and each have 20 people, is an industrial powerhouse, Germany, Europe’s biggest economy, so Germany’s strength is obvious, the other Russia, the world’s largest country, once the Soviet union is the world’s economic power, although after the collapse of the Soviet union, but Mr Putin’s ruling in recent years, a steady economic recovery.

Most of the countries behind are European countries, and the fifth is India, whose scientific and technological strength is very developed.

(2) Which companies have the most people on the list?

Note oh ~ can on this list, lowest wealth is $6 billion, statistically, Mars most companies on the list, there are six on the list of rich from Mars company, followed by wal-mart department store co., LTD., three people from the company, the two companies are cosmetic companies, the following: Microsoft, Facebook, Google is a technology company

I didn’t know that the snickers bar and Dove are from the same company, Mars. Double-click 666 here. In addition, Wal-mart was ranked the first among the World’s Top 500 in 2018. In a sense, it is the strongest company in the universe ~ (When I was a child, I always thought That Fudi was the most powerful supermarket, and when I grew up, I thought wanda was the most powerful supermarket. Now, I know, it is Wal-mart!

(3) What is the industry distribution of the people on the list? Waiting for your answer

This part is bad to do, because we get to the data is not directly connected to the industry data, the only can contact industry is to the company, this needs us through a company name to judge (or) on the Internet the company’s category attributes, such as the Internet company, or a traditional industry, etc.

Three for your words

Persistence and hard work: results.

The idea is very complicated,

The implementation is interesting,

As long as you don’t give up,

Fame will come.

— Old Watch doggerel

See you next time. I’m a cat lover and a tech lover. If you find this article helpful, please like, comment and follow me!