Source: Python Data Analysis

# # # directory:

  • DIKW model and data engineering
  • Scientific calculation tool Numpy
  • Data analysis tool Pandas
  • Pandas function applications, hierarchical indexing, and statistical calculations
  • Pandas groups and aggregates
  • Data cleaning, merging, transformation, and reconstruction #1. Functions for Pandas apply #apply and applyMap

###1. NumPy functions can be used directly

Sample code:

# Numpy ufunc functionDf = pd.dataframe (np.random. Randn (5,4) -1)print(df)

print(np.abs(df))
Copy the code

Running results:

0 12 3 0 -0.062413 0.844813-1.853721-1.980717 1-0.539628-1.975173-0.856597-2.612406 2-1.277081-1.088457 0.152189 0.530325 3-1.356578-1.996441 0.368822-2.211478 4-0.562777 0.518648-2.007223 0.059411 0 1 23 0 0.062413 0.844813 1.853721 1.980717 1 0.539628 1.975173 0.856597 2.612406 2 1.277081 1.088457 0.152189 0.530325 3 1.356578 1.996441 0.368822 2.211478 4 0.562777 0.518648 2.007223 0.059411Copy the code

###2. Apply functions to columns or rows using apply

Sample code:

Apply row or column data using apply
#f = lambda x : x.max()
print(df.apply(lambda x : x.max()))
Copy the code

Running results:

0-0.062413 1 0.844813 2 0.368822 3 0.530325 Dtype:float64
Copy the code

Note that the axis orientation is specified. By default axis=0, the orientation is column

Sample code:

# specify the axis direction, axis=1, direction is row
print(df.apply(lambda x : x.max(), axis=1))
Copy the code

Running results:

0    0.844813
1   -0.539628
2    0.530325
3    0.368822
4    0.518648
dtype: float64
Copy the code

###3. Apply the function to each data via applyMap

Sample code:

# Apply applyMap to each data
f2 = lambda x : '%.2f' % x
print(df.applymap(f2))
Copy the code

Running results:

0 1 2 3 0-0.06 0.84 -1.85 -1.98 1 -0.54 -1.98 -0.86 -2.61 2-1.28-1.09-0.15 0.53 3-1.36-2.00 0.37-2.21 4 -0.56 0.52 2.01 0.06Copy the code

# sort

###1

sort_index()

Ascending sort is used by default. Ascending =False indicates descending sort

Sample code:

# Series
s4 = pd.Series(range(10, 15), index = np.random.randint(5, size=5))
print(s4)

# index sort
s4.sort_index() 0, 0, 1, 3, 3
Copy the code

Running results:

0    10
3    11
1    12
3    13
0    14
dtype: int64

0    10
0    14
1    12
3    11
3    13
dtype: int64
Copy the code

Note the axis orientation when working with DataFrame

Sample code:

# DataFrame
df4 = pd.DataFrame(np.random.randn(3, 5), 
                   index=np.random.randint(3, size=3),
                   columns=np.random.randint(5, size=5))
print(df4)

df4_isort = df4.sort_index(axis=1, ascending=False)
print(df4_isort) # 4, 2, 1, 1, 0
Copy the code

Running results:

14 0 12 2-0.416686-0.161256 0.088802-0.004294 1.164138 1 -0.671914 0.531256 0.303222-0.509493-0.342573 1 1.988321 -0.466987 2.787891-1.105912 0.889082 42 1 10 2 -0.161256 1.164138-0.416686-0.004294 0.088802 1 0.531256-0.342573 -0.671914-0.509493 0.303222 1-0.466987 0.889082 1.988321-1.105912 2.787891Copy the code

Sort by value

sort_values(by='column name')

Sort by a unique column name, or an error if there are other identical column names.

Sample code:

Sort by value
df4_vsort = df4.sort_values(by=0, ascending=False)
print(df4_vsort)
Copy the code

Running results:

14 0 12 1 1.988321-0.466987 2.787891-1.105912 0.889082 1-0.671914 0.531256 0.303222-0.509493-0.342573 2-0.416686 0.161256 0.088802 0.004294 1.164138Copy the code

# Handle missing data

Sample code:

df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
                       [np.nan, 4., np.nan], [1., 2., 3.]])
print(df_data.head())
Copy the code

Running results:

0 12 0-0.281885-0.786572 0.487126 1 1.000000 2.000000 NaN 2 NaN 4.000000 NaN 3 1.000000 2.000000 3.000000Copy the code

1. Check whether there is a missing value: isnull()

Sample code:

# isnull
print(df_data.isnull())
Copy the code

Running results:

       0      1      2
0  False  False  False
1  False  False   True
2   True  False   True
3  False  False  False
Copy the code

Drop missing data: dropna()

Discard rows or columns that contain NaN based on the axis orientation. Sample code:

# dropna
print(df_data.dropna())

print(df_data.dropna(axis=1))
Copy the code

Running results:

0 12 0-0.281885-0.786572 0.487126 3 1.000000 2.000000 3.000000 1 0 -0.786572 1 2.000000 2 4.000000 3 2.000000Copy the code

###3. Fill missing data: fillna()

Sample code:

# fillna
print(df_data.fillna(-100.))
Copy the code

Running results:

0 12 0-0.281885-0.786572 0.487126 1 1.000000 2.000000-100.000000 2 -100.000000 4.000000-100.000000 3 1.000000 2.000000 3.000000Copy the code

#2. Hierarchical Indexing

Create a Series. When you type Index Index, enter a list consisting of two sub-lists. The first sub-list is the outer Index and the second list is the inner Index.

Sample code:

import pandas as pd
import numpy as np

ser_obj = pd.Series(np.random.randn(12),index=[
                ['a'.'a'.'a'.'b'.'b'.'b'.'c'.'c'.'c'.'d'.'d'.'d'[0, 1, 2, 0, 2, 0, 1, 2, 0, 1, 2]])print(ser_obj)
Copy the code

Running results:

A 0 0.099174 1-0.310414 2-0.558047 b 0 1.742445 1 1.152924 2-0.725332 c 0-0.150638 1 0.251660 2 0.063387 d 0 1.080605 1 0.567547 2 -0.154148 Dtype:float64
Copy the code

#MultiIndex Index object

  • Print the index type of this Series, showing MultiIndex

  • Print the index as lavels and labels. Lavels indicate which labels are present in the two levels, and labels are what labels are present in each position.

Sample code:

print(type(ser_obj.index))
print(ser_obj.index)
Copy the code

Running results:

<class 'pandas.indexes.multi.MultiIndex'>
MultiIndex(levels=[['a'.'b'.'c'.'d'], [0, 1, 2]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])
Copy the code

# Select a subset

  • Get the data according to the index. Because there are now two levels of indexes, when data is retrieved from the outer index, it can be retrieved directly using the tags of the outer index.

  • When retrieving data from an inner index, we pass in two elements in the list, the first representing the outer index to be selected and the second representing the inner index to be selected.

###1. Outer selection:

ser_obj['outer_label']

Sample code:

# select outer layer
print(ser_obj['c'])
Copy the code

Running results:

0-1.362096 1 1.558091 2 0.452313 Dtype:float64
Copy the code
  1. Inner selection:

ser_obj[:, 'inner_label']

Sample code:

# Inner selection
print(ser_obj[:, 2])
Copy the code

Running results:

A 0.826662 b 0.026426 c-0.452313 d-0.051063dtype:float64
Copy the code

Often used for grouping operations, PivotTable generation, and so on

Swap the hierarchical order

###1. swaplevel()

.swaplevel() swaps inner and outer indexes.

Sample code:

print(ser_obj.swaplevel())
Copy the code

Running results:

0 a 0 a 0 a 0 a 0 a 0 b 0 a 0 b 0 c 0 c 0 a 0 a 0 a 0 b 0 b 0 c 0 c 0 a 0 a 0 a 0 a 0 a 0 b 0 b 0 c 0 a 0 c 0 a 0 a 0 a 0 a 0 a 0 a 0 a 0 a 0 a 0 b 1 a 0 a 0 a 2 a 0 b 0 d 1.080605 1 d 0.567547 2 d-0.154148 dtype:float64
Copy the code

# swap and sort hierarchies sortlevel()

  • .sortlevel() sorts the outer index first and then the inner index, ascending by default.

Sample code:

Swap and sort hierarchies
print(ser_obj.swaplevel().sortlevel())
Copy the code

Running results:

0 a 0.099174b 1.742445c 0.150638d 1.080605 1 a 0.310414b 1.152924c 0.251660d 0.567547 2 a 0.558047b 1.742445c 0.150638d 1.080605 2 a 0.558047b 1.742445c 0.150638d 1.080605 C 0.063387 d-0.154148 Dtype:float64
Copy the code

#3.Pandas

Import numpy as np import pandas as pd df_obj = pd.dataframe (np.random. Randn (5,4), columns = ['a'.'b'.'c'.'d'])
print(df_obj)
Copy the code

Running results:

A b c d 0 1.469682 1.948965 1.373124-0.564129 1-1.466670-0.494591 0.467787-2.007771 2 1.368750 0.532142 0.487862 -1.130825 3-0.758540-0.479684 1.239135 1.073077 4-0.007470 0.997034 2.669219 0.742070Copy the code

### Common statistical calculations

The sum, mean, Max, min…

Axis =0 is counted by column, axis=1 is counted by row

Skipna excludes missing values and defaults to True

Sample code:

df_obj.sum()

df_obj.max()

df_obj.min(axis=1, skipna=False)
Copy the code

Running results:

A 0.60575b 2.503866 c 6.237127 d-1.887578 Dtype:float64

a    1.469682
b    1.948965
c    2.669219
d    1.073077
dtype: float64 0-0.564129 1-2.007771 2-1.130825 3-0.758540 4-0.007470 DTYPE:float64
Copy the code

### Common statistical description

##describe generates multiple statistics

Sample code:

print(df_obj.describe())
Copy the code

Running results:

A b c d count 5.000000 5.000000 5.000000 5.000000 5.000000 0.180305 0.106488 0.244978 0.178046 STD 0.641945 0.454340 1.064356 1.144416 min-0.677175-0.490278-1.164928-1.574556 25% -0.064069-0.182920-0.464013-0.089962 50% 0.231722 0.127846 0.355859 0.190482 75% 0.318854 0.463377 1.169750 0.983663 Max 1.092195 0.614413 1.328220 1.380601Copy the code

### Common statistical description methods: