Introduction to the

In data processing, Pandas will use NaN to represent data that cannot be parsed or missing data. Although all the data was represented, NaN was clearly mathematically impossible.

This article will explain how Pandas handles NaN data.

An example of NaN

Given that missing data is represented as NaN, let’s look at a specific example:

Let’s first build a DF:

In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], ... : columns=['one', 'two', 'three']) ... : In [2]: df['four'] = 'bar' In [3]: df['five'] = df['one'] > 0 In [4]: df Out[4]: One two three four five a 0.469112-0.282863-1.509059 bar True c-1.135632 1.212112-0.173215 bar False e 0.119209 -1.044236-0.861849 bar True F-2.104569-0.494929 1.071804 bar False h 0.721555-0.706771-1.039575 bar TrueCopy the code

There are only acEFh indexes in DF. Let’s re-index the data:

In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) In [6]: df2 Out[6]: One two three four five a 0.469112-0.282863-1.509059 bar True b NaN NaN NaN NaN c-1.135632 1.212112-0.173215 bar False d NaN NaN NaN NaN NaN e 0.119209-1.044236-0.861849 bar True F-2.104569-0.494929 1.071804 bar False g NaN NaN NaN NaN NaN h 0.721555-0.706771-1.039575 bar TrueCopy the code

When data is missing, a lot of nans are created.

To detect NaN, either the isna() or notna() methods can be used.

In [7]: df2['one'] Out[7]: a 0.469112 b NaN c-1.135632 d NaN e 0.119209 f-2.104569 g NaN h 0.721555 Name: one, dtype: float64 In [8]: pd.isna(df2['one']) Out[8]: a False b True c False d True e False f False g True h False Name: one, dtype: bool In [9]: df2['four'].notna() Out[9]: a True b False c True d False e True f True g False h True Name: four, dtype: boolCopy the code

Note that None is equal in Python:

In [11]: None == None                                                 # noqa: E711
Out[11]: True
Copy the code

But NP.nan is not equal:

In [12]: np.nan == np.nan
Out[12]: False
Copy the code

Missing value of integer type

NaN is a float by default, but if it is an integer, we can force the conversion:

In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]: 
0       1
1       2
2    <NA>
3       4
dtype: Int64
Copy the code

Missing value of type Datetimes

Missing values for time types are represented by NaT:

In [15]: df2 = df.copy() In [16]: df2['timestamp'] = pd.Timestamp('20120101') In [17]: df2 Out[17]: One two three four five timestamp a 0.469112-0.282863-1.509059 bar True 2012-01-01 C-1.135632 1.212112-0.173215 bar False 2012-01-01 e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 H 0.721555-0.706771-1.039575 bar True 2012-01-01 In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan In [19]: df2 Out[19]: One two three four five timestamp a NaN - 0.282863-1.509059 bar True NaT c NaN 1.212112-0.173215 bar False NaT e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 h nan-0.706771 -1.039575 bar True NaT In [20]: df2.dtypes.valuE_COUNTS () Out[20]: Float64 3 dateTime64 [ns] 1 bool 1 Object 1 dType: int64Copy the code

Conversion of None to Np.nan

For numeric types, if None is assigned, then the corresponding NaN type is converted:

In [21]: s = pd.series ([1, 2, 3]) In [22]: s.loc[0] = None In [23]: s Out[23]: 0 NaN 1 2.0 2 3.0 dType: Float64Copy the code

If it is an object type, assignment with None will remain unchanged:

In [24]: s = pd.Series(["a", "b", "c"])

In [25]: s.loc[0] = None

In [26]: s.loc[1] = np.nan

In [27]: s
Out[27]: 
0    None
1     NaN
2       c
dtype: object
Copy the code

Calculation of missing values

Mathematical calculation of missing values or missing values:

In [28]: a Out[28]: One two a NaN -0.282863 c NaN 1.212112 E 0.119209-1.044236 F-2.104569-0.494929 h-2.104569-0.706771 In [29]: b Out[29]: One two three a NaN 0.282863-1.509059 c NaN 1.212112-0.173215e 0.119209-1.044236-0.861849 f-2.104569-0.494929 In [30]: a + b Out[30]: a + B Out[30]: One three two a NaN NaN -0.565727 c NaN NaN 2.424224 e 0.238417 NaN -2.088472 f -4.209138 NaN -0.989859 h NaN NaN 1.413542Copy the code

But NaN is treated as a zero in statistics.

In [31]: df Out[31]: One two three a NaN 0.282863-1.509059 c NaN 1.212112-0.173215e 0.119209-1.044236-0.861849 f-2.104569-0.494929 In [32]: df['one'].sum() Out[32]: -1.985360507597844 In [33]: df. Mean (1) Out[33]: A-0.895961 C 0.519449 E-0.595625 F-0.509232 h-0.873173 DTYPE: Float64Copy the code

If cumsum or cumprod is used, NaN is skipped by default. If you do not want to count NaN, add skipna=False

In [34]: df.cumsum() Out[34]: One two three a NaN 0.282863-1.509059 c NaN 0.929249-1.682273 E 0.119209-0.114987-2.544122 F-1.985361-0.609917 -1.472318 h NaN -1.316688-2.51189in [35]: df. Cumsum (skipna=False) Out[35]: One two three a NaN 0.282863-1.509059 c NaN 0.929249-1.682273 e NaN 0.114987-2.544122 f NaN 0.609917-1.472318 h NaN - 1.316688-2.511893Copy the code

Fillna is used to populate NaN data

In data analysis, if there is NaN data, it needs to be processed. One processing method is to use FillNA to fill it.

Fill the constants below:

In [42]: df2 Out[42]: One two three four five timestamp a NaN - 0.282863-1.509059 bar True NaT c NaN 1.212112-0.173215 bar False NaT e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 h nan-0.706771 -1.039575 bar True NaT In [43]: df2.fillna(0) Out[43]: One two three four five timestamp a 0.000000-0.282863-1.509059 bar True 0 c 0.000000 1.212112-0.173215 bar False 0 e 0.119209-1.044236-0.861849 bar True 2012-01-01 00:00:00 F-2.104569-0.494929 1.071804 bar False 2012-01-01 00:00:00 h 0.000000-0.70671-1.039575 bar True 0Copy the code

You can also specify a padding method, such as pad:

In [45]: df Out[45]: One two three a NaN 0.282863-1.509059 c NaN 1.212112-0.173215e 0.119209-1.044236-0.861849 f-2.104569-0.494929 In: df. Fillna (method='pad') Out[46]: df. Fillna (method='pad') Out[46]: One two three a NaN 0.282863-1.509059 c NaN 1.212112-0.173215e 0.119209-1.044236-0.861849 f-2.104569-0.494929 1.071804 H-2.104569-0.706771-1.039575Copy the code

You can specify the number of rows to fill:

In [48]: df.fillna(method='pad', limit=1)
Copy the code

Fill method statistics:

The method name describe
pad / ffill Forward to fill
bfill / backfill Back fill

It can be filled with a PandasObject:

In [53]: dff Out[53]: A B C 0 0.27180-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 NaN 0.577046 -1.715002 4 NaN NaN -1.157892 5 -1.344312 NaN NaN 6 -0.109050 1.643563 NaN 7 0.357021-0.674600 NaN 8 -0.968914 -1.294524 0.413738 9 0.276662-0.472035-0.013960 In [54]: dff.fillna(dff.mean()) Out[54]: A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 0.140857 0.577046 -1.715002 4-0.140857-0.401419-1.157892 5-1.344312-0.401419-0.293543 6 -0.109050 1.643563 -0.293543 7 0.357021 -0.674600-0.293543 8-0.968914-1.294524 0.413738 9 0.276662-0.472035-0.013960in [55]: dff.fillna(dff.mean()['B':'C']) Out[55]: A B C 0 0.27180-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 NaN 0.577046 -1.715002 4 NaN - 0.401419-1.157892 5-1.344312-0.401419-0.293543 6 -0.109050 1.643563 -0.293543 7 0.357021-0.674600 0.293543 8-0.968914-1.294524 0.413738 9 0.276662-0.472035-0.013960Copy the code

The above operation is equivalent to:

In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')
Copy the code

Use Dropna to delete data that contains NA

In addition to fillna to fill data, you can also use Dropna to delete data that contains NA.

In [57]: df Out[57]: One two three a NaN - 0.282863-1.509059 c NaN 1.212112-0.173215 e NaN 0.000000 0.000000 f NaN 0.000000 0.000000 h NaN In [58]: df. Dropna (axis=0) Out[58]: Empty DataFrame Columns: [one, two, three] Index: [] In [59]: df.dropna(axis=1) Out[59]: Two three a-0.282863-1.509059c 1.212112-0.173215e 0.000000 0.000000 f 0.000000 0.000000 h-0.706771-1.039575 In [60]: df['one'].dropna() Out[60]: Series([], Name: one, dtype: float64)Copy the code

The interpolation interpolation

Interpolate () when analyzing data, we will interpolate() to interpolate data smoothly. It is simple to use:

In [61]: Ts Out[61]: 2000-01-31 0.469112 2000-02-29 NaN 2000-03-31 NaN 2000-04-28 NaN 2000-05-31 NaN... 2007-12-31-6.950267 2008-01-31-7.904475 2008-02-29-6.441779 2008-03-31-8.184940 2008-04-30-9.011531 Freq: BM, Length: 100, dtype: float64Copy the code
In [64]: ts.interpolate() Out[64]: 2000-01-31 0.469112 2000-02-29 0.434469 2000-03-31 0.399826 2000-04-28 0.365184 2000-05-31 0.330541... 2007-12-31-6.950267 2008-01-31-7.904475 2008-02-29-6.441779 2008-03-31-8.184940 2008-04-30-9.011531 Freq: BM, Length: 100, dtype: float64Copy the code

The interpolation function can also add parameters that specify the interpolation method, such as interpolation by time:

In [67]: ts2 Out[67]: 2000-01-31 0.469112 2000-02-29 NaN 2002-07-31-5.785037 2005-01-31 NaN 2008-04-30-9.011531 DType: Float64 In [68]: ts2.interpolate() Out[68]: 2000-01-31 0.469112 2000-02-29-2.657962 2002-07-31-5.785037 2005-01-31-7.398284 2008-04-30-9.011531 DTYPE: 2000-02-29-2.657962 2000-07-31-5.785037 2005-01-31-7.398284 2008-04-30-9.011531 float64 In [69]: ts2.interpolate(method='time') Out[69]: 2000-01-31 0.469112 2000-02-29 0.270241 2002-07-31-5.785037 2005-01-31-7.190866 2008-04-30-9.011531 DTYPE: Float64Copy the code

Interpolate as float value of index:

In [70] : ser Out [70] : 0.0 0.0 1.0 10.0 10.0 dtype NaN: float64 In [71] : ser. Interpolate () Out [71] : Interpolate (method='values') Out[72]: Interpolate (interpolate ='values') In [72]: interpolate(interpolate ='values') Out[72]: 0.0 0.0 1.0 1.0 10.0 10.0 DTYPE: Float64Copy the code

In addition to interpolating Series, we can also interpolate DF:

In [73] : df = pd DataFrame ({' A ': [1, 2.1, np. Nan, 4.7, 5.6, 6.8],... 'B' : [25, np, nan, np, nan, 4, 12.2, 14.4]})... : In [74]: df Out[74]: A B 0 1.0 0.25 1 2.1 NaN 2 NaN NaN 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40 In [75]: Df.interpolate () Out[75]: A B 0 1.0 0.25 1 2.1 1.50 2 3.4 2.75 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40Copy the code

Interpolate also accepts the limit parameter, which specifies the number of interpolations.

In [95]: Ser.interpolate (Limit =1) Out[95]: 0 NaN 1 NaN 2 5.0 3 7.0 4 NaN 5 NaN 6 13.0 7 13.0 8 NaN dType: float64Copy the code

Replace the value with replace

Replace can replace constants as well as lists:

In [102]: ser = pd.Series([0., 1., 2., 3., 4.])

In [103]: ser.replace(0, 5)
Out[103]: 
0    5.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
Copy the code
In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
Out[104]: 
0    4.0
1    3.0
2    2.0
3    1.0
4    0.0
dtype: float64
Copy the code

You can replace specific values in DF:

In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})

In [107]: df.replace({'a': 0, 'b': 5}, 100)
Out[107]: 
     a    b
0  100  100
1    1    6
2    2    7
3    3    8
4    4    9
Copy the code

You can use interpolation substitution:

In [108]: ser.replace([1, 2, 3], method='pad') Out[108]: 0 0.0 1 0.0 2 0.0 3 4.0 dType: float64Copy the code

This article is available at www.flydean.com/07-python-p…

The most popular interpretation, the most profound dry goods, the most concise tutorial, many tips you didn’t know waiting for you to discover!

Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!