Introduction to the
In data processing, Pandas will use NaN to represent data that cannot be parsed or missing data. Although all the data was represented, NaN was clearly mathematically impossible.
This article will explain how Pandas handles NaN data.
An example of NaN
Given that missing data is represented as NaN, let’s look at a specific example:
Let’s first build a DF:
In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], ... : columns=['one', 'two', 'three']) ... : In [2]: df['four'] = 'bar' In [3]: df['five'] = df['one'] > 0 In [4]: df Out[4]: One two three four five a 0.469112-0.282863-1.509059 bar True c-1.135632 1.212112-0.173215 bar False e 0.119209 -1.044236-0.861849 bar True F-2.104569-0.494929 1.071804 bar False h 0.721555-0.706771-1.039575 bar TrueCopy the code
There are only acEFh indexes in DF. Let’s re-index the data:
In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) In [6]: df2 Out[6]: One two three four five a 0.469112-0.282863-1.509059 bar True b NaN NaN NaN NaN c-1.135632 1.212112-0.173215 bar False d NaN NaN NaN NaN NaN e 0.119209-1.044236-0.861849 bar True F-2.104569-0.494929 1.071804 bar False g NaN NaN NaN NaN NaN h 0.721555-0.706771-1.039575 bar TrueCopy the code
When data is missing, a lot of nans are created.
To detect NaN, either the isna() or notna() methods can be used.
In [7]: df2['one'] Out[7]: a 0.469112 b NaN c-1.135632 d NaN e 0.119209 f-2.104569 g NaN h 0.721555 Name: one, dtype: float64 In [8]: pd.isna(df2['one']) Out[8]: a False b True c False d True e False f False g True h False Name: one, dtype: bool In [9]: df2['four'].notna() Out[9]: a True b False c True d False e True f True g False h True Name: four, dtype: boolCopy the code
Note that None is equal in Python:
In [11]: None == None # noqa: E711
Out[11]: True
Copy the code
But NP.nan is not equal:
In [12]: np.nan == np.nan
Out[12]: False
Copy the code
Missing value of integer type
NaN is a float by default, but if it is an integer, we can force the conversion:
In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]:
0 1
1 2
2 <NA>
3 4
dtype: Int64
Copy the code
Missing value of type Datetimes
Missing values for time types are represented by NaT:
In [15]: df2 = df.copy() In [16]: df2['timestamp'] = pd.Timestamp('20120101') In [17]: df2 Out[17]: One two three four five timestamp a 0.469112-0.282863-1.509059 bar True 2012-01-01 C-1.135632 1.212112-0.173215 bar False 2012-01-01 e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 H 0.721555-0.706771-1.039575 bar True 2012-01-01 In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan In [19]: df2 Out[19]: One two three four five timestamp a NaN - 0.282863-1.509059 bar True NaT c NaN 1.212112-0.173215 bar False NaT e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 h nan-0.706771 -1.039575 bar True NaT In [20]: df2.dtypes.valuE_COUNTS () Out[20]: Float64 3 dateTime64 [ns] 1 bool 1 Object 1 dType: int64Copy the code
Conversion of None to Np.nan
For numeric types, if None is assigned, then the corresponding NaN type is converted:
In [21]: s = pd.series ([1, 2, 3]) In [22]: s.loc[0] = None In [23]: s Out[23]: 0 NaN 1 2.0 2 3.0 dType: Float64Copy the code
If it is an object type, assignment with None will remain unchanged:
In [24]: s = pd.Series(["a", "b", "c"])
In [25]: s.loc[0] = None
In [26]: s.loc[1] = np.nan
In [27]: s
Out[27]:
0 None
1 NaN
2 c
dtype: object
Copy the code
Calculation of missing values
Mathematical calculation of missing values or missing values:
In [28]: a Out[28]: One two a NaN -0.282863 c NaN 1.212112 E 0.119209-1.044236 F-2.104569-0.494929 h-2.104569-0.706771 In [29]: b Out[29]: One two three a NaN 0.282863-1.509059 c NaN 1.212112-0.173215e 0.119209-1.044236-0.861849 f-2.104569-0.494929 In [30]: a + b Out[30]: a + B Out[30]: One three two a NaN NaN -0.565727 c NaN NaN 2.424224 e 0.238417 NaN -2.088472 f -4.209138 NaN -0.989859 h NaN NaN 1.413542Copy the code
But NaN is treated as a zero in statistics.
In [31]: df Out[31]: One two three a NaN 0.282863-1.509059 c NaN 1.212112-0.173215e 0.119209-1.044236-0.861849 f-2.104569-0.494929 In [32]: df['one'].sum() Out[32]: -1.985360507597844 In [33]: df. Mean (1) Out[33]: A-0.895961 C 0.519449 E-0.595625 F-0.509232 h-0.873173 DTYPE: Float64Copy the code
If cumsum or cumprod is used, NaN is skipped by default. If you do not want to count NaN, add skipna=False
In [34]: df.cumsum() Out[34]: One two three a NaN 0.282863-1.509059 c NaN 0.929249-1.682273 E 0.119209-0.114987-2.544122 F-1.985361-0.609917 -1.472318 h NaN -1.316688-2.51189in [35]: df. Cumsum (skipna=False) Out[35]: One two three a NaN 0.282863-1.509059 c NaN 0.929249-1.682273 e NaN 0.114987-2.544122 f NaN 0.609917-1.472318 h NaN - 1.316688-2.511893Copy the code
Fillna is used to populate NaN data
In data analysis, if there is NaN data, it needs to be processed. One processing method is to use FillNA to fill it.
Fill the constants below:
In [42]: df2 Out[42]: One two three four five timestamp a NaN - 0.282863-1.509059 bar True NaT c NaN 1.212112-0.173215 bar False NaT e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 h nan-0.706771 -1.039575 bar True NaT In [43]: df2.fillna(0) Out[43]: One two three four five timestamp a 0.000000-0.282863-1.509059 bar True 0 c 0.000000 1.212112-0.173215 bar False 0 e 0.119209-1.044236-0.861849 bar True 2012-01-01 00:00:00 F-2.104569-0.494929 1.071804 bar False 2012-01-01 00:00:00 h 0.000000-0.70671-1.039575 bar True 0Copy the code
You can also specify a padding method, such as pad:
In [45]: df Out[45]: One two three a NaN 0.282863-1.509059 c NaN 1.212112-0.173215e 0.119209-1.044236-0.861849 f-2.104569-0.494929 In: df. Fillna (method='pad') Out[46]: df. Fillna (method='pad') Out[46]: One two three a NaN 0.282863-1.509059 c NaN 1.212112-0.173215e 0.119209-1.044236-0.861849 f-2.104569-0.494929 1.071804 H-2.104569-0.706771-1.039575Copy the code
You can specify the number of rows to fill:
In [48]: df.fillna(method='pad', limit=1)
Copy the code
Fill method statistics:
The method name | describe |
---|---|
pad / ffill | Forward to fill |
bfill / backfill | Back fill |
It can be filled with a PandasObject:
In [53]: dff Out[53]: A B C 0 0.27180-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 NaN 0.577046 -1.715002 4 NaN NaN -1.157892 5 -1.344312 NaN NaN 6 -0.109050 1.643563 NaN 7 0.357021-0.674600 NaN 8 -0.968914 -1.294524 0.413738 9 0.276662-0.472035-0.013960 In [54]: dff.fillna(dff.mean()) Out[54]: A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 0.140857 0.577046 -1.715002 4-0.140857-0.401419-1.157892 5-1.344312-0.401419-0.293543 6 -0.109050 1.643563 -0.293543 7 0.357021 -0.674600-0.293543 8-0.968914-1.294524 0.413738 9 0.276662-0.472035-0.013960in [55]: dff.fillna(dff.mean()['B':'C']) Out[55]: A B C 0 0.27180-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 NaN 0.577046 -1.715002 4 NaN - 0.401419-1.157892 5-1.344312-0.401419-0.293543 6 -0.109050 1.643563 -0.293543 7 0.357021-0.674600 0.293543 8-0.968914-1.294524 0.413738 9 0.276662-0.472035-0.013960Copy the code
The above operation is equivalent to:
In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')
Copy the code
Use Dropna to delete data that contains NA
In addition to fillna to fill data, you can also use Dropna to delete data that contains NA.
In [57]: df Out[57]: One two three a NaN - 0.282863-1.509059 c NaN 1.212112-0.173215 e NaN 0.000000 0.000000 f NaN 0.000000 0.000000 h NaN In [58]: df. Dropna (axis=0) Out[58]: Empty DataFrame Columns: [one, two, three] Index: [] In [59]: df.dropna(axis=1) Out[59]: Two three a-0.282863-1.509059c 1.212112-0.173215e 0.000000 0.000000 f 0.000000 0.000000 h-0.706771-1.039575 In [60]: df['one'].dropna() Out[60]: Series([], Name: one, dtype: float64)Copy the code
The interpolation interpolation
Interpolate () when analyzing data, we will interpolate() to interpolate data smoothly. It is simple to use:
In [61]: Ts Out[61]: 2000-01-31 0.469112 2000-02-29 NaN 2000-03-31 NaN 2000-04-28 NaN 2000-05-31 NaN... 2007-12-31-6.950267 2008-01-31-7.904475 2008-02-29-6.441779 2008-03-31-8.184940 2008-04-30-9.011531 Freq: BM, Length: 100, dtype: float64Copy the code
In [64]: ts.interpolate() Out[64]: 2000-01-31 0.469112 2000-02-29 0.434469 2000-03-31 0.399826 2000-04-28 0.365184 2000-05-31 0.330541... 2007-12-31-6.950267 2008-01-31-7.904475 2008-02-29-6.441779 2008-03-31-8.184940 2008-04-30-9.011531 Freq: BM, Length: 100, dtype: float64Copy the code
The interpolation function can also add parameters that specify the interpolation method, such as interpolation by time:
In [67]: ts2 Out[67]: 2000-01-31 0.469112 2000-02-29 NaN 2002-07-31-5.785037 2005-01-31 NaN 2008-04-30-9.011531 DType: Float64 In [68]: ts2.interpolate() Out[68]: 2000-01-31 0.469112 2000-02-29-2.657962 2002-07-31-5.785037 2005-01-31-7.398284 2008-04-30-9.011531 DTYPE: 2000-02-29-2.657962 2000-07-31-5.785037 2005-01-31-7.398284 2008-04-30-9.011531 float64 In [69]: ts2.interpolate(method='time') Out[69]: 2000-01-31 0.469112 2000-02-29 0.270241 2002-07-31-5.785037 2005-01-31-7.190866 2008-04-30-9.011531 DTYPE: Float64Copy the code
Interpolate as float value of index:
In [70] : ser Out [70] : 0.0 0.0 1.0 10.0 10.0 dtype NaN: float64 In [71] : ser. Interpolate () Out [71] : Interpolate (method='values') Out[72]: Interpolate (interpolate ='values') In [72]: interpolate(interpolate ='values') Out[72]: 0.0 0.0 1.0 1.0 10.0 10.0 DTYPE: Float64Copy the code
In addition to interpolating Series, we can also interpolate DF:
In [73] : df = pd DataFrame ({' A ': [1, 2.1, np. Nan, 4.7, 5.6, 6.8],... 'B' : [25, np, nan, np, nan, 4, 12.2, 14.4]})... : In [74]: df Out[74]: A B 0 1.0 0.25 1 2.1 NaN 2 NaN NaN 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40 In [75]: Df.interpolate () Out[75]: A B 0 1.0 0.25 1 2.1 1.50 2 3.4 2.75 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40Copy the code
Interpolate also accepts the limit parameter, which specifies the number of interpolations.
In [95]: Ser.interpolate (Limit =1) Out[95]: 0 NaN 1 NaN 2 5.0 3 7.0 4 NaN 5 NaN 6 13.0 7 13.0 8 NaN dType: float64Copy the code
Replace the value with replace
Replace can replace constants as well as lists:
In [102]: ser = pd.Series([0., 1., 2., 3., 4.])
In [103]: ser.replace(0, 5)
Out[103]:
0 5.0
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
Copy the code
In [104]: ser.replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0])
Out[104]:
0 4.0
1 3.0
2 2.0
3 1.0
4 0.0
dtype: float64
Copy the code
You can replace specific values in DF:
In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})
In [107]: df.replace({'a': 0, 'b': 5}, 100)
Out[107]:
a b
0 100 100
1 1 6
2 2 7
3 3 8
4 4 9
Copy the code
You can use interpolation substitution:
In [108]: ser.replace([1, 2, 3], method='pad') Out[108]: 0 0.0 1 0.0 2 0.0 3 4.0 dType: float64Copy the code
This article is available at www.flydean.com/07-python-p…
The most popular interpretation, the most profound dry goods, the most concise tutorial, many tips you didn’t know waiting for you to discover!
Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!