Introduction to the
During data processing, Pandas will use NaN to represent unparsed or missing data. Although all the data is represented, NaN is obviously not mathematically feasible.
This article will explain how Pandas handles NaN data.
An example of NaN
As mentioned above, missing data will be represented as NaN, let’s look at a specific example:
Let’s start by building a DF:
In [1]: df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], ... : columns=['one', 'two', 'three']) ... : In [2]: df['four'] = 'bar' In [3]: df['five'] = df['one'] > 0 In [4]: df Out[4]: One two three four five a 0.469112-0.282863-1.509059 bar True C -1.135632 1.212112-0.173215 bar False E 0.119209 -1.044236-0.861849 bar True F-2.104569-0.494929 1.071804 bar False h 0.721555-0.706771 -1.039575 bar True
There are only several indexes of ACEFH in DF above, so let’s re-index the data:
In [5]: df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']) In [6]: df2 Out[6]: One two three four five a 0.469112-0.282863-1.509059 bar True B NaN NaN NaN C 1.135632 1.212112-0.173215 bar False d NaN NaN NaN NaN NaN NaN e 0.119209-1.044236-0.861849 bar True f - 2.104569-0.494929 1.071804 bar False g NaN NaN NaN NaN NaN h 0.721555-0.706771 -1.039575 bar True
When data is missing, many NaNs are generated.
To check for NaN, either the isna() or notna() methods can be used.
In [7]: df2['one'] Out[7]: a 0.469112 b NaN C-1.135632 d NaN E 0.119209F-2.104569 g NaN H 0.721555 float64 In [8]: pd.isna(df2['one']) Out[8]: a False b True c False d True e False f False g True h False Name: one, dtype: bool In [9]: df2['four'].notna() Out[9]: a True b False c True d False e True f True g False h True Name: four, dtype: bool
Note that None is equal in Python:
In [11]: None == None # noqa: E711
Out[11]: True
But Np. NaN is unequal:
In [12]: np.nan == np.nan
Out[12]: False
Missing value of integer type
NaN is a float by default, but if it is an integer, we can cast it:
In [14]: pd.Series([1, 2, np.nan, 4], dtype=pd.Int64Dtype())
Out[14]:
0 1
1 2
2 <NA>
3 4
dtype: Int64
Missing value of type Datetimes
Missing values of the time type are represented by NAT:
In [15]: df2 = df.copy() In [16]: df2['timestamp'] = pd.Timestamp('20120101') In [17]: df2 Out[17]: One two three four timestamp a 0.469112-0.282863-1.509059 bar True 2012-01-01 C-1.135632 1.21212-0.173215 bar 2012-01-01 False 2012-01-01 e 0.119209-1.044236-0.861849 bar True 2012-01-01 f-2.104569-0.494929 1.071804 bar False 2012-01-01 e 0.119209-1.044236-0.861849 bar True 2012-01-01 f-2.104569-0.494929 1.071804 bar False 2012-01-01 H 0.721555-0.706771-1.039575 bar True 2012-01-01 In [18]: df2.loc[['a', 'c', 'h'], ['one', 'timestamp']] = np.nan In [19]: df2 Out[19]: One two three four timestamp a NaN -0.282863-1.509059 bar True NaT c NaN 1.212112-0.173215 bar False NaT e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 H NaN-0.706771 -1.039575 bar True NaT In [20]: df2.dtypes.value_counts() Out[20]: float64 3 datetime64[ns] 1 bool 1 object 1 dtype: int64
None and NP. NaN conversions
For numeric types, if the value is assigned to None, it is converted to the corresponding NaN type:
In [21]: s = pd.series ([1, 2, 3]) In [22]: s.oc [0] = None In [23]: s Out[23]: 0 NaN 1 2.0 2 3.0 dtype: float64
If it is an object type, using the assignment None will remain the same:
In [24]: s = pd.Series(["a", "b", "c"])
In [25]: s.loc[0] = None
In [26]: s.loc[1] = np.nan
In [27]: s
Out[27]:
0 None
1 NaN
2 c
dtype: object
Calculation of missing values
The mathematical calculation of a missing value is still a missing value:
In [28]: a Out[28]: One two a Na-0.282863 C Na-1.212112 E 0.119209-1.044236 F-2.104569-0.494929 H-2.104569-0.706771 In [29]: b Out[29]: One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 1.071804 H NA-0.706771-1.039575 IN [30]: A + B OUT [30]: One three two a NaN NaN -0.565727 c NaN NaN 2.424224 e 0.238417 NaN -2.088472 f -4.209138 NaN -0.989859 h NaN NaN 1.413542
But NaN will be treated as 0 in the statistics.
In [31]: df Out[31]: One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 1.071804 h Nan-0.706771-1.039575 In [32]: df['one']. SUM () OUT [32]: -1.9853605075978744 In [33]: df. A-0.895961 C 0.519449 E-0.595625 F-0.509232 H-0.873173 DTYPE: FLOAT64
If you are in cumsum or cumprod, the default is to skip NaN. If you do not want to count NaN, you can add skipna=False
In [34]: df.cumsum() Out[34]: One two three a NaO-0.282863-1.509059 C NaO-0.929249-1.682273 E 0.119209-0.114987-2.544122 F-1.98531-0.609917 -1.472318 h NaN-1.316688-2.511893 In [35]: df.cumsum(skipna=False) Out[35]: One two three a NaO-0.282863-1.509059 C NaO-0.929249-1.682273 E NaO-0.114987-2.544122 F NaO-0.609917-1.472318 H NaN - 1.316688-2.511893
The NaN data is populated with fillna
In data analysis, if there is NaN data, it needs to be processed. One method of processing is to use fillna to fill it.
Fill in the constants below:
In [42]: df2 Out[42]: One two three four timestamp a NaN -0.282863-1.509059 bar True NaT c NaN 1.212112-0.173215 bar False NaT e 0.119209-1.044236-0.861849 bar True 2012-01-01 F-2.104569-0.494929 1.071804 bar False 2012-01-01 H NaN-0.706771 -1.039575 bar True NaT In [43]: df2.fillna(0) Out[43]: One two three four timestamp a 0.000000-0.282863-1.509059 bar True 0 c 0.000000 1.21212-0.173215 bar False 0 e 0.119209-1.044236-0.861849 bar True 2012-01-01 00:00:00 F-2.104569-0.494929 1.071804 bar False 2012-01-01 00:00:00 H 0.000000-0.706771 -1.039575 bar True 0
You can also specify the fill method, such as pad:
In [45]: df Out[45]: One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 1.071804 h NaN-0.706771-1.039575 In [46]: df. Fillna (method='pad') Out[46]: One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 One two three a Na-0.282863-1.509059 C Na-1.212112-0.173215 E 0.119209-1.044236-0.861849 F-2.104569-0.494929 1.071804H-2.104569-0.706771-1.039575
You can specify the number of rows to fill:
In [48]: df.fillna(method='pad', limit=1)
Fill method statistics:
The method name | describe |
---|---|
pad / ffill | Forward to fill |
bfill / backfill | Back fill |
You can use PandasObject to populate it:
In [53]: dff Out[53]: A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 NaN 0.577046 -1.715002 4 Nan Nan -1.157892 5-1.344312 Nan Nan 6-0.109050 1.643563 Nan 7 0.357021-0.674600 Nan 8-0.968914 -1.294524 0.413738 9 0.276662-0.472035-0.013960 In [54]: dff.fillna(dff.mean()) Out[54]: A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3-0.140857 0.577046 A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3-0.140857 0.577046 -1.715002 4-0.140857-0.401419-1.157892 5-1.344312-0.401419-0.293543 6-0.109050 1.643563-0.293543 7 0.357021 -0.674600-0.293543 8-0.968914-1.294524 0.413738 9 0.276662-0.472035-0.013960 In [55]: dff.fillna(dff.mean()['B':'C']) Out[55]: A B C 0 0.271860-0.424972 0.567020 1 0.276232-1.087401-0.673690 2 0.113648-1.478427 0.524988 3 NaN 0.577046 -1.715002 4 Na-0.401419-1.157892 5-1.344312-0.401419-0.293543 6-0.109050 1.643563-0.293543 7 0.357021-0.674600 -0.293543 8-0.968914 -1.294524 0.413738 9 0.276662-0.472035-0.013960
The above operation is equivalent to:
In [56]: dff.where(pd.notna(dff), dff.mean(), axis='columns')
Delete data containing NA using DropNA
In addition to fillna to populate the data, you can also use dropna to delete the data containing the na.
In [57]: df Out[57]: One two three a NaN -0.282863-1.509059 c NaN 1.212112-0.173215 e NaN 0.000000 0.000000 f NaN 0.000000 h NaN -0.70671-1.039575 In [58]: df.dropna(axis=0) Out[58]: Empty datafame Columns: [one, two, three] Index: [] In [59]: df.dropna(axis=1) Out[59]: Two three A-0.282863-1.509059 C 1.2121212-0.173215 E 0.000000, 0.000000 F 0.000000, 0.000000 H-0.706771 -1.039575 IN Two three A-0.282863-1.509059 C 1.2121212-0.173215 E 0.000000, 0.000000 F 0.000000, 0.000000 H-0.706771 -1.039575 IN [60]: df['one'].dropna() Out[60]: Series([], Name: one, dtype: float64)
The interpolation interpolation
When analyzing the data, in order to make the data smooth, we need some interpolation operation “interpolate()”, which is very simple to use:
In [61]: 2000-01-31 0.469112 2000-02-29 Nan 2000-03-31 Nan 2000-04-28 Nan 2000-05-31 Nan... 2007-12-31-6.950267 2008-01-31-7.904475 2008-02-29-6.441779 2008-03-31-8.184940 2008-04-30-9.011531 Freq: BM, Length: 100, dtype: float64
In [64]: ts.interpolate() Out[64]: 2000-01-31 0.469112 2000-02-29 0.434469 2000-03-31 0.399826 2000-04-28 0.365184 2000-05-31 0.330541... 2007-12-31-6.950267 2008-01-31-7.904475 2008-02-29-6.441779 2008-03-31-8.184940 2008-04-30-9.011531 Freq: BM, Length: 100, dtype: float64
Interpolation functions can also add arguments to specify how to interpolate, such as by time:
In [67]: ts2 Out[67]: 2000-01-31 0.469112 2000-02-29 Nan 2002-07-31-5.785037 2005-01-31 Nan 2008-04-30-9.011531 Dtype: Float64in [68]: ts2.interpolate() Out[68]: 2000-01-31 0.469112 2000-02-29-2.657962 2000-07-31-5.785037 2005-01-31-7.398284 2008-04-30-9.011531 Dtype: 2000-01-31 0.469112 2000-02-29-2.657962 2002-07-31-5.785037 2005-01-31-7.398284 2008-04-30-9.011531 Dtype: float64 In [69]: ts2.interpolate(method='time') Out[69]: 2000-01-31 0.469112 2000-02-29 0.270241 2002-07-31-5.785037 2005-01-31-7.190866 2008-04-30-9.011531 Dtype: Float64
Interpolate by float value of index:
In [70]: ser Out[70]: 0.0 0.0 1.0 NaN 10.0 10.0 dtype: float64 In [71]: ser.interpolate() Out[71]: Interpolate (method='values') Out[72]: interpolate(method='values') Out[72]: interpolate(method='values') Out[72]: 0.0 0.0 1.0 1.0 10.0 10.0
In addition to interpolating Series, you can also interpolate DF:
In [73] : df = pd DataFrame ({' A ': [1, 2.1, np. Nan, 4.7, 5.6, 6.8],... 'B' : [25, np, nan, np, nan, 4, 12.2, 14.4]})... : In [74]: df Out[74]: A B 0 1.0 0.25 1 2.1 NaN 2 NaN 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40 In [75]: Df. Interpolate () Out[75]: A B 0 1.0 0.25 1 2.1 1.50 2 3.4 2.75 3 4.7 4.00 4 5.6 12.20 5 6.8 14.40
Interpolate also accepts the limit parameter, and you can specify the number of interpolations.
In [95]: ser.interpolate(limit=1)
Out[95]:
0 NaN
1 NaN
2 5.0
3 7.0
4 NaN
5 NaN
6 13.0
7 13.0
8 NaN
dtype: float64
Use replace to replace the value
Replace can replace a constant or a list:
In [102]: ser = pd.Series([0., 1., 2., 3., 4.])
In [103]: ser.replace(0, 5)
Out[103]:
0 5.0
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
In [104]: Ser. replace([0, 1, 2, 3, 4], [4, 3, 2, 1, 0]) Out[104]: 0 4.0 1 3.0 2 2.0 3 1.0 4 0.0 Dtype: float64
You can replace specific values in DF:
In [106]: df = pd.DataFrame({'a': [0, 1, 2, 3, 4], 'b': [5, 6, 7, 8, 9]})
In [107]: df.replace({'a': 0, 'b': 5}, 100)
Out[107]:
a b
0 100 100
1 1 6
2 2 7
3 3 8
4 4 9
You can use interpolation to replace:
In [108]: ser.replace([1, 2, 3], method='pad')
Out[108]:
0 0.0
1 0.0
2 0.0
3 0.0
4 4.0
dtype: float64
This article has been included in http://www.flydean.com/07-python-pandas-missingdata/
The most popular interpretation, the most profound dry goods, the most concise tutorial, many you do not know the tips to wait for you to discover!
Welcome to pay attention to my public number: “procedures those things”, understand technology, more understand you!