This is the fourth day of my participation in the August Text Challenge.More challenges in August
Hello, everyone, I am the talented brother.
In daily data processing, we often need to type transform some data for the convenience of subsequent processing. Because I don’t like to remember them, I search for them every time I don’t remember the specific function method, which still feels a little Fei time.
Today let’s take a look at some common conversion operations and save them for a rainy day.
Directory:
[toc]
1. Specify the data type when loading data
In general, to save trouble I just pd.dataframe (data) or pd.read_xx(filename).
Such as:
import pandas as pd
df = pd.read_excel('Data type Conversion case data.xlsx')
df
Copy the code
countries | popularity | score | Yearning degree | |
---|---|---|---|---|
0 | China | 10 | 10 | 10.0 |
1 | The United States | 6 | 5.8 | 7.0 |
2 | Japan | 2 | 1.2 | 7.0 |
3 | Germany | 8 | 6.8 | 6.0 |
4 | Britain | 7 | 6.6 | nan |
We look at the dtypes attribute
df.dtypes
Copy the code
Country Object Popularity INT64 Score FLOAT64 Desirability FLOAT64 DType: objectCopy the code
You can see that the country field is of type Object, the popularity is of type int, and the score and desirability are both of type float. In fact, for the degree of yearning we might want an int integer, and the country field is a string.
So, we can specify each field data type with the parameter dtype when loading the data.
import pandas as pd
df = pd.read_excel('Data type Conversion case data.xlsx',
dtype={
'countries':'string'.'Desirability':'Int64'
}
)
df
Copy the code
countries | popularity | score | Yearning degree | |
---|---|---|---|---|
0 | China | 10 | 10 | 10 |
1 | The United States | 6 | 5.8 | 7 |
2 | Japan | 2 | 1.2 | 7 |
3 | Germany | 8 | 6.8 | 6 |
4 | Britain | 7 | 6.6 |
Look again at the dtypes attribute
df.dtypes
Copy the code
Country String Popularity INT64 Score FLOAT64 Yearning int64 DType: objectCopy the code
Similarly, when creating DataFrame data, you can use the dtype parameter to set the data type (in this case, all fields).
df = pd.DataFrame({'A': [1.2.3.4.].'B': [1.3.5.7]
},
dtype='float32'
)
df.dtypes
Copy the code
A float32
B float32
dtype: object
Copy the code
2. Astype Converts data types
For existing data, we often use astype to convert data types, either for a Series column or for multiple columns at the same time.
In [1]: df. Popularity astype('float')
Out[1] :0 10.0
1 6.0
2 2.0
3 8.0
4 7.0Name: Popularity, dType: float64 In [2]: df.astype({'countries':'string'.'Desirability':'Int64'})
Out[2]: Country popularity rating0China10 10.0 10
1The United States6 5.8 7
2Japan2 1.2 7
3Germany8 6.8 6
4Britain7 6.6 <NA>
Copy the code
3. Pd. to_xx Converts data types
3.1.pd. to_datetime converts to the time type
- The string for the date like is converted to a date
- Timestamp conversion to date, etc
- Numeric strings are converted to dates in format
If an unconverted case is encountered, an error will be reported by default. You can set errors=’ COERce ‘to NaT for the unconverted case.
# Convert a string to a date
In [3]: s = pd.Series(['3/11/2000'.'3/12/2000'.'3/13/2000'])
In [4]: s
Out[4] :0 3/11/2000
1 3/12/2000
2 3/13/2000
dtype: object
In [5]: pd.to_datetime(s, infer_datetime_format=True)
Out[5] :0 2000- 03 -11
1 2000- 03 -12
2 2000- 03 -13
dtype: datetime64[ns]
You can also convert the timestamp to a date
In [6]: s = pd.Series([1490195805.1590195805.1690195805])
In [7]: pd.to_datetime(s, unit='s')
Out[7] :0 2017- 03 -22 15:16:45
1 2020- 05 -23 01:03:25
2 2023- 07 -24 10:50:05
dtype: datetime64[ns]
In [8]: s = pd.Series([1490195805433502912.1590195805433502912.1690195805433502912])
In [9]: pd.to_datetime(s, unit='ns')
Out[9] :0 2017- 03 -22 15:16:45.433502912
1 2020- 05 -23 01:03:25.433502912
2 2023- 07 -24 10:50:05.433502912
dtype: datetime64[ns]
# Number string is converted to date according to format
In [10]: s = pd.Series(['20200101'.'20200202'.'202003'])
In [11]: pd.to_datetime(s, format='%Y%m%d', errors='ignore')
Out[11] :0 20200101
1 20200202
2 202003
dtype: object
In [12]: pd.to_datetime(s, format='%Y%m%d', errors='coerce')
Out[12] :0 2020- 01-011 2020- 02-022 NaT
dtype: datetime64[ns]
Copy the code
It should be noted that the default start time for the date conversion of the above timestamp is 1970-01-01, which will differ by 8 hours for the domestic time. We have the following ways to process it.
In [13]: s
Out[13] :0 1490195805
1 1590195805
2 1690195805
dtype: int64
Greenwich time by default
In [14]: pd.to_datetime(s, unit='s')
Out[14] :0 2017- 03 -22 15:16:45
1 2020- 05 -23 01:03:25
2 2023- 07 -24 10:50:05
dtype: datetime64[ns]
# Add 8 hours to the start time
In [15]: pd.to_datetime(s, unit='s', origin=pd.Timestamp('the 1970-01-01 08:00:00'))
Out[15] :0 2017- 03 -22 23:16:45
1 2020- 05 -23 09:03:25
2 2023- 07 -24 18:50:05
dtype: datetime64[ns]
Add 8 hours manually
In [16]: pd.to_datetime(s, unit='s') + pd.Timedelta(days=8/24)
Out[16] :0 2017- 03 -22 23:16:45
1 2020- 05 -23 09:03:25
2 2023- 07 -24 18:50:05
dtype: datetime64[ns]
Copy the code
3.2.pd.to_numeric is converted to a numeric type
In [17]: s = pd.Series(['1.0'.'2', -3])
In [18]: pd.to_numeric(s)
Out[18] :0 1.0
1 2.0
2 -3.0
dtype: float64
In [19]: pd.to_numeric(s, downcast='signed')
Out[19] :0 1
1 2
2 -3
dtype: int8
In [20]: s = pd.Series(['apple'.'1.0'.'2', -3])
In [21]: pd.to_numeric(s, errors='ignore')
Out[21] :0 apple
1 1.0
2 2
3 -3
dtype: object
In [22]: pd.to_numeric(s, errors='coerce')
Out[22] :0 NaN
1 1.0
2 2.0
3 -3.0
dtype: float64
Copy the code
3.3.pd.to_timedelta to a time difference type
Convert numbers, time difference string like, and so on to time difference data types
In [23] :import numpy as np
In [24]: pd.to_timedelta(np.arange(5), unit='d')
Out[24]: TimedeltaIndex(['0 days'.'1 days'.'2 days'.'3 days'.'4 days'], dtype='timedelta64[ns]', freq=None)
In [25]: pd.to_timedelta('1 days 06:05:01. 00003')
Out[25]: Timedelta('1 days 06:05:01. 000030')
In [26]: pd.to_timedelta(['1 days 06:05:01. 00003'.'15.5 us'.'nan'])
Out[26]: TimedeltaIndex(['1 days 06:05:01. 000030'.'0 days 00:00:00. 000015500', NaT], dtype='timedelta64[ns]', freq=None)
Copy the code
4. Determine the data type intelligently
The convert_dtypes method can be used to do smart data type conversions, see
5. Data type filtering
See in some learning groups often have friends ask how to filter the specified data type of the field, today we also introduce.
Pandas provides a select_dtypes() function that filters by field data type to select required fields and excluded fields.
Signature:
df.select_dtypes(include=None, exclude=None) - >'DataFrame'
Docstring:
Return a subset of the DataFrame's columns based on the column dtypes.
Copy the code
There are the following data types:
Number: number, int, float
Boolean: the Boolean
Time: datetime64
Time difference: timedelta64
Category: the category
String: string
Object: the object
In [27]: df
Out[27]: Country popularity rating0China10 10.0 10.0
1The United States6 5.8 7.0
2Japan2 1.2 7.0
3Germany8 6.8 6.0
4Britain7 6.6 NaN
In [28]: df.dtypes
Out[28] : the countryobjectPopularity INT64 Score FLOAT64 Desirability float64 DType:object
In [29]: df.select_dtypes(include='int')
Out[29]: Popularity0 10
1 6
2 2
3 8
4 7
In [30]: df.select_dtypes(include='number')
Out[30]: Popularity rating Desirability0 10 10.0 10.0
1 6 5.8 7.0
2 2 1.2 7.0
3 8 6.8 6.0
4 7 6.6 NaN
In [31]: df.select_dtypes(include=['int'.'object'])
Out[31]: National popularity0China10
1The United States6
2Japan2
3Germany8
4Britain7
In [32]: df.select_dtypes(exclude=['object'])
Out[32]: Popularity rating Desirability0 10 10.0 10.0
1 6 5.8 7.0
2 2 1.2 7.0
3 8 6.8 6.0
4 7 6.6 NaN
Copy the code
The above is all the content of this time, I hope to help you, if you like, remember to click a thumbs-up, brush a look at oh!
In fact, there are more parameter methods that you can use either help or? Take a look at the demo!