The article directories
-
- Review the above
- DataFrame
-
- DataFrame case demonstration of key attributes
-
- The columns property
- The index attribute
- Attribute values
- Dtypes properties
- Size and Shape properties
- The data type of pandas
-
- Converters converts the data type
- Parse the time data type
- Pandas Keeps the data
-
- Save the data to CSV
-
- The experiment
- Save the data to JSON
-
- The sample
- Save data to Excel
-
- Write data to multiple tables
- Save data to MySQL
- Use Apply to process each row of data
-
- demo
Review the above
DataFrame
The DataFrame is the most important class for pandas. The DataFrame is used to analyze data by analyzing its properties and by using DataFrame methods.
DataFrame case demonstration of key attributes
The CSV file still uses the test.csv file from the previous article.
The columns property
import pandas as pd
csv_file = 'test.csv'
df = pd.read_csv(csv_file)
print(df.columns)
print(type(df.columns))
Copy the code
Index(['a'.'b'], dtype='object')
<class 'pandas.core.indexes.base.Index'> # pandasA key attribute ofCopy the code
Print (df.columns. Values) print(‘a’ ‘b’]
Is the name of all columns.
Now find all columns that contain ‘a’ :
for name in df.columns.values:
if name.find('a') > =0:
print(name)
Copy the code
That’s one way to do it. Is there a simpler way, like lambda functions, that there is a way.
list(filter(lambda x:x.find('a') > =0,df.columns.values))
# filter generator arguments: The first is a function pointer, and the second is an array passed as an argument to the function pointer.
Filter returns a tuple, so use a list to enforce it
Copy the code
The index attribute
print(df.index)
print(type(df.index))
Copy the code
RangeIndex(start=0, stop=3, step=1)
<class 'pandas.core.indexes.range.RangeIndex'>
Copy the code
It starts at 0 and ends at 3, with a step of 1.
Attribute values
print(df.values)
print(type(df.values))
Copy the code
[[ 1. 2.]
[ 3. 4.]
[nan 5.]]
<class 'numpy.ndarray'>
Copy the code
As you can see, it’s a multidimensional array.
Dtypes properties
print(df.dtypes['a'])
print(type(df.dtypes))
Copy the code
float64
<class 'pandas.core.series.Series'>
Copy the code
I’m not going to say much about this, because I covered it in the last article.
Size and Shape properties
These two just look at it,
print(df.size)
print(df.shape)
Copy the code
6 # How many elements are there
(3.2) # rows and columns
Copy the code
The data type of pandas
There is no point in talking.
Basic data types: discrete: objec (string), bool, timedelta continuous: int, float, datatime
Data type conversion: Astype method to_numeric method to_datatime method
I will not repeat what I said in the last article.
Converters converts the data type
As mentioned earlier, you can change the data type of the specified column by specifying the dtype parameter at read time, but that is too general and sometimes you can’t change it, like yesterday when I encountered a NAN and didn’t let me force an int.
def toint(x) :
try:
return int(float(x)) # you don't know what type of data is being passed in, or do you want to change it to float
except:
return 0
df = pd.read_csv(csv_file,converters={'a':toint})
Copy the code
Parse the time data type
For ease of operation, I have added a time column to the CSV file:
a,b,data
1.2.2018-10-1
3.4.2020-12-12
,5
Copy the code
df = pd.read_csv(csv_file,parse_dates=['data'])
print(df.dtypes['data'])
print(df.head())
Copy the code
datetime64[ns]
a b data
0 1.0 2 2018-10- 011 3.0 4 2020-12-12
2 NaN 5 NaT
Copy the code
Pandas Keeps the data
Save the data to CSV
To_csv () is a DataFrame method.
Path_or_buf: file saving position sep: delimiter. If no file is written, the default value is ", ". Columns: specifies the column to saveCopy the code
The experiment
Save to CSV
df = pd.read_csv(csv_file,parse_dates=['data'])
df.to_csv('test2.csv',sep='\t')
Copy the code
a b data
0 1.0 2 2020-10-21
1 3.0 4 2020-10-21
2 5
Copy the code
Specifies that columns save and do not save index columns
df = pd.read_csv(csv_file,parse_dates=['data'])
df.to_csv('test3.csv',columns='a',index = False)
Copy the code
a
1.0
3.0
""
Copy the code
Specify index column
df = pd.read_csv(csv_file,parse_dates=['data'])
df.to_csv('test4.csv',index_label='ID')
# feels more like giving a name to the index column
Copy the code
ID,a,b,data
0.1.0.2.2020-10-21
1.3.0.4.2020-10-21
2.5.Copy the code
Save the data to JSON
The sample
I didn’t get the data, so I don’t have any examples. I’ll have it when I run the data in a couple of days.
Save data to Excel
DataFrame.to_excel(excel_writer, sheet_name='Sheet1', columns=None, header=True, index=True)
Copy the code
3. It seems that there is no good definition, the previous have been said.
Let’s just do it.
df = pd.read_csv(csv_file,parse_dates=['data'])
df.to_excel('text1.xlsx',sheet_name='a')
Copy the code
There e is a problem writing this date.
How to do? Turn object to write, try, test available,
Also note that the corresponding Excel should not be manually opened on the computer at the time of writing.
Write data to multiple tables
ExcelWriter is required
df = pd.read_csv(csv_file,parse_dates=['data'])
with pd.ExcelWriter('test4.xlsx') as f:
df.to_excel(f,sheet_name='a',columns=['a'],index=True,index_label='ID')
df.to_excel(f, sheet_name='b', columns=['b'.'data'], index=True, index_label='ID')
Copy the code
Please check the results for yourself.
Save data to MySQL
def to_sql(
self,
name,
con,
schema=None,
if_exists="fail",
index=True,
index_label=None,
chunksize=None,
dtype=None,
method=None.)
Copy the code
Parameter definition :(the defined parameters will not be described here)
Schema: Used to create database objects. Default values are generally used.
If_exists: What if a table exists?
Fail: raises ValueError. Replace: deletes the table before inserting data. Delete table (s); delete table (s); Append: Inserts new data. If there are primary keys, avoid primary key conflicts. Columns of a DataFrame correspond to columns of a table. DF index is written to the database by default as a column of dataCopy the code
Index: Writes the index as a column to the database. The default value is True, which means that the index of DF is written to the database by default. Index_label is the column name
Index_label: Column name when the index is written to the database. The default value is index. If DF is a multilevel index, index_Label should be a sequence
Chunksize: Batch processing, how many pieces of data are processed at a time. Default all, generally no what use, unless the amount of data is too large, obvious feeling card can be batch processing.
Dtype: a dictionary that specifies the data type of the column. The key is the column name and the value is a string form of SQLAlchemy Types or SQlite3. If a new table is created, the type must be specified; otherwise, the default type is the type with the largest storage capacity. For example, a VARCHar type becomes a text type, which wastes a lot of space. If you want to add data, you do not need to specify this parameter.
Method: What type of insert statement?
None: Single line by default. 'multi' : multi-linecallableInsert function as callback function, write function name, unused.Copy the code
Here I want to say: if you are not in a hurry to access the database, and the amount of data is large, you can write directly into CSV, later directly import data from CSV into the database.
Use Apply to process each row of data
DataFrame.apply(func,axis)
Copy the code
Func: a function pointer used for internal processing axis: a parameter passed to a function pointer
1: Processes data by row0: Processes data by columnCopy the code
demo
Let’s power column A.
df = pd.read_csv(csv_file,dtype = {'data':object})
print(df.head())
df['aa'] = df.apply(lambda x: x.a*x.a,axis = 1) You can try column processing instead
print(df.head())
Copy the code
ID a b data
0 0 1.0 2 2020-10-21
1 1 3.0 4 2020-10-21
2 2 NaN 5 NaN
ID a b data aa
0 0 1.0 2 2020-10-21 1.0
1 1 3.0 4 2020-10-21 9.0
2 2 NaN 5 NaN NaN
Copy the code
The usual ones are handled by row.
Here for the time being, the next article will enter into a little summary of actual combat.