Dataframe Select data \

\

\

1. Select the row name, column name, and value

* * * *

Select data by label (row, column name) — x.loc[row label, column label]

* * * *

3. Select data by position (row, column) index — X. iloc

* * * *

4. Select data by label and position at the same time — X. x[row, column]

* * * *

5. Select consecutive rows and columns — slices

Start index: End index This use is called a slice and refers to the start index to end index. You can see it in practice.

If the starting index is omitted from the beginning, and the end index is omitted from the end, then everything is omitted.

Slicing can be used in.loc,.iloc, and.ix

As can be seen from the above example, when the index is based on position, the slice does not include the end point and is left closed and right open. Columns 1 through 3 are not written 0:2 but 0:3.

* * * *

6. Select lines or columns that are not contiguous

Remark: Loc [‘2016-02-02′,’2016-02-04’,:] [‘ 2016-02-04’, ‘2016-02-04’,:] [‘2016-02-02′,’2016-02-04’,:] [‘2016-02-02′,’2016-02-04’,:]] [‘2016-02-02′,’2016-02-04’, ‘2016-02-04’,:]] Df. ix[[pd.Timestamp(‘2016-02-02’), pd.Timestamp(‘2016-02-04’)]]

7. Easily get rows or columns

Get rows directly with slices and columns directly with label names. Be careful not to stray.

* * * *

8. How to return a single column or row of a Dataframe

As above, a series is returned instead of a Dataframe. Sometimes a series is returned when a single row is fetched, such as df.ix[0,:].

To return a dataframe, enclose the index in brackets, as follows.

* * * *

9. Select data by condition — DF [logical condition]

Logical condition support & (and), | (or), such as ~ (non) logic operation

This method has a common scenario where data is modified conditionally:

\

\

02. Dataframe transpose, sort \

\

\

1. Transposed df. T

* * * *

2. Sort by row or column name — df.sort_index

df.sort_index(axis=0,ascending=True)

  • Axis = 0 is sorted by row name; 1 indicates sorting by column name
  • Ascending = True indicates ascending; False to downward

* * * *

3. Sort by value — df.sort_index

df.sort(by=, ascending=True)

  • By = sort by the value of the column. The default is by row label
  • Ascending = True indicates ascending; False to downward

\

\

03. Dataframe adds or deletes rows/columns \

\

\

1. Get a sample datFrame data type

* * * *

2. Add a column or row

* * * *

3. Drop rows or columns — df.drop

Df. Drop (labels, axis = 0, inplace = Flase)

  • Labels Indicates the label name of a row or column, which can be omitted in the first column.
  • Axis = 0 deletes rows; 1 remove column
  • Inplace = False Generates a new dataframe; True Does not generate a new dataframe and replaces the original dataframe. The default is False.
  • By default, this operation returns another new dataframe, so that the original is unchanged, as in the deleted column in the first example below and in the second example. To replace the original, adjust the inplace parameter

\

\

04. Linking multiple dataframe\

\

1. The concat, concat (/ df1, df2,… ,axis=0)

  • Axis = 0 vertical; 1 transverse.
  • You need to import the PANDAS module before using the pandas module
  • Use it with care to align the dataframe columns and columns of the connection
  • Multiple Dataframes can be concatenated simultaneously
  • Concatenation is mandatory, allowing a row or column with the same name to follow the join, as shown in the second example of vertical joins

* * * *

2. Connect horizontally

* * * *

3. Connect vertically

\

\

05. Form a dataframe \

\

\

1. Create pd.dataframe

pd.DataFrame(data=None, index=None, columns=None)

  • Data = data
  • Index = index, that is, row name, row table header
  • Columns = The name of the column and the head of the list

To use this command, import pandas as pd is required

` `

* * * *

Pd.dataframe = pd.dataframe = pd.dataframe

The method is basically the same as above, because the word typically comes with a label, so there is no need to write column names.

* * * *

3. Easily obtain the time index in the broadened data \

Sometimes when creating a dataframe, you need to use the same time row index to keep consistent with the platform data. However, it is difficult to create the time row index directly because it involves holidays and non-trading days. Here is a simple method to quickly obtain the time row index consistent with the platform data. The idea is to take the time index of platform data directly out. Examples are as follows:

\

\

06. Missing value handling of dataframe \

\

1. Delete the missing value df.dropna

df.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)

  • Axis = 0 check for missing lines; 1 Check the deletion by column. No write Defaults to 0
  • How = ‘any’ with a missing value is missing; ‘all’ rows or columns (according to axis arguments) are missing. Do not write default is ‘any’
  • Thresh = x,x is an integer, meaning that the number of non-missing values in rows or columns (according to axis parameters) is greater than or equal to x and is not missing, that is, less than xis missing and will be removed.
  • Subset = Indicates the label name. Select which column or row (as opposed to axis) to check for missing, and not for unwritten. That is to limit the scope of inspection.

Relatively complex, see examples and examples notes.

* * * *

2. Fill the missing value with df.fillna

df.fillna(value=None,axis=None)

  • Value = The value to replace the missing value. It can be a single value, a dictionary, a dataframe, and so on, but not a list. See examples for the difference.

* * * *

3. Check whether the data is missing — df.isnull

Why do we use this method to judge whether it is missing?

Because nan does not equal nan (as in the following example), it is impossible to determine whether a value x is nan by methods such as x == nan condition is true.

\

\

\

07. Common statistical function \

\

\

Common statistical function

Describe calculates summary statistics against a Series or a DataFrame column

Count Number of non-NA values

Min and Max calculate the minimum and maximum values

The idXmin and IDxmax calculations can obtain the maximum and minimum worth index values

Quantile calculates the quantile of a sample (0 to 1)

The sum of the values of sum

Mean worth mean

Median = 50%

Mad calculates the mean absolute deviation from the mean

The variance of the var sample

Standard deviation of STD sample values

Skew sample value skewness (third moment)

Kurt sample worth kurtosis (fourth moment)

The cumsum sample is worth accumulating

Cummin, cummax sample value accumulative maximum value and accumulative minimum value

Cumprod samples are worth accumulating

Diff computes first-order differences

Pct_change computes percentage change

View the details of the function

\

\

08. Panel type data is decomposed into dataframe\

\

1. Access method of panel

The panel type can be used in a similar way to a dataframe. Generally, we need to do statistical work, which is also decomposed into dataframe for operation, basically meeting daily needs. For more information about panel operation, please refer to: pandas.pydata.org/pandas-docs…

\

2. Panel type data is decomposed into dataframe methods

\

\

09. Research internal access dataframe\

\

1. Save dataframe as CSV file — df.to_csv()

The files are stored in the research space. If the path is not written, the default is the root directory, such as df.to_csv(‘df.csv’).

\

2. Read dataframe — pd.read_csv()

\

Article by JoinQuant

\

Jukuan, the largest Python quantization platform in China, provides the whole process products from data, back test, simulation, real disk and so on. Jukuan has gathered more than 15W quantitative enthusiasts, in cooperation with dozens of organizations, jukuan public account will regularly update quantitative dry goods, teach you how to write good strategies in Python.

ID: JoinQuant

\