A DataFrame is a multidimensional data type. Because two-dimensional data is commonly used, DataFrame can be thought of as excel tabular data consisting of multiple columns, each of which can be of a different type. A Series is just a column
Because DataFrame is a multidimensional data type, DataFrame has both row and column indexes.
DataFrame Creation mode
We can create (initialize) objects of type DataFrame as follows:
- 2 darray structure (list, Ndarray array, DataFrame, etc.) type.
- Dictionary type, key for column name, value for one-dimensional array structure (list, nDARray array, Series, etc.).
Description:
- If no explicit row and column indexes are specified, an integer numeric index starting with 0 is automatically generated. We can specify this with the index and columns arguments when creating DataFrame objects.
- You can access the first/last N rows of records (data) through head and tail.
Create DataFrame with 2d data structure. No row index is specified, row index is automatically generated. All natural numbers starting at 0
array1 = np.random.rand(3, 5)
df = pd.DataFrame(array1)
print(df)
# error, over 2d.
# df_more_than2d = pd.DataFrame(np.random.rand(3, 3, 3))01 2 3 4 0 0.877072 0.941101 0.131574 0.056032 0.141660 1 0.129488 0.211658 0.786556 0.477778 0.912969 2 0.624839 0.336306 0.936274 0.581543 0.541653Copy the code
Create DataFrame with dictionary. A key-value pair is a column. Key specifies the index of the column and value specifies the value of the column.
df = pd.DataFrame({"Beijing": [100, 200, 125, 112]."Tianjin": [109, 203, 123, 112]."Shanghai":[39, 90, 300, 112]})
display(df)
Copy the code
# Display the first N records
display(df.head(2))
# Display the last N records
display(df.tail(2))
Select N records randomly
display(df.sample(2)) Copy the code
Create DataFrame, specify row, column index.
df = pd.DataFrame(np.random.rand(3, 5), index=["In 1"."In 2"."Three areas"], columns=["Beijing"."Tianjin"."Shanghai"."Shenyang"."Guangzhou"])
display(df)Copy the code
DataFrame related attributes
- The index index
- The columns are listed
- Values values
- Shape shape
- Ndim dimension
- Dtypes Indicates the data type
Description:
- Columns access column indexes and values access data. Index and columns can also be set (modified).
- You can specify the name attribute value for the index and columns attributes of the DataFrame.
- DataFrame data cannot exceed two dimensions.
df = pd.DataFrame(np.random.rand(3, 5), index=["In 1"."In 2"."Three areas"], columns=["Beijing"."Tianjin"."Shanghai"."Shenyang"."Guangzhou"])
Idnex, columns, values
display(df.values, type(df.values)) # return the array of ndarray associated with DataFrame
display(df.index) Return the row index
display(df.columns) # return column indexArray ([[0.88915553, 0.09234275, 0.41773469, 0.92490647, 0.13286735], [0.85550017, 0.06293159, 0.75023895, 0.01887861, 0.327761], [0.13309605, 0.98347602, 0.95935583, 0.92139592, 0.48752687])numpy. NdarrayIndex (['area 1'.In '2'.In '3'], dtype='object')Index(['Beijing'.'tianjin'.'Shanghai'.'the shenyang'.'guangzhou'], dtype='object')Copy the code
# return shape
display(df.shape)
# return dimension
display(df.ndim)
Return type information for each column.Display (df.dtypes) (3, 5)2 Beijingfloat64 in tianjin,float64 Shanghaifloat64 in shenyang,float64 guangzhoufloat64
dtype: objectCopy the code
DataFrame related operations
Suppose df is an object of type DataFrame.
The column operation
- Get columns [which is better?
- Df [column index]
- Df. The column index
- Add (modify) columns: df[column index] = column data
- Delete the column
- Del df[column index]
- Df.pop (column index)
- Df.drop (column index or array)
Line operations
- Get line
- Df.loc indexes by label.
- Df.iloc is indexed by location.
- Df.ix mixed index. Index by label first, or by location if not found (provided the label is not of a numeric type). [No longer recommended, deprecated in new version]
- Add line: append (append)
- Delete rows
- Df.drop (row index or array)
Row and column mixing operation:
- Get the rows first, then the columns.
- First get the column, then get the row.
Description:
- The drop method can drop both rows and columns, specifying the axis direction through axis. You can modify in place, you can also return the result after the modification.
- Access via DF [index] is to operate on the column.
- Access via DF [slice] is to operate on rows. First by label, then by index. If the label is of a numeric type, only the label will be matched.
- A Boolean index operates on a row.
- By indexing an array you operate on a column.
This place is highly compartmentalized and easily confused. To summarize it another way:
- Row operations: slicing and Boolean arrays
- Column manipulation: index and label array/position array
df = pd.DataFrame(np.random.rand(5, 5), columns=list("abcde"), index=list("hijkl"))
display(df)Copy the code
Get multiple columns (return a DataFrame, even if only one column is selected)
display(df[["a"."d"]])Copy the code
# delete columns
df["e"] = [6, 7, 8, 9, 10]
del df["e"]
df["e"] = [6, 7, 8, 9, 10] So let's add the e column back
display(df.pop("e")) If column e is deleted from df, the deleted column e is returned
display(df)
Copy the code
h 6
i 7
j 8
k 9
l 10
Name: e, dtype: int64Copy the code
df2 = df.drop("h", inplace=False, axis=0)
display(df, df2)Copy the code
Construct a DataFrame
np.random.seed(100)
df = pd.DataFrame(np.random.rand(5, 5), index=list("abcde"), columns=list("yuiop"))
display(df)Copy the code
The loC is positioned from higher to lower dimensions.
display(df.loc["c"] ["i"])
display(df.loc["c"."i"])
display(df.loc["c"].loc["i"])
display(df.loc["c"]) # LOC functions are positioned from higher to lower dimensions. Can not specify the higher dimension, directly to the lower dimension
display(df.loc[:, "i"]) 0.185328219550075060.185328219550075060.18532821955007506 y u I 0.185328 0.209202 0.891322 0.108377 p o 0.219697 Name: c, dtype:float64a 0.424518b 0.825853c 0.185328d 0.17194e 0.817649 Name: I, dtype:float64Copy the code
Get columns first, then rows.
df["i"].loc["a"] = 3
display(df)
# tag index group locates the column, then LOC slices the row
display(df[["i"."o"."p"]].loc["b":"d"])Copy the code
If a Boolean array is two-dimensional, the elements at True are displayed as they are, and the elements at False are null (NaN).Display (df > 0.5) display(df[df > 0.5]) display(df["i"[df] > 0.5) display (df ["i"] > 0.5])
Copy the code
df = pd.DataFrame(np.random.rand(5, 5), columns=list("abcde"), index=list("hijkl"))
display(df)
# DataFrame line operation
Get line LOc iloc IX
# loc gets by tag
# ilOC fetch by location
# ix Mixed indexes first by label, then by location
display(df.loc["i"])
display(df.iloc[1])
# Not recommended as it can be very confusing.
display(df.ix["i"])
display(df.ix[1])Copy the code
A 0.598843b 0.603805c 0.105148d 0.38194e 0.036476 Name: I, dtype:float64Copy the code
A 0.598843b 0.603805c 0.105148d 0.38194e 0.036476 Name: I, dtype:float64Copy the code
A 0.598843b 0.603805c 0.105148d 0.38194e 0.036476 Name: I, dtype:float64Copy the code
A 0.598843b 0.603805c 0.105148d 0.38194e 0.036476 Name: I, dtype:float64Copy the code
conclusion
1) To select a whole row (multiple whole rows) or a whole column (multiple whole columns) of data, df[], df.loc[], df.iloc[] can be used, in which df[] is easier to write.
2) For region selection, if only label index is used, df.loc[] or df.ix[] is used; if only integer index is used, df.iloc[] or df.ix[] is used. However, I have read that it is not recommended to use df.ix[] because df.loc[] and df.iloc[] are more accurate. I don’t understand exactly where, please inform).
Df.at [], df.iat[], df.loc[], df.iloc[], df.iloc[], df.iat[], df.loc[], df.iloc[]
4) When selecting data, the returned value has the following conditions:
Series object if the return value includes single row multiple columns or multiple rows single column; DataFrame object if multiple rows and columns are returned; If the return value is a single cell (single row, single column), the return value is the basic data type, such as STR, int, and so on.
5) the df[] method can only select row and column data, can not be exact to the cell, so the return value of df[] must be a DataFrame or Series object.
6) When the default DataFrame index (integer index) is used, the integer index is the label index.
Revised DataFrame increase
df = pd.DataFrame(np.random.rand(5, 5), columns=list("abcde"), index=list("hijkl"))
display(df)
# add a lineLine = pd.series ([23, 33, 12., 334.22, 200], index=list("abcde"), name="p")
df = df.append(line)
display(df)
# delete a row
df1 = df.drop(["h"."j"])
display(df1)
# change a line
df.loc["k"] = pd. Series (,1,1,1,1 [1], the index = list ("abcde"))
display(df)Copy the code
DataFrame structure
DataFrame row or column, all objects of type Series. For rows, the name attribute value of the Series object is the row index name, and the value of its internal element is the corresponding column index name. For columns, the name attribute value of a Series object is the column index name, and the value of its internal element is the corresponding row index name.
DataFrame operation
Each row or column of a DataFrame is a Series object. Therefore, a DataFrame can be approximated as a Series with multiple rows or columns. A Series object supports many operations that are also applicable to DataFrame objects.
- transpose
- DataFrame operations are aligned based on row and column indexes. A null value (NaN) is generated when the index does not match. If you do not want a null value, you can use the DataFrame operator instead of the operator, using the fill_value argument to specify the fill_value.
- DataFrame mixed with Series. The default Series index matches the column index of the DataFrame and then performs row broadcast. You can specify the matching method (row index or column index) by using the axis parameter of the operation method of the DataFrame object.
df1 = pd.DataFrame(np.arange(9).reshape(3, 3))
df2 = pd.DataFrame(np.arange(9, 18).reshape(3, 3))
display(df1, df2)
display(df1 + df2)
display(df1 * df2)Copy the code
The index sort
Series and DataFrame objects can sort indexes using the sort_index method. DataFrame objects can also specify the axis (row or column index) with the Axis parameter when sorting. You can also specify ascending or descending with the ascending parameter.
Random ((3, 5)), index=[3,1,2], columns=[1,3,5,2,4]) display(df)Copy the code
Specify sort by row index
display(df.sort_index(axis=0, ascending=False))Copy the code
# specify sort by column index
display(df.sort_index(axis=1, ascending=True))Copy the code
df = pd.DataFrame(np.arange(9).reshape(3, 3), index=[3, 1, 2], columns=[6, 4, 5])
display(df)
By default, the row index is sorted in ascending order.
df1 = df.sort_index()
display(df1)
Sort by column index
df2 = df.sort_index(axis=1)
display(df2)
Copy the code
# change in place, do not return the modified result.
df.sort_index(inplace=True)
display(df)
Sort in ascending order by default. You can specify descending sort.
df3 = df.sort_index(ascending=False, axis=1)
display(df3)
Copy the code
sorted
Series and DataFrame objects can sort their values using the sort_values method.
Df = pd DataFrame ([[1, 3, 2], [5, 2, 4], [2, 4, 3]], the index = 31 [1], the columns = list ("cab"))
display(df)
Sort by column, specify order
df1 = df.sort_values("c", ascending=False)
display(df1)
Sort by row, specify order
df2 = df.sort_values(1, axis=1, ascending=False)
display(df2)
# sorted
df = pd.DataFrame([[1, 3, 300], [66, 5, 100], [1, 3, 400]])
display(df)
# order by column 2
df1 = df.sort_values(2)
display(df1)
Descending order according to line 1
df2 = df.sort_values(1, axis=1, ascending=False)
display(df2)
Copy the code
DataFrame Indicates the Index object Index
The index of a Series(DataFrame) or columns of a DataFrame are index objects.
- Index objects can be indexed like arrays.
- Index objects are not modifiable.
DataFrame Statistics related methods
- mean / sum / count
- max / min
- cumsum / cumprod
- argmax / argmin
- idxmax / idxmin
- var / std