preface
There are four main methods we use when using DataFrame to merge operations:
- Join
- Merge
- Concat
- Append
Less familiar friends may see so many methods confused, and thus do not know how to start. In this article, we will elaborate on the similarities and differences between these methods with specific examples, so that you will no longer be unfamiliar with these four methods, and take a solid step towards data analysis.
The connection method
When it comes to merger, there is no escaping the concept of the way of merger. The joins in Pandas are similar to the left, right, inner, and outer joins in database queries.
Before we start learning about these merge functions, let’s learn about join methods.
An example of a left and right connection is as follows:
A concrete example of the inner and outer connection is shown below:
As the legend shows, we can clearly see the differences between these four types of connection. The left join is based on a column in the left table (name column in the figure). When the two tables are joined in this way, only the names in the right table and the names in the left table match (Mike, Nana) will be merged into the left table. The right connection mode is the opposite of the left. It is based on a column in the right table. When the name in the left table matches the name in the right table, it will be merged into the right table. An inner is often called an inner join, and the final result is the same column in both tables. Outer is the outer join, the result is that the data of the two tables are all fused together, you have nothing to keep, I have nothing to keep, you and I have the common data is merged into a row.
A Join.
A Join is a merge operation based on a row index. If you want to merge two dataframes based on the index of two DFS, then Join is the best choice. Let’s look at how to use the Join method:
df.join(other, on=None, how="left", lsuffix="", rsuffix="", sort=False)
Copy the code
Parameters that
-
Other can pass in a DataFrame, Series, or a list of Dataframes. The row index of other must match one of the columns in df (an exception is raised if two columns do not match.) Before performing the connection operation. If Seires is passed in, the Name attribute must be passed in.
-
On defaults to None, which uses the row index columns of DF and other to join. In this case, the index of OTHER must correspond to the column name in the passed DF.
-
How Connection mode can be left, right, inner, or outer. The default value is left.
-
Lsuffix represents the overlapping column suffix of DF. The default is an empty string. If df and other have the same column, it must be passed.
-
Rsuffix stands for the overlapping column suffix of other. The default is an empty string. If df and other have the same column, it must be passed.
-
Sort sorts the result Dataframes in lexicographical order by join keywords. If False, the order of the join keys depends on the join type (the how keyword). The default is False
Use the sample
- Simple merge (using default left connection)
df1 = pd.DataFrame({'column1': [1.2].'column2': [3.4]}, index=list('BA'))
df2 = pd.DataFrame({'column1': [5.6].'column2': [7.8]}, index=list('BC'))
""" df1: df2: column1 column2 | column1 column2 B 1 3 | B 2 7 A 2 4 | C 5 8 """
df = df1.join(df2, lsuffix='_L', rsuffix='_R') # merge from left
Output:
column1_L column2_L column1_R column2_R
B 1 3 2.0 7.0
A 2 4 NaN NaN
Copy the code
In the above example, because the two DFS have the same column, lsuffix and rsuffix parameters must be passed to change the suffix of overlapping column names (you can also pass only one overlapping suffix and keep the original column names unchanged in the other one).
- Merge with a column field:
Sort by column1 in df1
df = df1.join(df2.set_index(df2['column1']), on='column1', lsuffix='_L', rsuffix='_R', how='left')
Output:
column1_L column2_L column1_R column2_R
B 1 3 NaN NaN
A 2 4 2.0 7.0
Copy the code
The on field only specifies the merge field in DF1 and cannot be applied to DF2 (DF2 still uses index as benchmark for merge). Therefore, in order to maintain field consistency, we use set_index method to replace index column in DF2 with colum1 column
2. The Merge
Merge is a key-based Merge of two DFS. For example, if DF_A contains index, SID,name, and math, and DF_B contains index, SID, and Chinese, and DF_A and DF_B are merged based on sid, you can use Merge in this case. Join can also work by assigning the SID columns of two DFS to index columns, but Merge applies to key-by-key merging. . Here is how to use Merge:
pd.merge(left, right, how="inner", on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=("_x"."_y"), copy=True, indicator=False, validate=None)
Copy the code
Parameters that
-
Left The left DataFrame object for the merge
-
Right Is the right DataFrame object for the merge
-
How Connection mode can be left, right, inner, or outer. The default is inner.
-
On specifies which key (column name) to base the merge on. The key passed in must exist in both DFS, otherwise an error is reported. If no value is passed in and both left_index and right_index are False, the same columns in both DFS are merged as a baseline.
-
Left_on The base column of left DF. This parameter is usually used because there are no identical columns in the two DFS, so you need to specify the base columns to merge on the left and right sides.
-
Right_on Reference column of DF on the right.
-
Left_index Whether to use the left row index column as the benchmark. Equivalent to passing the row index column as the base column in left_ON
-
Right_index Whether to use the row index column on the left as the benchmark. Equivalent to passing the row index column as the base column in right_ON
-
Suffixes Sets the suffixes for overlapping column names. Type string class tuple. The default suffix is _x to the left of overlap and _y to the right of overlap
-
Sort sorts the result DataFrame by the connect key. The default is True, and in many cases, setting it to False will greatly improve performance.
-
Validate The default value is none. If specified, the merge is checked to see if it is of the specified type.
- One_to_one or 1:1: check whether the merge key is unique in the left and right data sets
- One_to_many or 1:m: Checks whether the merge key is unique in the left dataset
- “Many_to_one” or “1:1” : checks whether the merge key is unique in the right dataset
- “Many_to_many” or “m:m” : allows but does not result in a check.
-
Indicator adds a column named _merge to the output DF containing information about the source of each row. For indicators that only appear in the left DF, the value is left_ONLY, and for indicators that only appear in the right DF, the value is right_only. There are both positions of two DF indicators
Use the sample
- Use Merge for simple Merge operations:
# merge from left
df = pd.merge(df1, df2, on='column1', how='left')
Output:
column1 column2_x column2_y
0 1 3 NaN
1 2 4 7.0
Select * from outer; select * from outer; select * from outer
df = pd.merge(df1, df2, on='column1', how='outer', indicator=True)
Output:
column1 column2_x column2_y _merge
0 1 3.0 NaN left_only
1 2 4.0 7.0 both
2 5 NaN 8.0 right_only
Copy the code
- If two columns have different merged field values:
Select * from outer; select * from column1; select * from column2; select * from column2
df = pd.merge(df1, df2, left_on='column1', right_on='column2', how='outer', indicator=True)
Output:
column1_x column2_x column1_y column2_y _merge
0 1.0 3.0 NaN NaN left_only
1 2.0 4.0 NaN NaN left_only
2 NaN NaN 2.0 7.0 right_only
3 NaN NaN 5.0 8.0 right_only
Copy the code
3. Concat
If you want to concatenate multiple (two or more) DF objects, and you need to merge them along axes (horizontal or vertical), you can try using the concat method.
pd.concat(objs, axis=0, join="outer", ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort: bool = False, copy=True)
Copy the code
-
Sequence or mapping of objS Series or DataFrame objects. If the dict is passed, it will be used as the keys argument unless a sorted key is passed, in which case the value will be selected (see below). Unless all objects are None, all None objects are silently deleted, in which case ValueError is raised.
-
Axis selects which axis to merge on, 0: index axis (vertical merge, similar to the append method mentioned below), 1: Columns axis (horizontal merge). The default is 0
-
Join Connection mode, which can be inner or outer. The default is outter. Unlike Join and Merge, Concat only supports inner Join and outer Join methods.
-
Ignore_index Specifies a Boolean value. Default is False. Ignores the index value of the merge axis. When axis 0 is merged, the index value of the merged row will be adjusted to 0.1.2… N the assignment. Columns will be assigned to range(0,df.shape[0]) after merging on axis 1.
-
Keys are used to build a hierarchical index (stratified by source of data) on the connection axis.
-
Verify_integrity A Boolean value that defaults to False (duplicates are allowed). Check for duplication of the result object on the selected axis.
-
Levels are used as indexes for levels within a hierarchy or index. Generally used as a multi-level index, the actual use is not commonly used. For those interested, see Pandas’ multi-level index
-
Names is used to name the generated multi-level index. The length of the name list should match the number of levels in the generated multi-level index.
-
Sort does not sort by default. If the unconnected axis is not aligned when join is outer, it is sorted. When join = inner and order is preserved, sorting will not work.
Usage examples
- Simple merge (horizontal and vertical)
df = pd.concat([df1, df2]) # vertical merge (Y axis, Y axis)
Output:
column1 column2
B 1 3
A 2 4
B 2 7
C 5 8
df = pd.concat([df1, df2], axis=1) # horizontal merge (along the X axis)
Output:
column1 column2 column1 column2
B 1.0 3.0 2.0 7.0
A 2.0 4.0 NaN NaN
C NaN NaN 5.0 8.0
df = pd.concat([df1, df2], axis=1, join='inner') # horizontal merge to display columns with the same index
Output:
column1 column2 column1 column2
B 1 3 2 7
df = pd.concat([df1, df2], ignore_index=True) The axis (index column) is reassigned after the merge
Output:
column1 column2
0 1 3
1 2 4
2 2 7
3 5 8
Copy the code
Four Append.
Of the four merge methods, the simplest to use is append, but because it is easy to use, it also supports fewer operations than the previous three big, common operation, we will first look at the parameter description:
df.append(other, ignore_index=False, verify_integrity=False, sort=False)
Copy the code
-
Other can pass in a DataFrame, Series, or a list of Dataframes (Series) (if multiple DFS (Series) are combined into DF).
-
Ignore_index Specifies a Boolean value. Default is False. The combined index value is ignored. Range (0,df.shape[0])
-
Verify_integrity A Boolean value that defaults to False (duplicates are allowed). Check for duplicates in the index columns of the merged result object.
-
Sort does not sort by default. If the unconnected axis is not aligned when join is outer, it is sorted. When join = inner and order is preserved, sorting will not work.
Usage examples
- Simple merge (horizontal and vertical)
df = df1.append(df2) # merger
Output:
column1 column2
B 1 3
A 2 4
B 2 7
C 5 8
df = df1.append([df2, df2], ignore_index=True) # merge df2 twice, reassign index
Output:
column1 column2
0 1 3
1 2 4
2 2 7
3 5 8
4 2 7
5 5 8
Copy the code
Simple summary
Join: Combines two or more DF objects based on indexes.
Merge: Merge operations based on keys (indexes or other column names). Only two DF objects can be merged.
Concat: Sort based on axes (horizontal or vertical), supporting simultaneous merging of two or more DF objects.
Append: Similar to concat, but only supports vertical extension. Two or more DF objects can be merged at the same time.
Because there are too many parameters in the four methods, the examples in this section only give the basic usage situation, as for more combinations of usages, please learn to use in your own IDE, this section of learning will come to an end, thank you for your support, you can leave a message in the comment section if you have any questions, thanks!