Give you a good way to sort Python data

Learning the Pandas sorting method is a great way to get started or practice basic data analysis in Python. The most common data analysis is done using spreadsheets, SQL, or PANDAS. One of the advantages of using Pandas is that it can handle large amounts of data and provide high performance data manipulation capabilities.

This article is written by Chuan Yu in Pandas Sort: Your Guide to Sorting Python data.

Learning the Pandas sorting method is a great way to get started or practice basic data analysis in Python. The most common data analysis is done using spreadsheets, SQL, or PANDAS. One of the advantages of using Pandas is that it can handle large amounts of data and provide high performance data manipulation capabilities.

In this tutorial, you will learn how to use.sort_values() and.sort_index(), which will enable you to efficiently sort the data in the DataFrame.

By the end of this tutorial, you will know how to:

Sort Pandas DataFrame by one or more column values
Use the ascending parameter to change the sort order
Sort DataFrame by index. sort_index()
Organize missing data when sorting values
Use set to to sort DataFrame inplace inplaceTrue

To follow this tutorial, you will need a basic understanding of Pandas DataFrames and some understanding of reading data from files.

Pandas Describes the sorting methods

A quick reminder that a DataFrame is a data structure, an axis with rows and columns marked. You can sort dataframes by row or column value and row or column index.

Both rows and columns have indexes, which are numerical representations of the location of the data in the DataFrame. You can use the index location of a DataFrame to retrieve data from a specific row or column. By default, the index number starts from zero. You can also manually assign your own indexes.

Preparing the data set

In this tutorial, you will use fuel economy data compiled by the U.S. Environmental Protection Agency (EPA) for vehicles built between 1984 and 2021. The EPA fuel economy dataset is great because it contains many different types of information that you can sort on, from text to numeric data types. The dataset contains a total of 83 columns.

To continue, you need to install the Pandas Python library. The code in this tutorial is executed using Pandas 1.2.0 and Python 3.9.1.

Note: The entire fuel economy data set is approximately 18 MB. It may take a minute or two to read the entire data set into memory. Limiting the number of rows and columns helps improve performance, but it can still take a few seconds to download the data.

For analysis purposes, you will view MPG (miles per gallon) data for the vehicle by make, model, year, and other vehicle attributes. You can specify the columns to read into the DataFrame. For this tutorial, you only need a subset of the available columns.

Here is the command to read the relevant columns of the fuel economy data set into the DataFrame and display the first five rows:

>>> >>> import pandas as pd >>> column_subset = [ ... "id", ... "make", ... "model", ... "year", ... "cylinders", ... "fuelType", ... "trany", ... "mpgData", ... "city08", ... "highway08" ... ]  >>> df = pd.read_csv( ... "https://www.fueleconomy.gov/feg/epadata/vehicles.csv", ... usecols=column_subset, ... nrows=100 ... ) >>> df.head() city08 cylinders fuelType ... mpgData trany year 0 19 4 Regular ... Y Manual 5-spd 1985 1 9 12 Regular ... N Manual 5-spd 1985 2 23 4 Regular ... Y Manual 5-spd 1985 3 10 8 Regular ... N Automatic 3-spd 1985 4 17 4 Premium ... N Manual 5-spd 1993 [5 rows x 10 columns]Copy the code

You can load data into a DataFrame with a.read_csv() call using the dataset URL. Shrinking columns results in faster load times and less memory usage. To further limit memory consumption and quickly understand the data, you can use nrows to specify the number of rows to load.

Familiar with sort_values ()

You use.sort_values() to sort the values in the DataFrame along any axis (column or row). Typically, you want to sort rows in a DataFrame by the values of one or more columns:

The figure above shows the result of sorting rows of a DataFrame using.sort_values() based on the values in the highway08 column. This is similar to the way columns are used to sort data in a spreadsheet.

Familiar with sort_index ()

You use.sort_index() to sort dataframes by row index or column label. Sort_values () is that you sort the DataFrame by its row index or column name, not by the values in those rows or columns:

The DataFrame row index is highlighted in blue in the figure above. Indexes are not considered a column, and you usually have only one row index. A row index can be thought of as a row number starting from zero.

Sort the DataFrame on a single column

To sort the DataFrame by the value in a single column, you use.sort_values(). By default, this returns a new DataFrame sorted in ascending order. It does not modify the original DataFrame.

Sort by column in ascending order

To use.sort_values(), pass a single argument to the method that contains the name of the column you want to sort by. In this example, you sort the Dataframes by city08 column, which represents the city MPG for pure fuel vehicles:

>>> >>> df.sort_values("city08") city08 cylinders fuelType ... mpgData trany year 99 9 8 Premium ... N Automatic 4-spd 1993 1 9 12 Regular ... N Manual 5-spd 1985 80 9 8 Regular ... N Automatic 3-spd 1985 47 9 8 Regular ... N Automatic 3-spd 1985 3 10 8 Regular ... N Automatic 3-spd 1985 .. . . . . . . . 9 23 4 Regular ... Y Automatic 4-spd 1993 8 23 4 Regular ... Y Manual 5-spd 1993 7 23 4 Regular ... Y Automatic 3-spd 1993 76 23 4 Regular ... Y Manual 5-spd 1993 2 23 4 Regular ... Y Manual 5-spd 1985 [100 rows x 10 columns]Copy the code

This will sort your DataFrame city08 using the column values in city08, first showing the vehicle with the lowest MPG. By default, the order is ascending. Sort_values () sorts the data. Although you didn’t specify a name for the argument you passed,.sort_values() you actually use the by argument, which you’ll see in the next example.

Changing sort order

Sort_values () is ascending. By default, sort_values() has ascending set True. If you want the DataFrame to be sorted in descending order, you can pass False to this parameter:

>>> >>> df.sort_values( ... by="city08", ... ascending=False ... ) city08 cylinders fuelType ... mpgData trany year 9 23 4 Regular ... Y Automatic 4-spd 1993 2 23 4 Regular ... Y Manual 5-spd 1985 7 23 4 Regular ... Y Automatic 3-spd 1993 8 23 4 Regular ... Y Manual 5-spd 1993 76 23 4 Regular ... Y Manual 5-spd 1993 .. . . . . . . . 58 10 8 Regular ... N Automatic 3-spd 1985 80 9 8 Regular ... N Automatic 3-spd 1985 1 9 12 Regular ... N Manual 5-spd 1985 47 9 8 Regular ... N Automatic 3-spd 1985 99 9 8 Premium ... N Automatic 4-spd 1993 [100 rows x 10 columns]Copy the code

You reverse the sort order by passing False to Ascending. Your DataFrame is now ordered in descending order by the average MPG measured under city conditions. The vehicles with the highest MPG are in the first row.

Selective sorting algorithm

It is worth noting that PANDAS allows you to choose different sorting algorithms to use with.sort_values() and.sort_index(). The algorithms available are Quicksort, mergesort and Heapsort. For more information about these different sorting algorithms, see Sorting Algorithms in Python.

The default algorithm used for sorting a single column is Quicksort. To change this to a stable sorting algorithm, use mergesort. You can do this using the kind parameter in OR, as follows:.sort_values().sort_index()

>>> >>> df.sort_values( ... by="city08", ... ascending=False, ... kind="mergesort" ... ) city08 cylinders fuelType ... mpgData trany year 2 23 4 Regular ... Y Manual 5-spd 1985 7 23 4 Regular ... Y Automatic 3-spd 1993 8 23 4 Regular ... Y Manual 5-spd 1993 9 23 4 Regular ... Y Automatic 4-spd 1993 10 23 4 Regular ... Y Manual 5-spd 1993 .. . . . . . . . 69 10 8 Regular ... N Automatic 3-spd 1985 1 9 12 Regular ... N Manual 5-spd 1985 47 9 8 Regular ... N Automatic 3-spd 1985 80 9 8 Regular ... N Automatic 3-spd 1985 99 9 8 Premium ... N Automatic 4-spd 1993 [100 rows x 10 columns]Copy the code

With Kind, you set the sorting algorithm to mergesort. The previous output used the default Quicksort algorithm. Looking at the highlighted index, you can see that the rows are in a different order. This is because Quicksort is not a stable sorting algorithm, but mergesort.

Note: In Pandas, kind is ignored when you sort multiple columns or tags.

When you sort multiple records with the same key, a stable sorting algorithm preserves the original order of those records after sorting. Therefore, if you plan to perform more than one sort, you must use a stable sort algorithm.

Sort the DataFrame on multiple columns

In data analysis, it is often desirable to sort data according to the values of multiple columns. Imagine that you have a data set that contains people’s first and last names. It makes sense to sort first by last name and then by first name, so that people with the same last name are arranged alphabetically according to their first name.

In the first example, you sort the DataFrame city08 on a single column named. From an analytical point of view, MPG under urban conditions is an important factor in determining the popularity of cars. In addition to MPGS for urban conditions, you may want to look at MPGS for highway conditions. To sort by two keys, you can pass a list of column names to BY:

>>> >>> df.sort_values( ... by=["city08", "highway08"] ... ) [["city08", "highway08"]] city08 highway08 80 9 10 47 9 11 99 9 13 1 9 14 58 10 11 .. . . 9 23 30 10 23 30 8 23 31 76 23 31 2 23 33 [100 rows x 2 columns]Copy the code

You can sort the dataframes on both columns by specifying the column name city08 and the list highway08. Sort_values (). The next example explains how to specify the sort order and why it’s important to pay attention to the list of column names you use.

Sort by multiple columns in ascending order

To sort dataframes on multiple columns, you must provide a list of column names. For example, to press make and sort model, you should create the following list and pass it to.sort_values() :

>>> >>> df.sort_values( ... by=["make", "model"] ... ) [["make", "model"]] make model 0 Alfa Romeo Spider Veloce 2000 18 Audi 100 19 Audi 100 20 BMW 740i 21 BMW 740il .. . . 12 Volkswagen Golf III / GTI 13 Volkswagen Jetta III 15 Volkswagen Jetta III 16 Volvo 240 17 Volvo 240 [100 rows x 2 columns]Copy the code

Now your DataFrame is sorted in ascending order make. If there are two or more identical brands, sort the model. The order in which column names are specified in the list corresponds to how dataframes are sorted.

Change the column sorting order

Because you are sorting with multiple columns, you can specify the sort order of the columns. If you wanted to change the logical sorting order in the previous example, you could change the order of the column names in the list passed to the BY parameter:

>>> >>> df.sort_values( ... by=["model", "make"] ... ) [["make", "model"]] make model 18 Audi 100 19 Audi 100 16 Volvo 240 17 Volvo 240 75 Mazda 626 .. . . 62 Ford Thunderbird 63 Ford Thunderbird 88 Oldsmobile Toronado 42 CX Automotive XM v6 43 CX Automotive XM v6a [100 rows x 2 columns]Copy the code

Your DataFrame is now sorted by column in ascending model order, and then sorted by make to see if there are two or more of the same model. You can see that changing the order of the columns also changes the sorting order of the values.

Sort by multiple columns in descending order

So far, you’ve only sorted multiple columns in ascending order. In the next example, you will sort in descending order by the make and Model columns. To sort in descending order, set Ascending to False:

>>> >>> df.sort_values( ... by=["make", "model"], ... ascending=False ... ) [["make", "model"]] make model 16 Volvo 240 17 Volvo 240 13 Volkswagen Jetta III 15 Volkswagen Jetta III 11 Volkswagen Golf III / GTI .. . . 21 BMW 740il 20 BMW 740i 18 Audi 100 19 Audi 100 0 Alfa Romeo Spider Veloce 2000 [100 rows x 2 columns]Copy the code

The values in the make column are in alphabetical reverse order, model, for those with the same make. For text data, sorting is case sensitive, which means that uppercase text will appear first in ascending order and last in descending order.

Sort by multiple columns with different sort orders

You may be wondering if you can sort with multiple columns and have them use a different ascending parameter. With panda, you can do this with a single method call. If you want to sort some columns in ascending order and some columns in descending order, you can pass a list of Boolean values to Ascending.

In this example, you arrange data frames by make, Model, and City08 columns, with the first two columns sorted in ascending order and City08 sorted in descending order. To do this, you pass the list of column names to BY and the list of booleans to Ascending:

>>> >>> df.sort_values( ... by=["make", "model", "city08"], ... ascending=[True, True, False] ... ) [["make", "model", "city08"]] make model city08 0 Alfa Romeo Spider Veloce 2000 19 18 Audi 100 17 19 Audi 100 17 20 BMW 740i 14 21 BMW 740il 14 .. . . . 11 Volkswagen Golf III / GTI 18 15 Volkswagen Jetta III 20 13 Volkswagen Jetta III 18 17 Volvo 240 19 16 Volvo 240 18 [100 rows x 3 columns]Copy the code

Now sort your data frames make and model in ascending order, but with City08 in descending order columns. This is useful because it groups the cars in classification order and displays the car with the highest MPG first.

Sort the DataFrame by index

Before sorting an index, it is a good idea to know what it represents. DataFrame has a **.index** attribute, which by default is a numeric representation of its row position. You can think of an index as a line number. It facilitates fast row lookup and identification.

Sort by index in ascending order

You can sort dataframes by row index. sort_index(). Sorting by column value as in the previous example reorders the rows in the DataFrame, so the indexes become cluttered. This also happens when you filter a DataFrame or delete or add rows.

To illustrate the use of.sort_index(), first create a new sort datafame.sort_values () using the following method:

>>> >>> sorted_df = df.sort_values(by=["make", "model"]) >>> sorted_df city08 cylinders fuelType ... mpgData trany year 0 19 4 Regular ... Y Manual 5-spd 1985 18 17 6 Premium ... Y Automatic 4-spd 1993 19 17 6 Premium ... N Manual 5-spd 1993 20 14 8 Premium ... N Automatic 5-spd 1993 21 14 8 Premium ... N Automatic 5-spd 1993 .. . . . . . . . 12 21 4 Regular ... Y Manual 5-spd 1993 13 18 4 Regular ... N Automatic 4-spd 1993 15 20 4 Regular ... N Manual 5-spd 1993 16 18 4 Regular ... Y Automatic 4-spd 1993 17 19 4 Regular ... Y Manual 5-spd 1993 [100 rows x 10 columns]Copy the code

You have created a DataFrame that sorts with multiple values. Notice how the row indexes are in no particular order. To restore the new DataFrame to its original order, you can use.sort_index() :

>>> >>> sorted_df.sort_index() city08 cylinders fuelType ... mpgData trany year 0 19 4 Regular ... Y Manual 5-spd 1985 1 9 12 Regular ... N Manual 5-spd 1985 2 23 4 Regular ... Y Manual 5-spd 1985 3 10 8 Regular ... N Automatic 3-spd 1985 4 17 4 Premium ... N Manual 5-spd 1993 .. . . . . . . . 95 17 6 Regular ... Y Automatic 3-spd 1993 96 17 6 Regular ... N Automatic 4-spd 1993 97 15 6 Regular ... N Automatic 4-spd 1993 98 15 6 Regular ... N Manual 5-spd 1993 99 9 8 Premium ... N Automatic 4-spd 1993 [100 rows x 10 columns]Copy the code

The index is now arranged in ascending order. Like the default argument to in.sort_values(), you can change it to descending by passing it. Sorting the index has no effect on the data itself, because the values remain the same. ascending.sort_index()TrueFalse

When you use **.set_index()**. If you want to set custom indexes with make and model columns, you can pass the list to.set_index() :

>>> >>> assigned_index_df = df.set_index( ... ["make", "model"] ... ) >>> assigned_index_df city08 cylinders ... trany year make model ... Alfa Romeo Spider Veloce 2000 19 4 ... Manual 5-spd 1985 Ferrari Testarossa 9 12 ... Manual 5-spd 1985 Dodge Charger 23 4 ... Manual 5-spd 1985 B150/B250 Wagon 2WD 10 8 ... Automatic 3-spd 1985 Subaru Legacy AWD Turbo 17 4 ... Manual 5-spd 1993 ... . . . . Pontiac Grand Prix 17 6 ... Automatic 3-spd 1993 Grand Prix 17 6 ... Automatic 4-spd 1993 Grand Prix 15 6 ... Automatic 4-spd 1993 Grand Prix 15 6 ... Manual 5-spd 1993 Rolls-Royce Brooklands/Brklnds L 9 8 ... Automatic 4-spd 1993 [100 rows x 8 columns]Copy the code

With this approach, you can replace the default integer based row index with two axis labels. This is known as a MultiIndex or a hierarchical index. Your DataFrame is now indexed by multiple keys, and you can sort it using the following key:.sort_index() :

>>> >>> assigned_index_df.sort_index() city08 cylinders ... trany year make model ... Alfa Romeo Spider Veloce 2000 19 4 ... Manual 5-spd 1985 Audi 100 17 6 ... Automatic 4-spd 1993 100 17 6 ... Manual 5-spd 1993 BMW 740i 14 8 ... Automatic 5-spd 1993 740il 14 8 ... Automatic 5-spd 1993 ... . . . . Volkswagen Golf III / GTI 21 4 ... Manual 5-spd 1993 Jetta III 18 4 ... Automatic 4-spd 1993 Jetta III 20 4 ... Manual 5-spd 1993 Volvo 240 18 4 ... Automatic 4-spd 1993 240 19 4 ... Manual 5-spd 1993 [100 rows x 8 columns]Copy the code

First use make and the column DataFrame to assign a new index model, and then use sort_index() to sort the indexes. You can read more about using pandas in the.set_index() documentation.

Sort by descending index

For the next example, you will sort the Dataframes in descending order by index. Remember that by sorting DataFrame. Sort_values (), you can reverse the sort order by setting ascending to False. This parameter also applies to.sort_index(), so you can sort the dataframes in reverse order, as follows:

>>> >>> assigned_index_df.sort_index(ascending=False) city08 cylinders ... trany year make model ... Volvo 240 18 4 ... Automatic 4-spd 1993 240 19 4 ... Manual 5-spd 1993 Volkswagen Jetta III 18 4 ... Automatic 4-spd 1993 Jetta III 20 4 ... Manual 5-spd 1993 Golf III / GTI 18 4 ... Automatic 4-spd 1993 ... . . . . BMW 740il 14 8 ... Automatic 5-spd 1993 740i 14 8 ... Automatic 5-spd 1993 Audi 100 17 6 ... Automatic 4-spd 1993 100 17 6 ... Manual 5-spd 1993 Alfa Romeo Spider Veloce 2000 19 4 ... Manual 5-spd 1985 [100 rows x 8 columns]Copy the code

Your DataFrame is now sorted in descending order by its index. Sort_values () is it. Sort_index () has no by argument because it defaults to sorting dataframes on the row index.

Explore advanced index sorting concepts

There are many situations in data analysis where you want to sort hierarchical indexes. You’ve seen how to use make and Model in MultiIndex. For this data set, you can also use the ID column as an index.

Setting the ID column as an index may help to link related datasets. For example, the EPA emissions data set is also used for ID to represent vehicle record IDS. This links emissions data to fuel economy data. Sorting the indexes of two datasets in a DataFrame can be done using other methods (for example. Merge (). To learn more about combining data in Pandas, see combining data using Merge (),.join(), and concat() in Pandas.

Sort DataFrame columns

You can also sort row values using DataFrame’s column labels. The DataFrame is sorted by column label using an optional parameter set to.sort_index(). Sorting algorithms are applied to axis labels rather than actual data. This facilitates visual inspection of the DataFrame. axis1

Use the data box Axis

It will be used as the default argument when you use.sort_index() without passing any explicit argument axis=0. The axis of a DataFrame refers to an index (Axis =0) or column (Axis =1). You can use these two axes to index and select and sort data in the DataFrame.

Sort using column labels

You can also use the DataFrame column tag as.sort_index(). Set DataFrame column axis to sort by 1 based on column label:

>>> >>> df.sort_index(axis=1) city08 cylinders fuelType ... mpgData trany year 0 19 4 Regular ... Y Manual 5-spd 1985 1 9 12 Regular ... N Manual 5-spd 1985 2 23 4 Regular ... Y Manual 5-spd 1985 3 10 8 Regular ... N Automatic 3-spd 1985 4 17 4 Premium ... N Manual 5-spd 1993 .. . . . . . . . 95 17 6 Regular ... Y Automatic 3-spd 1993 96 17 6 Regular ... N Automatic 4-spd 1993 97 15 6 Regular ... N Automatic 4-spd 1993 98 15 6 Regular ... N Manual 5-spd 1993 99 9 8 Premium ... N Automatic 4-spd 1993 [100 rows x 10 columns]Copy the code

DataFrame columns are sorted alphabetically from left to right. To sort the columns in descending order, use ascending=False:

>>> >>> df.sort_index(axis=1, ascending=False) year trany mpgData ... fuelType cylinders city08 0 1985 Manual 5-spd Y ... Regular 4 19 1 1985 Manual 5-spd N ... Regular 12 9 2 1985 Manual 5-spd Y ... Regular 4 23 3 1985 Automatic 3-spd N ... Regular 8 10 4 1993 Manual 5-spd N ... Premium 4 17 .. . . . . . . . 95 1993 Automatic 3-spd Y ... Regular 6 17 96 1993 Automatic 4-spd N ... Regular 6 17 97 1993 Automatic 4-spd N ... Regular 6 15 98 1993 Manual 5-spd N ... Regular 6 15 99 1993 Automatic 4-spd N ... Premium 8 9 [100 rows x 10 columns]Copy the code

Using axis= 1in.sort_index (), you can sort DataFrame columns in ascending and descending order. This may be more useful in other data sets, such as a data set with column labels corresponding to months of the year. In this case, it makes sense to arrange the data by month in ascending or descending order.

Processing lost data while sorting in Pandas

Often, real-world data is flawed. Although Pandas has a variety of methods for cleaning up data before sorting, it is sometimes good to view lost data while sorting. You can do this with the na_position argument.

The subset of fuel economy data used in this tutorial has no missing values. To illustrate the use of na_position, you first need to create some missing data. The following code creates a new column based on the existing mpgData column, mapping True where mpgData is equal to Y and NaN is not equal to:

>>> >>> df["mpgData_"] = df["mpgData"].map({"Y": True}) >>> df city08 cylinders fuelType ... trany year mpgData_ 0 19 4 Regular ... Manual 5-spd 1985 True 1 9 12 Regular ... Manual 5-spd 1985 NaN 2 23 4 Regular ... Manual 5-spd 1985 True 3 10 8 Regular ... Automatic 3-spd 1985 NaN 4 17 4 Premium ... Manual 5-spd 1993 NaN .. . . . . . . . 95 17 6 Regular ... Automatic 3-spd 1993 True 96 17 6 Regular ... Automatic 4-spd 1993 NaN 97 15 6 Regular ... Automatic 4-spd 1993 NaN 98 15 6 Regular ... Manual 5-spd 1993 NaN 99 9 8 Premium ... Automatic 4-spd 1993 NaN [100 rows x 11 columns]Copy the code

Now you have a new column called mpgData_ that contains both True and NaN values. You will use this column to see the effect of na_position using both sorting methods. For more information about using.map(), you can read the Pandas project: Creating a score book using Python and Pandas.

Understand the na_position argument. Sort_values ()

.sort_values() takes a parameter named na_position, which helps organize the missing data in the columns you sort. If you sort columns with missing data, the row with the missing value appears at the end of the DataFrame. This happens whether you sort in ascending or descending order.

When you sort a column with missing data, your DataFrame looks like this:

>>> >>> df.sort_values(by="mpgData_") city08 cylinders fuelType ... trany year mpgData_ 0 19 4 Regular ... Manual 5-spd 1985 True 55 18 6 Regular ... Automatic 4-spd 1993 True 56 18 6 Regular ... Automatic 4-spd 1993 True 57 16 6 Premium ... Manual 5-spd 1993 True 59 17 6 Regular ... Automatic 4-spd 1993 True .. . . . . . . . 94 18 6 Regular ... Automatic 4-spd 1993 NaN 96 17 6 Regular ... Automatic 4-spd 1993 NaN 97 15 6 Regular ... Automatic 4-spd 1993 NaN 98 15 6 Regular ... Manual 5-spd 1993 NaN 99 9 8 Premium ... Automatic 4-spd 1993 NaN [100 rows x 11 columns]Copy the code

To change this behavior and have lost data appear in your data frame for the first time, set na_position to first. The na_position argument accepts only the values last, which is the default, and first. Here’s how to use na_postion. Sort_values () :

>>> >>> df.sort_values( ... by="mpgData_", ... na_position="first" ... ) city08 cylinders fuelType ... trany year mpgData_ 1 9 12 Regular ... Manual 5-spd 1985 NaN 3 10 8 Regular ... Automatic 3-spd 1985 NaN 4 17 4 Premium ... Manual 5-spd 1993 NaN 5 21 4 Regular ... Automatic 3-spd 1993 NaN 11 18 4 Regular ... Automatic 4-spd 1993 NaN .. . . . . . . . 32 15 8 Premium ... Automatic 4-spd 1993 True 33 15 8 Premium ... Automatic 4-spd 1993 True 37 17 6 Regular ... Automatic 3-spd 1993 True 85 17 6 Regular ... Automatic 4-spd 1993 True 95 17 6 Regular ... Automatic 3-spd 1993 True [100 rows x 11 columns]Copy the code

Any missing data in the columns you used to sort will now show up at the top of the DataFrame. This is useful when you are first starting to analyze the data and are not sure if there are missing values.

Understand the na_position argument. Sort_index ()

.sort_index() also accepts na_position. Your DataFrame usually does not have a NaN value as part of its index, so this parameter is in.sort_index(). However, it’s nice to know that if your DataFrame does have NaN in the row index or column name, then you can use.sort_index() and quickly identify this na_position.

By default, this parameter is set to last, placing the NaN value at the end of the sort result. To change this behavior and have lost data before your data frame, set na_position to first.

Modify your DataFrame using the sorting method

In all the examples you’ve seen so far, both.sort_values() and.sort_index() have returned data frame objects when you call those methods. This is because the panda sort does not work in place by default. In general, this is the most common and preferred way to analyze data using Pandas, because it creates a new DataFrame rather than modifying the original data. This allows you to preserve the state of the data when it is read from the file.

However, you can directly modify the original DataFrame True by specifying an optional parameter with an inplace value of. Most Pandas methods include the inplace argument. Below, you’ll see some examples of inplace=True being used to properly sort dataframes.

.sort_values() is used in place

With inplace set to True, you modify the original data frame, so the sorting method returns None. City08 sorts dataframes by column value as in the first example, but with inplace set to True:

>>>
>>> df.sort_values("city08", inplace=True)
Copy the code

Notice how the call. Sort_values () does not return a DataFrame. This is what the original df looks like:

>>> >>> df city08 cylinders fuelType ... trany year mpgData_ 99 9 8 Premium ... Automatic 4-spd 1993 NaN 1 9 12 Regular ... Manual 5-spd 1985 NaN 80 9 8 Regular ... Automatic 3-spd 1985 NaN 47 9 8 Regular ... Automatic 3-spd 1985 NaN 3 10 8 Regular ... Automatic 3-spd 1985 NaN .. . . . . . . . 9 23 4 Regular ... Automatic 4-spd 1993 True 8 23 4 Regular ... Manual 5-spd 1993 True 7 23 4 Regular ... Automatic 3-spd 1993 True 76 23 4 Regular ... Manual 5-spd 1993 True 2 23 4 Regular ... Manual 5-spd 1985 True [100 rows x 11 columns]Copy the code

In df objects, values are now sorted in ascending order based on the City08 column. Your original DataFrame has been modified and the changes persist. It is usually a good idea to avoid inplace=True for analysis, because changes to the DataFrame cannot be undone.

.sort_index() is used in place

The next example shows that this inplace also applies to.sort_index().

Since the indexes are created in ascending order when you read the file into the DataFrame, you can df again modify the objects to restore them to their original order. Modify the data box with.sort_index() and inplace set to True:

>>> >>> df.sort_index(inplace=True) >>> df city08 cylinders fuelType ... trany year mpgData_ 0 19 4 Regular ... Manual 5-spd 1985 True 1 9 12 Regular ... Manual 5-spd 1985 NaN 2 23 4 Regular ... Manual 5-spd 1985 True 3 10 8 Regular ... Automatic 3-spd 1985 NaN 4 17 4 Premium ... Manual 5-spd 1993 NaN .. . . . . . . . 95 17 6 Regular ... Automatic 3-spd 1993 True 96 17 6 Regular ... Automatic 4-spd 1993 NaN 97 15 6 Regular ... Automatic 4-spd 1993 NaN 98 15 6 Regular ... Manual 5-spd 1993 NaN 99 9 8 Premium ... Automatic 4-spd 1993 NaN [100 rows x 11 columns]Copy the code

Now your DataFrame is using.sort_index(). Since your DataFrame still has its default index, sorting it in ascending order puts the data back into its original order.

If you’re familiar with Python’s built-in functions sort()and sorted(), the parameters available in the inplacepandas sorting method may feel very similar. For more information, you can see how to use sorted() and sort() in Python.

conclusion

You now know how to use the two core methods of pandas:.sort_values() and.sort_index(). With this knowledge, you can perform basic data analysis using DataFrame. Although there are many similarities between the two approaches, by looking at the differences between them, it is clear which approach is used to perform different analysis tasks.

In this tutorial, you learned how to:

Sort Pandas DataFrame by one or more column values
Use the ascending parameter to change the sort order
Sort DataFrame by index. sort_index()
Organize missing data when sorting values
Use set to to sort DataFrame inplace inplaceTrue

These methods are an important part of mastering data analysis. They will help you build a strong foundation from which you can perform more advanced operations in Pandas. The Pandas documentation is a great resource if you want to see some examples of more advanced uses of the Pandas sorting method.

Click to follow, the first time to learn about Huawei cloud fresh technology ~