The author | Amanda Iglesias Moreno compile | source of vitamin k | forward Datas of Science

Filtering data from data frames is one of the most common operations when cleaning up data. Pandas provides a series of methods for selecting data based on row and column positions and labels. In addition, Pandas allows you to retrieve subsets of data based on column types and filter rows using Boolean indexes.

In this article, we cover the most common operations for selecting a subset of data from the Pandas data box:

  • Select a single column by label
  • Select multiple columns by label
  • Select columns by data type
  • Select a row by label
  • Select multiple lines by label
  • Select a row by position
  • Select multiple rows by position
  • Select both rows and columns
  • Select scalar value
  • Select selection rows using Boolean selection

In addition, we will provide multiple coding examples!

The data set

In this article, we use a small data set to learn. In the real world, the data set used is much larger; However, the process used to filter the data remains the same.

The data box contains the information of 10 employees of the company :(1) id card, (2) name, (3) last name, (4) department, (5) phone number, (6) salary, (7) contract type.

import pandas as pd

# Employee information
id_number = ['128'.'478'.'257'.'299'.'175'.'328'.'099'.'457'.'144'.'222']
name = ['Patrick'.'Amanda'.'Antonella'.'Eduard'.'John'.'Alejandra'.'Layton'.'Melanie'.'David'.'Lewis']
surname = ['Miller'.'Torres'.'Brown'.'Iglesias'.'Wright'.'Campos'.'Platt'.'Cavill'.'Lange'.'Bellow']
division = ['Sales'.'IT'.'IT'.'Sales'.'Marketing'.'Engineering'.'Engineering'.'Sales'.'Engineering'.'Sales']
salary = [30000.54000.80000.79000.15000.18000.30000.35000.45000.30500]
telephone = ['7366578'.'7366444'.'7366120'.'7366574'.'7366113'.'7366117'.'7366777'.'7366579'.'7366441'.'7366440']
type_contract = ['permanent'.'temporary'.'temporary'.'permanent'.'internship'.'internship'.'permanent'.'temporary'.'permanent'.'permanent']

Dataframe containing employee information
df_employees = pd.DataFrame({'name': name, 'surname': surname, 'division': division,
                             'salary': salary, 'telephone': telephone, 'type_contract': type_contract}, index=id_number)

df_employees
Copy the code

1. Select a column by label

To select a column in Pandas, we can use Operator and the [] operator.

Select a single column by label

df[string]
Copy the code

The code below accesses the SALARY column using both methods.

Select columns using the. Symbol (salary)
salary = df_employees.salary

Select columns with square brackets (salary)
salary_2 = df_employees['salary']

# When selecting a single column, we get a Series object
print(type(salary))
# <class 'pandas.core.series.Series'>

print(type(salary_2))
# <class 'pandas.core.series.Series'>

salary
Copy the code

As shown above, when a single column is retrieved, the result is a Series object. To get a DataFrame object when only one column is selected, we need to pass in a list, not just a string.

Get a Series object by passing a string to the index operator
df_employees['salary']

Get the DataFrame object by passing a list with a single item to the index operator
df_employees[['salary']]
Copy the code

Also, it is important to remember that we cannot use column names when they contain Spaces. Representation to access a specific column of a data frame. If we do that, we’ll get a syntax error.


2. Select multiple columns by label

We can select multiple columns of a data frame by passing in a list of column names as follows.

Select multiple columns by label

df[list_of_strings]
Copy the code
# Select multiple columns by passing a list containing column names to the index operator
df_employees[['division'.'salary']]
Copy the code

As shown above, the result is a DataFrame object that contains only the columns provided in the list.

3. Select columns by data type

Datafame. Select (include=None, exclude=None) can be used to select columns based on their data type. This method accepts a list or a single data type in the parameters include and exclude.

Remember, you must provide at least one of the parameters (include or exclude), and they cannot contain overlapping elements.

Select columns by data type

df.select_dtypes(include=None, exclude=None)
Copy the code

In the example below, we add the include parameter by passing in an Np.number object. Alternatively, we can obtain the same result by providing the string ‘number’ as input.

As you can see, the select_dtypes() method returns a DataFrame object that includes the data types in the include argument and excludes the data types in the exclude argument.

import numpy as np

# Select the numeric column - numpy object
numeric_inputs = df_employees.select_dtypes(include=np.number)

# use the.columns attribute
numeric_inputs.columns
# Index(['salary'], dtype='object')

This method returns a DataFrame object
print(type(numeric_inputs))
# <class 'pandas.core.frame.DataFrame'>

# select the number column
numeric_inputs_2 = df_employees.select_dtypes(include='number')

# use the.columns attribute
numeric_inputs_2.columns
# Index(['salary'], dtype='object')

This method returns a DataFrame object
print(type(numeric_inputs_2))
# <class 'pandas.core.frame.DataFrame'>

# Visualize data boxes
numeric_inputs
Copy the code

As stated earlier, the select_dtypes() method can accept both a string and a Numpy object as input. The table below shows the most common ways to refer to data types in Pandas.

As a reminder, we can use the pandas. Datafame. Info method or the pandas. Datafame. The former prints a concise summary of the data frame, including the column name and its data type, while the latter returns a sequence containing the data type for each column.

# A brief summary of the data framework, including column names and their data types
df_employees.info()
Copy the code

Check the data type of the column
df_employees.dtypes
Copy the code

4. Select a single line based on the label

Data frames and sequences do not necessarily have numeric indexes. By default, indexes are integers representing row positions; However, it can also be an alphanumeric string. In our current example, the index is the ID number of the employee.

We can use the.index method to check the index of the data frame
df_employees.index
# Index(['128', '478', '257', '299', '175', '328', '099', '457', '144', '222'], dtype='object')
# index is the id number of the employee.
Copy the code

To select a row by ID, we can use the.loc[] indexer to provide a string (index name) as input.

Select a single line by label

df.loc[string]
Copy the code

The following code shows how to select the employee with ID 478.

Select the employee with id 478 using the.loc[] indexer
df_employees.loc['478']
Copy the code

As shown above, when a row is selected, the.loc[] indexer returns a Series object. However, we can also get a single row of data frames by passing a single list of elements to the.loc[] method, as shown below.

# use the.loc[] indexer to select the employee with id 478 and provide a single element list
df_employees.loc[['478']]
Copy the code

5. Select multiple lines according to the label

We can use the.loc[] indexer to select multiple lines. In addition to a single tag, the indexer accepts a list or tag fragment as input.

Select multiple lines by label

df.loc[list_of_strings]
df.loc[slice_of_strings]
Copy the code

Next, we get a subset of data frames containing employees with IDS 478 and 222, as shown below.

Select employees with ids 478 and 222 using the.loc[] indexer
df_employees.loc[['478'.'222']]
Copy the code

Note that the closing index of the.loc[] method is always included, which means that the selection includes the last tag.

6. Select a single line by position

The ilOC [] indexer is used to index data frames by position. To select a single line using the.iloc[] property, we pass the line position (a single integer) to the indexer.

Select a single line by position

df.iloc[integer]
Copy the code

In the following code block, we select the row with an index of 0. In this case, the first row of the data frame is returned, because in Pandas the index starts at 0.

Select the first row of the data frame
df_employees.iloc[0]
Copy the code

In addition, the.iloc[] indexer also supports negative integers (starting at -1) as positions relative to the end of the data frame.

Select the last line of the data frame
df_employees.iloc[-1]
Copy the code

As shown above, when a row is selected, the.iloc[] indexer returns a Series object indexed by the column name. However, just as we did with the.loc[] indexer, we can also pass a single list of integers to the indexer to get the data frame in the following way.

Select the last line of the data frame
df_employees.iloc[[-1]]
Copy the code

Finally, remember that an indexer error is raised when you try to access an index that is out of bounds.

Shape of data box - 10 rows and 6 columns
df_employees.shape
# (10, 6)

# An IndexError is raised when attempting to access an out-of-bounds index
df_employees.iloc[10]
# IndexError
Copy the code

7. Select multiple locations

To extract multiple rows by position, we pass a List or slice object to the.iloc[] indexer.

Select multiple rows by position

df.iloc[list_of_integers]
df.iloc[slice_of_integers]
Copy the code

The following code block shows how to select the first five lines of a data frame using an integer list.

Select the first 5 rows of the dataframe using a list
df_employees.iloc[[0.1.2.3.4]]v
Copy the code

Alternatively, we can get the same result using slicing notation.

Select the first 5 rows of the dataframe using slices
df_employees.iloc[0:5]
Copy the code

As shown above, The Python slicing rule (half-open interval) applies to the.iloc[] attribute, which means the first index is included, but not the closing index.

8. Select both rows and columns

So far, we have learned how to use the.loc[] and.iloc[] indexers to select rows in a data frame by label or position. However, both indexers can select not only rows, but also rows and columns at the same time.

To do this, we must provide comma-separated row and column labels/positions as follows:

Select both rows and columns

df.loc[row_labels, column_labels]
df.iloc[row_positions, column_positions]
Copy the code

Where row labels and column labels can be a single string, a list of strings, or a string fragment. Similarly, row and column positions can be single integers, lists of integers, or slices of integers.

The following example shows how to use the.loc[] and.iloc[] indexers to extract rows and columns simultaneously.

  • Select scalar value

We select the salary of the employee whose ID is 478 as follows.

Select salary of employee with ID number 478 by location
df_employees.iloc[1.3]

# select salary from employee where id = 478
df_employees.loc['478'.'salary']
# 54000
Copy the code

In this case, the output of both indexers is integers.

  • Select single row and multiple columns

We select the first name, last name, and salary of the employee with ID number 478 by taking a value as the first argument and a list of values as the second argument to get a Series object.

Select the first name, last name and salary of the employee with ID number 478 by position
df_employees.iloc[1[0.1.3]]

Select the first name, last name, and salary of the employee whose ID number is 478
df_employees.loc['478'['name'.'surname'.'salary']]
Copy the code

  • Select rows and columns that do not intersect

To select multiple rows and columns, we need to pass two lists of values to both indexers. The code below shows how to extract the first name, last name, and salary of employees with ids 478 and 222.

Select the first name, last name and salary of employees with ID numbers 478 and 222 by position
df_employees.iloc[[1.9], [0.1.3]]

Select first name, last name, and salary for employees with ID numbers 478 and 222
df_employees.loc[['478'.'222'], ['name'.'surname'.'salary']]
Copy the code

Unlike before, the output of both indexers is a DataFrame object.

  • Select consecutive rows and columns

We can use slicing notation to extract consecutive rows and columns of a data frame. The following code snippet shows how to select the first, last, and salary of employees with ids 128, 478, 257, and 299.

Select first name, last name, and salary for employees with ids 128, 478, 257, 299 by position
df_employees.iloc[:4[0.1.3]]

Select first name, last name, and salary for employees with ids 128, 478, 257, 299
df_employees.loc[:'299'['name'.'surname'.'salary']]
Copy the code

As shown above, we only use the slicing notation to extract the rows of the data frame, because the ID number we want to select is sequential (index 0 to 3).

Keep in mind that the LOc [] indexer uses a closed interval to extract both the start label and the stop label. In contrast, the.iloc[] indexer uses a half-open interval and therefore does not include the value at which the index is stopped.

9. Use the.at[] and.iat[] indexers to select scalar values

As mentioned above, we can select scalar values by passing two comma-separated strings/integers to the.loc[] and.iloc[] indexers. In addition, Pandas provides two optimization functions to extract scalar values from data frame objects: the.at[] and.iat[] operators. The former extracts a single value by tag, while the latter accesses a single value by location.

Select scalar values by label and position

df.at[string, string]
df.iat[integer, integer]
Copy the code

The following code shows how to use the.at[] and.iat[] indexers to select the salary of an employee with ID 478 by label and location.

Select salary of employee with ID number 478 by location
df_employees.iat[1.3]

# select salary from employee where id = 478
df_employees.at['478'.'salary']
# 54000
Copy the code

We can use the **%timeit** magic function to calculate the execution time of these two Python statements. As shown below, the.at[] and.iat[] operators are much faster than the.loc[] and.iloc[] indexers.

# loC indexer execution time
%timeit df_employees.loc['478'.'salary']

# at execution time of indexer
%timeit df_employees.at['478'.'salary']
Copy the code

# iloc indexer execution time
%timeit df_employees.iloc[1.3]

# iAT indexer execution time
%timeit df_employees.iat[1.3]
Copy the code

Finally, it is important to remember that.at[] and.iat[] indexers can only be used to access a single value and will raise type errors when attempting to select multiple elements of a data frame.

An exception is thrown when you try to select more than one element
df_employees.at['478'['name'.'surname'.'salary']]
# TypeError
Copy the code

10. Use booleans to select rows

So far, we have filtered the rows and columns in the data frame by label and position. Alternatively, we can use a Boolean index to select a subset in Pandas. Boolean selection involves selecting rows of a data frame by providing a Boolean value (True or False) for each row.

In most cases, this Boolean array is evaluated by applying a condition to the values of one or more columns, which evaluates to True or False, depending on whether the values satisfy the condition. However, Boolean arrays can also be created manually using other sequences, Numpy arrays, lists, or the Pandas series.

The sequence of Boolean values is then placed inside square brackets [], returning the line associated with the truth value.

Select selection rows using Boolean selection

df[sequence_of_booleans]
Copy the code
Boolean selection based on a single column value

The most common way to filter data frames based on single-column values is to use comparison operators.

The comparison operator evaluates the relationship between two operands (A and b) and returns True or False depending on whether the condition is met. The following table contains the comparison operators available in Python.

These comparison operators can be used on a single column of a data frame to obtain a sequence of Boolean values. For example, we use the greater than operator to determine whether an employee’s salary is greater than 45,000, as shown below.

# Employees earning more than 45,000
df_employees['salary'] > 45000
Copy the code

The output is a series of Boolean functions where wages above 45,000 are true and wages below or equal to 45,000 are false. As you may have noticed, the Boolean series has the same index (ID number) as the original data frame.

You can pass this sequence to the index operator [] to return only rows that result in True.

# Select employees who earn more than 45,000
df_employees[df_employees['salary'] > 45000]
Copy the code

As shown above, we get a data frame object that contains only employees whose salaries are above 45,000.

Boolean selection based on multiple column values

Previously, we filtered a data frame based on a condition. However, we can also combine multiple Boolean expressions using logical operators.

In Python, there are three logical operators: and, OR, and not. However, these keywords cannot be used to combine multiple Boolean conditions in Pandas. Instead, use the following operators.

The following code shows how to select an employee who earns more than 45,000 and has a permanent contract that contains two Boolean expressions and the logical operator &.

# Select employees who earn more than 45,000 and have a long-term contract
df_employees[(df_employees['salary'] > 45000) & (df_employees['type_contract'] = ='permanent')]
Copy the code

As you know, comparison operators take precedence over logical operators in Python. However, it does not apply to Panda where logical operators take precedence over comparison operators. Therefore, we need to wrap each Boolean expression in parentheses to avoid errors.

Boolean selection using the Pandas method

Pandas provides a series of built-in functions that return sequences of Boolean values, and is an attractive alternative to more complex Boolean expressions that combine comparison and logical operators.

  • The isin method

The pandas.series. Isin method accepts a Series of values and returns True at the locations in the sequence that match the values in the list.

This method allows us to check for the presence of one or more elements in a column without using logical operators or. The following code shows how to select employees with permanent or temporary contracts using the logical operator OR and isIN methods.

# Use logical operators or select employees with permanent or temporary contracts
df_employees[(df_employees['type_contract'] = ='temporary') | (df_employees['type_contract'] = ='permanent')]

# Use the ISIN method to select employees on permanent or temporary contracts
df_employees[df_employees['type_contract'].isin(['temporary'.'permanent']]Copy the code

As you can see, the ISIN method is handy for checking multiple or conditions in the same column. Plus, it’s faster!

# use logical operators | execution time
%timeit df_employees[(df_employees['type_contract'] = ='temporary') | (df_employees['type_contract'] = ='permanent')]

# isin method execution time
%timeit df_employees[df_employees['type_contract'].isin(['temporary'.'permanent']]Copy the code

Between method

The panda method takes two comma-separated scalars that represent the upper and lower bounds of a range of values and returns True at positions within that range.

The following code selects employees whose salary is 30000 or higher and 80,000 or less.

# Employees whose salary is higher than or equal to 30,000 and lower than or equal to 80,000
df_employees[df_employees['salary'].between(30000.80000)]
Copy the code

As you can see, both boundaries (30000 and 80000) are included. To exclude them, we must pass the inclusive=False argument as follows.

# Employees whose salary is above 30,000 yuan and below 80,000 yuan
df_employees[df_employees['salary'].between(30000.80000, inclusive=False)]
Copy the code

As you may have noticed, the above code is equivalent to writing two Boolean expressions and evaluating them using the logical operator and.

# Employees whose salary is higher than or equal to 30,000 and lower than or equal to 80,000
df_employees[(df_employees['salary'] > =30000) & (df_employees['salary'] < =80000)]
Copy the code
  • String method

In addition, we can use Boolean indexes with string methods, as long as they return a sequence of Boolean values.

Such as pandas. Series. STR. The contains method check whether there is the substring in the list of all the elements, and returns a Boolean value, we can pass the Boolean value to the index operator to filter data frames.

The code below shows how to select all phone numbers containing 57.

# select all phone numbers containing 57
df_employees[df_employees['telephone'].str.contains('57')]
Copy the code

When the CONTAINS method calculates whether the substring is contained in each element of the sequence. Pandas. Series. STR. Startswith function checks whether the string at the beginning is the string. Similarly pandas. Series. STR. Endswith test whether there is at the end of the string string.

The following code shows how to select employees whose names begin with ‘A’.

# Select employees whose names begin with "A"
df_employees[df_employees['name'].str.startswith('A')]
Copy the code

Abstract

In this article, we learn to select subsets from dataframes. In addition, we provide several usage examples. Now! Now it’s time to apply these techniques to cleaning up your own data!

The original link: towardsdatascience.com/filtering-d…

Welcome to panchuangai blog: panchuang.net/

Sklearn123.com/

Welcome to docs.panchuang.net/