• Pythonic Data Cleaning With NumPy and Pandas
  • Malay Agarwal
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: bambooom
  • Proofreader: luochen1992, Hopsken

Python style data cleansing using NumPy and Pandas

Data scientists spend a lot of time cleaning up data sets and getting them into a working form. In fact, many data scientists say that 80% of the job is capturing and cleaning up data.

Therefore, whether you are just entering the field or planning to enter it, the ability to deal with messy data is very important, whether that means missing values, inconsistencies, misformatting, or meaningless outliers.

In this tutorial, we will use the libraries Pandas and NumPy to clean up data.

We will introduce the following:

  • deleteDataFrameUnnecessary columns in
  • To change theDataFrameThe index of the
  • with.str()Method to clear columns
  • useDataFrame.applymap()The function cleans up the data set as an element
  • Rename columns to more recognizable labels
  • Skip unnecessary lines in the CSV file

These are the data sets we will use:

  • Bl-flickr – images-book.csv – CSV file containing information about British Library books
  • University_towns.txt – a text file containing the names of college towns in each state of the United States
  • Olympics.csv – A CSV file that summarizes the participation of all countries in the Summer and Winter Olympics

You can download all data sets from Real Python’s GitHub repository to view the following examples.

Note: I recommend using Jupyter Notebook for the following steps.

This tutorial assumes that you have a basic understanding of Pandas and the NumPy library, including the Series and Dataframes that Pandas works with, common methods to apply to them, and familiarity with NaN values for NumPy.

Let’s start by importing these modules!

>>> import pandas as pd
>>> import numpy as np
Copy the code

deleteDataFrameUnnecessary columns in

You will often find that not all categories of data in a dataset are useful to you. For example, you might have a data set that contains student information (name, grades, standards, parents’ names and addresses), but you want to focus on analyzing student grades.

In this case, the address and parents’ names are not important to you, and keeping these categories will take up unnecessary space and may slow down the run time.

Pandas provides a handy drop() function to remove columns or rows from the DataFrame. Let’s look ata simple example of removing columns from a DataFrame.

First, we create a DataFrame from the CSV file “bl-Flicker-images-book.csv”. In the following example, we pass the relative path to pd.read_csv. Under the current working path, all Datasets are stored in the Datasets folder:

>>> df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')
>>> df.head()

    Identifier             Edition Statement      Place of Publication  \
0         206                           NaN                    London
1         216                           NaN  London; Virtue & Yorston
2         218                           NaN                    London
3         472                           NaN                    London
4         480  A new edition, revised, etc.                    London

  Date of Publication              Publisher  \
0         1879 [1878]       S. Tinsley & Co.
1                1868           Virtue & Co.
2                1869  Bradbury, Evans & Co.
3                1851          James Darling
4                1857   Wertheim & Macintosh

                                               Title     Author  \
0                  Walter Forbes. [A novel.] By A. A      A. A.
1  All forA novel. The dedication signed... A., A. 2 Love The Avenger. By The author of "Allfor Gr...  A., A. A.
3  Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.
4  [The World in which I live, and my place in it...  A., E. S.

                                   Contributors  Corporate Author  \
0                               FORBES, Walter.               NaN
1  BLAZE DE BURY, Marie Pauline Rose - Baroness               NaN
2  BLAZE DE BURY, Marie Pauline Rose - Baroness               NaN
3                   Appleyard, Ernest Silvanus.               NaN
4                           BROOME, John Henry.               NaN

   Corporate Contributors Former owner  Engraver Issuance type\ 0 NaN NaN NaN monographic 1 NaN NaN NaN monographic 2 NaN NaN NaN monographic 3 NaN NaN NaN monographic 4 NaN NaN NaN monographic Flickr URL \ 0 http://www.flickr.com/photos/britishlibrary/ta... 1 http://www.flickr.com/photos/britishlibrary/ta... 2 http://www.flickr.com/photos/britishlibrary/ta... 3 http://www.flickr.com/photos/britishlibrary/ta... 4 http://www.flickr.com/photos/britishlibrary/ta... Shelfmarks 0 British Library HMNTS 12641.b.30. 1 British Library HMNTS 12626.cc.2. 2 British Library HMNTS 12625.dd.1. 3  British Library HMNTS 10369.bbb.15. 4 British Library HMNTS 9007.d.28.Copy the code

When we look at the first five data pieces using the head() method, we can see that some columns provide helpful auxiliary information for libraries, but not much help in describing the books themselves: Edition Statement, Corporate Author, Corporate Polymorphism, Former owner, Engraver, HK $20TYPE and Shelfmarks.

We can delete these columns as follows:

>>> to_drop = ['Edition Statement'.'Corporate Author'.'Corporate Contributors'.'Former owner'.'Engraver'.'Contributors'.'Issuance type'.'Shelfmarks']

>>> df.drop(to_drop, inplace=True, axis=1)
Copy the code

Here, we define a list that contains the names of the columns we want to delete. The drop() function is then called, passing in the inplace argument as True and the axis argument as 1. These two parameters tell Pandas that we want the changes to be applied directly to the object, and that it is the columns that we need to delete.

Looking at the DataFrame again, you can see that the unwanted column has been removed:

>>> df.head()
   Identifier      Place of Publication Date of Publication  \
0         206                    London         1879 [1878]
1         216  London; Virtue & Yorston                1868
2         218                    London                1869
3         472                    London                1851
4         480                    London                1857

               Publisher                                              Title  \
0       S. Tinsley & Co.                  Walter Forbes. [A novel.] By A. A
1           Virtue & Co.  All for2 Bradbury, Evans & Co. Love The Avenger. By The author of "Allfor Gr...
3          James Darling  Welsh Sketches, chiefly ecclesiastical, to the...
4   Wertheim & Macintosh  [The World in which I live, and my place in it...

      Author                                         Flickr URL
0      A. A.  http://www.flickr.com/photos/britishlibrary/ta...
1  A., A. A.  http://www.flickr.com/photos/britishlibrary/ta...
2  A., A. A.  http://www.flickr.com/photos/britishlibrary/ta...
3  A., E. S.  http://www.flickr.com/photos/britishlibrary/ta...
4  A., E. S.  http://www.flickr.com/photos/britishlibrary/ta...
Copy the code

Alternatively, we can delete columns by passing them directly to the columns parameter, without specifying the label to delete and whether the column or row is to be deleted separately:

>>> df.drop(columns=to_drop, inplace=True)
Copy the code

This method is more intuitive and readable, and it’s very obvious what this step does.

If you know in advance which columns you need to keep, another option is to pass the columns to pd.read_csv as parameters to usecols.

To change theDataFrameThe index of the

Pandas’ Index extends NumPy’s array capabilities to allow more interception and tagging. In most cases, it is helpful to use the unique identification field of the data as an index.

For example, in the dataset used in the previous section, it is conceivable that a librarian who needs to search for records might enter the unique Identifier column for a book:

>>> df['Identifier'].is_unique
True
Copy the code

Let’s replace the existing index with set_index:

>>> df = df.set_index('Identifier')
>>> df.head()
                Place of Publication Date of Publication  \
206                           London         1879 [1878]
216         London; Virtue & Yorston                1868
218                           London                1869
472                           London                1851
480                           London                1857

                        Publisher  \
206              S. Tinsley & Co.
216                  Virtue & Co.
218         Bradbury, Evans & Co.
472                 James Darling
480          Wertheim & Macintosh

                                                        Title     Author  \
206                         Walter Forbes. [A novel.] By A. A      A. A.
216         All forThe dedication signed... A., A. 218 Love The Avenger. By The author of "Allfor Gr...  A., A. A.
472         Welsh Sketches, chiefly ecclesiastical, to the...  A., E. S.
480         [The World in which I live, and my place in it...  A., E. S.

                                                   Flickr URL
206         http://www.flickr.com/photos/britishlibrary/ta...
216         http://www.flickr.com/photos/britishlibrary/ta...
218         http://www.flickr.com/photos/britishlibrary/ta...
472         http://www.flickr.com/photos/britishlibrary/ta...
480         http://www.flickr.com/photos/britishlibrary/ta...
Copy the code

Technical details: Unlike primary keys in SQL, Pandas’ Index is not guaranteed to be unique, although many indexing and merge operations are accelerated when run in unique cases.

We can access each record directly using loc[]. Although loC [] may not have an intuitive name, it allows us to perform label-based indexing, marking a row or record regardless of its location:

>>> df.loc[206]
Place of Publication                                               London
Date of Publication                                           1879 [1878]
Publisher                                                S. Tinsley & Co.
Title                                   Walter Forbes. [A novel.] By A. A
Author                                                              A. A.
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object
Copy the code

In other words, 206 is the first tag of the index. To access it by location, we can use df.iloc[0], which performs location-based indexing.

Technical details:.loc[] is technically an instance of a class that has some special syntax that does not fully conform to most ordinary Python instance methods.

At first, our index is a range index, which is an integer starting from 0, similar to Python’s built-in range. We change the index to the value in the Identifier by passing the column name to set_index.

You may have noticed that we use df = df.set_index(…) Reassign the value returned by this method to the variable. This is because by default, this method returns a modified copy and does not make changes directly to the original object. Indexes can avoid this by setting the inplace parameter:

df.set_index('Identifier', inplace=True)
Copy the code

Collate fields in data

At this point, we have removed unnecessary columns and changed the DataFrame index to a more meaningful column. In this section, we’ll clean up specific columns into a uniform format to better understand the dataset and enforce consistency. Specifically, we will clean up the Date of Publication and Place of Publication columns.

Upon inspection, all data types are Object dTypes, which are similar to STR in Python.

It encapsulates any fields that do not apply to numeric or categorical data. This makes sense, since the data we use is initially just a jumble of characters:

>>> df.get_dtype_counts()
object    6
Copy the code

The publication date column is more meaningful if it is converted to a numeric type, so we can calculate as follows:

>>> df.loc[1905:, 'Date of Publication'].head(10) Identifier 1905 1888 1929 1839, 38-54 2836 [1897?]  2854 1865 2956 1860-63 2957 1873 3017 1866 3131 1899 4598 1814 4884 1820 Name: Date of Publication, dtype: objectCopy the code

A book can only have one publication date, so we need to do the following:

  • Remove extra dates in square brackets wherever they occur, e.g., 1879 [1878]
  • Convert date range to “start date”, for example: 1860-63; 1839, 38-54
  • Completely remove any uncertain dates and use NumPy’sNaNValue substitution: [1897?]
  • The stringnanAlso converted to NumPyNaN

Together, we can actually use a regular expression to extract the year of publication:

regex = r'^(\d{4})'
Copy the code

This regular expression is intended to find four digits at the beginning of the string, which is sufficient for our purposes. Above is a raw string (meaning that the backslash is no longer an escape character), which is standard practice for regular expressions.

\d represents any number, {4} represents the repetition of 4 times, ^ represents the beginning of the matching string, and the parentheses represent a capture group that indicates to Pandas that we want to extract this part of the regular expression. (We want ^ to avoid the case where the string starts with [.)

Now let’s see what happens when we run the expression in the dataset:

>>> extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
>>> extr.head()
Identifier
206    1879
216    1868
218    1869
472    1851
480    1857
Name: Date of Publication, dtype: object
Copy the code

Not familiar with re? You can check out this regular expression at regex101.com and read more Python regular expressions tutorials at HOWTO.

Technically, this column is still object dType, but we can easily get the numbers using pd.to_numeric:

>>> df['Date of Publication'] = pd.to_numeric(extr)
>>> df['Date of Publication'].dtype
dtype('float64')
Copy the code

Doing so results in a loss of a tenth of the value, but this is a small price to pay for being able to evaluate the remaining valid values:

>>> df['Date of Publication']. Isnull (). The sum ()/len (df) 0.11717147339205986Copy the code

Very good! This section is complete!

Combined with NumPy andstrMethod to clean up columns

In the last section, you may have noticed that we used df[‘Date of Publication’].str. This property is a way to access Pandas’ fast string operations, which are modeled primarily on strings or compiled regular expression methods in native Python, such as.split(),.replace(), and.capitalize().

To clean up the Place of Publication field, we can combine the STR method of Pandas with NumPy’s Np.Where function, which is basically a vectorized form of the IF() macro in Excel. Its syntax is as follows:

>>> np.where(condition, then.else)
Copy the code

Here, condition can be an array-like object or a Boolean mask, using then values if condition is True and else values otherwise.

In essence, the.Where () function checks each element in the object to see if condition is True and returns an Ndarray object containing either then or else values.

It can also be used in nested if-THEN statements, allowing us to evaluate on multiple conditions:

>>> np.where(condition1, x1, np.where(condition2, x2, np.where(condition3, x3, ...) ))Copy the code

We’ll use these two functions to clean up the Place of Publication column because it contains strings. Here is the contents of the column:

>>> df['Place of Publication'].head(10)
Identifier
206                                  London
216                London; Virtue & Yorston
218                                  London
472                                  London
480                                  London
481                                  London
519                                  London
667     pp. 40. G. Bryan & Co: Oxford, 1898
874                                 London]
1143                                 London
Name: Place of Publication, dtype: object
Copy the code

We found that in some lines, the publication was surrounded by other unnecessary information. If we look at more values, we see that this is only true for published rows that contain ‘London’ or ‘Oxford’.

Let’s look at two specific pieces of data:

>>> df.loc[4157862]
Place of Publication                                  Newcastle-upon-Tyne
Date of Publication                                                  1867
Publisher                                                      T. Fordyce
Title                   Local Records; or, Historical Register of rema...
Author                                                        T.  Fordyce
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object

>>> df.loc[4159587]
Place of Publication                                  Newcastle upon Tyne
Date of Publication                                                  1834
Publisher                                                Mackenzie & Dent
Title                   An historical, topographical and descriptive v...
Author                                               E. (Eneas) Mackenzie
Flickr URL              http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object
Copy the code

The two books are published in one place, but one place contains a hyphen and the other does not.

To clean up this column at once, we can use str.contains() to obtain a Boolean mask.

We clear the column as follows:

>>> pub = df['Place of Publication']
>>> london = pub.str.contains('London')
>>> london[:5]
Identifier
206    True
216    True
218    True
472    True
480    True
Name: Place of Publication, dtype: bool

>>> oxford = pub.str.contains('Oxford')
Copy the code

Combined with Np. where:

df['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               pub.str.replace(The '-'.' ')))

>>> df['Place of Publication'].head()
Identifier
206    London
216    London
218    London
472    London
480    London
Name: Place of Publication, dtype: object
Copy the code

Here, the np.where function is called in the nested result, and condition is the Series object of the Boolean returned from str.contains(). The contains() method is similar to the built-in in keyword in native Python, which is used to find if an entity is present in an iterator (or if a substring is present in a string).

It replaces the string of the place we want to publish. We also use the str.replace() method to replace the hyphen with a space and reassign it to the DataFrame column.

Although there is still a lot of dirty data in this dataset, we are only talking about these two columns for now.

Let’s revisit the first five items, which look a lot clearer than when we started:

>>> df.head()
           Place of Publication Date of Publication              Publisher  \
206                      London                1879        S. Tinsley & Co.
216                      London                1868           Virtue & Co.
218                      London                1869  Bradbury, Evans & Co.
472                      London                1851          James Darling
480                      London                1857   Wertheim & Macintosh

                                                        Title    Author  \
206                         Walter Forbes. [A novel.] By A. A        AA
216         All forA. A A. 218 Love The Avenger. By The author of "Allfor Gr...   A. A A.
472         Welsh Sketches, chiefly ecclesiastical, to the...   E. S A.
480         [The World in which I live, and my place in it...   E. S A.

                                                   Flickr URL
206         http://www.flickr.com/photos/britishlibrary/ta...
216         http://www.flickr.com/photos/britishlibrary/ta...
218         http://www.flickr.com/photos/britishlibrary/ta...
472         http://www.flickr.com/photos/britishlibrary/ta...
480         http://www.flickr.com/photos/britishlibrary/ta...
Copy the code

Note: Here, Place of Publication would be a Categorical dtype column because we can code integers against small unique cities. (The amount of memory used to classify data types is proportional to the number of categories plus the length of the data, and the size of the DType object is a constant times the length of the data.)

useapplymapThe function cleans up the entire data set

In some cases, you will find dirty data not just in a single column, but scattered throughout the entire data set.

Sometimes it is helpful to apply a custom function to each unit or element in a DataFrame. Pandas’.applymap() function is similar to the built-in map() function, except that it applies to all elements in the DataFrame.

Let’s look at an example where we will create a DataFrame from the “university_towns.txt” file.

$ head Datasets/univerisity_towns.txt
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Copy the code

StateA TownA1 TownA2 StateB TownB1 TownB2… If we look at the way state names are written in the file, we’ll see that all of them have an “[edit]” substring.

We can use this pattern to create a list of (state, city) tuples and put them into the DataFrame.

>>> university_towns = []
>>> with open('Datasets/university_towns.txt') as file:
...     for line in file:
...         if '[edit]' in line:
...             # Remember this `state` until the next is found. state = line ...else:...# Otherwise, we have a city; keep `state` as last-seen. university_towns.append((state, line)) >>> university_towns[:5] [('Alabama[edit]\n'.'Auburn (Auburn University)[1]\n'),
 ('Alabama[edit]\n'.'Florence (University of North Alabama)\n'),
 ('Alabama[edit]\n'.'Jacksonville (Jacksonville State University)[2]\n'),
 ('Alabama[edit]\n'.'Livingston (University of West Alabama)[2]\n'),
 ('Alabama[edit]\n'.'Montevallo (University of Montevallo)[2]\n')]
Copy the code

We can package this list into the DataFrame and name the columns “State” and “RegionName”. Pandas retrieves the elements in each list, putting the left value in the State column and the right value in the RegionName column.

The generated DataFrame is as follows:

>>> towns_df = pd.DataFrame(university_towns,
...                         columns=['State'.'RegionName'])

>>> towns_df.head()
 State                                         RegionName
0  Alabama[edit]\n                    Auburn (Auburn University)[1]\n
1  Alabama[edit]\n           Florence (University of North Alabama)\n
2  Alabama[edit]\n  Jacksonville (Jacksonville State University)[2]\n
3  Alabama[edit]\n       Livingston (University of West Alabama)[2]\n
4  Alabama[edit]\n         Montevallo (University of Montevallo)[2]\n
Copy the code

Although we could use a for loop to clean up the string above, it would be more convenient to use Pandas. We just need the state name and the town name, and we can delete everything else. Although the.str() method can be used again here, we can also use the applymap() method to map a Python callable method to each element of the DataFrame.

We use the term element all the time, but what does it really mean? Take a look at the following DataFrame example:

        0           1
0    Mock     Dataset
1  Python     Pandas
2    Real     Python
3   NumPy     Clean
Copy the code

In this example, each cell (‘ Mock ‘, ‘Dataset’, ‘Python’, ‘Pandas’, etc.) is an element. So the applumap() method applies the function to each element. Suppose the function is defined as:

>>> def get_citystate(item):
...     if '(' in item:
...         return item[:item.find('(')].elif '[' in item:
...         return item[:item.find('[')].else:...return item
Copy the code

Pandas’.applymap() accepts only one argument, the (callable) function that will operate on each element:

>>> towns_df =  towns_df.applymap(get_citystate)
Copy the code

First, we define a Python function that takes elements from the DataFrame as arguments. Inside the function, a check is performed to see if the element contains (or [.

The value returned by the function depends on this check. Finally, the applymap() function is called on our DataFrame object. Now our DataFrame object is much more compact.

>>> towns_df.head()
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo
Copy the code

The applymap() method takes each element from the DataFrame, passes it to the function, and then replaces the original value with the value returned by the function. It’s that simple!

Technical details: While it is a convenient and versatile method,.applymap() has a significant running time for larger data sets because it maps callable Python functions to each individual element. In some cases, it is more efficient to use vectorization in Cython or NumPy (calling C).

Column renaming and row skipping

Often, the data set you need to work with may contain obscure column names, or some lines that contain unimportant information, which may be the first few lines about term definitions, or the last footnote.

In this case, we want to rename the columns and skip some rows so that we can drill down only to the necessary information and tags that make sense.

To illustrate how we do this, let’s take a look at the first five lines of the “Elections.csv” dataset:

$ head -n 5 Datasets/olympics.csv0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,? Summer,01 ! , 02! , 03! ,Total,? Winter,01 ! , 02! , 03! ,Total,? Games,01 ! , 02! , 03! , the Combined total Afghanistan (AFG), 13,0,0,2,2,0,0,0,0,0,13,0,0,2,2 Algeria (ALG), 12,5,2,8,15,3,0,0,0,0,15,5,2,8,15 Argentina (ARG), 23,18,24,28,70,18,0,0,0,0,41,18,24,28,70Copy the code

Then, read it into Pandas’ DataFrame:

>>> olympics_df = pd.read_csv('Datasets/olympics.csv') >>> olympics_df.head() 0 1 2 3 4 5 6 7 8 \ 0 NaN ? Summer 01 ! 02! 03! Total ? Winter 01 ! 02! 1 Afghanistan (AFG) 13 0 0 2 2 0 0 0 2 Algeria (ALG) 12 5 2 8 15 3 0 0 3 Argentina (ARG) 23 18 24 28 70 18 0 0 4 Armenia  (ARM) 5 1 2 9 12 6 0 0 9 10 11 12 13 14 15 0 03 ! Total ? Games 01 ! 02! 03! Combined total 1 0 0 13 0 0 2 2 2 0 0 15 5 2 8 15 3 0 0 41 18 24 28 70 4 0 0 11 1 2 9 12Copy the code

This is really messy! A column is a string number indexed from 0. The row that should be the header (that is, the row that should be set to the column name) is located at olympics_df.iloc[0]. This happens because our CSV file starts with 0, 1, 2… It starts with 15.

Also, if we look at the source of the dataset, NaN should be something like “Country”,? Summer should stand for “Summer Games” and 01! It should be “Gold” and so on.

So, we need to do the following two things:

  • Skip a line and set the first line (index 0) to header
  • Rename the columns

We can skip a line and set the header when reading the CSV file by passing some arguments to the read_csv() function.

This function has many optional arguments, but in this case we only need one argument (header) to remove line 0:

>>> olympics_df = pd.read_csv('Datasets/olympics.csv', header=1) >>> olympics_df.head() Unnamed: 0 ? Summer 01 ! 02! 03! Total ? Winter \ 0 Afghanistan (AFG) 13 0 0 2 2 0 1 Algeria (ALG) 12 5 2 8 15 3 2 Argentina (ARG) 23 18 24 28 70 18 3 Armenia (ARM) 5 1 2 9 12 6 4 Australasia (ANZ) [ANZ] 2 3 4 5 12 0 01 ! 1 02! 1 '03. .1 Total.1 ? Games 01 ! 02 2! 2 3! .2 \ 0 0 0 0 0 13 0 0 2 1 0 0 0 0 15 5 2 8 2 0 0 0 0 41 18 24 28 3 0 0 0 0 11 1 2 9 4 0 0 0 0 2 3 4 5 Combined total 0 2 1 15 2 70 3 12 4 12Copy the code

We now have the correct header lines, and we have removed all unnecessary lines. Notice that Pandas changes the name of the column containing the country name from NaN to Unnames:0.

To rename columns, we’ll use the rename() method, which allows you to rename axes based on a mapping (in this case, a dictionary).

Let’s start by defining a new dictionary that maps the name of the current column as a key to a more usable name (the dictionary value).

>>> new_names =  {'Unnamed: 0': 'Country'.'? Summer': 'Summer Olympics'.'01 !': 'Gold'.'02 !': 'Silver'.'03 !': 'Bronze'.'? Winter': 'Winter Olympics'.'01! . 1 ': 'Gold.1'.'02! . 1 ': 'Silver.1'.'03! . 1 ': 'Bronze.1'.'? Games': '# Games'.'01! 2 '.: 'Gold.2'.'02! 2 '.: 'Silver.2'.'03! 2 '.: 'Bronze.2'}
Copy the code

Then call the rename() function:

>>> olympics_df.rename(columns=new_names, inplace=True)
Copy the code

Setting the inplace parameter to True applies the change directly to our DataFrame object. Let’s see if it works:

>>> olympics_df.head() Country Summer Olympics Gold Silver Bronze Total \ 0 Afghanistan (AFG) 13 0 0 2 2 1 Algeria (ALG)  12 5 2 8 15 2 Argentina (ARG) 23 18 24 28 70 3 Armenia (ARM) 5 1 2 9 12 4 Australasia (ANZ) [ANZ] 2 3 4 5 12 Winter Olympics Gold.1 Silver.1 Bronze.1 Total.1# Games Gold.2 \0 0 0 0 0 0 13 0 1 3 0 0 0 0 15 5 2 18 0 0 0 0 41 18 3 6 0 0 0 0 11 1 4 0 0 0 0 0 2 3 Silver.2 Bronze.2 Combined total 0 2 24 28 70 3 2 9 12 4 4 5 12Copy the code

Python Data Cleansing: Review and other resources

In this tutorial, you learned how to use the drop() function to remove unnecessary information, and how to index your dataset to make it easier to reference other items.

In addition, you learned how to clean up object fields using.str() and how to clean up an entire dataset using the applymap() function. Finally, we explored how to skip some columns in a CSV file and rename columns using the rename() method.

It is important to understand data cleansing because it is an important part of data science. You now have a basic understanding of how to clean up datasets using Pandas and NumPy.

Check out the following links to find more resources to continue your Python data science journey:

  • Pandas document
  • NumPy document
  • Python data analysis was written by Wes McKinney, creator of Pandas
  • Pandas Cookbook is written by Data science coach and consultant Ted Petrou

Every tutorial in Real Python is created by a team of developers, so it meets our high quality standards. The team members participating in this tutorial are Malay Agarwal (author) and Brad Solomon (editor).


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.