- Pythonic Data Cleaning With NumPy and Pandas
- Malay Agarwal
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: bambooom
- Proofreader: luochen1992, Hopsken
Python style data cleansing using NumPy and Pandas
Data scientists spend a lot of time cleaning up data sets and getting them into a working form. In fact, many data scientists say that 80% of the job is capturing and cleaning up data.
Therefore, whether you are just entering the field or planning to enter it, the ability to deal with messy data is very important, whether that means missing values, inconsistencies, misformatting, or meaningless outliers.
In this tutorial, we will use the libraries Pandas and NumPy to clean up data.
We will introduce the following:
- delete
DataFrame
Unnecessary columns in - To change the
DataFrame
The index of the - with
.str()
Method to clear columns - use
DataFrame.applymap()
The function cleans up the data set as an element - Rename columns to more recognizable labels
- Skip unnecessary lines in the CSV file
These are the data sets we will use:
- Bl-flickr – images-book.csv – CSV file containing information about British Library books
- University_towns.txt – a text file containing the names of college towns in each state of the United States
- Olympics.csv – A CSV file that summarizes the participation of all countries in the Summer and Winter Olympics
You can download all data sets from Real Python’s GitHub repository to view the following examples.
Note: I recommend using Jupyter Notebook for the following steps.
This tutorial assumes that you have a basic understanding of Pandas and the NumPy library, including the Series and Dataframes that Pandas works with, common methods to apply to them, and familiarity with NaN values for NumPy.
Let’s start by importing these modules!
>>> import pandas as pd
>>> import numpy as np
Copy the code
deleteDataFrame
Unnecessary columns in
You will often find that not all categories of data in a dataset are useful to you. For example, you might have a data set that contains student information (name, grades, standards, parents’ names and addresses), but you want to focus on analyzing student grades.
In this case, the address and parents’ names are not important to you, and keeping these categories will take up unnecessary space and may slow down the run time.
Pandas provides a handy drop() function to remove columns or rows from the DataFrame. Let’s look ata simple example of removing columns from a DataFrame.
First, we create a DataFrame from the CSV file “bl-Flicker-images-book.csv”. In the following example, we pass the relative path to pd.read_csv. Under the current working path, all Datasets are stored in the Datasets folder:
>>> df = pd.read_csv('Datasets/BL-Flickr-Images-Book.csv')
>>> df.head()
Identifier Edition Statement Place of Publication \
0 206 NaN London
1 216 NaN London; Virtue & Yorston
2 218 NaN London
3 472 NaN London
4 480 A new edition, revised, etc. London
Date of Publication Publisher \
0 1879 [1878] S. Tinsley & Co.
1 1868 Virtue & Co.
2 1869 Bradbury, Evans & Co.
3 1851 James Darling
4 1857 Wertheim & Macintosh
Title Author \
0 Walter Forbes. [A novel.] By A. A A. A.
1 All forA novel. The dedication signed... A., A. 2 Love The Avenger. By The author of "Allfor Gr... A., A. A.
3 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
4 [The World in which I live, and my place in it... A., E. S.
Contributors Corporate Author \
0 FORBES, Walter. NaN
1 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
2 BLAZE DE BURY, Marie Pauline Rose - Baroness NaN
3 Appleyard, Ernest Silvanus. NaN
4 BROOME, John Henry. NaN
Corporate Contributors Former owner Engraver Issuance type\ 0 NaN NaN NaN monographic 1 NaN NaN NaN monographic 2 NaN NaN NaN monographic 3 NaN NaN NaN monographic 4 NaN NaN NaN monographic Flickr URL \ 0 http://www.flickr.com/photos/britishlibrary/ta... 1 http://www.flickr.com/photos/britishlibrary/ta... 2 http://www.flickr.com/photos/britishlibrary/ta... 3 http://www.flickr.com/photos/britishlibrary/ta... 4 http://www.flickr.com/photos/britishlibrary/ta... Shelfmarks 0 British Library HMNTS 12641.b.30. 1 British Library HMNTS 12626.cc.2. 2 British Library HMNTS 12625.dd.1. 3 British Library HMNTS 10369.bbb.15. 4 British Library HMNTS 9007.d.28.Copy the code
When we look at the first five data pieces using the head() method, we can see that some columns provide helpful auxiliary information for libraries, but not much help in describing the books themselves: Edition Statement, Corporate Author, Corporate Polymorphism, Former owner, Engraver, HK $20TYPE and Shelfmarks.
We can delete these columns as follows:
>>> to_drop = ['Edition Statement'.'Corporate Author'.'Corporate Contributors'.'Former owner'.'Engraver'.'Contributors'.'Issuance type'.'Shelfmarks']
>>> df.drop(to_drop, inplace=True, axis=1)
Copy the code
Here, we define a list that contains the names of the columns we want to delete. The drop() function is then called, passing in the inplace argument as True and the axis argument as 1. These two parameters tell Pandas that we want the changes to be applied directly to the object, and that it is the columns that we need to delete.
Looking at the DataFrame again, you can see that the unwanted column has been removed:
>>> df.head()
Identifier Place of Publication Date of Publication \
0 206 London 1879 [1878]
1 216 London; Virtue & Yorston 1868
2 218 London 1869
3 472 London 1851
4 480 London 1857
Publisher Title \
0 S. Tinsley & Co. Walter Forbes. [A novel.] By A. A
1 Virtue & Co. All for2 Bradbury, Evans & Co. Love The Avenger. By The author of "Allfor Gr...
3 James Darling Welsh Sketches, chiefly ecclesiastical, to the...
4 Wertheim & Macintosh [The World in which I live, and my place in it...
Author Flickr URL
0 A. A. http://www.flickr.com/photos/britishlibrary/ta...
1 A., A. A. http://www.flickr.com/photos/britishlibrary/ta...
2 A., A. A. http://www.flickr.com/photos/britishlibrary/ta...
3 A., E. S. http://www.flickr.com/photos/britishlibrary/ta...
4 A., E. S. http://www.flickr.com/photos/britishlibrary/ta...
Copy the code
Alternatively, we can delete columns by passing them directly to the columns parameter, without specifying the label to delete and whether the column or row is to be deleted separately:
>>> df.drop(columns=to_drop, inplace=True)
Copy the code
This method is more intuitive and readable, and it’s very obvious what this step does.
If you know in advance which columns you need to keep, another option is to pass the columns to pd.read_csv as parameters to usecols.
To change theDataFrame
The index of the
Pandas’ Index extends NumPy’s array capabilities to allow more interception and tagging. In most cases, it is helpful to use the unique identification field of the data as an index.
For example, in the dataset used in the previous section, it is conceivable that a librarian who needs to search for records might enter the unique Identifier column for a book:
>>> df['Identifier'].is_unique
True
Copy the code
Let’s replace the existing index with set_index:
>>> df = df.set_index('Identifier')
>>> df.head()
Place of Publication Date of Publication \
206 London 1879 [1878]
216 London; Virtue & Yorston 1868
218 London 1869
472 London 1851
480 London 1857
Publisher \
206 S. Tinsley & Co.
216 Virtue & Co.
218 Bradbury, Evans & Co.
472 James Darling
480 Wertheim & Macintosh
Title Author \
206 Walter Forbes. [A novel.] By A. A A. A.
216 All forThe dedication signed... A., A. 218 Love The Avenger. By The author of "Allfor Gr... A., A. A.
472 Welsh Sketches, chiefly ecclesiastical, to the... A., E. S.
480 [The World in which I live, and my place in it... A., E. S.
Flickr URL
206 http://www.flickr.com/photos/britishlibrary/ta...
216 http://www.flickr.com/photos/britishlibrary/ta...
218 http://www.flickr.com/photos/britishlibrary/ta...
472 http://www.flickr.com/photos/britishlibrary/ta...
480 http://www.flickr.com/photos/britishlibrary/ta...
Copy the code
Technical details: Unlike primary keys in SQL, Pandas’ Index is not guaranteed to be unique, although many indexing and merge operations are accelerated when run in unique cases.
We can access each record directly using loc[]. Although loC [] may not have an intuitive name, it allows us to perform label-based indexing, marking a row or record regardless of its location:
>>> df.loc[206]
Place of Publication London
Date of Publication 1879 [1878]
Publisher S. Tinsley & Co.
Title Walter Forbes. [A novel.] By A. A
Author A. A.
Flickr URL http://www.flickr.com/photos/britishlibrary/ta...
Name: 206, dtype: object
Copy the code
In other words, 206 is the first tag of the index. To access it by location, we can use df.iloc[0], which performs location-based indexing.
Technical details:.loc[] is technically an instance of a class that has some special syntax that does not fully conform to most ordinary Python instance methods.
At first, our index is a range index, which is an integer starting from 0, similar to Python’s built-in range. We change the index to the value in the Identifier by passing the column name to set_index.
You may have noticed that we use df = df.set_index(…) Reassign the value returned by this method to the variable. This is because by default, this method returns a modified copy and does not make changes directly to the original object. Indexes can avoid this by setting the inplace parameter:
df.set_index('Identifier', inplace=True)
Copy the code
Collate fields in data
At this point, we have removed unnecessary columns and changed the DataFrame index to a more meaningful column. In this section, we’ll clean up specific columns into a uniform format to better understand the dataset and enforce consistency. Specifically, we will clean up the Date of Publication and Place of Publication columns.
Upon inspection, all data types are Object dTypes, which are similar to STR in Python.
It encapsulates any fields that do not apply to numeric or categorical data. This makes sense, since the data we use is initially just a jumble of characters:
>>> df.get_dtype_counts()
object 6
Copy the code
The publication date column is more meaningful if it is converted to a numeric type, so we can calculate as follows:
>>> df.loc[1905:, 'Date of Publication'].head(10) Identifier 1905 1888 1929 1839, 38-54 2836 [1897?] 2854 1865 2956 1860-63 2957 1873 3017 1866 3131 1899 4598 1814 4884 1820 Name: Date of Publication, dtype: objectCopy the code
A book can only have one publication date, so we need to do the following:
- Remove extra dates in square brackets wherever they occur, e.g., 1879 [1878]
- Convert date range to “start date”, for example: 1860-63; 1839, 38-54
- Completely remove any uncertain dates and use NumPy’s
NaN
Value substitution: [1897?] - The string
nan
Also converted to NumPyNaN
Together, we can actually use a regular expression to extract the year of publication:
regex = r'^(\d{4})'
Copy the code
This regular expression is intended to find four digits at the beginning of the string, which is sufficient for our purposes. Above is a raw string (meaning that the backslash is no longer an escape character), which is standard practice for regular expressions.
\d represents any number, {4} represents the repetition of 4 times, ^ represents the beginning of the matching string, and the parentheses represent a capture group that indicates to Pandas that we want to extract this part of the regular expression. (We want ^ to avoid the case where the string starts with [.)
Now let’s see what happens when we run the expression in the dataset:
>>> extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
>>> extr.head()
Identifier
206 1879
216 1868
218 1869
472 1851
480 1857
Name: Date of Publication, dtype: object
Copy the code
Not familiar with re? You can check out this regular expression at regex101.com and read more Python regular expressions tutorials at HOWTO.
Technically, this column is still object dType, but we can easily get the numbers using pd.to_numeric:
>>> df['Date of Publication'] = pd.to_numeric(extr)
>>> df['Date of Publication'].dtype
dtype('float64')
Copy the code
Doing so results in a loss of a tenth of the value, but this is a small price to pay for being able to evaluate the remaining valid values:
>>> df['Date of Publication']. Isnull (). The sum ()/len (df) 0.11717147339205986Copy the code
Very good! This section is complete!
Combined with NumPy andstr
Method to clean up columns
In the last section, you may have noticed that we used df[‘Date of Publication’].str. This property is a way to access Pandas’ fast string operations, which are modeled primarily on strings or compiled regular expression methods in native Python, such as.split(),.replace(), and.capitalize().
To clean up the Place of Publication field, we can combine the STR method of Pandas with NumPy’s Np.Where function, which is basically a vectorized form of the IF() macro in Excel. Its syntax is as follows:
>>> np.where(condition, then.else)
Copy the code
Here, condition can be an array-like object or a Boolean mask, using then values if condition is True and else values otherwise.
In essence, the.Where () function checks each element in the object to see if condition is True and returns an Ndarray object containing either then or else values.
It can also be used in nested if-THEN statements, allowing us to evaluate on multiple conditions:
>>> np.where(condition1, x1, np.where(condition2, x2, np.where(condition3, x3, ...) ))Copy the code
We’ll use these two functions to clean up the Place of Publication column because it contains strings. Here is the contents of the column:
>>> df['Place of Publication'].head(10)
Identifier
206 London
216 London; Virtue & Yorston
218 London
472 London
480 London
481 London
519 London
667 pp. 40. G. Bryan & Co: Oxford, 1898
874 London]
1143 London
Name: Place of Publication, dtype: object
Copy the code
We found that in some lines, the publication was surrounded by other unnecessary information. If we look at more values, we see that this is only true for published rows that contain ‘London’ or ‘Oxford’.
Let’s look at two specific pieces of data:
>>> df.loc[4157862]
Place of Publication Newcastle-upon-Tyne
Date of Publication 1867
Publisher T. Fordyce
Title Local Records; or, Historical Register of rema...
Author T. Fordyce
Flickr URL http://www.flickr.com/photos/britishlibrary/ta...
Name: 4157862, dtype: object
>>> df.loc[4159587]
Place of Publication Newcastle upon Tyne
Date of Publication 1834
Publisher Mackenzie & Dent
Title An historical, topographical and descriptive v...
Author E. (Eneas) Mackenzie
Flickr URL http://www.flickr.com/photos/britishlibrary/ta...
Name: 4159587, dtype: object
Copy the code
The two books are published in one place, but one place contains a hyphen and the other does not.
To clean up this column at once, we can use str.contains() to obtain a Boolean mask.
We clear the column as follows:
>>> pub = df['Place of Publication']
>>> london = pub.str.contains('London')
>>> london[:5]
Identifier
206 True
216 True
218 True
472 True
480 True
Name: Place of Publication, dtype: bool
>>> oxford = pub.str.contains('Oxford')
Copy the code
Combined with Np. where:
df['Place of Publication'] = np.where(london, 'London',
np.where(oxford, 'Oxford',
pub.str.replace(The '-'.' ')))
>>> df['Place of Publication'].head()
Identifier
206 London
216 London
218 London
472 London
480 London
Name: Place of Publication, dtype: object
Copy the code
Here, the np.where function is called in the nested result, and condition is the Series object of the Boolean returned from str.contains(). The contains() method is similar to the built-in in keyword in native Python, which is used to find if an entity is present in an iterator (or if a substring is present in a string).
It replaces the string of the place we want to publish. We also use the str.replace() method to replace the hyphen with a space and reassign it to the DataFrame column.
Although there is still a lot of dirty data in this dataset, we are only talking about these two columns for now.
Let’s revisit the first five items, which look a lot clearer than when we started:
>>> df.head()
Place of Publication Date of Publication Publisher \
206 London 1879 S. Tinsley & Co.
216 London 1868 Virtue & Co.
218 London 1869 Bradbury, Evans & Co.
472 London 1851 James Darling
480 London 1857 Wertheim & Macintosh
Title Author \
206 Walter Forbes. [A novel.] By A. A AA
216 All forA. A A. 218 Love The Avenger. By The author of "Allfor Gr... A. A A.
472 Welsh Sketches, chiefly ecclesiastical, to the... E. S A.
480 [The World in which I live, and my place in it... E. S A.
Flickr URL
206 http://www.flickr.com/photos/britishlibrary/ta...
216 http://www.flickr.com/photos/britishlibrary/ta...
218 http://www.flickr.com/photos/britishlibrary/ta...
472 http://www.flickr.com/photos/britishlibrary/ta...
480 http://www.flickr.com/photos/britishlibrary/ta...
Copy the code
Note: Here, Place of Publication would be a Categorical dtype column because we can code integers against small unique cities. (The amount of memory used to classify data types is proportional to the number of categories plus the length of the data, and the size of the DType object is a constant times the length of the data.)
useapplymap
The function cleans up the entire data set
In some cases, you will find dirty data not just in a single column, but scattered throughout the entire data set.
Sometimes it is helpful to apply a custom function to each unit or element in a DataFrame. Pandas’.applymap() function is similar to the built-in map() function, except that it applies to all elements in the DataFrame.
Let’s look at an example where we will create a DataFrame from the “university_towns.txt” file.
$ head Datasets/univerisity_towns.txt
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Copy the code
StateA TownA1 TownA2 StateB TownB1 TownB2… If we look at the way state names are written in the file, we’ll see that all of them have an “[edit]” substring.
We can use this pattern to create a list of (state, city) tuples and put them into the DataFrame.
>>> university_towns = []
>>> with open('Datasets/university_towns.txt') as file:
... for line in file:
... if '[edit]' in line:
... # Remember this `state` until the next is found. state = line ...else:...# Otherwise, we have a city; keep `state` as last-seen. university_towns.append((state, line)) >>> university_towns[:5] [('Alabama[edit]\n'.'Auburn (Auburn University)[1]\n'),
('Alabama[edit]\n'.'Florence (University of North Alabama)\n'),
('Alabama[edit]\n'.'Jacksonville (Jacksonville State University)[2]\n'),
('Alabama[edit]\n'.'Livingston (University of West Alabama)[2]\n'),
('Alabama[edit]\n'.'Montevallo (University of Montevallo)[2]\n')]
Copy the code
We can package this list into the DataFrame and name the columns “State” and “RegionName”. Pandas retrieves the elements in each list, putting the left value in the State column and the right value in the RegionName column.
The generated DataFrame is as follows:
>>> towns_df = pd.DataFrame(university_towns,
... columns=['State'.'RegionName'])
>>> towns_df.head()
State RegionName
0 Alabama[edit]\n Auburn (Auburn University)[1]\n
1 Alabama[edit]\n Florence (University of North Alabama)\n
2 Alabama[edit]\n Jacksonville (Jacksonville State University)[2]\n
3 Alabama[edit]\n Livingston (University of West Alabama)[2]\n
4 Alabama[edit]\n Montevallo (University of Montevallo)[2]\n
Copy the code
Although we could use a for loop to clean up the string above, it would be more convenient to use Pandas. We just need the state name and the town name, and we can delete everything else. Although the.str() method can be used again here, we can also use the applymap() method to map a Python callable method to each element of the DataFrame.
We use the term element all the time, but what does it really mean? Take a look at the following DataFrame example:
0 1
0 Mock Dataset
1 Python Pandas
2 Real Python
3 NumPy Clean
Copy the code
In this example, each cell (‘ Mock ‘, ‘Dataset’, ‘Python’, ‘Pandas’, etc.) is an element. So the applumap() method applies the function to each element. Suppose the function is defined as:
>>> def get_citystate(item):
... if '(' in item:
... return item[:item.find('(')].elif '[' in item:
... return item[:item.find('[')].else:...return item
Copy the code
Pandas’.applymap() accepts only one argument, the (callable) function that will operate on each element:
>>> towns_df = towns_df.applymap(get_citystate)
Copy the code
First, we define a Python function that takes elements from the DataFrame as arguments. Inside the function, a check is performed to see if the element contains (or [.
The value returned by the function depends on this check. Finally, the applymap() function is called on our DataFrame object. Now our DataFrame object is much more compact.
>>> towns_df.head()
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
Copy the code
The applymap() method takes each element from the DataFrame, passes it to the function, and then replaces the original value with the value returned by the function. It’s that simple!
Technical details: While it is a convenient and versatile method,.applymap() has a significant running time for larger data sets because it maps callable Python functions to each individual element. In some cases, it is more efficient to use vectorization in Cython or NumPy (calling C).
Column renaming and row skipping
Often, the data set you need to work with may contain obscure column names, or some lines that contain unimportant information, which may be the first few lines about term definitions, or the last footnote.
In this case, we want to rename the columns and skip some rows so that we can drill down only to the necessary information and tags that make sense.
To illustrate how we do this, let’s take a look at the first five lines of the “Elections.csv” dataset:
$ head -n 5 Datasets/olympics.csv0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,? Summer,01 ! , 02! , 03! ,Total,? Winter,01 ! , 02! , 03! ,Total,? Games,01 ! , 02! , 03! , the Combined total Afghanistan (AFG), 13,0,0,2,2,0,0,0,0,0,13,0,0,2,2 Algeria (ALG), 12,5,2,8,15,3,0,0,0,0,15,5,2,8,15 Argentina (ARG), 23,18,24,28,70,18,0,0,0,0,41,18,24,28,70Copy the code
Then, read it into Pandas’ DataFrame:
>>> olympics_df = pd.read_csv('Datasets/olympics.csv') >>> olympics_df.head() 0 1 2 3 4 5 6 7 8 \ 0 NaN ? Summer 01 ! 02! 03! Total ? Winter 01 ! 02! 1 Afghanistan (AFG) 13 0 0 2 2 0 0 0 2 Algeria (ALG) 12 5 2 8 15 3 0 0 3 Argentina (ARG) 23 18 24 28 70 18 0 0 4 Armenia (ARM) 5 1 2 9 12 6 0 0 9 10 11 12 13 14 15 0 03 ! Total ? Games 01 ! 02! 03! Combined total 1 0 0 13 0 0 2 2 2 0 0 15 5 2 8 15 3 0 0 41 18 24 28 70 4 0 0 11 1 2 9 12Copy the code
This is really messy! A column is a string number indexed from 0. The row that should be the header (that is, the row that should be set to the column name) is located at olympics_df.iloc[0]. This happens because our CSV file starts with 0, 1, 2… It starts with 15.
Also, if we look at the source of the dataset, NaN should be something like “Country”,? Summer should stand for “Summer Games” and 01! It should be “Gold” and so on.
So, we need to do the following two things:
- Skip a line and set the first line (index 0) to header
- Rename the columns
We can skip a line and set the header when reading the CSV file by passing some arguments to the read_csv() function.
This function has many optional arguments, but in this case we only need one argument (header) to remove line 0:
>>> olympics_df = pd.read_csv('Datasets/olympics.csv', header=1) >>> olympics_df.head() Unnamed: 0 ? Summer 01 ! 02! 03! Total ? Winter \ 0 Afghanistan (AFG) 13 0 0 2 2 0 1 Algeria (ALG) 12 5 2 8 15 3 2 Argentina (ARG) 23 18 24 28 70 18 3 Armenia (ARM) 5 1 2 9 12 6 4 Australasia (ANZ) [ANZ] 2 3 4 5 12 0 01 ! 1 02! 1 '03. .1 Total.1 ? Games 01 ! 02 2! 2 3! .2 \ 0 0 0 0 0 13 0 0 2 1 0 0 0 0 15 5 2 8 2 0 0 0 0 41 18 24 28 3 0 0 0 0 11 1 2 9 4 0 0 0 0 2 3 4 5 Combined total 0 2 1 15 2 70 3 12 4 12Copy the code
We now have the correct header lines, and we have removed all unnecessary lines. Notice that Pandas changes the name of the column containing the country name from NaN to Unnames:0.
To rename columns, we’ll use the rename() method, which allows you to rename axes based on a mapping (in this case, a dictionary).
Let’s start by defining a new dictionary that maps the name of the current column as a key to a more usable name (the dictionary value).
>>> new_names = {'Unnamed: 0': 'Country'.'? Summer': 'Summer Olympics'.'01 !': 'Gold'.'02 !': 'Silver'.'03 !': 'Bronze'.'? Winter': 'Winter Olympics'.'01! . 1 ': 'Gold.1'.'02! . 1 ': 'Silver.1'.'03! . 1 ': 'Bronze.1'.'? Games': '# Games'.'01! 2 '.: 'Gold.2'.'02! 2 '.: 'Silver.2'.'03! 2 '.: 'Bronze.2'}
Copy the code
Then call the rename() function:
>>> olympics_df.rename(columns=new_names, inplace=True)
Copy the code
Setting the inplace parameter to True applies the change directly to our DataFrame object. Let’s see if it works:
>>> olympics_df.head() Country Summer Olympics Gold Silver Bronze Total \ 0 Afghanistan (AFG) 13 0 0 2 2 1 Algeria (ALG) 12 5 2 8 15 2 Argentina (ARG) 23 18 24 28 70 3 Armenia (ARM) 5 1 2 9 12 4 Australasia (ANZ) [ANZ] 2 3 4 5 12 Winter Olympics Gold.1 Silver.1 Bronze.1 Total.1# Games Gold.2 \0 0 0 0 0 0 13 0 1 3 0 0 0 0 15 5 2 18 0 0 0 0 41 18 3 6 0 0 0 0 11 1 4 0 0 0 0 0 2 3 Silver.2 Bronze.2 Combined total 0 2 24 28 70 3 2 9 12 4 4 5 12Copy the code
Python Data Cleansing: Review and other resources
In this tutorial, you learned how to use the drop() function to remove unnecessary information, and how to index your dataset to make it easier to reference other items.
In addition, you learned how to clean up object fields using.str() and how to clean up an entire dataset using the applymap() function. Finally, we explored how to skip some columns in a CSV file and rename columns using the rename() method.
It is important to understand data cleansing because it is an important part of data science. You now have a basic understanding of how to clean up datasets using Pandas and NumPy.
Check out the following links to find more resources to continue your Python data science journey:
- Pandas document
- NumPy document
- Python data analysis was written by Wes McKinney, creator of Pandas
- Pandas Cookbook is written by Data science coach and consultant Ted Petrou
Every tutorial in Real Python is created by a team of developers, so it meets our high quality standards. The team members participating in this tutorial are Malay Agarwal (author) and Brad Solomon (editor).
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.