Pandas is a Python software library for data manipulation and analysis. Dataquest. IO has compiled a tutorial on how to optimize pandas’ memory footprint by reducing the memory footprint of a baseball game dataset by nearly 90% with a simple cast.

Performance is generally not an issue when manipulating small amounts of data (less than 100 MB) using PANDAS. When dealing with larger volumes of data (100 MB to several GIGABytes), performance issues can make the run time much longer and cause the run to fail completely due to lack of memory.

While tools like Spark can handle large data sets (100 GB to several terabytes), more expensive hardware is often required to take full advantage of their power. And unlike PANDAS, they lack a rich set of features for high-quality data cleansing, exploration, and analysis. For medium scale data, it is best to take advantage of PANDAS more fully, rather than switching to another tool.

In this article, we’ll learn about PANDAS ‘memory usage and how you can reduce the memory footprint of a Dataframe by nearly 90% simply by choosing the right data type for the column.

Deal with baseball game logs  

We will deal with 130 years of major league baseball (MLB) data, the data from Retrosheet:www.retrosheet.org/gamelogs/in…

The data was originally split into 127 different CSV files, but we have combined the data using CSVKit and added column names in the first line. If you want to download the version of the data used in this article, please visit: data.world/dataquest/m…

Let’s start by importing the data and looking at the first five lines:

We’ve summarized some important columns below, but if you want to see all of them, we’ve also created a data dictionary for the entire dataset:Data. The world/dataquest/m…

  • Date – Match time
  • V_name – Name of the visiting team
  • V_league – Visiting team league
  • H_name – The name of the main team
  • H_league – Home team league
  • V_score – The score of the visiting team
  • H_score – Score scored by the home team
  • V_line_score – The number of points scored by the visiting team in each innings, e.g. 010000(10)00.
  • H_line_score – The number of points scored by the home team in each innings, for example: 010000(10)0X.
  • Park_id – Name of the course on which the match is held
  • Attendance – Competition spectators

We can use the datafame.info () method to give us high-level information about the DataFrame, including its size, data type information, and memory usage.

By default, pandas approximates the memory usage of the Dataframe to save time. Since we are also concerned about accuracy, we set the memory_usage parameter to ‘deep’ to get the exact number.

As we can see, we have 171,907 rows and 161 columns. Pandas automatically detects the data type for us and finds that 83 columns are values and 78 columns are objects. Object refers to a situation where there are strings or mixed data types.

To better understand how pandas can reduce memory usage, let’s look at how pandas stores data in memory.

Internal representation of a dataframe

In PANDAS, columns of the same data type are organized into blocks of values. Here is an example of how pandas stores the first 12 columns of our Dataframe.

You can see that the blocks do not retain their original column names. This is because the blocks are optimized for storing the actual values in the Dataframe. Pandas’ BlockManager class preserves the mapping between column and column indexes and the actual blocks. It can be used as an API, providing access to the underlying data. Whenever we select, edit, or delete these values, the interfaces of the Dataframe and BlockManager classes translate our requests into calls to functions and methods.

In the pandas. Core. Internals module, there is a special class for each type. Pandas uses the ObjectBlock class to represent blocks containing string columns, and the FloatBlock class to represent blocks containing floating-point columns. For blocks representing integers and floating point numbers, pandas combines the columns and stores them as NumPy NDARray. NumPy Ndarray is built around arrays in C, where values are stored in contiguous chunks of memory. This storage scheme makes access to values very fast.

Because each data type is stored separately, we will examine the memory usage of the different data types. First, let’s look at the average memory usage for each data type.

As you can see, 78 object columns use the largest amount of memory. We’ll talk more about that later. First let’s see if we can improve the memory usage of numeric columns.

Understanding subtypes

As we briefly mentioned earlier, pandas internally represents the values as NumPy ndarrays and stores them in contiguous chunks of memory. This storage mode takes up less space and also gives us quick access to the values. Because PANDAS uses the same number of bytes for each value of the same type, and NumPy ndarray can store the number of values, PANDAS can quickly and accurately return the number of bytes consumed by a numeric column.

Many of the types in PANDAS have multiple subtypes that can use fewer bytes to represent each value. The float type, for example, contains the float16, float32, and float64 subtypes. The number in the type name represents the number of bits that represent the value of the type. For example, the subtypes we just listed use 2, 4, 8, and 16 bytes, respectively. The table below gives the subtypes of the most commonly used types in PANDAS:

An INT8 value uses 1 byte of storage space and can represent 256 (2^8) binary numbers. This means that we can use this subtype to represent all integer values from -128 to 127, including 0.

We can use the numpy.iinfo class to verify the maximum and minimum values of each integer subtype. Here’s an example:

Here we see the difference between uint (unsigned integer) and int (signed integer). Both types have the same storage capacity, but one only holds 0 and positive numbers. Unsigned integers allow us to deal more efficiently with columns that have only positive values.

Optimize numeric columns with subtypes

We can use the function pd.to_numeric() to downcast our numeric types. We’ll use datafame.select_dtypes to select the integer column, and then we’ll optimize the data type and compare the memory usage.

We can see that memory usage dropped from 7.9 MB to 1.5 MB, a decrease of more than 80%. But this doesn’t affect our original Dataframe very much, because there are very few integer columns.

Let’s do the same for the floating-point columns.

We can see that the floating-point column data type changes from FLOAT64 to FLOAT32, reducing memory usage by 50%.

Let’s create a copy of the original Dataframe, replace the original columns with these optimized ones, and look at our overall memory usage now.

Although we significantly reduced the memory usage for the numeric column, the overall memory usage was only reduced by 7%. Most of our gains will come from optimizing the Object type.

Before we get started, let’s look at how strings are stored in PANDAS compared to the way numeric types are stored.

Comparison of numeric storage to string storage

The object type represents the value of a Python string object, in part because NumPy does not support missing strings. Because Python is a high-level interpreted language, it does not have fine-grained control over values stored in memory.

This limitation results in a fragmented way of storing strings, which consumes more memory and is slower to access. Each element in the Object column is actually a pointer that contains the “address” of the actual value’s location in memory.

The figure below shows how numeric data is stored as a NumPy data type and string data is stored using Python’s built-in types.

* image: jakevdp. Making. IO/blog / 2014/0…

In the previous table, you may have noticed that the memory usage of the Object type is variable. Although each pointer takes up only 1 byte of memory, if each string was stored separately in Python, it would take up as much space as the actual string. We can prove this using the sys.getsizeof() function, looking first at the individual strings and then at the items in the Pandas Series.

As you can see, the size of the string when stored in pandas Series is the same as the size of the string stored separately in Python.

Optimize object types using Categoricals

Pandas has introduced Categorials in the 0.15 edition. The category type uses integer values underneath to represent values in a column, rather than raw values. Pandas uses a separate mapping dictionary to map these integer values to the original values. This approach is useful whenever a column contains a finite set of values. When we convert a column to category dType, PANDAS uses the most space-saving int subtype to represent all the different values in the column.

To see why we can use this type to reduce memory usage, let’s look at the number of different values for each type in our Object type.

See article for the full image above

At a glance, the number of unique values in our entire dataset of 172,000 matches is very small. To see what happens when we convert this to a categorical type, let’s take an object column. We will use the second column of the dataset, DAY_of_week.

Looking at the table above, you can see that it contains only seven different values. We will use the.astype() method to convert this to a categorical type.

As you can see, the data still looks exactly the same, except that the type of this column has changed. Let’s see what’s going on behind the scenes.

In the code below, we use the series.cat. codes attribute to return the integer value that the category type uses to represent each value.

You can see that each different value is assigned an integer value, and that the base data type for this column is now INT8. There aren’t any missing values in this column, but even if there are, the category subtype can handle it by setting it to -1.

Finally, let’s look at the memory usage before and after converting this column to category type.

The 9.8MB memory usage was reduced to 0.16MB, a 98% reduction! Note that this particular column may represent one of our best-case scenarios — about 172,000 items with only seven different values.

Although it sounds tempting to convert all columns to this type, it’s important to understand the trade-offs. The biggest disadvantage is the inability to perform numerical calculations. Without first converting it to the value dtype, we would not be able to do arithmetic on the category column, that is, we would not be able to use methods like series.min () and series.max ().

We should stick with the category type primarily for object columns where the number of different values is less than 50% of the total number of values. If all the values in a column are different, the category type will use more memory. This column stores not only all the raw string values, but also their integer value code. You can learn about the limitations of the category type in the PANDAS documentation: pandas.pydata.org/pandas-docs…

We’ll write a loop function that iteratively checks if the number of different values in each object column is less than 50%; If so, convert it to category.

Compare as before:

In this case, all object columns are converted to category type, but not all data sets are, so you should use the above process to check.

The memory usage for the Object column was reduced from 752MB to 52MB, a 93% reduction. Let’s combine this with the rest of our Dataframe and see how much progress has been made from the original 861MB.

Wow, this is going really well! There is another optimization we can do — if you remember the table of datatypes given earlier, you know there is also a Datetime type. The first column of the dataset can use this type.

You may remember that this column started as an integer and has now been optimized to unint32. Therefore, converting it to datetime actually doubles the memory usage because datetime is 64-bit. Converting it to datetime is valuable because it allows us to do better time series analysis.

The pandas. To_datetime () function can do this for us, using its format argument to store our date data in YYYY-MM-DD.

Select the type while reading in the data  

We have now explored ways to reduce the memory footprint of existing Dataframes. By first reading the Dataframe and iterating through the process to reduce memory footprint, we learned how much memory savings each optimization approach can bring. But as we mentioned earlier, we often don’t have enough memory to represent all the values in the dataset. If we can’t even create a Dataframe in the first place, how can we apply memory saving techniques?

Fortunately, we can specify the optimal column type while reading in the data. The pandas. Read_csv () function has several different arguments that allow us to do this. The dtype argument accepts dictionaries with column names as keys and NumPy object as values.

First, we can store the final type of each column in a dictionary, where the key value represents the column name, and remove the date column first, because the date column requires different processing.

Now we are ready to use the dictionary, along with a few more arguments to read in dates of the correct type in just a few lines of code:

See article for the full image above

By optimizing these columns, we managed to reduce pandas’ memory footprint from 861.6MB to 104.28MB — a staggering 88% reduction!

Analyzing the baseball game

Now that we have optimized our data, we can perform some analysis. Let’s start by understanding the distribution of dates for these races.

We can see that there were few baseball games on Sunday before the 1920s, but they became more and more frequent during the second half of the last century.

It is also clear that the distribution of dates has changed little over the past 50 years.

Let’s take a look at how the length of the match has changed:

Since the 1940s, baseball games have become longer and longer.

Summary and next steps

Having seen how Pandas uses different data types, we then used this knowledge to reduce the memory usage of a PANDAS Dataframe by nearly 90% using a few simple techniques:

  • Cast the numeric column down to a more efficient type
  • Convert string columns to a categorical type

Article reprinted: The heart of the machine

(Copyright belongs to original author, deleted)