This article was adapted from DATAQUEST by Josh Devlin.

Pandas is a Python software library for data manipulation and analysis. Dataquest. IO has compiled a tutorial on how to optimize pandas’ memory footprint by reducing the memory footprint of a baseball game dataset by nearly 90% with a simple cast.
Performance is generally not an issue when manipulating small amounts of data (less than 100 MB) using PANDAS. When dealing with larger volumes of data (100 MB to several GIGABytes), performance issues can make the run time much longer and cause the run to fail completely due to lack of memory.

While tools like Spark can handle large data sets (100 GB to several terabytes), more expensive hardware is often required to take full advantage of their power. And unlike PANDAS, they lack a rich set of features for high-quality data cleansing, exploration, and analysis. For medium scale data, it is best to take advantage of PANDAS more fully, rather than switching to another tool.

In this article, we’ll learn about PANDAS ‘memory usage and how you can reduce the memory footprint of a Dataframe by nearly 90% simply by choosing the right data type for the column.

Deal with baseball game logs

We will deal with 130 years of major league baseball (MLB) data, the data from Retrosheet:http://www.retrosheet.org/gamelogs/index.html.

The data was originally split into 127 different CSV files, but we have combined the data using CSVKit and added column names in the first line. If you want to download the data used in this version, please visit: https://data.world/dataquest/mlb-game-logs.

Let’s start by importing the data and looking at the first five lines:

import pandas as pd

gl = pd.read_csv('game_logs.csv')
gl.head()Copy the code
Below we summarized some important column, but if you want to know all the columns, we also for the whole data set to create a data dictionary: https://data.world/dataquest/mlb-game-logs/workspace/data-dictionary.

  • Date – Match time
  • V_name – Name of the visiting team
  • V_league – Visiting team league
  • H_name – The name of the main team
  • H_league – Home team league
  • V_score – The score of the visiting team
  • H_score – Score scored by the home team
  • V_line_score – The number of points scored by the visiting team in each innings, e.g. 010000(10)00.
  • H_line_score – The number of points scored by the home team in each innings, for example: 010000(10)0X.
  • Park_id – Name of the course on which the match is held
  • Attendance – Competition spectators
We can use the datafame.info () method to give us high-level information about the DataFrame, including its size, data type information, and memory usage.

By default, pandas approximates the memory usage of the Dataframe to save time. Since we are also concerned about accuracy, we set the memory_usage parameter to ‘deep’ to get the exact number.

gl.info(memory_usage='deep')Copy the code


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), INT64 (6), Object (78) Memory Usage: 861.6 MBCopy the code
As we can see, we have 171,907 rows and 161 columns. Pandas automatically detects the data type for us and finds that 83 columns are values and 78 columns are objects. Object refers to a situation where there are strings or mixed data types.

To better understand how pandas can reduce memory usage, let’s look at how pandas stores data in memory.

Internal representation of a dataframe

In PANDAS, columns of the same data type are organized into blocks of values. Here is an example of how pandas stores the first 12 columns of our Dataframe.

You can see that the blocks do not retain their original column names. This is because the blocks are optimized for storing the actual values in the Dataframe. Pandas’ BlockManager class preserves the mapping between column and column indexes and the actual blocks. It can be used as an API, providing access to the underlying data. Whenever we select, edit, or delete these values, the interfaces of the Dataframe and BlockManager classes translate our requests into calls to functions and methods.

In the pandas. Core. Internals module, there is a special class for each type. Pandas uses the ObjectBlock class to represent blocks containing string columns, and the FloatBlock class to represent blocks containing floating-point columns. For blocks representing integers and floating point numbers, pandas combines the columns and stores them as NumPy NDARray. NumPy Ndarray is built around arrays in C, where values are stored in contiguous chunks of memory. This storage scheme makes access to values very fast.

Because each data type is stored separately, we will examine the memory usage of the different data types. First, let’s look at the average memory usage for each data type.

for dtype in ['float'.'int'.'object'] : selected_dtype = gl.select_dtypes(include=[dtype]) mean_usage_b = selected_dtype.memory_usage(deep=True).mean() mean_usage_mb = mean_usage_b / 1024 ** 2print("Average Memory Usage for {} columns: {:03.2 F} MB".format(dtype,mean_usage_mb))Copy the code


Average memory usage for floatColumns: 1.29 MB Average Memory UsageforInt columns: 1.12 MB Average Memory UsageforObject the columns: 9.53 MBCopy the code
As you can see, 78 object columns use the largest amount of memory. We’ll talk more about that later. First let’s see if we can improve the memory usage of numeric columns.

Understanding subtypes

As we briefly mentioned earlier, pandas internally represents the values as NumPy ndarrays and stores them in contiguous chunks of memory. This storage mode takes up less space and also gives us quick access to the values. Because PANDAS uses the same number of bytes for each value of the same type, and NumPy ndarray can store the number of values, PANDAS can quickly and accurately return the number of bytes consumed by a numeric column.

Many of the types in PANDAS have multiple subtypes that can use fewer bytes to represent each value. The float type, for example, contains the float16, float32, and float64 subtypes. The number in the type name represents the number of bits that represent the value of the type. For example, the subtypes we just listed use 2, 4, 8, and 16 bytes, respectively. The table below gives the subtypes of the most commonly used types in PANDAS:

An INT8 value uses 1 byte of storage space and can represent 256 (2^8) binary numbers. This means that we can use this subtype to represent all integer values from -128 to 127, including 0.

We can use the numpy.iinfo class to verify the maximum and minimum values of each integer subtype. Here’s an example:

import numpy as np
int_types = ["uint8"."int8"."int16"]
for it in int_types:
    print(np.iinfo(it))Copy the code


Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------Copy the code
Here we see the difference between uint (unsigned integer) and int (signed integer). Both types have the same storage capacity, but one only holds 0 and positive numbers. Unsigned integers allow us to deal more efficiently with columns that have only positive values.



Optimize numeric columns with subtypes

We can use the function pd.to_numeric() to downcast our numeric types. We’ll use datafame.select_dtypes to select the integer column, and then we’ll optimize the data type and compare the memory usage.

# We're going to be calculating memory usage a lot,
# so we'll create a function to save us some time!

def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{: 03.2 f} MB".format(usage_mb)

gl_int = gl.select_dtypes(include=['int'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')

print(mem_usage(gl_int))
print(mem_usage(converted_int))

compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before'.'after']
compare_ints.apply(pd.Series.value_counts)Copy the code


7.87 MB
1.48 MBCopy the code

We can see that memory usage dropped from 7.9 MB to 1.5 MB, a decrease of more than 80%. But this doesn’t affect our original Dataframe very much, because there are very few integer columns.

Let’s do the same for the floating-point columns.

gl_float = gl.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')

print(mem_usage(gl_float))
print(mem_usage(converted_float))

compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before'.'after']
compare_floats.apply(pd.Series.value_counts)Copy the code


100.99 MB
50.49 MBCopy the code

We can see that the floating-point column data type changes from FLOAT64 to FLOAT32, reducing memory usage by 50%.

Let’s create a copy of the original Dataframe, replace the original columns with these optimized ones, and look at our overall memory usage now.

optimized_gl = gl.copy()

optimized_gl[converted_int.columns] = converted_int
optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))
print(mem_usage(optimized_gl))Copy the code


861.57 MBCopy the code


804.69 MBCopy the code
Although we significantly reduced the memory usage for the numeric column, the overall memory usage was only reduced by 7%. Most of our gains will come from optimizing the Object type.

Before we get started, let’s look at how strings are stored in PANDAS compared to the way numeric types are stored.

Comparison of numeric storage to string storage

The object type represents the value of a Python string object, in part because NumPy does not support missing strings. Because Python is a high-level interpreted language, it does not have fine-grained control over values stored in memory.

This limitation results in a fragmented way of storing strings, which consumes more memory and is slower to access. Each element in the Object column is actually a pointer that contains the “address” of the actual value’s location in memory.

The figure below shows how numeric data is stored as a NumPy data type and string data is stored using Python’s built-in types.

Image source: https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/

In the previous table, you may have noticed that the memory usage of the Object type is variable. Although each pointer takes up only 1 byte of memory, if each string was stored separately in Python, it would take up as much space as the actual string. We can prove this using the sys.getsizeof() function, looking first at the individual strings and then at the items in the Pandas Series.

from sys import getsizeof

s1 = 'working out'
s2 = 'memory usage for'
s3 = 'strings in python is fun! '
s4 = 'strings in python is fun! '

for s in [s1, s2, s3, s4]:
    print(getsizeof(s))Copy the code


60, 65, 74, 74Copy the code


obj_series = pd.Series(['working out'.'memory usage for'.'strings in python is fun! '.'strings in python is fun! '])
obj_series.apply(getsizeof)Copy the code


0    60
1    65
2    74
3    74
dtype: int64Copy the code
As you can see, the size of the string when stored in pandas Series is the same as the size of the string stored separately in Python.

Optimize object types using Categoricals

Pandas has introduced Categorials in the 0.15 edition. The category type uses integer values underneath to represent values in a column, rather than raw values. Pandas uses a separate mapping dictionary to map these integer values to the original values. This approach is useful whenever a column contains a finite set of values. When we convert a column to category dType, PANDAS uses the most space-saving int subtype to represent all the different values in the column.

To see why we can use this type to reduce memory usage, let’s look at the number of different values for each type in our Object type.

gl_obj = gl.select_dtypes(include=['object']).copy()
gl_obj.describe()Copy the code
See article for the full image above

At a glance, the number of unique values in our entire dataset of 172,000 matches is very small.

To see what happens when we convert this to a categorical type, let’s take an object column. We will use the second column of the dataset, DAY_of_week.

Looking at the table above, you can see that it contains only seven different values. We will use the.astype() method to convert this to a categorical type.

dow = gl_obj.day_of_week
print(dow.head())

dow_cat = dow.astype('category')
print(dow_cat.head())Copy the code


0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: object
0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]Copy the code
As you can see, the data still looks exactly the same, except that the type of this column has changed. Let’s see what’s going on behind the scenes.

In the code below, we use the series.cat. codes attribute to return the integer value that the category type uses to represent each value.

dow_cat.head().cat.codesCopy the code


0    4
1    0
2    2
3    1
4    5
dtype: int8Copy the code
You can see that each different value is assigned an integer value, and that the base data type for this column is now INT8. There aren’t any missing values in this column, but even if there are, the category subtype can handle it by setting it to -1.

Finally, let’s look at the memory usage before and after converting this column to category type.

print(mem_usage(dow))
print(mem_usage(dow_cat))Copy the code


9.84 MB
0.16 MBCopy the code
The 9.8MB memory usage was reduced to 0.16MB, a 98% reduction! Note that this particular column may represent one of our best-case scenarios — about 172,000 items with only seven different values.

Although it sounds tempting to convert all columns to this type, it’s important to understand the trade-offs. The biggest disadvantage is the inability to perform numerical calculations. Without first converting it to the value dtype, we would not be able to do arithmetic on the category column, that is, we would not be able to use methods like series.min () and series.max ().

We should stick with the category type primarily for object columns where the number of different values is less than 50% of the total number of values. If all the values in a column are different, the category type will use more memory. This column stores not only all the raw string values, but also their integer value code. You can be in pandas in the document to understand the limitations of the category type: http://pandas.pydata.org/pandas-docs/stable/categorical.html.

We’ll write a loop function that iteratively checks if the number of different values in each object column is less than 50%; If so, convert it to category.

converted_obj = pd.DataFrame()

for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    ifNum_unique_values/num_total_values < 0.5: converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]Copy the code
Compare as before:

print(mem_usage(gl_obj))
print(mem_usage(converted_obj))

compare_obj = pd.concat([gl_obj.dtypes,converted_obj.dtypes],axis=1)
compare_obj.columns = ['before'.'after']
compare_obj.apply(pd.Series.value_counts)Copy the code


752.72 MB
51.67 MBCopy the code

In this case, all object columns are converted to category type, but not all data sets are, so you should use the above process to check.

The memory usage for the Object column was reduced from 752MB to 52MB, a 93% reduction. Let’s combine this with the rest of our Dataframe and see how much progress has been made from the original 861MB.

optimized_gl[converted_obj.columns] = converted_obj

mem_usage(optimized_gl)Copy the code


'103.64 MB'Copy the code
Wow, this is going really well! There is another optimization we can do — if you remember the table of datatypes given earlier, you know there is also a Datetime type. The first column of the dataset can use this type.

date = optimized_gl.date
print(mem_usage(date))
date.head()Copy the code


0.66 MBCopy the code


0    18710504
1    18710505
2    18710506
3    18710508
4    18710509
Name: date, dtype: uint32Copy the code
You may remember that this column started as an integer and has now been optimized to unint32. Therefore, converting it to datetime actually doubles the memory usage because datetime is 64-bit. Converting it to datetime is valuable because it allows us to do better time series analysis.

The pandas. To_datetime () function can do this for us, using its format argument to store our date data in YYYY-MM-DD.

optimized_gl['date'] = pd.to_datetime(date,format='%Y%m%d')

print(mem_usage(optimized_gl))
optimized_gl.date.head()Copy the code


104.29 MBCopy the code


0   1871-05-04
1   1871-05-05
2   1871-05-06
3   1871-05-08
4   1871-05-09
Name: date, dtype: datetime64[ns]Copy the code


Select the type while reading in the data

We have now explored ways to reduce the memory footprint of existing Dataframes. By first reading the Dataframe and iterating through the process to reduce memory footprint, we learned how much memory savings each optimization approach can bring. But as we mentioned earlier, we often don’t have enough memory to represent all the values in the dataset. If we can’t even create a Dataframe in the first place, how can we apply memory saving techniques?

Fortunately, we can specify the optimal column type while reading in the data. The pandas. Read_csv () function has several different arguments that allow us to do this. The dtype argument accepts dictionaries with column names as keys and NumPy object as values.

First, we can store the final type of each column in a dictionary, where the key value represents the column name, and remove the date column first, because the date column requires different processing.

dtypes = optimized_gl.drop('date',axis=1).dtypes

dtypes_col = dtypes.index
dtypes_type = [i.name for i in dtypes.values]

column_types = dict(zip(dtypes_col, dtypes_type))

# rather than print all 161 items, we'll
# sample 10 key/value pairs from the dict
# and print it nicely using prettyprint

preview = first2pairs = {key:value for key,value in list(column_types.items())[:10]}
import pprint
pp = pp = pprint.PrettyPrinter(indent=4)
pp.pprint(preview)Copy the code


{   'acquisition_info': 'category'.'h_caught_stealing': 'float32'.'h_player_1_name': 'category'.'h_player_9_name': 'category'.'v_assists': 'float32'.'v_first_catcher_interference': 'float32'.'v_grounded_into_double': 'float32'.'v_player_1_id': 'category'.'v_player_3_id': 'category'.'v_player_5_id': 'category'}Copy the code
Now we are ready to use the dictionary, along with a few more arguments to read in dates of the correct type in just a few lines of code:

read_and_optimized = pd.read_csv('game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)

print(mem_usage(read_and_optimized))
read_and_optimized.head()Copy the code


104.28 MBCopy the code
See article for the full image above

By optimizing these columns, we managed to reduce pandas’ memory footprint from 861.6MB to 104.28MB — a staggering 88% reduction!



Analyzing the baseball game

Now that we have optimized our data, we can perform some analysis. Let’s start by understanding the distribution of dates for these races.

optimized_gl['year'] = optimized_gl.date.dt.year
games_per_day = optimized_gl.pivot_table(index='year',columns='day_of_week',values='date',aggfunc=len)
games_per_day = games_per_day.divide(games_per_day.sum(axis=1),axis=0)

ax = games_per_day.plot(kind='area',stacked='true')
ax.legend(loc='upper right'PLT) ax. Set_ylim (0, 1), show ()Copy the code

We can see that there were few baseball games on Sunday before the 1920s, but they became more and more frequent during the second half of the last century.

It is also clear that the distribution of dates has changed little over the past 50 years.

Let’s take a look at how the length of the match has changed:

game_lengths = optimized_gl.pivot_table(index='year', values='length_minutes')
game_lengths.reset_index().plot.scatter('year'.'length_minutes')
plt.show()Copy the code
Since the 1940s, baseball games have become longer and longer.

Summary and next steps

Having seen how Pandas uses different data types, we then used this knowledge to reduce the memory usage of a PANDAS Dataframe by nearly 90% using a few simple techniques:

  • Cast the numeric column down to a more efficient type
  • Convert string columns to a categorical type
If you still want to use pandas handle larger data, can participate in this interactive course: https://www.dataquest.io/m/163/optimizing-dataframe-memory-footprint/16/next-steps.