Author: xiaoyu

Python Data Science

Python data analyst


preface

When it comes to data analysis, the languages most mentioned are Python and SQL. Python is well suited for data analysis because it has many powerful third-party libraries to assist it. Pandas is one of them. Pandas’ documentation describes it as follows:

“Fast, flexible, expressive data structures designed to make the use of ‘relational’ or ‘tagging’ data both simple and intuitive.”

We know pandas’ two main data structures: dataframe and Series, and we base some of our operations on these data structures. In practice, however, we may often feel that running some data structures is extremely slow. One action a few seconds late may not look like much, but many actions in an entire project add up to a very inefficient development effort. Some friends complain that pandas is too slow. There are some tricks for pandas.

Pandas is built on top of the array structure of the NUMpy library, and many of its operations are implemented in C (either by NUMpy or by pandas itself as an extension module implemented by Cpython and compiled into C). Thus, pandas should be very fast if used correctly.

In this article, I will describe some of the methods used in pandas, what to watch for, and how to speed them up.

  • Advantages of using Datetime data with time series
  • The most efficient way to do batch calculation
  • Saving time by storing data through HDFStore

Use Datetime data to save time

Let’s look at an example.

>>> import pandas as pd
>>> pd.__version__
'0.23.1'

# import data set
>>> df = pd.read_csv('demand_profile.csv') >>> df.head() date_time energy_kwh 0 1/1/13 0:00 0.586 1 1/1/13 1:00 0.580 2 1/1/13 2:00 0.572 3 1/1/13 3:00 0.596 4 1/1/13 4:00 PM 0.592Copy the code

From the results of running the above code, there seems to be no problem. But pandas and NUMPY actually have a concept of dtypes. If not specified, date_time will use an object dtype, as shown in the following code:

>>> df.dtypes
date_time      object
energy_kwh    float64
dtype: object

>>> type(df.iat[0, 0])
str
Copy the code

The object type is like a large container that can hold not only STR, but also any characteristic columns that don’t fit well into a data type. If we use dates as STR types, efficiency can be greatly affected.

Therefore, for time series data, we need to format the date_time column above into an array of DateTime objects (called timestamps by pandas). Pandas is very simple. The operations are as follows:

>>> df['date_time'] = pd.to_datetime(df['date_time'])
>>> df['date_time'].dtype
datetime64[ns]
Copy the code

So let’s run this DF and see what it looks like.

>>> df.head()
               date_time    energy_kwh
0    2013-01-01 00:00:00         0.586
1    2013-01-01 01:00:00         0.580
2    2013-01-01 02:00:00         0.572
3    2013-01-01 03:00:00         0.596
4    2013-01-01 04:00:00         0.592
Copy the code

Date_time format has been automatically converted, but this is not the end, we can continue to speed up on this basis. How to speed things up? For better comparison, let’s first test the conversion times of the above code with the Timeit decorator.

>>> @timeit(repeat=3, number=10)
... def convert(df, column_name):
...     return pd.to_datetime(df[column_name])

>>> df['date_time'] = convert(df, 'date_time')
Best of 3 trials with 10 function calls per trial:
Function `convert` ran inAverage of 1.610 seconds.Copy the code

1.61s, it looks fast, but it can actually be faster. Let’s take a look at the following method.

>>> @timeit(repeat=3, number=100)
>>> def convert_with_format(df, column_name):
...     return pd.to_datetime(df[column_name],
...                           format='%d/%m/%y %H:%M')
Best of 3 trials with 100 function calls per trial:
Function `convert_with_format` ran inAverage of 0.032 seconds.Copy the code

The result was 0.032 seconds, nearly 50 times faster. ** The reason is: we set the conversion format format. Since datetimes in CSV is not in ISO 8601 format, pandas will use the dateutil package to convert each string STR to a date if it is not set.

Conversely, if the original datetime is already in ISO 8601 format, pandas can immediately use the fastest method to parse the date. That’s why setting the format well in advance can improve things so much.

Pandas Indicates the loop operation of the data

Still based on the above data, we want to add a new feature, but this new feature is based on some time conditions, which change according to the duration (hours), as follows:

Therefore, it is normal for us to write a function using the Apply method, which contains the logic code for the time condition.

def apply_tariff(kwh, hour):
 """Calculate your hourly electric bill."""  
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'Invalid hour: {hour}')
    return rate * kwh
Copy the code

Then use the for loop to iterate over the DF and add new features according to the apply function logic as follows:

>>> # Disapprove of this operation
>>> @timeit(repeat=3, number=100)
... def apply_tariff_loop(df):
...     """Calculate costs in loop. Modifies `df` inplace.""". energy_cost_list = [] ...for i in range(len(df)):
...         # Get power consumption and time (hour). energy_used = df.iloc[i]['energy_kwh']... hour = df.iloc[i]['date_time'].hour
...         energy_cost = apply_tariff(energy_used, hour)
...         energy_cost_list.append(energy_cost)
...     df['cost_cents'] = energy_cost_list
... 
>>> apply_tariff_loop(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_loop` ran inAverage of 3.152 seconds.Copy the code

To those who write Pythonic, this design looks natural. However, this cycle will seriously affect efficiency and is not recommended. There are several reasons:

  • First, it needs to initialize a list that will record output.
  • Second, it uses opaque object scopes(0, len (df))Loop, and then applyapply_tariff()After that, it must append the results to the list used to create the new DataFrame column. It also USESdf.iloc [i] ['date_time']Perform what is called chained indexing, which often leads to unexpected results.
  • But the biggest problem with this approach is the time cost of the calculation. For 8760 rows, the loop took 3 seconds. Next, you’ll see some improved iterative solutions for the Pandas structure.

Use itertuples() and iterrows() loops

So what are the recommended practices?

Pandas can actually be used to introduce the itertuples and iterrows methods to make it more efficient. These are all generator methods that generate one line at a time, similar to the yield usage used in scrapy.

Itertuples produces a namedtuple for each row, with the row index as the first element of the tuple. Nametuple is a data structure in Python’s Collections module that behaves like A Python tuple but has fields that can be accessed through attribute look-ups.

Iterrows produces tuples (index, series) for each row in the DataFrame.

Itertuples are usually faster, but in this case, using.iterrows, let’s see how it looks using iterrows.

>>> @timeit(repeat=3, number=100)
... def apply_tariff_iterrows(df):
...     energy_cost_list = []
...     for index, row in df.iterrows():
...         # Get power consumption and time (hour). energy_used = row['energy_kwh']... hour = row['date_time'].hour
...         Add the cost list. energy_cost = apply_tariff(energy_used, hour) ... energy_cost_list.append(energy_cost) ... df['cost_cents'] = energy_cost_list
...
>>> apply_tariff_iterrows(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_iterrows` ran inAverage of 0.713 seconds.Copy the code

Syntax: This syntax is more explicit and has less clutter in row value references, so it is more readable.

In terms of time gain: Nearly five times faster! But there is more room for improvement. We are still using some form of Python for loop, which means that every function call is done in Python, ideally in the faster language built into Pandas’ internal architecture.

Pandas’.apply() method

We can further improve this operation by using the.apply method instead of the.iterrows method. The.apply method takes callables and applies them along the axis of the DataFrame (all rows or columns). In this example, the lambda function will help you pass two columns of data to apply_tariff() :

>>> @timeit(repeat=3, number=100)
... def apply_tariff_withapply(df):
...     df['cost_cents'] = df.apply(
...         lambda row: apply_tariff(
...             kwh=row['energy_kwh'],... hour=row['date_time'].hour),
...         axis=1)
...
>>> apply_tariff_withapply(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_withapply` ran inAverage of 0.272 seconds.Copy the code

The advantages of the apply syntax are obvious: fewer lines and more readable code. In this case, it takes about half the time of the.iterrows method.

But that’s not “very fast.” One reason is that.apply() internally tries to loop through the Cython iterator. But in this case, the lambda passed is not something that can be handled in Cython, so it’s called in Python, so it’s not that fast.

If you use.apply() to capture 10 years of hourly data, you will need about 15 minutes of processing time. If the calculation is just a small part of a larger model, then you really should speed it up. This is where vectorization comes in handy.

Vectorization: select data using.isin()

What is vectorization? If you don’t base it on some criteria, you can apply all the power consumption data to the price in one line of code (df [‘energy_kwh’] * 28), something like this. This particular operation is an example of a vectorization operation, which is the fastest method to perform in Pandas.

But how do you apply conditional computation to vectorization in Pandas? One trick is to select and group dataframes according to your criteria, and then apply vectorization to each selected group. In the next example, you’ll see how to use Pandas’.isin() method to select rows and then implement the addition of the new feature above in the quantization operation. Before doing this, it would be easier to set the date_time column to the index of the DataFrame:

df.set_index('date_time', inplace=True)

@timeit(repeat=3, number=100)
def apply_tariff_isin(df):
Define hour range Boolean array
    peak_hours = df.index.hour.isin(range(17, 24))
    shoulder_hours = df.index.hour.isin(range(7, 17))
    off_peak_hours = df.index.hour.isin(range(0, 7))

    # use the above definition
    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
    df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
    df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12
Copy the code

Let’s see what happens.

>>> apply_tariff_isin(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_isin` ran inAverage of 0.010 seconds.Copy the code

To understand what just happened in the code, we need to know that the.isin() method returns an array of Boolean values, like this:

[False, False, False, ..., True, True, True]
Copy the code

These values identify which DataFrame indexes (DateTimes) fall within the specified hour range. Then, when you pass these Boolean arrays to the DataFrame’s.LOc indexer, you get a DataFrame slice containing only the rows that match those hours. After that, it’s just a matter of multiplying slices by the appropriate rate, a quick vectorization operation.

How does this compare to our loop operation above? First, you might notice that apply_tariff() is no longer required, because all conditional logic applies to row selection. As a result, the lines of code you have to write and the Python code you call are significantly reduced.

What about processing time? 315 times faster than a non-Pythonic loop, 71 times faster than.iterrows, and 27 times faster than.apply.

Could it have been better?

In apply_tariff_isin, we can still do some “manual work” by calling df.loc and df.index.hour.isin three times. If we had a finer timeslot range, you might argue that this solution is not scalable. Fortunately, in this case, you can do more programmatically using Pandas’ pd.cut() function:

@timeit(repeat=3, number=100)
def apply_tariff_cut(df):
    cents_per_kwh = pd.cut(x=df.index.hour,
                           bins=[0, 7, 17, 24],
                           include_lowest=True,
                           labels=[12, 20, 28]).astype(int)
    df['cost_cents'] = cents_per_kwh * df['energy_kwh']
Copy the code

Let’s see what happens here. Pd.cut () applies a set of labels (costs) based on the bin of each hour.

Note that the include_lowest parameter indicates whether the first interval should include the left side (you want to include time = 0 in the group). This is a fully vectorized way to get our expected results, which is the fastest in terms of time:

>>> apply_tariff_cut(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_cut` ran inAverage of 0.003 seconds.Copy the code

So far, the time limit is almost there, taking less than a second to process a full 10-year hourly data set. The final option, however, is to use the NumPy function to manipulate the underlying NumPy array for each DataFrame and then integrate the results back into the Pandas data structure.

Use Numpy to continue the acceleration

One thing that should not be forgotten when using Pandas is that the Pandas Series and DataFrames are designed on top of the NumPy library. This gives you more computational flexibility because Pandas can seamlessly interface with NumPy arrays and operations.

Next, we’ll use NumPy’s function digitize(). It is similar to Pandas’ cut() in that the data will be boxed, but this time it will be represented by an array of indexes that indicate which bin each hour belongs to. Then apply these indexes to the price array:

@timeit(repeat=3, number=100)
def apply_tariff_digitize(df):
    prices = np.array([12, 20, 28])
    bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
    df['cost_cents'] = prices[bins] * df['energy_kwh'].values
Copy the code

Like cut, this syntax is concise and easy to read. But how does it compare in terms of speed? Let’s take a look:

>>> apply_tariff_digitize(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_digitize` ran inAverage of 0.002 seconds.Copy the code

At this point, there is still a performance improvement, but it becomes more marginal in nature. Using Pandas, it helps to maintain “hierarchies” and, if you wish, batch calculations as shown here, which are usually ranked from fastest to slowest (most flexible to least flexible) :

  1. Using vectorization: Pandas methods and functions without for loops.

  2. Apply methods: used with callable methods.

  3. Itertuples: Iterates DataFrame rows as namedTuples from Python’s collection module.

  4. Iterrows: Iterates over DataFrame rows as (index, Series) pairs. Although the Pandas series is a flexible data structure, it can be expensive to build each row into a series and then access it.

  5. Use the “element-by-element” loop: update one cell or row at a time using df.loc or df.iloc.

Use HDFStore to prevent reprocessing

Now that you’ve seen the accelerated data flow in Pandas, let’s explore how to avoid reprocessing time with the HDFStore recently integrated into Pandas.

In general, when building complex data models, it is convenient to do some pre-processing of the data. For example, if you have 10 years of minute frequency power consumption data, even if you specify format parameters, just converting the date and time to the date time may take 20 minutes. You really only want to do it once, not every time you run your model, test it or analyze it.

One very useful thing you can do here is preprocess and then store the data in the processed form for use when needed. But how do you store the data in the right format without having to reprocess it again? If you save as CSV, you will only lose the DateTimes object and will have to reprocess it when you access it again.

Pandas has a built-in solution that uses HDF5, a high-performance storage format specifically designed for storing tabular data arrays. Pandas’ HDFStore class allows you to store a DataFrame in an HDF5 file so that it can be accessed efficiently while still preserving column types and other metadata. It is a dictionary-like class, so you can read and write just like Python dict objects.

Here is how to store the preprocessing power consumption DataFrame DF in an HDF5 file:

Create a store object as processed_data
data_store = pd.HDFStore('processed_data.h5')

# Put the DataFrame into the object and set the key to preprocessed_df
data_store['preprocessed_df'] = df
data_store.close()
Copy the code

Now, you can turn off the computer and take a break. When you come back, the data you processed will be available to you when you need it, without having to process it again. Here’s how to access data from HDF5 files and preserve data types:

Get the data store object
data_store = pd.HDFStore('processed_data.h5')

Get data by key
preprocessed_df = data_store['preprocessed_df']
data_store.close()
Copy the code

A data store can hold multiple tables, with the name of each table as a key.

Note about using HDFStore in Pandas: You will need to install PyTables> = 3.0.0, so be sure to update PyTables after installing Pandas, as shown below:

pip install --upgrade tables
Copy the code

conclusion

If you feel that your Pandas project is not fast, flexible, simple, or intuitive, please reconsider the way you use the library.

The examples explored here are fairly simple, but illustrate how proper application of the Pandas functionality can greatly improve code readability at run time and speed. Here are some lessons to apply the next time you work with a large dataset in Pandas:

  • Try to use vectorization whenever possible, rather than solving for x problems in df. If your code is a lot of for loops, it may be better to use native Python data structures, because Pandas imposes a lot of overhead.
  • If you have more complex operations where vectorization is simply impossible or too difficult to solve effectively, use the.apply method.
  • If you must loop through a set of numbers (and it does), use.iterrows() or.itertuples() for speed and syntax.
  • Pandas has many options, and there are almost always several ways to get from A to B. Take note of this, compare how different approaches are executed, and choose the route that works best in the project context.
  • Once the data cleansing script is set up, you can avoid reprocessing by using HDFStore to store the intermediate results.
  • Integrating NumPy into the Pandas operation generally improves speed and simplifies syntax.

Reference: realpython.com/fast-flexib…

If you find it helpful, please give it a thumbs up!

Welcome to my personal account: Python Data Science