Many organizations are trying to collect and use as much data as possible to improve the way they do business, increase revenue and increase impact. As a result, it is becoming increasingly common for data scientists to face data sets as large as 50GB or even 500GB.

However, such data sets are not easy to use. They’re small enough to fit into the hard drive of an everyday laptop, but too big to fit into RAM, making them already hard to open and examine, let alone explore or analyze.

When working with such data sets, three strategies are commonly used.

The first is to re-sample the data, but the downside is obvious: you can miss key insights by ignoring relevant parts of the data, or worse, misinterpret what the data means.

The second strategy is to use distributed computing. This can be an effective approach in some cases, but it requires a lot of overhead to manage and maintain the cluster.

Alternatively, you can rent a powerful cloud instance that has the memory needed to process the relevant data. For example, AWS provides instances with terabytes of RAM. In this case, you still have to manage the cloud data store, wait for data to be transferred from the storage space to the instance each time an instance starts, and consider the compliance issues of storing data on the cloud, as well as the inconvenience of working on a remote computer. Not to mention the cost, which, while low at first, tends to increase later.

Vaex is a new approach to this problem. It’s a faster, safer, and more convenient way to do data science on almost any size of data, as long as the data set can be installed on your laptop, desktop, or server hard drive.

What is Vaex?

Vaex is an open source DataFrame library for visualizing, exploring, analyzing, and even practicing machine learning in tables the size of your hard drive.

It can compute statistics of over a billion (10^9) objects/rows per second on an N-dimensional grid, such as mean, sum, count, standard difference, etc. Visualization is accomplished using histograms, density maps, and 3D volume rendering to interactively explore big data. Vaex uses a memory-mapped, zero-memory copy strategy for optimal performance (no wasted memory).

To achieve these functions, Vaex uses concepts such as memory mapping, efficient out-of-core algorithms, and deferred computing. All of these are packaged into an API like Pandas, so that anyone can learn it quickly.

Data analysis of a billion taxis

To illustrate this concept, let’s do a simple exploratory data analysis of a data set that doesn’t fit into a typical laptop’S RAM.

This article will use the New York City (NYC) Taxi Dataset, which contains information on more than 1 billion trips made by iconic yellow taxis between 2009 and 2015. Data is available from the website (*www1.nyc.gov/site/tlc/ab… *) download, and CSV format. Complete analysis can separate view in this Jupyter notebook (* nbviewer.jupyter.org/github/vaex… *).

Why vAEx

Performance: Processing massive amounts of tabular data, over a billion rows per second

Virtual columns: Calculates dynamically and does not waste memory

Efficient memory performs filtering/selection/subset without memory copies.

Visualization: Direct support, a single line is usually sufficient.

User friendly API: Handling only one cube object, TAB completion and docString can help you: DS. Mean, similar to Pandas.

Lean: Split into multiple packages

Jupyter Integration: VAEX-JupyTER will offer interactive visualization and selection in both The Jupyter Notebook and the Jupyter Lab.

It only takes 0.052 seconds to open a 100GB dataset

The first step is to convert the data to an in-memory mable file format, such as Apache Arrow, Apache Parquet, or HDF5. Examples of how to convert CSV data to HDF5 can also be found here. Once the data is in memory-mappable format, it can be opened instantly (in 0.052 seconds!) using Vaex, even if the size exceeds 100GB on disk. :

Why so fast? When a memory-mapped file is opened using Vaex, no data is actually read. Vaex only reads the metadata of the file, such as the location of the data on disk, data structure (number of rows, columns, column names, and types), file description, and so on. So what if we want to examine or interact with data? Open the dataset to generate a standard DataFrame and perform a quick check on it:

Note that the unit execution time is too short. This is because displaying a Vaex DataFrame or column requires only the first and last five rows of data to be read from disk. This leads us to another important point: Vaex will only iterate through the entire data set as needed, and will try to do this with as little data passing as possible.

Anyway, let’s clear this data set starting with extreme outliers or bad data input values. A good approach is to use the Describe method for a high-level overview of the data, showing the number of samples, the number of missing values, and the data type for each column. If the data type of the column is numeric, the mean, standard deviation, and minimum and maximum values are also displayed. All of these statistics are calculated by a single pass of the data.

Use the Describe method to get a high-level overview of the DataFrame. Note that the DataFrame contains 18 columns of data, although the screenshot only shows the first seven columns.

The describe method is a good example of Vaex’s functionality and efficiency: all these statistics were calculated in less than 3 minutes on my 2018 MacBook Pro (15-inch, 2.6ghz Intel Core I7, 32GB RAM). All other libraries or approaches require distributed computing or cloud instances of more than 100GB to perform the same computation. With Vaex, all you need is data and a laptop with just a few gigabytes of RAM.

Looking at the describe output, it is easy to notice that the data contains some serious outliers.

Start by checking your pickup location. The easiest way to eliminate outliers is simply to plot the pick-up and drop-off locations and visually define the NYC region we want to focus our analysis on. Since we are working with such a large data set, the histogram is the most effective visualization. Creating and displaying histograms and maps with Vaex is fast, and diagrams are interactive!

Once we have interactively determined the NYC region to focus on, we can simply create a filtered DataFrame:

The cool thing about the code above is that the amount of memory it needs to execute is negligible! Data is not copied when filtering a Vaex DataFrame, but only a reference to the original object is created, on which a binary mask is applied. Select the row to display with a mask and use it for future calculations. This will save us 100GB of RAM while many standard data science tools like today copy data.

Now check the passenger_Count column. The maximum number of passengers recorded in a single taxi trip was 255, which seems an exaggeration. Calculating the number of passengers per trip is easy to do using the following Value_counts method:

Using the value_COUNTS method on a billion rows of data takes only 20 seconds

As you can see from the graph above, trips with more than six passengers may be rare outliers or simply bad data entry, as well as a large number of trips with zero passengers. Since we don’t know if these trips are legal, we also filter them out.

Let’s do a similar exercise for distance. Since this is a continuous variable, we can plot the distribution of distance traveled. Let’s draw a histogram of a more reasonable range.

New York Taxi data Travel distance histogram

As can be seen from the figure above, the number of trips decreases with the increase of distance. At a distance of about 100 miles, the distribution drops significantly. For now, we will use this as a starting point to eliminate extreme outliers according to travel distance:

There are extreme outliers in the travel distance column, which is also the motivation to study travel time and average taxi speed. These features are not yet available in the dataset, but are simple to calculate:

The above block of code takes no memory and no time to execute! This is because the code only creates virtual columns. These columns contain only mathematical expressions and are evaluated only as needed. In addition, the virtual column behaves the same as any other regular column. Note that other standard libraries will require 10 GB of RAM to perform the same operation.

Ok, let’s plot the distribution of travel time:

A histogram of the time spent on more than a billion taxi trips in New York

As you can see from the graph above, 95% of taxis take less than 30 minutes to reach their destination, although some trips may take four to five hours. Can you imagine being stuck in a taxi in New York City for more than three hours? In any case, we want to keep an open mind and consider all trips that take less than 3 hours:

Now, let’s study the average taxi speed and choose a reasonable data range:

Distribution of average taxi speed

Based on where the distribution flattens out, we can infer a reasonable average coasting speed between 1 and 60 miles per hour, so we can update the filtered DataFrame:

Shift the focus to taxi fares. From the output of the describe method, we can see some crazy outliers in the fare_amount, Total_amount, and TIP_AMOUNT columns. For starters, none of the values in any of these columns should be negative. Meanwhile, figures show that some lucky drivers have almost become millionaires by driving a taxi just once. Let’s look at the distribution of these numbers within a relatively reasonable range:

The distribution of fares, totals and tips for more than a billion taxi trips in New York. It took only 31 seconds to draw these charts on a laptop!

We see that all three of these distributions have fairly long tails. Some of the trailing values may be valid, while others may be incorrect data input. Anyway, let’s be conservative for now and only consider trips with fare_amount, total_amount, and tip_amount less than 200. We also require fareAmount, totalAmount, to be greater than 200. We also require fare_amount, total_amount to be greater than 200. We also want fareAmount, totalAmount to be greater than zero.

Finally, after initially cleaning up all the data, let’s see how much taxi data needs to be analyzed:

And over 1.1 billion trips! A wealth of data can give you insight into the information behind taxi journeys.

Afterword.

In addition, Vaex was also used to analyze data from the perspective of maximizing profits for taxi drivers. In short, Vaex will help you mitigate some of the data challenges you may face.

With Vaex, you can traverse more than a billion lines of data, compute statistics, aggregate and produce infographics in just a few seconds, all from your laptop. It’s free and open source.

If you’re interested in exploring the data set used in this article, you can use it directly in S3 with Vaex, and see the full Jupyter Notebook to see how.

Vaex official website: *vaex.io/ *

Document: *docs.vaex.io/ *

Making: * github.com/vaexio/vaex… *

Pypi.python.org/pypi/vaex/ PyPi: * *