By Itamar Turner-Trauring

Lao Qi

A recommended book for this article: Learning Python with Old Qi: Data Analysis


Let’s imagine that you have a data set that is so large that reading it into memory will cause it to overflow, but you want to use Pandas to process a portion of it. If you want to load only a portion of the data set at a certain point, you can use the chunking method.

If you break up the data set into parts and load them separately, it will still end up being slow.

The solution, in this case, is to create a searchable index, which can be easily solved using SQLite.

The first method: chunking

Consider a concrete example: You’re running for election, and for that, you find a CSV file that contains information about every voter in your city. So you send people to go door to door and invite all the registered voters on a given street to vote.

The Pandas DataFrame object now has indexes, but the data must be read into memory. The CSV file is too large to fit in memory, so it occurs to you that you can load only the records you care about.

So that’s the first method, you slice it.

import pandas as pd

def get_voters_on_street(name): 
    return pd.concat( 
       df[df["street"] == name] for df in 
       pd.read_csv("voters.csv", chunksize=1000) 
    ) 
Copy the code

The CSV file is loaded in chunks and filtered by street name, which is then recorded and consolidated.

Although loaded line by line, the focus is on a small subset, so there is some overhead. For example, for a 70K data set, it would take 574ms to execute the above function on my computer. In 2018, there were 4.6 million registered voters in New York City, and it took 30 seconds to go from street to street.

If we just do it once, that’s fine, if we have to do it over and over again, it’s not optimal.

Creating index objects

An index is a summary, so the argument goes: if you care about it, you’ll find a lot of data here. In our example, we want to create an index based on the street name, so we can quickly load the street where the voter is.

If you’re worried about index data running out of memory, a database can be used as a container for storing them. PostgreSQL, MySQL, etc. Oh, and you don’t like installing and maintaining those pesky services, well, here comes SQLite.

SQLite is a full-featured relational database that can run just like any other database, but does not require a server. Pyhton supports this database by default. SQLite saves data in separate files, so you have to manage an SQLite data file instead of a CSV file.

Store data with SQLite

Here’s how to use SQLite in Pandas:

1. Load the data into SQLite and create indexes

The SQLite database can hold multiple tables. We first load the data from the Voters. CSV file into SQLite and save it as a Voters.

Next, create an index of the street in SQLite.

Just do the following:

import sqlite3

# Create a new database file:
db = sqlite3.connect("voters.sqlite")

# Load the CSV in chunks:
for c in pd.read_csv("voters.csv", chunksize=1000):
    # Append all rows to a new database table, which
    # we name 'voters':
    c.to_sql("voters", db, if_exists="append")
# Add an index on the 'street' column:
db.execute("CREATE INDEX street ON voters(street)") 
db.close()
Copy the code

Although we only create a single index, we can also create other indexes on other columns or multiple columns, allowing us to use those columns to quickly search the database.

Rewrite the query function

Now that all the data has been loaded into SQLite, we can retrieve it by street.

def get_voters_for_street(street_name):
    conn = sqlite3.connect("voters.sqlite")
    q = "SELECT * FROM voters WHERE street = ?"
    values = (street_name,)
    return pd.read_sql_query(q, conn, values)
Copy the code

SQLite only loads the rows that match the query and stores them as DataFrame objects in ITS Pandas.

More than 50 times faster

The CSV file, with 70,000 rows, used to take 574ms and now only takes 10ms.

This is more than 50 times faster because we only need to load the lines we care about, not every line in the CSV file.

The original link: pythonspeed.com/articles/in…

Search the public account of technical question and answer: Lao Qi classroom

Reply in the official account: Lao Qi, you can view all articles, books and courses.

If you think it looks good, like it and retweet it