- How to rewrite your SQL queries in Pandas, and more
- By Irina Truong
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: geniusq1981
- Proofread by: DAA233
Fifteen years ago, a software developer needed only a few skills to get 95 percent of the jobs he or she would get. These skills include:
- Object-oriented programming
- Scripting language
- JavaScript and so on
- SQL
SQL is a common tool when you need to quickly scan some data and draw preliminary conclusions, which might lead to an analysis report or write an application. This is called exploratory analysis.
Today, data comes in many forms and is no longer just a synonym for “relational databases.” Your data might be a CSV file, plain text, Parquet, HDF5, or something else. These are the highlights of Pandas.
What are Pandas?
Pandas, the Python Data Analysis Library, is a Python Library for Data Analysis and processing. It is open source and is supported by Anaconda. It is particularly suited for structured (tabulated) data. For more information, see pandas.pydata.org/pandas-docs… .
What can you do with it?
Pandas does all the searching and other operations you can do in SQL.
That’s great! Where do I start?
This can be a daunting part for people who are used to dealing with data problems in SQL statements.
SQL is a declarative programming language: en.wikipedia.org/wiki/List_o… .
With SQL, you declare what you want by declaring statements that read almost as smoothly as normal English sentences.
Pandas has a completely different syntax from SQL. In PANDAS, you process data sets and chain them together for transformation and refactoring the way you want.
We need a phrasebook!
Profiling SQL queries
An SQL query consists of several important keywords. Between these keywords, add the specific data you want to see. Here are some frameworks for query statements without specific data:
The SELECT… The FROM… The WHERE…
GROUP BY… HAVING…
The ORDER BY…
LIMIT… OFFSET…
There are other commands, of course, but these are the most important. How do we implement these commands in Pandas?
First, we need to load some data into Pandas because it is not in the database yet. As follows:
import pandas as pd
airports = pd.read_csv('data/airports.csv')
airport_freq = pd.read_csv('data/airport-frequencies.csv')
runways = pd.read_csv('data/runways.csv')
Copy the code
My data comes from ourairports.com/data/.
SELECT, WHERE, DISTINCT, LIMIT
Here are some SELECT statements. We use LIMIT to intercept results, WHERE to filter, and DISTINCT to remove duplicate results.
SQL | Pandas |
---|---|
select * from airports | airports |
select * from airports limit 3 | airports.head(3) |
select id from airports where ident = ‘KLAX’ | airports[airports.ident == ‘KLAX’].id |
select distinct type from airport | airports.type.unique() |
SELECT operations using multiple criteria
We combine multiple conditions with the symbol &. If we want only a subset of the conditions in the table column, we can do so by adding another pair of square brackets.
SQL | Pandas |
---|---|
select * from airports where iso_region = ‘US-CA’ and type = ‘seaplane_base’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘seaplane_base’)] |
select ident, name, municipality from airports where iso_region = ‘US-CA’ and type = ‘large_airport’ | airports[(airports.iso_region == ‘US-CA’) & (airports.type == ‘large_airport’)][[‘ident’, ‘name’, ‘municipality’]] |
ORDER BY
By default, Pandas uses ascending sort. To use descending order, set asending=False.
SQL | Pandas |
---|---|
select * from airport_freq where airport_ident = ‘KLAX’ order by type | airport_freq[airport_freq.airport_ident == ‘KLAX’].sort_values(‘type’) |
select * from airport_freq where airport_ident = ‘KLAX’ order by type desc | airport_freq[airport_freq.airport_ident == ‘KLAX’].sort_values(‘type’, ascending=False) |
The IN… NOT IN. Do not include)
We know how to filter values, but how to filter a list, as IN SQL? IN Pandas, the.isin() operator works the same way as IN IN for SQL. To use a negative condition, use ~.
SQL | Pandas |
---|---|
select * from airports where type in (‘heliport’, ‘balloonport’) | airports[airports.type.isin([‘heliport’, ‘balloonport’])] |
select * from airports where type not in (‘heliport’, ‘balloonport’) | airports[~airports.type.isin([‘heliport’, ‘balloonport’])] |
GROUP BY, COUNT, ORDER BY
Grouping is simple: use the.groupby() operator. There are subtle differences between the COUNT statement in SQL and PANDAS. In Pandas,.count() returns a non-empty/non-nan value. To get the same result as SQL COUNT, use.size().
SQL | Pandas |
---|---|
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type | airports.groupby([‘iso_country’, ‘type’]).size() |
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc | airports.groupby([‘iso_country’, ‘type’]).size().to_frame(‘size’).reset_index().sort_values([‘iso_country’, ‘size’], ascending=[True, False]) |
Next, we group multiple fields. Pandas will sort the contents of the same field in the list by default, so.sort_values() is not required in the first example. If we want to use a different field for sorting, or want to use DESC instead of ASC, as in the second example, then we must explicitly use.sort_values() :
SQL | Pandas |
---|---|
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, type | airports.groupby([‘iso_country’, ‘type’]).size() |
select iso_country, type, count(*) from airports group by iso_country, type order by iso_country, count(*) desc | airports.groupby([‘iso_country’, ‘type’]).size().to_frame(‘size’).reset_index().sort_values([‘iso_country’, ‘size’], ascending=[True, False]) |
Why are.to_frame() and reset_index() used? Because we want to sort by the calculated field (size), this field needs to be part of the DataFrame. After grouping Pandas, we get a new type called GroupByObject. So we need to use.to_frame() to convert it back to DataFrame. Using.reset_index() again, we renumber the rows of the data frames.
HAVING (contain)
In SQL, you can append filtering to grouped data using conditional statements. In Pandas, you can use.filter() and give it a Python function (or lambda function) that returns True if the group is included in the result.
SQL | Pandas |
---|---|
select type, count(*) from airports where iso_country = ‘US’ group by type having count(*) > 1000 order by count(*) desc | airports[airports.iso_country == ‘US’].groupby(‘type’).filter(lambda g: len(g) > 1000).groupby(‘type’).size().sort_values(ascending=False) |
The first N records
Suppose we do some preliminary queries and now have a dataframe named BY_country that contains the number of airports in each country:
In the first example that follows, we sort by airport_count, selecting only the top 10 countries. The second example is more complicated, we want “the other 10 after the top 10, i.e. 11 to 20” :
SQL | Pandas |
---|---|
select iso_country from by_country order by size desc limit 10 | by_country.nlargest(10, columns=’airport_count’) |
select iso_country from by_country order by size desc limit 10 offset 10 | by_country.nlargest(20, columns=’airport_count’).tail(10) |
Aggregate function (MIN, MAX, MEAN)
Now given a set of dataframes, or a set of runway data:
Calculate the minimum, maximum, average and median runway length:
SQL | Pandas |
---|---|
select max(length_ft), min(length_ft), mean(length_ft), median(length_ft) from runways | runways.agg({‘length_ft’: [‘min’, ‘max’, ‘mean’, ‘median’]}) |
You will notice that with SQL queries, each statistical result is a column of data. But using Pandas’ aggregation method, each statistic results in a single line of data:
Don’t worry – just convert the dataframe to.t to get columns of data:
JOIN
Use.merge() to join Pandas’ Dataframes. You need to provide which columns (left_ON and right_ON) to connect and the type of connection: inner (default), left (corresponding to left OUTER in SQL), right (right OUTER), or OUTER (FULL OUTER).
SQL | Pandas |
---|---|
select airport_ident, type, description, frequency_mhz from airport_freq join airports on airport_freq.airport_ref = airports.id where airports.ident = ‘KLAX’ | airport_freq.merge(airports[airports.ident == ‘KLAX’][[‘id’]], left_on=’airport_ref’, right_on=’id’, how=’inner’)[[‘airport_ident’, ‘type’, ‘description’, ‘frequency_mhz’]] |
UNION ALL and UNION
Use pd.concat() instead of UNION ALL to merge two Dataframes:
SQL | Pandas |
---|---|
select name, municipality from airports where ident = ‘KLAX’ union all select name, municipality from airports where ident = ‘KLGB’ | pd.concat([airports[airports.ident == ‘KLAX’][[‘name’, ‘municipality’]], airports[airports.ident == ‘KLGB’][[‘name’, ‘municipality’]]]) |
To delete duplicate data (equivalent to UNION) during a merge, you also need to add.drop_duplicates().
INSERT
So far, we’ve been talking about filtering, but during your exploratory analysis, you may need to modify as well. What do you do if you want to add some missing records?
Pandas does not have methods that resemble INSERT statements. Instead, you can only create a new dataframe containing a new record and then merge the two dataframes:
SQL | Pandas |
---|---|
create table heroes (id integer, name text); | df1 = pd.DataFrame({‘id’: [1, 2], ‘name’: [‘Harry Potter’, ‘Ron Weasley’]}) |
insert into heroes values (1, ‘Harry Potter’); | df2 = pd.DataFrame({‘id’: [3], ‘name’: [‘Hermione Granger’]}) |
insert into heroes values (2, ‘Ron Weasley’); | |
insert into heroes values (3, ‘Hermione Granger’); | pd.concat([df1, df2]).reset_index(drop=True) |
UPDATE (UPDATE)
Now we need to fix some bad data in the original dataframe:
SQL | Pandas |
---|---|
update airports set home_link = ‘http://www.lawa.org/welcomelax.aspx’ where ident == ‘KLAX’ | airports.loc[airports[‘ident’] == ‘KLAX’, ‘home_link’] = ‘http://www.lawa.org/welcomelax.aspx’ |
DELETE (DELETE)
The simplest (and most readable) way to “delete” data from Pandas Dataframe is to extract the Dataframe containing a subset of the row data that you wish to keep. Alternatively, you can do this by getting row indexes, using the.drop() method to drop the rows of those indexes:
SQL | Pandas |
---|---|
delete from lax_freq where type = ‘MISC’ | lax_freq = lax_freq[lax_freq.type != ‘MISC’] |
lax_freq.drop(lax_freq[lax_freq.type == ‘MISC’].index) |
Immutability
I need to mention one important thing — immutability. By default, most operators applied to the Pandas Dataframe return a new object. Some operators accept inplace=True so that you can continue to use the original Dataframe. For example, here is an in-place index reset method:
df.reset_index(drop=True, inplace=True)
Copy the code
However, the.loc operator in the UPDATE example above only locates indexes that need to UPDATE records, and these values change in place. Also, if you update all the values of a column:
df['url'] = 'http://google.com'
Copy the code
Or add a new computed column:
df['total_cost'] = df['price'] * df['quantity']
Copy the code
These all change in place.
Much more!
Pandas is good because it is more than just a query engine. You can do more with your data, such as:
- Output in multiple formats:
df.to_csv(...) # csv file
df.to_hdf(...) # HDF5 file
df.to_pickle(...) # serialized object
df.to_sql(...) # to SQL database
df.to_excel(...) # to Excel sheet
df.to_json(...) # to JSON string
df.to_html(...) # render as HTML table
df.to_feather(...) # binary feather-format
df.to_latex(...) # tabular environment table
df.to_stata(...) # Stata binary data files
df.to_msgpack(...) # msgpack (serialize) object
df.to_gbq(...) # to a Google BigQuery table.
df.to_string(...) # console-friendly tabular output.
df.to_clipboard(...) # clipboard that can be pasted into Excel
Copy the code
- Chart:
top_10.plot(
x='iso_country',
y='airport_count',
kind='barh',
figsize=(10.7),
title='Top 10 countries with most airports')
Copy the code
Go check out some great charts!
- Sharing:
The best medium to share the results, drawings, and related content of Pandas is The notebooks of Jupyter (jupyter.org/). In fact, some (like Jake Vanderplas, who was awesome) will post the entire book on the notebooks of Jupyter: github.com/jakevdp/Pyt… .
It’s easy to create a new notebook:
pip install jupyter
jupyter notebook
Copy the code
After:
- Open the localhost: 8888
- Click “New” and give the notebook a name
- Query and display data
- Create a GitHub repository and add your notebooks (files with the suffix.ipynb) to the repository.
GitHub has a great built-in viewer that displays the contents of Jupyter Notebooks in Markdown format.
Now, you can begin your tour of Pandas!
I hope you’re now convinced that Pandas can help you do exploratory data analysis just as well as your old friend SQL, and in some cases even better. It’s time to start querying data in Pandas yourself!
✉️ Subscribe to CodeBurst’s once-weekly Email Blast, 🐦 Follow CodeBurst on Twitter, view 🗺️ The 2018 Web Developer Roadmap, and 🕸️ Learn Full Stack Web Development.
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.