Pandas is a library for data processing and analysis. Pandas is a library for data processing and analysis.

But admittedly, not all programmers can use Python, and not all Pythoners can use Pandas.

The good news is that with PandassQL, you can use SQL to manipulate the DataFrame.

# import related libraries
import numpy as np
import pandas as pd

from pandasql import sqldf, load_meat, load_births
Copy the code

basis

The main function in pandasQL is SQLDF, which takes two arguments:

  • An SQL query statement
  • A set of session/environment variables (locals()globals())

For convenience, we can define a function that we can call easily.

pysqldf = lambda sql: sqldf(sql, globals())
Copy the code

Next we import some data.

meat = load_meat()
meat.head()
Copy the code
date beef veal pork lamb_and_mutton broilers other_chicken turkey
0 1944-01-01 751.0 85.0 1280.0 89.0 NaN NaN NaN
1 1944-02-01 713.0 77.0 1169.0 72.0 NaN NaN NaN
2 1944-03-01 741.0 90.0 1128.0 75.0 NaN NaN NaN
3 1944-04-01 650.0 89.0 978.0 66.0 NaN NaN NaN
4 1944-05-01 681.0 106.0 1029.0 78.0 NaN NaN NaN
births = load_births()
births.head()
Copy the code
date births
0 1975-01-01 265775
1 1975-02-01 241045
2 1975-03-01 268849
3 1975-04-01 247455
4 1975-05-01 254545

The query

The syntax used by PandassQL is SQLite syntax. Any DataFrame will be automatically detected by PandassQL and you can query them as such.

Limit the number of

Let’s see how to limit the number of data items. So let’s get the first two.

sql = "select * from births limit 2"
pysqldf(sql)
Copy the code
date births
0 1975-01-01 00:00:00. 000000 265775
1 1975-02-01 00:00:00. 000000 241045

In addition to limiting the first N pieces of data from scratch, we can also set offsets. Let’s get the first two pieces of data from the second row.

sql = "select * from births limit 2 offset 2"
pysqldf(sql)
Copy the code
date births
0 1975-03-01 00:00:00. 000000 268849
1 1975-04-01 00:00:00. 000000 247455

Finite field

Since this is SQL, we can of course restrict the required fields in the query. We limit the acquisition of the indicated points here.

sql = "select births from births limit 2"
pysqldf(sql)
Copy the code
births
0 265775
1 241045

The sorting

Sorting functionality is also very common and PandassQL supports it perfectly. We have programmed the holocene progression from date to date.

sql = "select * from births order by date desc, births asc limit 2"
pysqldf(sql)
Copy the code
date births
0 2012-12-01 00:00:00. 000000 340995
1 2012-11-01 00:00:00. 000000 320195

Restricted query conditions

We can specify where to query data that meets the requirements. Here we select the data with Turkey not empty and date after 1974-12-31.

sql = """
select *
from meat
where turkey not null
and date > '1974-12-31'
limit 5
"""
pysqldf(sql)
Copy the code
date beef veal pork lamb_and_mutton broilers other_chicken turkey
0 1975-01-01 00:00:00. 000000 2106.0 59.0 1114.0 36.0 646.2 None 64.9
1 1975-02-01 00:00:00. 000000 1845.0 50.0 954.0 31.0 570.2 None 47.1
2 1975-03-01 00:00:00. 000000 1891.0 57.0 976.0 35.0 616.6 None 54.4
3 1975-04-01 00:00:00. 000000 1895.0 60.0 1100.0 34.0 688.3 None 68.7
4 1975-05-01 00:00:00. 000000 1849.0 59.0 934.0 31.0 690.1 None 81.9

The aggregation

Aggregation is essential for data analysis, and PandassQL certainly supports it. Here we group the points by year and sum up, mean, Max and minimum the points.

sql = """ select strftime('%Y', date) as year, sum(births), avg(births), max(births), min(births) from births group by strftime('%Y', date) limit 3 """
pysqldf(sql)
Copy the code
year sum(births) avg(births) max(births) min(births)
0 1975 3136965 261413.750000 281300 241045
1 1976 6304156 262673.166667 286496 236551
2 1979 3333279 277773.250000 302805 249898

associated

Association is also a very common operation. Here we associate the meat and moisture of the DataFrame according to the field date as the condition.

sql = """ select m.date, b.births, m.beef from meat m inner join births b on m.date = b.date order by m.date limit 5; "" "
pysqldf(sql)
Copy the code
date births beef
0 1975-01-01 00:00:00. 000000 265775 2106.0
1 1975-02-01 00:00:00. 000000 241045 1845.0
2 1975-03-01 00:00:00. 000000 268849 1891.0
3 1975-04-01 00:00:00. 000000 247455 1895.0
4 1975-05-01 00:00:00. 000000 254545 1849.0

In addition to the common functions listed above, PandassQL supports many more operations that are based on SQLite syntax, which you can explore for yourself.


To learn more about artificial intelligence, please follow our official account: AI Pai