Pandas is a library for data processing and analysis. Pandas is a library for data processing and analysis.
But admittedly, not all programmers can use Python, and not all Pythoners can use Pandas.
The good news is that with PandassQL, you can use SQL to manipulate the DataFrame.
# import related libraries
import numpy as np
import pandas as pd
from pandasql import sqldf, load_meat, load_births
Copy the code
basis
The main function in pandasQL is SQLDF, which takes two arguments:
- An SQL query statement
- A set of session/environment variables (
locals()
或globals()
)
For convenience, we can define a function that we can call easily.
pysqldf = lambda sql: sqldf(sql, globals())
Copy the code
Next we import some data.
meat = load_meat()
meat.head()
Copy the code
date | beef | veal | pork | lamb_and_mutton | broilers | other_chicken | turkey | |
---|---|---|---|---|---|---|---|---|
0 | 1944-01-01 | 751.0 | 85.0 | 1280.0 | 89.0 | NaN | NaN | NaN |
1 | 1944-02-01 | 713.0 | 77.0 | 1169.0 | 72.0 | NaN | NaN | NaN |
2 | 1944-03-01 | 741.0 | 90.0 | 1128.0 | 75.0 | NaN | NaN | NaN |
3 | 1944-04-01 | 650.0 | 89.0 | 978.0 | 66.0 | NaN | NaN | NaN |
4 | 1944-05-01 | 681.0 | 106.0 | 1029.0 | 78.0 | NaN | NaN | NaN |
births = load_births()
births.head()
Copy the code
date | births | |
---|---|---|
0 | 1975-01-01 | 265775 |
1 | 1975-02-01 | 241045 |
2 | 1975-03-01 | 268849 |
3 | 1975-04-01 | 247455 |
4 | 1975-05-01 | 254545 |
The query
The syntax used by PandassQL is SQLite syntax. Any DataFrame will be automatically detected by PandassQL and you can query them as such.
Limit the number of
Let’s see how to limit the number of data items. So let’s get the first two.
sql = "select * from births limit 2"
pysqldf(sql)
Copy the code
date | births | |
---|---|---|
0 | 1975-01-01 00:00:00. 000000 | 265775 |
1 | 1975-02-01 00:00:00. 000000 | 241045 |
In addition to limiting the first N pieces of data from scratch, we can also set offsets. Let’s get the first two pieces of data from the second row.
sql = "select * from births limit 2 offset 2"
pysqldf(sql)
Copy the code
date | births | |
---|---|---|
0 | 1975-03-01 00:00:00. 000000 | 268849 |
1 | 1975-04-01 00:00:00. 000000 | 247455 |
Finite field
Since this is SQL, we can of course restrict the required fields in the query. We limit the acquisition of the indicated points here.
sql = "select births from births limit 2"
pysqldf(sql)
Copy the code
births | |
---|---|
0 | 265775 |
1 | 241045 |
The sorting
Sorting functionality is also very common and PandassQL supports it perfectly. We have programmed the holocene progression from date to date.
sql = "select * from births order by date desc, births asc limit 2"
pysqldf(sql)
Copy the code
date | births | |
---|---|---|
0 | 2012-12-01 00:00:00. 000000 | 340995 |
1 | 2012-11-01 00:00:00. 000000 | 320195 |
Restricted query conditions
We can specify where to query data that meets the requirements. Here we select the data with Turkey not empty and date after 1974-12-31.
sql = """
select *
from meat
where turkey not null
and date > '1974-12-31'
limit 5
"""
pysqldf(sql)
Copy the code
date | beef | veal | pork | lamb_and_mutton | broilers | other_chicken | turkey | |
---|---|---|---|---|---|---|---|---|
0 | 1975-01-01 00:00:00. 000000 | 2106.0 | 59.0 | 1114.0 | 36.0 | 646.2 | None | 64.9 |
1 | 1975-02-01 00:00:00. 000000 | 1845.0 | 50.0 | 954.0 | 31.0 | 570.2 | None | 47.1 |
2 | 1975-03-01 00:00:00. 000000 | 1891.0 | 57.0 | 976.0 | 35.0 | 616.6 | None | 54.4 |
3 | 1975-04-01 00:00:00. 000000 | 1895.0 | 60.0 | 1100.0 | 34.0 | 688.3 | None | 68.7 |
4 | 1975-05-01 00:00:00. 000000 | 1849.0 | 59.0 | 934.0 | 31.0 | 690.1 | None | 81.9 |
The aggregation
Aggregation is essential for data analysis, and PandassQL certainly supports it. Here we group the points by year and sum up, mean, Max and minimum the points.
sql = """ select strftime('%Y', date) as year, sum(births), avg(births), max(births), min(births) from births group by strftime('%Y', date) limit 3 """
pysqldf(sql)
Copy the code
year | sum(births) | avg(births) | max(births) | min(births) | |
---|---|---|---|---|---|
0 | 1975 | 3136965 | 261413.750000 | 281300 | 241045 |
1 | 1976 | 6304156 | 262673.166667 | 286496 | 236551 |
2 | 1979 | 3333279 | 277773.250000 | 302805 | 249898 |
associated
Association is also a very common operation. Here we associate the meat and moisture of the DataFrame according to the field date as the condition.
sql = """ select m.date, b.births, m.beef from meat m inner join births b on m.date = b.date order by m.date limit 5; "" "
pysqldf(sql)
Copy the code
date | births | beef | |
---|---|---|---|
0 | 1975-01-01 00:00:00. 000000 | 265775 | 2106.0 |
1 | 1975-02-01 00:00:00. 000000 | 241045 | 1845.0 |
2 | 1975-03-01 00:00:00. 000000 | 268849 | 1891.0 |
3 | 1975-04-01 00:00:00. 000000 | 247455 | 1895.0 |
4 | 1975-05-01 00:00:00. 000000 | 254545 | 1849.0 |
In addition to the common functions listed above, PandassQL supports many more operations that are based on SQLite syntax, which you can explore for yourself.
To learn more about artificial intelligence, please follow our official account: AI Pai