The Python library is used to manipulate the Python library

Pandas is a library for data processing and analysis. Pandas is a library for data processing and analysis.

But admittedly, not all programmers can use Python, and not all Pythoners can use Pandas.

The good news is that with PandassQL, you can use SQL to manipulate the DataFrame.

# import related libraries
import numpy as np
import pandas as pd

from pandasql import sqldf, load_meat, load_births
Copy the code

basis

The main function in pandasQL is SQLDF, which takes two arguments:

An SQL query statement
A set of session/environment variables (locals() 或 globals())

For convenience, we can define a function that we can call easily.

pysqldf = lambda sql: sqldf(sql, globals())
Copy the code

Next we import some data.

meat = load_meat()
meat.head()
Copy the code

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey
0	1944-01-01	751.0	85.0	1280.0	89.0	NaN	NaN	NaN
1	1944-02-01	713.0	77.0	1169.0	72.0	NaN	NaN	NaN
2	1944-03-01	741.0	90.0	1128.0	75.0	NaN	NaN	NaN
3	1944-04-01	650.0	89.0	978.0	66.0	NaN	NaN	NaN
4	1944-05-01	681.0	106.0	1029.0	78.0	NaN	NaN	NaN

births = load_births()
births.head()
Copy the code

	date	births
0	1975-01-01	265775
1	1975-02-01	241045
2	1975-03-01	268849
3	1975-04-01	247455
4	1975-05-01	254545

The query

The syntax used by PandassQL is SQLite syntax. Any DataFrame will be automatically detected by PandassQL and you can query them as such.

Limit the number of

Let’s see how to limit the number of data items. So let’s get the first two.

sql = "select * from births limit 2"
pysqldf(sql)
Copy the code

	date	births
0	1975-01-01 00:00:00. 000000	265775
1	1975-02-01 00:00:00. 000000	241045

In addition to limiting the first N pieces of data from scratch, we can also set offsets. Let’s get the first two pieces of data from the second row.

sql = "select * from births limit 2 offset 2"
pysqldf(sql)
Copy the code

	date	births
0	1975-03-01 00:00:00. 000000	268849
1	1975-04-01 00:00:00. 000000	247455

Finite field

Since this is SQL, we can of course restrict the required fields in the query. We limit the acquisition of the indicated points here.

sql = "select births from births limit 2"
pysqldf(sql)
Copy the code

	births
0	265775
1	241045

The sorting

Sorting functionality is also very common and PandassQL supports it perfectly. We have programmed the holocene progression from date to date.

sql = "select * from births order by date desc, births asc limit 2"
pysqldf(sql)
Copy the code

	date	births
0	2012-12-01 00:00:00. 000000	340995
1	2012-11-01 00:00:00. 000000	320195

Restricted query conditions

We can specify where to query data that meets the requirements. Here we select the data with Turkey not empty and date after 1974-12-31.

sql = """
select *
from meat
where turkey not null
and date > '1974-12-31'
limit 5
"""
pysqldf(sql)
Copy the code

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey
0	1975-01-01 00:00:00. 000000	2106.0	59.0	1114.0	36.0	646.2	None	64.9
1	1975-02-01 00:00:00. 000000	1845.0	50.0	954.0	31.0	570.2	None	47.1
2	1975-03-01 00:00:00. 000000	1891.0	57.0	976.0	35.0	616.6	None	54.4
3	1975-04-01 00:00:00. 000000	1895.0	60.0	1100.0	34.0	688.3	None	68.7
4	1975-05-01 00:00:00. 000000	1849.0	59.0	934.0	31.0	690.1	None	81.9

The aggregation

Aggregation is essential for data analysis, and PandassQL certainly supports it. Here we group the points by year and sum up, mean, Max and minimum the points.

sql = """ select strftime('%Y', date) as year, sum(births), avg(births), max(births), min(births) from births group by strftime('%Y', date) limit 3 """
pysqldf(sql)
Copy the code

	year	sum(births)	avg(births)	max(births)	min(births)
0	1975	3136965	261413.750000	281300	241045
1	1976	6304156	262673.166667	286496	236551
2	1979	3333279	277773.250000	302805	249898

associated

Association is also a very common operation. Here we associate the meat and moisture of the DataFrame according to the field date as the condition.

sql = """ select m.date, b.births, m.beef from meat m inner join births b on m.date = b.date order by m.date limit 5; "" "
pysqldf(sql)
Copy the code

	date	births	beef
0	1975-01-01 00:00:00. 000000	265775	2106.0
1	1975-02-01 00:00:00. 000000	241045	1845.0
2	1975-03-01 00:00:00. 000000	268849	1891.0
3	1975-04-01 00:00:00. 000000	247455	1895.0
4	1975-05-01 00:00:00. 000000	254545	1849.0

In addition to the common functions listed above, PandassQL supports many more operations that are based on SQLite syntax, which you can explore for yourself.

To learn more about artificial intelligence, please follow our official account: AI Pai

The Python library is used to manipulate the Python library

basis

The query

Limit the number of

Finite field

The sorting

Restricted query conditions

The aggregation

associated

Related Posts

Introduction to Image manipulation

ConeAI! Intelligent background sample manufacturing machine!

Hands-on practice of a NLP – based close word finder