This article takes a historical weather query product of Aliyun market as an example to walk you through how to collect, analyze and visualize data using Python API. Hopefully, you will be able to easily handle future API data collection and analysis tasks.
The same
In last week’s graduate class, students presented in groups the second assignment of the practical session, which focused on capturing, analyzing, and visualizing data using apis.
People do a lot of different things.
This group, for example, looked at the animated film Peppa Pig. This film is said to be very popular recently.
Guess what the next group of respondents are?
Yes, “Game of Thrones.” A very good American TV series.
The theme is rich and colorful.
As a teacher, I’m down there. I’m happy, right?
No, I can’t even laugh or cry.
More than half of the 14 groups did the same for Wikipedia page views.
Why is that?
Because I was kind enough to give you an example of a tutorial I wrote earlier, “How to Get Web Data for free with R and API?” .
So they’re all using R to analyze wikipedia page views.
Are these students too lazy?
After listening to them, I found that many of them wanted to do something new.
They looked for a number of domestic cloud markets to find API products.
The overpriced apis are automatically filtered by them.
But there are plenty of low-cost or free apis for practicing.
The problem was, it took them a long time and they couldn’t fix it.
Given the impending presentation schedule, they had to follow my tutorial and analyze Wikipedia with R.
So, multiple groups of work, all the same.
At this point, they look embarrassed.
But I realized there was a problem.
Almost all API products in the domestic cloud market are well documented. Many also simply give a variety of programming language corresponding call code.
If you have the sample code, why can’t you do it?
After class, I asked the students who had questions to stay, and I took them through an actual test of an API product to try to find out what was causing their difficulties.
market
What we tried was an API product they found in the Aliyun market to provide weather data.
It comes from easy source data, linked here.
It’s a paid API that costs a penny for 100 calls.
As a homework exercise, 100 calls is enough.
The price, they said, was acceptable.
I went through the process myself.
Click the “Buy now” button.
You’ll be directed to the paid page. If you are not logged in, you can follow the instructions to log in using your Taobao account.
After paying 1 cent, you will see the following success prompt.
After that, the system will prompt you with some very important information.
Note the fields marked red in the figure above.
This is your AppCode, which is the most important authentication method when you call the API to get the data. Please click the “copy” button to save it.
Click the product name link in the image above to return to the product introduction page.
The API interface of this product provides a variety of data acquisition functions.
One item that students tried to use was “historical weather retrieval by ID or place name.”
Notice a few important things in this picture:
- Calling address: This is the basic information we need to access the API. It’s like when you’re going to meet a friend, you always know where you’re going to meet.
- Request mode: In this example, GET is one of the main forms of data transfer using HTTP protocol request;
- Request parameters: Here you provide two bits of information to the API, either the “region name” or the “region ID” or the month data. Note the format and time range available.
Scroll down to see an example request.
The default curl request is the simplest curl.
If you already have Curl installed on your operating system (if you don’t, you can download it by clicking this link), try copying the line at the beginning of curl into your text editor.
Something like this:
curl -i -k --get --include 'https://ali-weather.showapi.com/weatherhistory?area=%E4%B8%BD%E6%B1%9F&areaid=101291401&month=201601' -H 'Authorization:APPCODE Your own APPCODE 'Copy the code
Then, be sure to replace the string “your own AppCode” with your real AppCode.
Copy and paste the replaced statement into the terminal window and run it.
The running result is as follows:
Do you see the data in Chinese at the bottom of the window?
Using the API to get the data, it’s that simple.
Why program when a terminal can execute a command?
Good question!
Because we may not get all the data we need in one call.
You have to call the API multiple times, and you have to keep changing parameters and accumulating data.
It would be inefficient to execute commands manually every time.
API providers provide users with detailed documentation and instructions, and even samples.
Curl: curl curl: curl curl: curl curl
- Java
- C#
- PHP
- Python
- Object C
Let’s take Python as an example and click on the TAB.
All you need to do is copy the entire sample code and save it in a text editor as a Python script with a “. Py “extension, such as demo.py.
Again, don’t forget to replace the string “your own AppCode” with your real AppCode and save it.
On the terminal, run the following command:
python demo.py
Copy the code
If you are using Python version 2.7, you should get the correct results immediately.
Why can’t many students do the results?
I had them actually run around and found that some students were careless and forgot to change their AppCode.
However, most of you have encountered the following problems due to installing the latest version of Anaconda (Python 3.6) :
You might think that this is because the URllib2 module is not installed correctly and execute
pip install urllib2
Copy the code
You may see the following error message:
You might try dropping the version number and just installing urllib, i.e. :
pip install urllib
Copy the code
But the results are still not pretty:
Some Python developers may laugh at this: Urllib was split in Python 3. Everyone knows that you should…
Please maintain a sense of empathy.
Think about it for a moment. Why would an average user need to know the difference between statements in different versions of Python? Why should you know the solution to this versioning?
In their opinion, the sample provided by the official website should work. Reported the wrong, and can not be installed through their own software package “three plate axe” to solve, will panic and anxiety.
Furthermore, they don’t know much about the JSON format.
JSON, though, is already a very clean, human-readable way to store data.
They want to know how to scale the problem to something they can solve.
For example, can I convert JSON to Excel format data boxes?
If so, they can invoke familiar Excel commands for data filtering, analysis, and plotting.
They also think it would be better if Python itself could do the whole process of reading, collating, analyzing, and visualizing data in one stop.
But where’s the sample? Where’s the sample?
In my Python Programming Problem, What do Liberal Arts Students Do? In this article, I mentioned the importance of such examples for the average user.
Without “gourd”, how can they “imitate”?
Since the official documentation does not provide such detailed code and explanation samples in this example, I will draw a gourd for you.
Below, I’ll show you step by step how to call the API in Python 3 to read, analyze, and draw data.
The environment
First, let’s look at the code environment.
As mentioned earlier, if the sample code is running in a different environment than your native environment, the timing code itself is fine and will not execute properly.
So, I build you a cloud code environment. If you’re interested in the process of building this code runtime environment, please read my how to Run Python Code on an iPad. Article).
Please click on this link (t.cn/R3us4Ao) to directly enter our experimental environment.
You do not need to install any packages on your local computer. All you need is a modern browser (including Google Chrome, Firefox, Safari and Microsoft Edge). I’ve got all the dependency software ready for you.
After opening the link, you will see this page.
This interface is from Jupyter Lab.
The left column in the figure shows all the files in the working directory.
On the right is the ipynb file that we’re going to use.
According to my explanation, please do it one by one and observe the results carefully.
In this example, we will mainly use the following two new packages.
First, there were HTTP toolkit requests, called “for humans.”
This tool not only conforms to human perception and usage, but is also more Python 3 friendly. Author Kenneth Reitz is even urging all Python 2 users to move to Python 3.
The use of Python 3 is highly preferred over Python 2. Consider upgrading your applications and infrastructure if you find yourself still using Python 2 in production today. If you are using Python 3, Congratulations — You are indeed a person of excellent taste. — Kenneth Reitz
One of the drawing tools that we’re going to use is called Plotnine.
It wasn’t actually a drawing tool on the Python platform, but was ported from GGplot2 on the R platform.
At this point, there are a number of excellent drawing packages available on the Python platform, such as Matplotlib, Seaborn, Bokeh, And Plotly.
So why transplant GGploT2 over the long haul?
Because the author of GGplot2 is Hadley Wickham, the famous R language master.
He created GGplot2 not to provide another drawing tool for R, but to provide another way of drawing.
Ggplot2 fully follows and implements the “Grammar of Graphics” proposed by Leland Wilkinson, in which the drawing of images is changed from component splitting to hierarchical splitting.
As a result, data visualization has never been easier to learn and more powerful.
I’ll show you how to use both packages in detail in the “Code” section below.
I suggest you run through the tutorial and run the results.
If all is well, replace the data with something of interest to you.
After that, try opening a blank IPynb file, following the tutorials and documentation, doing your own coding, and trying to make adjustments.
This will help you understand the workflow and tool usage.
Let’s look at the code.
code
First, read HTTP toolkit requests.
import requests
Copy the code
In the second sentence, there is “Your AppCode here”, please replace it with Your own AppCode, otherwise the following operation will report an error.
appcode = 'Your AppCode here'
Copy the code
We tried to get the weather information of Lijiang in May.
On the API info page, there are tables for cities and codes.
It’s hidden, above the company profile.
I’ve put the url of the Excel document here (http://t.cn/R3T7e39), which you can click to download.
After downloading the Excel file and opening it, according to the table query, we know that “101291401” is the city code of Lijiang.
We write it to the AreAID variable.
Date We chose the month of writing this article, May 2018.
areaid = "101291401"
month = "201805"
Copy the code
Let’s set up the information related to the API call.
According to API information on the page, we want to visit the site at: https://ali-weather.showapi.com/weatherhistory, need to input the two parameters, is has just set areaid and the month.
In addition, we need to verify our identity and prove that we have paid.
Click on the blue “API Simple Authentication Call Method (APPCODE)” in the image above and you will see the following sample page.
So we need to put AppCode in the HTTP header.
Let’s write down all this information in turn.
url = 'https://ali-weather.showapi.com/weatherhistory'
payload = {'areaid': areaid, 'month': month}
headers = {'Authorization': 'APPCODE {}'.format(appcode)}
Copy the code
Now, it’s time to work with the Requests package.
The syntax for requests is very concise, specifying only four things:
- Call method “GET”
- Access Address URL
- Payload (contains
areaid
andmonth
Values) - HTTP data header information, namely AppCode
r = requests.get(url, params=payload, headers=headers)
Copy the code
After execution, it seems… Nothing happened!
Let’s take a look:
r
Copy the code
Python tells us:
<Response [200]>
Copy the code
Return code 200 indicates that the access is successful.
To recap, How to Get Web Data For Free with R and API? In one article, we mentioned:
A state code starting with a 2 is the best result, meaning everything is fine; If the status value starts with the number 4 or 5, there is a problem and you need to check for errors.
Now that the call is successful, let’s take a look at the specific data content returned by the API interface.
Call the content property of the return value:
r.content
Copy the code
This screen, it’s packed.
Many of these characters don’t even display properly. How is that good?
It doesn’t matter. We know from the API info page that the data that’s returned is in JSON format.
That’s fine. Let’s call the json package that comes with Python.
import json
Copy the code
Loads are parsed with json package strings and stored in content_JSON.
content_json = json.loads(r.content)
Copy the code
Look at the content_json result:
content_json
Copy the code
As you can see, the information returned is complete. And just can’t display the normal Chinese, at this time also show the true face of Lu Shan.
The next step is crucial.
We extract the data that we really care about.
We don’t need to return error codes and so on in the results.
What we want is a list of weather information for each day.
Observe that this part of the data is stored in ‘list’, which is stored in ‘showapi_res_body’
So, to select the list, we need to specify the path within it:
content_json['showapi_res_body'] ['list']
Copy the code
The redundant information has been stripped away, leaving only the list we want.
But operating on a list is not convenient and flexible enough.
We want to turn the list into a data box. This makes analysis and visualization much easier.
In the worst case, we can export the data box directly into Excel files and throw it into the familiar Excel environment to draw graphics.
Read the Python data box tool pandas.
import pandas as pd
Copy the code
We asked Pandas to convert the list we just saved to a data box and store it to DF.
df = pd.DataFrame(content_json['showapi_res_body'] ['list'])
Copy the code
Check out the content:
df
Copy the code
At this point, the data display format is very neat, the information at a glance.
The Pandas data box is used to retrieve data for a given month and city.
However, we have to do analysis, obviously not limited to a single month and a single city.
It would be hard to have to do this all over again every time you added a set of data. And the statement is much, the execution, unavoidably care for one and lose another, error.
So, we need to put together the code statement we just wrote, modularize it, and form it into a function.
In this way, we just need to pass in different parameters when calling the function, such as different city names, month information, etc., to get the desired result.
Combining the above statements, we define a complete function that passes in city and month information to get the data box.
def get_df(areaid, areaname_dict, month, appcode):
url = 'https://ali-weather.showapi.com/weatherhistory'
payload = {'areaid': areaid, 'month': month}
headers = {'Authorization': 'APPCODE {}'.format(appcode)}
r = requests.get(url, params=payload, headers=headers)
content_json = json.loads(r.content)
df = pd.DataFrame(content_json['showapi_res_body'] ['list'])
df['areaname'] = areaname_dict[areaid]
return df
Copy the code
Notice that in addition to the statement we just used, we added an input parameter to the function, areaname_dict.
It’s a dictionary, and each item contains a city code and a corresponding city name.
Based on the city code we enter, the function automatically adds a column to the resulting data box indicating the corresponding city.
When we get data from multiple cities, it becomes clear which city a particular row of data refers to.
On the other hand, if you were just shown the city code, you would quickly become dazzled and confused.
However, this function alone is not efficient enough.
After all, we may need to query information for months and cities. It would be tiring to call the above function every time.
So, let’s write another function that will help us automate the dirty work.
def get_dfs(areaname_dict, months, appcode):
dfs = []
for areaid in areaname_dict:
dfs_times = []
for month in months:
temp_df = get_df(areaid, areaname_dict, month, appcode)
dfs_times.append(temp_df)
area_df = pd.concat(dfs_times)
dfs.append(area_df)
return dfs
Copy the code
To clarify, this function takes input, including a city code-name dictionary, a list of months, and our AppCode.
The way it’s handled, quite simply, is a double loop.
The outer loop is responsible for traversing all required cities, and the inner loop traverses all specified time ranges.
What it returns is a list.
Each item in the list is a weather data box for a particular city over a period of time (possibly several months).
Let’s try it with a single city, a single month.
It was Lijiang in May 2018.
areaname_dict = {"101291401":"Lijiang"}
months = ["201805"]
Copy the code
We pass this information to the get_dfs function.
dfs = get_dfs(areaname_dict, months, appcode)
Copy the code
Check out the results:
dfs
Copy the code
What is returned is a list.
Since there’s only one city in the list, we’ll just return it to the first item.
dfs[0]
Copy the code
This time, the data box is displayed:
The test passed, now we strike while the iron is hot, tianjin, Shanghai, Lijiang from the beginning of 2018 to date all data read out.
First set the city:
areaname_dict = {"101030100":"Tianjin"."101020100":"Shanghai"."101291401":"Lijiang"}
Copy the code
Set the time range again:
months = ["201801"."201802"."201803"."201804"."201805"]
Copy the code
Let’s execute get_dfs again.
dfs = get_dfs(areaname_dict, months, appcode)
Copy the code
Look at the results this time:
dfs
Copy the code
The result is still a list.
Each item in the list corresponds to a city’s weather data for the period from the beginning of 2018 to May of this writing.
If we want to analyze the weather information of several cities, we can integrate these data boxes together.
The method used is the concat function built into Pandas.
It receives a list of data boxes, each of which is concatenated along the vertical axis (by default).
df = pd.concat(dfs)
Copy the code
Look at the effect of the total data box:
df
Copy the code
Here’s the beginning:
Here’s the end:
Three cities, more than four months of data correctly read and integrated.
So let’s try to analyze this.
First, we need to figure out what format each item in the data box is in:
df.dtypes
Copy the code
aqi object
aqiInfo object
aqiLevel object
max_temperature object
min_temperature object
time object
weather object
wind_direction object
wind_power object
areaname object
dtype: object
Copy the code
All columns are handled according to Object.
What is “object”?
In this context, you can think of it as a string type.
However, we can’t treat them all as strings.
Dates, for example, should be viewed by date type, otherwise how do you visualize time series?
AQI, if you look at it as a string, how do you compare the size?
So we need to convert the data type.
First convert the date column:
df.time = pd.to_datetime(df.time)
Copy the code
Then convert AQI value column:
df.aqi = pd.to_numeric(df.aqi)
Copy the code
Look at the data type of df at this point:
df.dtypes
Copy the code
aqi int64
aqiInfo object
aqiLevel object
max_temperature object
min_temperature object
time datetime64[ns]
weather object
wind_direction object
wind_power object
areaname object
dtype: object
Copy the code
This time it works. The date and the AQI are the types we need, respectively. Other data, for the time being, remain unchanged.
Some because they’re supposed to be strings, like the name of the city.
Others, because we’re not going to use them for a while.
Let’s draw a simple time series comparison graph.
Read the drawing toolkit Plotnine.
Note that we also read date_breaks, which specifies the time interval between drawing the graph.
import matplotlib.pyplot as plt
%matplotlib inline
from plotnine import *
from mizani.breaks import date_breaks
Copy the code
Formal drawing:
(ggplot(df, aes(x='time', y='aqi', color='factor(areaname)')) + geom_line() +
scale_x_datetime(breaks=date_breaks('2 weeks')) +
xlab('date') +
theme_matplotlib() +
theme(axis_text_x=element_text(rotation=45, hjust=1)) +
theme(text=element_text(family='WenQuanYi Micro Hei')))Copy the code
We specify time series on the horizontal axis and AQI on the vertical axis, and use different colored lines to distinguish cities.
When drawing time, “2 weeks” is used as the interval period to mark the data statistics information on time.
We modify the horizontal axis to be marked with “date” in Chinese.
Because the time display is long, if you follow the default style, it will be stacked on top of each other, so we rotate it 45 degrees to avoid overlap, so it’s easy to see.
In order for the Chinese characters in the picture to display properly, we need to specify the Chinese font, here we choose the open source “Wenquanyi Micron black”.
The data visualization results are shown in the figure below.
So, this comparison, it looks good, right?
What results can you analyze from the picture?
Anyway, AFTER I read this picture, I want to go to Lijiang.
summary
After reading this tutorial, you should have learned the following:
- How to choose the products you are interested in on the API cloud market according to the tips;
- How to get your authentication information AppCode;
- Curl curl curl curl curl curl curl curl curl curl curl curl curl curl
- How to call the API for data using Python 3 and the more user-friendly HTTP toolkit Requests;
- How to use JSON toolkit to parse and process the obtained string data;
- How to use Pandas to convert a JSON list to a data box
- How to package simple Python statements that pass the test into functions that can be called over and over again for efficiency;
- How to use Plotnine (a clone of GGploT2) to draw a time series line chart to compare the historical trend of AQI in different cities;
- How to run the sample in a cloud environment and modify it yourself.
I hope this sample code can help you build confidence and try to collect and obtain API data by yourself, so as to contribute to your research work.
If you want to run this sample locally, rather than in the cloud, use this link (t.cn/R3usDi9) to download the full source code and the Pipenv zip used in this article.
If you know how to use Github, you are welcome to use the link (t.cn/R3usEti) to clone or fork the corresponding Github repo.
Of course, it would be nice to get a star on my repo.
discuss
Have you ever tried getting data using Python and apis before? What better software packages do you use for data acquisition, processing, analysis, and visualization? What other data product markets have you used? Welcome to leave a message, share your experience and thinking to everyone, we exchange and discuss together.
If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.
If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.