Apis are one of the most important ways to get data on the Web. Want to learn how to use R to call apis to extract and organize the free Web data you need? This article gives you step-by-step instructions on how to do this.

Weigh the

As the saying goes, “even a clever housewife cannot cook a meal without rice.” Even if you’ve mastered the art of data analysis, it’s frustrating not to have data. “Draw sword all around look at the vacant” said probably is this situation.

There are many sources of data. Web data is one of the most abundant and relatively accessible types. Even better, a lot of Web data is free.

In this era of big data, how do you get data on the Web?

Many people use data sets that others have collated and published.

They are lucky enough to build on the work of others. It’s the most efficient.

But not everyone is so lucky. What if you need data that happens not to have been collated and published?

In fact, the numbers are even larger. Are we blind to them?

If you’re thinking about reptiles, you’re on the right track. Crawlers can pull almost anything you see (or even invisible) on the Web. However, writing and using crawlers can be costly. Including time resources, technical ability and so on. If you don’t think twice about taking the sledgehammer to any Web data acquisition problem, you’re probably just taking the hammer out of the bag.

Between “data prepared by others” and “data that needs to be crawled by yourself,” there is a wide area where APIS live.

What is an API?

It is short for Application Programming Interface. Specifically, a website with data that is accumulating and changing. Such data, if sorted out, would not only take up time and space, but also run the risk of being out of date. In fact, most people need only a small part of the data, but the requirements of timeliness can be very strong. So it’s not economical to sort it out and store it and make it available for download.

But if you can’t get the data out in some way, you’ll be harassed by countless crawlers. This will bring a lot of trouble to the normal operation of the website. Compromise, is the site to take the initiative to provide a channel. When you need a certain piece of data, even though there is no data set available, just use this channel, describe the data you want, and then the site review (usually automated, instantaneous), think it can give you, immediately send the data you specifically ask for. Both sides are happy.

When you’re looking for data in the future, it’s also worth checking to see if the target site offers an API to avoid fuelling your search.

The Github project has a very detailed list of the most popular API resources available today. The author is still revising it, you can put it away and read it slowly.

If we know that a web site provides an API, and by looking at the documentation, we know that the data we need is there, then the question becomes — how do we get the data from the API?

Let’s use a practical example to show you how to do this.

source

The example we looked at was Wikipedia.

For an overview of the Wikipedia API, please refer to this page.

Let’s say we’re interested in specifying the number of page views for wikipedia articles in a given period of time.

Wikipedia provides us with a class of data, called metrics, that includes the key number of page views. The introduction page of the corresponding API is here.

There is an example on the page. Suppose you wanted to get the number of visits to the Einstein page in October 2015, you could call it:

GET http://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Albert_Einstein/daily/ 2015100100/2015103100Copy the code

We can type in the address bar of our browser and hit Enter and see what we GET.

In our browser, we see the long string of text shown above. You’re probably wondering — what the hell is this?

Congratulations, that’s all the data we need to get. However, it uses a special data format called JSON.

JSON is currently one of the dominant formats for data interaction on the Internet. If you want to understand the meaning and usage of JSON, you can refer to this tutorial.

In the browser, we initially see only the very beginning of the data. But it already contains valuable content:

{"items":[{"project":"en.wikipedia","article":"Albert_Einstein","granularity":"daily","timestamp":"2015100100","access": "all-access","agent":"all-agents","views":18860}Copy the code

In this section, we see the project name (en.Wikipedia), article title (Albert Einstein), statistical granularity (days), timestamp (October 1, 2015), access type (all), terminal type (all), and number of visits (18860).

We use the slider to drag the returned text to the end and see the following message:

{"project":"en.wikipedia","article":"Albert_Einstein","granularity":"daily","timestamp":"2015103100","access":"all-acces s","agent":"all-agents","views":16380}]}Copy the code

Only the timestamp (As of October 31, 2015) and the number of visits (16,380) have changed compared to data from October 1.

What we skipped in the middle is the data from October 2 to October 30. The storage format is the same, but only date and access to two data values in the change.

All the data you need is here, you just need to extract the relevant information, and you’re done. But if you have to do it manually (for example, copy the items you need and paste them into Excel), it’s obviously inefficient and error-prone. Let’s show you how to automate this process using the R programming environment.

To prepare

Before we can actually call the API with R, we need to do some necessary preparatory work.

First, install R.

Please go to this website to download the R base installation package.

There are many download locations for R. Suggest you choose tsinghua University mirror, can get a higher download speed.

Please download the corresponding version according to your operating system platform. I’m using the macOS version.

Download the PKG file. Double click to install.

With the base package installed, we moved on to installing the integrated development environment RStudio. It helps you easily communicate with R interactively. RStudio can be downloaded here.

Select an installation package based on your operating system. The macOS installation package is a DMG file. Double-click to open it and drag the rStudio.app icon into the Applications folder to complete the installation.

From the application directory, double-click to run RStudio.

Let’s start by running the following statement in RStudio’s Console to install the necessary packages:

install.packages("tidyverse")
install.packages("rlist")
Copy the code

Once installed, select File->New from the menu and select R Notebook from the following screen.

R Notebook provides us with a template by default, along with some basic instructions.

We try clicking the Run button in the edit area (left) in the code section (gray).

You can see the result of the drawing immediately.

We click the Preview button on the menu bar to see the entire code run. The results are presented as an illustrated HTML file.

Now that we’re familiar with the environment, it’s time to actually run our own code. We left the opening description of the left edit area, removed the rest, and changed the file name to a meaningful web-data-api-with-r.

At this point, preparations are in place. So now we’re going to do the actual thing.

operation

In practice, we use another Wikipedia article from Wikipedia as a sample to prove the universality of this method. The article chosen is the one we used to introduce word cloud making, called “Yes, Minisiter”. It’s a British comedy from the 1980s.

Let’s first try in the browser to see if we can modify the parameters in the API sample to get Yes, Minister article access statistics. As a test, we only collected data from October 1, 2017 to October 3, 2017 for the time being.

Instead of the sample, we need to replace the start and end times and the title of the article.

We type in the browser’s address bar:

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Yes_Minister/daily/20 17100100/2017100300Copy the code

The result is as follows:

The data will return normally, and we will call it in statement form in RStudio.

Note that in the following code, the output section of the program begins with a ## mark to distinguish it from the executing code itself.

At the beginning, we need to set the time zone. Otherwise, errors will be encountered when processing time data later.

Sys.setenv(TZ="Asia/Shanghai")
Copy the code

Then, we call the Tidyverse package, which is a collection that loads many of the features we’ll use later at once.

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats
Copy the code

There may be some warning here, just ignore it. It doesn’t affect our operations at all.

Based on the previous example, we define the time span for the query and specify the name of the Wiki article to look for.

Note that unlike Python, assignment in R takes the <- flag instead of =. But R is actually pretty easy, and if you insist on using =, it will recognize it and won’t make an error.

starting <- "20171001"
ending <- "20171003"
article_title <- "Yes Minister"
Copy the code

Based on the set parameters, we can generate the API address for the call.

url <- paste("https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents",
             article_title,
             "daily",
             starting,
             ending,
             sep = "/")
Copy the code

Here we use paste, which helps us concatenate parts of a string, and the last “sep” refers to the concatenator used to link parts of a string. Because we want to form the url data in a directory-like format, we use the common slashes that separate directories.

Let’s check that the generated URL is correct:

url

## [1] "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Yes Minister/daily/20171001/20171003"
Copy the code

That’s it. It’s correct. Now we need to actually execute the GET function to call the API and GET the Wikipedia feedback data.

To do this, we need to load another package, HTTR. It is similar to the Request package in Python, similar to a Web browser, and can communicate with remote servers.

library(httr)
Copy the code

And then we start calling.

response <-GET(url, user_agent="[email protected] this is a test")
Copy the code

Let’s look at the result of calling the API:

response

## Response [https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/Yes Minister/daily/20171001/20171003]
##   Date: 2017-10-13 03:10
##   Status: 200
##   Content-Type: application/json; charset=utf-8
##   Size: 473 B
Copy the code

Notice the status part. We see that it returns a value of 200. A state code starting with a 2 is the best result, meaning everything is fine; If the status value starts with the number 4 or 5, there is a problem and you need to check for errors.

Since we fortunately didn’t have any problems, let’s open the return content and see what’s inside. Since we know that the returned content is in JSON format, we load the JSONLite package to print the content in a clear format.

library(jsonlite)

##
## Attaching package: 'jsonlite'

## The following object is masked from 'package:purrr':
##
##     flatten
Copy the code

We then print the content that returns JSON text.

toJSON(fromJSON(content(response, as="text")), pretty = TRUE) ## { ## "items": [ ## { ## "project": "en.wikipedia", ## "article": "Yes_Minister", ## "granularity": "daily", ## "timestamp": "2017100100", ## "access": "all-access", ## "agent": "all-agents", ## "views": 654 ## }, ## { ## "project": "en.wikipedia", ## "article": "Yes_Minister", ## "granularity": "daily", ## "timestamp": "2017100200", ## "access": "all-access", ## "agent": "all-agents", ## "views": 644 ## }, ## { ## "project": "en.wikipedia", ## "article": "Yes_Minister", ## "granularity": "daily", ## "timestamp": "2017100300", ## "access": "all-access", ## "agent": "all-agents", ## "views": 578 ##} ##] ##}Copy the code

As you can see, the 3-day access statistics, along with the additional metadata contained, are correctly fed back to us from the server API.

Let’s store this JSON content.

result <- fromJSON(content(response, as="text"))
Copy the code

Check the stored contents:

result

## $items
##        project      article granularity  timestamp     access      agent
## 1 en.wikipedia Yes_Minister       daily 2017100100 all-access all-agents
## 2 en.wikipedia Yes_Minister       daily 2017100200 all-access all-agents
## 3 en.wikipedia Yes_Minister       daily 2017100300 all-access all-agents
##   views
## 1   654
## 2   644
## 3   578
Copy the code

Let’s see what the type of storage is after parsing:

typeof(result)

## [1] "list"
Copy the code

The type of storage is a list. However, for subsequent analysis, we want to extract the required information and form a dataframe. The method is very simple, using the R package rlist, you can easily do it.

library(rlist)
Copy the code

We need to use two of these methods, one is list.select, which extracts the specified information; One is list.stack, which generates a data box from a list.

df <- list.stack(list.select(result, timestamp, views))
Copy the code

Let’s look at the results:

df

##    timestamp views
## 1 2017100100   654
## 2 2017100200   644
## 3 2017100300   578
Copy the code

The data extraction is correct, including the date and number of views. However, this date format is not a standard format, and the subsequent analysis will be problematic. We need to do a transformation.

The best way to handle date and time formats is to use the Lubridate package. Let’s call it first.

library(lubridate)

##
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
##
##     date
Copy the code

Since the date string is followed by two digits for the time zone (both zeros here), we need to call the Stringr package to intercept it. Then it can be converted correctly.

library(stringr)
Copy the code

We then begin the conversion by erasing the last two digits of the date string using the str_sub function (from the Stringr package) and converting the original string to the standard date format using the YMD function in the LubriDate package. After modifying the data, we store back df$timestamp.

df$timestamp <- ymd(str_sub(df$timestamp, 1, -3))
Copy the code

Let’s look at the contents of df:

df

##    timestamp views
## 1 2017-10-01   654
## 2 2017-10-02   644
## 3 2017-10-03   578
Copy the code

So far, all the data we need is in the right format.

However, if we were to deal with the number of readings per article, we would run sentences one by one, which would be inefficient and error prone. Let’s organize the input statement into a function that will make it easier to use later.

While rearranging the functions, we rewrote the previous content using the dplyr package’s “pipes” format (the %>% symbol you’ll see) to eliminate intermediate variables and make it look cleaner.

get_pv <- function(article_title, starting, ending){
  url <- paste("https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents",
             article_title,
             "daily",
             starting,
             ending,
             sep = "/")
 df <- url %>%
    GET(user_agent="[email protected] this is a test") %>%
    content(as="text") %>%
    fromJSON() %>%
    list.select(timestamp, views) %>%
    list.stack() %>%
    mutate(timestamp = timestamp %>%
             str_sub(1,-3) %>%
             ymd())
  df
}
Copy the code

Let’s try the API data retrieval again with the newly defined function:

starting <- "20171001"
ending <- "20171003"
article_title <- "Yes Minister"
get_pv(article_title, starting, ending)

##    timestamp views
## 1 2017-10-01   654
## 2 2017-10-02   644
## 3 2017-10-03   578
Copy the code

The result is correct.

But if we just grab three days of data, there’s no point in going through all this trouble. Now we extend the time range and try to capture the data from the beginning of 2014 to October 10, 2017.

starting <- "20141001"
ending <- "20171010"
article_title <- "Yes Minister"
df <- get_pv(article_title, starting, ending)
Copy the code

Let’s look at the results:

head(df)

##    timestamp views
## 1 2015-07-01   538
## 2 2015-07-02   588
## 3 2015-07-03   577
## 4 2015-07-04   473
## 5 2015-07-05   531
## 6 2015-07-06   500
Copy the code

Interestingly, the data did not start in 2014, but in July 2015. Is this due to the fact that Yes, Minister articles were only published in July 2015? Or is it because the API we’re calling has a limited retrieval time range? Or is it something else? Leave this question as a brainstorming question, and share your answer and analysis with us.

Next, we use the obtained data to draw graphs with ggploT2 software package. In one line, look at the number of visits to Yes, Minister articles over the years.

ggplot(data=df, aes(timestamp, views)) + geom_line()
Copy the code

As a series dating back more than 30 years, it still has a lot of visitors to its Wiki page today, which shows its charm. There are several peaks that are very obvious in this picture. Can you explain why they appear? This will be another problem set today for you to think about.

summary

Just to recap, here are some important things we learned in this article:

  • Three common ways to obtain Web data and their application scenarios;
  • Common API directory resource access address and usage method;
  • How to call the API with R and extract the data of interest from the server feedback.

Hopefully, this article will give you a sense of what’s above, and you’ll be able to expand your knowledge by following the links and tutorial resources provided here.

discuss

Have you ever retrieved Web data using an API before? What other API calling tools have you used besides R? What are the features of these tools compared to those described in this article? Welcome to leave a message, share your experience with you, we exchange and discuss together.

If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.

If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.