Format terms like CSV, JSON, and XML start to come up when you start working with rich and colorful open data sets. How do you efficiently read them in Python for subsequent collation and analysis? This article shows you the process step by step, and you can do it yourself.

demand

Artificial intelligence algorithm again exquisite, leave data is also “clever housewife without rice”.

Data is precious, and open data is especially precious. Whether on official accounts, weibo or moments, many people get excited when they hear the words “open data”, “data resources” and “data link”.

When you finally get your hands on that coveted data link, you’ll find that it can be downloaded in all sorts of weird formats.

The most common are the following:

  • CSV
  • XML
  • JSON

You want to complete your data alchemy by calling Python to clean and analyze them.

The first step is to learn how to read these open data formats in Python.

In this article, we will introduce how to read three common network open data formats, CSV, XML and JSON, into Python to form structured data boxes for your subsequent analysis.

Are you raring to go?

data

The open data platform we choose is Quandl.

Quandl is a financial and economic data platform. This includes both expensive data that is paid for and a lot of data that is freely available.

You need to sign up for a free account with Quandl to access its free data collection.

To register, just fill out the form above. After registering, log in with your new account and password.

Once you’re logged in, click “Search Data” in the “Core Financial Data” section on the home page.

You’ll soon see a dizzying array of data.

Don’t count your chickens before they hatch. If you look closely at the TAB on the right side of the data set, the first page is basically “Premium” (members only), only for paying users.

You don’t have to flip through pages to find free open data. Click the “Free” option under Filter at the top left of the page.

This time it’s all free data.

What does this data contain? If you’re interested, take a moment to browse.

Let’s use Zillow Real Estate Research, which is a very large Real Estate data set.

Zillow real estate data are all from US cities. You can choose cities that interest you according to your own interests. I said Lexington, Kentucky.

Why not New York, why not Los Angeles?

Because I often go there on weekends when I’m a visiting student in the United States.

The university I visited is located in a village. There is no local Chinese supermarket, some common food and spices are not available.

To get to a Chinese supermarket, you have to go to Lexington, the nearest “big city”.

From the school to there distance, with Tianjin to Beijing about the same.

I didn’t have a car of my own and public transport wasn’t convenient, so I was upset at first.

Fortunately, foreign students often go to Lexington on weekends to visit the mall. I always ride along.

An hour and a half drive into town, we go first to a real Chinese restaurant for a lunch buffet and then to the mall. They wander around for about 2 hours, and I find a cafe or rest area to close my eyes and listen to Luo Pang’s stories with headphones on.

When they finished shopping, we went to Chinese supermarket shopping together.

I have fond memories of this “big city” with shopping malls, authentic Chinese restaurants, buses and even Chinese supermarkets.

Let’s take that as an example.

To obtain

A search for “Lexington ky” returns quite a few results.

We selected “Zillow Home Value Index (Metro): Home Sales (SA) – Lexington, KY” and clicked to see a page for this data set.

Here’s the median home sales price in Lexington at different times.

Quandl has been thoughtful enough to help us plot historical price information with a line chart. Select the “TABLE” TAB and we can view the raw data.

Let’s download the data locally. There’s a Download button in the top right. We click on it.

As you can see, Quandl provides us with data in four formats, respectively

  • CSV
  • Excel
  • JSON
  • XML

Let’s skip Excel (because it’s Microsoft’s proprietary format) and just download the other three categories in turn.

We right click on the corresponding data category, select “Link save as” from the pop-up browser menu, and then save to local.

I have downloaded the relevant 3 data formats for you and stored them in a Github project. Please visit this link to download the package and unzip it.

In the compressed package, there are three different formats of Lexington real estate transaction information. In this graph, you can see the same data content. CSV files take up the least space, followed by JSON. The format that takes up the most space is XML.

Now that we have the data, let’s prepare the Python programming environment.

The environment

We use Python to integrate the runtime environment, Anaconda.

Please download the latest version of Anaconda at this website. Drop down the page to find the download location. Depending on the system you are currently using, the site will automatically recommend a suitable version for you to download. I use macOS and download files in PKG format.

The download page area shows Python version 3.6 on the left and 2.7 on the right. Please select version 2.7.

Double-click the downloaded PKG file and follow the Instructions in Chinese to install it step by step.

With Anaconda installed, we also need to make sure we have a few necessary packages installed.

Please go to your “terminal” (Linux, macOS) or “command prompt” (Windows) and go to the directory we just downloaded the decompressed sample.

Run the following command:

pip install json
pip install bs4
Copy the code

After installation, perform the following operations:

jupyter notebook
Copy the code

This brings us to the Jupyter laptop environment. Let’s create a New Python 2 notebook.

This creates a blank notebook.

Click on the laptop name in the upper left corner and change it to a meaningful laptop name, “Demo-python-read-open-data-formats.”

Now that we’re done, we can start reading data in Python in different formats.

CSV

Let’s start with the simplest CSV format.

CSV is an acronym for ‘Comma Separated Values’.

Let’s go back to the root of Jupyter Notebook.

Open up our sample CSV file, zillow-m550_sales.csv.

As you can see, the first line is the table header, indicating the name of each column. After that, each line is data, with the date and the corresponding median price.

Two columns of data in each row are separated by commas.

We can use Excel to open the CSV data and see the effect more intuitively.

As shown in the figure, when we open CSV data with Excel, Excel automatically recognizes it as a data form. The comma is gone, and there are two columns and several rows of data.

Let’s use Python to read in the CSV data file and visualize it.

Read the Pandas toolkit. It helps us with data boxes and is the foundation of Python data analysis.

import pandas as pd
Copy the code

Then, in order for the image to display correctly on the Jupyter Notebook, we use the following statement to allow the image to be embedded in the page.

%matplotlib inline
Copy the code

Let’s read in the CSV file. Pandas is the most friendly to CSV data and provides the read_csv command to read CSV data directly.

df = pd.read_csv("ZILLOW-M550_SALES.csv")
Copy the code

We store the CSV data in the data box variable df. The data read effect is shown below.

df.head()
Copy the code

As you can see, both the date and the median transaction price records were read correctly.

Let’s write a function to help us organize the data box. It mainly realizes the following functions:

  • Change column names to lowercase “date” and “value”;
  • Arrange data in chronological order. Place the oldest date and its value on the first line, and the latest date and its value at the end;
  • Set the time as the index of the data box, mainly so that the horizontal axis displays the date data correctly when drawing later.
def arrange_time_dataframe(df):
    df.columns = ['date'.'value']
    df.sort_values(by='date', inplace=True)
    df.set_index('date', inplace=True)
    return df
Copy the code

Now we call this function to collate the data box variable df.

df = arrange_time_dataframe(df)
Copy the code

Let’s show the first five lines of DF.

df.head()
Copy the code

You can see that the date data becomes indexed and is arranged in ascending order.

So now we have to plot. The data box tool Pandas provides convenient time series graphics.

In order to make the display more beautiful, we set the length and width ratio of the graph.

df.plot(figsize=(16.6))
Copy the code

Compare our own drawing with Quandl’s example. Is it consistent?

JSON

JSON, short for JavaScript Object Notation, is a lightweight data interchange format. It’s a text file, just like CSV.

Let’s open the downloaded JSON file in Jupyter Notebook and examine its contents:

All the data we need is in there, so let’s go back to the Python notebook file ipynb and try to read the JSON data content.

First we read the JSON toolkit.

import json
Copy the code

Open the m550_sales. json file we downloaded and read the data into the variable data.

with open("M550_SALES.json") as f:
    data = json.load(f)
Copy the code

To make it more intuitive, let’s print the JSON properly indented. Here we show only the first few lines.

print(json.dumps(data, indent=2))
Copy the code
{ "dataset": { "dataset_code": "M550_SALES", "column_names": [ "Date", "Value" ], "newest_available_date": "2016-06-30", "description": "The Zillow Home Value Index is Zillow's estimate of the median market value of home sales (nsa) within the metro of Morehead City, NC. This data is calculated by Zillow Real Estate Research (www.zillow.com/research) using their database of 110 million Homes. ", "end_date" : "2016-06-30", "data" : [[" 2016-06-30 ", 64.0], [" 2016-05-31 ", 163.0].Copy the code

As you can see, the JSON file is like a big dictionary. We select one of these indexes, and we get the corresponding data.

We select “dataset” :

data['dataset']
Copy the code

Here are the first few lines of the result.

{u'collapse': None, u'column_index': None, u'column_names': [u'Date', u'Value'], u'data': [[u '2016-06-30', 64.0], [u '2016-05-31', 163.0], [u '2016-04-30', 118.0].Copy the code

The data we care about is under “data”. Continue to:

data['dataset'] ['data']
Copy the code

Or just the first few lines:

[[u '2016-06-30', 64.0], [u '2016-05-31', 163.0], [u '2016-04-30', 118.0].Copy the code

Isn’t that what we want to read?

To distinguish it from CSV data, we will store the data in the DF1 variable after reading it.

df1 = pd.DataFrame(data['dataset'] ['data'])
Copy the code

Show the first few lines:

df1.head()
Copy the code

The data is all right, but the column names are weird.

That’s okay. Didn’t we just program the collation function? Column names, no matter how strange, can be sorted out.

df1 = arrange_time_dataframe(df1)
Copy the code

After sorting, we call the drawing function again to draw dF1 data:

df1.plot(figsize=(16.6))
Copy the code

Drawing correctly proves that our JSON data is read successfully.

XML

XML stands for eXtensible Markup Language. It looks a bit like the HTML source code we use every day when we surf the Web, but there are differences. It was designed not to display Web pages, but to exchange data.

We open the downloaded XML file in Jupyter Notebook.

At the bottom of the page, we see the data section we are interested in, but the data is wrapped in many tags.

Let’s try using Python to extract and organize XML data.

First, we read Beautifulsoup, a web analytics tool.

from bs4 import BeautifulSoup
Copy the code

This is a very important tool for extracting web page information and is one of the basic skills of Python crawler writing.

This article uses only a few simple commands from Beautifulsoup. So even if you have never worked with Beautifulsoup before, it doesn’t matter, just go through it first and get some perceptual knowledge and experience. We’ll study it systematically later.

The systematic approach I recommend is to go to the documentation page of Beautifulsoup and read and learn.

If you have some difficulty reading the English document, you can look at the translated Chinese document, the address is here.

Then, we read in the downloaded XML data and store it in the variable data.

with open("M550_SALES.xml") as f:
    data = f.read()
Copy the code

Next, we use the “LXML” tool to parse the data and store it in the SOUP variable.

soup = BeautifulSoup(data, "lxml")
Copy the code

After parsing, we are ready to take advantage of Beautifulsoup’s powerful search capabilities.

Here we look at the XML file:

As you can see, the date and median transaction records we care about are stored under the Datum tag.

Where, the date data is of type “date” and the median transaction price is of type “float”.

Let’s start by trying to extract all date data using Beautifulsoup’s find_all function:

dates = soup.find_all('datum', type='date')
Copy the code

Let’s look at the first five lines of the extracted results:

dates[:5]
Copy the code
[<datum type="date">2016-06-30</datum>,
 <datum type="date">2016-05-31</datum>,
 <datum type="date">2016-04-30</datum>,
 <datum type="date">2016-03-31</datum>,
 <datum type="date">2016-02-29</datum>]
Copy the code

Good. The data was extracted correctly. The problem is that we still have tag data before and after, so we don’t need them at this point.

Let’s deal with that. For each entry in the list, extract the content using the Text property of Beautifulsoup.

dates = [item.text for item in dates]
Copy the code

Take a look at the results of this extraction:

dates[:5]
Copy the code
[u'2016-06-30', u'2016-05-31', u'2016-04-30', u'2016-03-31', u'2016-02-29']
Copy the code

Ok, no problem.

Here we treat the median transaction price record in the same way:

values= soup.find_all('datum', type='float')
Copy the code

To display the results:

values[:5]
Copy the code
<datum type="float">64.0</datum>, <datum type="float">163.0</datum>, <datum type="float">118.0</datum>, The < datum type = "float" > 110.0 < / datum >, < datum type = "float" > 83.0 < / datum >]Copy the code

There are still labels this time, which need to be removed.

Note here that we want to store the result as a floating-point number, so in addition to extracting the value with the text property, we also use the float() function to do the conversion.

values = [float(item.text) for item in values]
Copy the code

Show the first five lines:

values[:5]
Copy the code
[64.0, 163.0, 118.0, 110.0, 83.0]
Copy the code

The data is correctly converted to floating point.

We have a list of records of dates and median transaction prices. Let’s convert it to the Pandas data box and store it in the DF2 variable.

df2 = pd.DataFrame({'dates':dates, 'values':values})
Copy the code

Look at the first few lines of DF2:

df2.head()
Copy the code

We have the data, the following also uses our own function to organize:

df2 = arrange_time_dataframe(df2)
Copy the code

Then we try to plot df2:

df2.plot(figsize=(16.6))
Copy the code

XML data read and viewed successfully.

summary

So far, you have tried reading CSV, JSON, and XML data into the Pandas data box and doing a basic time series visualization.

You may have the following questions:

Why bother learning difficult JSON and XML data reading methods when CSV files are so small and Pandas can read them easily?

That’s a good question!

I can think of at least two reasons.

First of all, the Quandl platform we found provides a full range of data download formats, covering almost all common data formats. But this is a special case. Most open data platforms don’t offer so many data formats for you to download. Therefore, it is important to know how to read data when you only have it in JSON or XML format.

Second, the content attached to JSON or XML is by no means meaningless. They can help you check the integrity and legitimacy of your data. You can even define your own semantic standards for efficient data exchange with others.

If you are interested in JSON and XML formats and want to learn them systematically, I recommend you to learn database courses on the MOOC platform Stanford Online.

I wish the progress!

discuss

What open data platforms have you downloaded data from? What data formats have you been exposed to? What tools do you use to organize and analyze them? Is there a more efficient method than this one? Welcome to leave a message, share your experience and thinking to everyone, we exchange and discuss together.

If you like, please give it a thumbs up. You can also follow and top my official account “Nkwangshuyi” on wechat.

If you’re interested in data science, check out my series of tutorial index posts entitled how to Get started in Data Science Effectively. There are more interesting problems and solutions.