Python’s flexibility and ease of use make it one of the most popular programming languages, especially for data processing and machine learning. Its powerful data processing libraries and algorithm libraries make Python the preferred language for getting started in data science. In daily use, CSV, JSON, and XML are the dominant data formats. I’ll share my quick processing methods for three data formats.
CSV data
CSV is the most common way to store data. Most of the data at Kaggle matches is stored this way. We can use the built-in Python CSV library to read and write CSV. Normally, we read data into a list list.
Look at the code below. All CSV data becomes accessible when we run csv.reader(). The csvReader.next () function reads a line from the CSV; Every time it is called, it moves to the next line. We can also iterate through the CSV for Row in csvReader using a for loop. Make sure you have the same number of columns in each row, or you may end up with some errors when working with list lists.
import csv
filename = "my_data.csv"
fields = []
rows = []
# Reading csv file
with open(filename, 'r') as csvfile:
# Creating a csv reader object
csvreader = csv.reader(csvfile)
# Extracting field names in the first row
fields = csvreader.next()
# Extracting each data row one by one
for row in csvreader:
rows.append(row)
# Printing out the first 5 rows
for row in rows[:5]:
print(row)
Copy the code
Writing CSV in Python is also easy. Set the field name in a single list and set the data in a list list. This time we’ll create a Writer () object and use it to write our data to a file, much the same way we do when we read.
import csv # Field names fields = ['Name', 'Goals', 'Assists', 'Shots'] # Rows of data in the csv file rows = [ ['Emily', '12', '18', '112'], ['Katie', '8', '24', '96'], ['John', '16', '9', '101'], ['Mike', '3', '14', '82']] filename = "soccer.csv" # Writing to csv file with open(filename, 'w+') as csvfile: # Creating a csv writer object csvwriter = csv.writer(csvfile) # Writing the fields csvwriter.writerow(fields) # Writing the data rows csvwriter.writerows(rows)Copy the code
We can use Pandas to convert CSV into a fast one-line dictionary list. After formatting the data into a dictionary list, we will use the DicttoXML library to convert it to XML format. We also saved it as a JSON file!
import pandas as pd
from dicttoxml import dicttoxml
import json
# Building our dataframe
data = {'Name': ['Emily', 'Katie', 'John', 'Mike'],
'Goals': [12, 8, 16, 3],
'Assists': [18, 24, 9, 14],
'Shots': [112, 96, 101, 82]
}
df = pd.DataFrame(data, columns=data.keys())
# Converting the dataframe to a dictionary
# Then save it to file
data_dict = df.to_dict(orient="records")
with open('output.json', "w+") as f:
json.dump(data_dict, f, indent=4)
# Converting the dataframe to XML
# Then save it to file
xml_data = dicttoxml(data_dict).decode()
with open("output.xml", "w+") as f:
f.write(xml_data)
Copy the code
The JSON data
JSON provides a concise and easy-to-read format that preserves a lexicographical structure. Just like CSV, Python has a built-in JSON module that makes reading and writing very easy! When we read CSV as a dictionary, we then write that dictionary format data to a file.
import json import pandas as pd # Read the data from file # We now have a Python dictionary with open('data.json') as f: data_listofdict = json.load(f) # We can do the same thing with pandas data_df = pd.read_json('data.json', orient='records') # We can write a dictionary to JSON like so # Use 'indent' and 'sort_keys' to make the JSON # file look nice with open('new_data.json', 'w+') as json_file: json.dump(data_listofdict, json_file, indent=4, sort_keys=True) # And again the same thing with pandas export = data_df.to_json('new_data.json', orient='records')Copy the code
As we saw earlier, once we have the data, it can be easily converted to CSV either through PANDAS or using the built-in Python CSV module. Dicttoxml library can be used when converting to XML. The specific code is as follows:
import json
import pandas as pd
import csv
# Read the data from file
# We now have a Python dictionary
with open('data.json') as f:
data_listofdict = json.load(f)
# Writing a list of dicts to CSV
keys = data_listofdict[0].keys()
with open('saved_data.csv', 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(data_listofdict)
Copy the code
The XML data
XML is a little different from CSV and JSON. CSV and JSON are easy to read, write, and interpret because they are so simple and fast. XML takes up more memory, requires more bandwidth, more storage, and longer running time to deliver and store. But XML also has some additional capabilities based on JSON and CSV: you can use namespaces to build and share structural standards, better inheritance, and industry-standardized approaches to data representation using XML, DTDS, and so on.
To read in XML data, we’ll use Python’s built-in XML module and submodule ElementTree. We can use the XMLTodict library to convert ElementTree objects into dictionaries. Once we have the dictionary, we can convert to CSV, JSON, or Pandas Dataframe! The specific code is as follows:
import xml.etree.ElementTree as ET
import xmltodict
import json
tree = ET.parse('output.xml')
xml_data = tree.getroot()
xmlstr = ET.tostring(xml_data, encoding='utf8', method='xml')
data_dict = dict(xmltodict.parse(xmlstr))
print(data_dict)
with open('new_data_2.json', 'w+') as json_file:
json.dump(data_dict, json_file, indent=4, sort_keys=True)
Copy the code