Python data processing blog series is here!
This series is based on the book Python Data Processing, with a blog post for each chapter of the book. There are places in the book where I don’t go into too much detail, so I’ll supplement that by looking at other sources and trying to make every blog post comprehensive and easy to understand.
This book focuses on how to handle various types of files in Python, such as JSON, XML, CSV, Excel, PDF, etc. The following chapters also cover usage skills such as data cleansing, web scraping, automation, and scaling. I am also a beginner in Python and will be writing from a beginner’s point of view, so the blog is more beginner friendly.
The Basics of Python If you’re not already familiar with it, check out my other blog: 10 Minutes to Get Started with Python
More than 100 experienced developers participated in the full stack full platform open source project on Github, nearly 1000 star. Want to know or participate? Project address: github.com/cachecats/c…
preface
A file format that stores data in a way that is easy for machines to understand is often referred to as Machine readable. Common machine-readable formats include:
- Comma-separated Values (CSV)
- JavaScript Object Notation (JSON)
- EXtensible Markup Language (XML)
These data formats are often referred to by their short names (e.g. CSV) in both spoken and written language. We will use these abbreviations.
I. CSV data
A CSV file (CSV for short) is a file in which data columns are separated by commas. The file extension is.csv.
Another data type, called tab-separated values (TSV) data, is sometimes grouped with CSV. The only difference between TSV and CSV is that the separators between data columns are TAB characters instead of commas. The file extension is usually.tsv, but.csv is sometimes used as an extension. In essence,.tsv files serve the same purpose in Python as.csv files.
The data source we used was downloaded from the World Health Organization (www.who.int/zh/home)…
When you go to the WHO website, click on “Health Topics”, “Data and Statistics” to find a lot of data.
Here you download the statistics for baby care and rename it data.csv.
CSV files can be opened directly in Excel to visually see, we use Excel to open the following picture:
The next step is to simply process the data in Python.
Read CSV data as a list
Write a program to read CSV files:
import csv
csvfile = open('./data.csv'.'r')
reader = csv.reader(csvfile)
for row in reader:
print(row)
Copy the code
Import CSV Import the CSV module of Python. Csvfile = open(‘./data.csv’, ‘r’) opens the data file as read-only and stores it in the variable csvFile. The CSV reader() method is then called to store the output in the Reader variable, and the for loop is used to print out the data.
Run the program, console output:
And you can see that it’s the same as what I opened in Excel.
Read CSV data as a dictionary
Change the code to read the CSV as a dictionary
import csv
csvfile = open('./data.csv'.'r')
reader = csv.DictReader(csvfile)
for row in reader:
print(row)
Copy the code
Console output:
2. JSON data
Also download the data source from the WHO website and rename it data.json. Open the JSON file with the formatting tool as follows:
Write a program to parse json
import json
Read the JSON file as a string
json_data = open('./data.json').read()
Decode json data
data = json.loads(json_data)
The type of data is dict
print(type(data))
Print data directly
print(data)
# walk through the dictionary
for k, v in data.items():
print(k + ':' + str(v))
Copy the code
Console output:
Python3 codecs JSON data using the JSON module, which contains two functions:
- Json.dumps (): Encodes the data.
- Json.loads (): decoding data.
During json codecs, python primitives and JSON types are converted to each other as follows:
Python code to JSON type conversion table:
Python | JSON |
---|---|
dict | object |
list, tuple | array |
str | string |
int, float, int- & float-derived Enums | number |
True | true |
False | false |
None | null |
JSON decoding to Python type conversion table:
JSON | Python |
---|---|
object | dict |
array | list |
string | str |
number (int) | int |
number (real) | float |
true | True |
false | False |
null | None |
XML data
Data in XML format is easy to read by machines as well as by humans. But for the data set in this chapter, it is much easier to preview and understand CSV and JSON files than XML files.
XML format description:
- Tag: part surrounded by < and >;
- Element: A part surrounded by a Tag, as in 2003, can be considered a node, which can have child nodes;
- Attribute: Name /value pairs that may exist in the Tag. For example, title=”Enemy Behind” in the example is an Attribute.
Who data is hard to understand, let’s use a simple and understandable movie data to illustrate:
<?xml version="1.0" encoding="UTF-8"? >
<collection shelf="New Arrivals">
<movie title="Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movie title="Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>
Copy the code
This data is relatively simple, with only three layers. But I’ve got the principle. I’ve got layers of data.
The following code parses the above XML and formats it into a dictionary and JSON format for output:
from xml.etree import ElementTree as ET
import json
tree = ET.parse('./resource/movie.xml')
root = tree.getroot()
all_data = []
for movie in root:
# dictionary for storing movie data
movie_data = {}
# dictionary to store attributes
attr_data = {}
Fetch the value of the type tag
movie_type = movie.find('type')
attr_data['type'] = movie_type.text
Fetch the value of the format tag
movie_format = movie.find('format')
attr_data['format'] = movie_format.text
Fetch the value of the year tag
movie_year = movie.find('year')
if movie_year:
attr_data['year'] = movie_year.text
Fetch the value of the rating tag
movie_rating = movie.find('rating')
attr_data['rating'] = movie_rating.text
Fetch the value of the stars tag
movie_stars = movie.find('stars')
attr_data['stars'] = movie_stars.text
Fetch the value of the description tag
movie_description = movie.find('description')
attr_data['description'] = movie_description.text
Get the movie name with the key of the movie name dictionary, attribute information is the dictionary value
movie_title = movie.attrib.get('title')
movie_data[movie_title] = attr_data
# Save to the list
all_data.append(movie_data)
print(all_data)
# all_data is now a list object. Json.dumps () converts Python objects to JSON strings
json_str = json.dumps(all_data)
print(json_str)
Copy the code
The comments are written in more detail, and the methods provided by ElementTree are described below.
3.1 Three methods of parsing
ElementTree parses XML in three ways:
-
Call the parse() method to return the parse tree
tree = ET.parse('./resource/movie.xml') root = tree.getroot() Copy the code
-
Call from_string() to return the root element of the parse tree
data = open('./resource/movie.xml').read() root = ET.fromstring(data) Copy the code
-
Call the ElementTree(self, Element =None, File =None) methods of the ElementTree class
tree = ET.ElementTree(file="./resource/movie.xml") root = tree.getroot() Copy the code
3.2 the Element object
class xml.etree.ElementTree.Element(tag, attrib={}, **extra)
The property of the Element object
- Tag: the tag
- Text: Remove the label and obtain the content in the label.
- Attrib: Gets attributes and attribute values in the tag.
- Tail: This property can be used to hold additional data associated with an element. Its value is usually a string, but may be an application-specific object.
Method of the Element object
-
Clear () : Clears all child elements and all attributes, and sets the text and tail attributes to None.
-
Get (attribute_name, default=None) : Obtains the attribute value by specifying the attribute name.
-
Items () : Returns element attributes as key-value pairs.
-
Keys () : Returns the element name as a list.
-
Set (attribute_name,attribute_value) : Sets an attribute and its value in a tag.
-
Append (subelement) : Adds an element’s children to the end of the internal list of the element’s children.
-
Extend (subelements) : Appends child elements.
-
Find (match, namespaces=None) : match can be tag name or path. Return Elememt instance or None.
-
Findall (match, namespaces=None) : findall(match, namespaces=None)
-
Findtext (match, default=None, namespaces=None) : findtext matching the first child element. The text content of the matched element is returned.
-
Getchildren () : Python3.2 uses list(elem) or iteration.
-
Getiterator (tag=None) : Python3.2 using element.iter ()
-
Iter (tag=None) : Creates a tree iterator with the current element as the root. Iterators iterate over this element and all elements below it (depth priority). If the tag is not None or ‘*’, then only elements whose tag equals the tag are returned from the iterator. If the tree structure is modified during iteration, the result is undefined.
-
Iterfind (match, namespaces=None) : Matches child elements and returns elements.
3.3 ElementTree object
class xml.etree.ElementTree.ElementTree(element=None, file=None)
ElementTree is a wrapper class that represents a complete hierarchy of elements and adds some additional support for serialization of standard XML.
-
Setroot (Element) : Replace the root element. The content of the original root element disappears.
-
Find (match, namespaces=None) : Matches from the root Element as element.find ().
-
Findall (match, namespaces=None) : Matches from the root Element as element.findall ().
-
Findtext (match, default=None, namespaces=None) : Matches from the root Element as element.findtext ().
-
Getiterator (tag=None) : Python3.2 use elementtree.iter () instead.
-
Iter (tag=None) : Iterates over all elements
-
Iterfind (match, namespaces=None) : Matches from the root Element as element.iterfind ().
-
Parse (source, parser=None) : Parses the XML text and returns the root element.
-
Write (file, encoding= “US-ASCII”, XML_declaration =None, default_namespace=None, method= “XML”, *, Short_empty_elements =True) : Write XML text.
JSON, XML, CSV format data processing is over, next time how to deal with Excel files, welcome to pay attention.
Full stack full platform open source project CodeRiver
CodeRiver is a free project collaboration platform with a vision to connect the upstream and downstream of the IT industry. Whether you are a product manager, designer, programmer, tester, or other industry personnel, you can come to CodeRiver to publish your project for free and gather like-minded team members to make your dream come true!
CodeRiver itself is also a large open source project, committed to creating full stack full platform enterprise quality open source projects. It covers React, Vue, Angular, applets, ReactNative, Android, Flutter, Java, Node and almost all the mainstream technology stacks, focusing on code quality.
So far, nearly 100 top developers have participated, and there are nearly 1,000 stars on Github. Each stack is staffed by a number of experienced gurus and two architects to guide the project architecture. No matter what language you want to learn and what skill level you are at, you can learn it here.
Help every developer grow quickly through high quality source code + blog + video.
Project address: github.com/cachecats/c…
Your encouragement is our biggest power forward, welcome to praise, welcome to send small stars ✨ ~