preface
I have had some minor issues with my recent working code, which has caused my updates to be quite slow. Today I would like to share with you the problems I have met before, and teach you through an actual combat content. I hope that when you encounter similar problems in the future, you can think of my article and solve the problem.
Today I’m going to share knowledge about parsing XML files.
What is the XML
XML refers to extensible Markup Language (XML), a subset of the Standard Generic Markup Language (GML), which is a markup language for structuring electronic files. XML is designed to transfer and store data. XML is a set of semantically defined markup rules that identify and identify many parts of a document. It is also a meta-markup language, that is, a syntactic language that defines other domain-specific, semantic, and structured markup languages
Parsing XML in Python
There are two main types of COMMON XML interfaces, DOM and SAX, which handle XML in different ways and, of course, in different scenarios.
- SAX (Simple API for XML)
The Python standard library includes a SAX parser, which uses an event-driven model to process XML files by firing events and calling user-defined callbacks as XML is parsed.
- DOM (Document Object Model)
XML data is parsed in memory into a tree, and XML is manipulated by manipulating the tree.
The XML file used in this sharing is movies.xml, which contains the following contents:
<collection shelf="New Arrivals">
<movie title="Enemy Behind">
<type>War, Thriller</type>
<format>DVD</format>
<year>2003</year>
<rating>PG</rating>
<stars>10</stars>
<description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
<type>Anime, Science Fiction</type>
<format>DVD</format>
<year>1989</year>
<rating>R</rating>
<stars>8</stars>
<description>A schientific fiction</description>
</movie>
<movie title="Trigun">
<type>Anime, Action</type>
<format>DVD</format>
<episodes>4</episodes>
<rating>PG</rating>
<stars>10</stars>
<description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
<type>Comedy</type>
<format>VHS</format>
<rating>PG</rating>
<stars>2</stars>
<description>Viewable boredom</description>
</movie>
</collection>
Copy the code
At present, the most common way of parsing is to use DOM modules for parsing.
Examples of Python parsing XML
from xml.dom.minidom import parse
import xml.dom.minidom
Open the XML document with the Minidom parser
DOMTree = xml.dom.minidom.parse('movies.xml') # return the Document object
collection = DOMTree.documentElement Get the element operation object
# print(collection)
if collection.hasAttribute('shelf') :print('Root element : %s' % collection.getAttribute('shelf'))
Get all movies in the collection
movies = collection.getElementsByTagName('movie') Return all movie tags and save them in the list
# print(movies)
for movie in movies:
print('*******movie******')
if movie.hasAttribute('title') :print('Title: %s' % movie.getAttribute('title'))
type = movie.getElementsByTagName('type') [0]
print('Type: %s' % type.childNodes[0].data) Get the content of the tag element
format = movie.getElementsByTagName('format') [0]
print('format: %s' % format.childNodes[0].data)
rating = movie.getElementsByTagName('rating') [0]
print('rating: %s' % rating.childNodes[0].data)
description = movie.getElementsByTagName('description') [0]
print('description: %s' % description.childNodes[0].data)
Copy the code
Iqiyi bullet screen
Recently, a new play, called “Tausiness”, must have seen it. Today, our actual combat content is to grab the bullets sent by the audience, and SHARE the content I met in the process of crawling with you.
Analysis of web page
Generally speaking, it is impossible for the video barrage to appear in the source code of the web page, so the preliminary judgment is to load the barrage data asynchronously.
First open developer tools -> Network -> XHR
To find a URL like the one shown above, all we need is /54/00/7973227714515400.
Iqiyi’s barrage address can be obtained as follows:
https://cmts.iqiyi.com/bullet/ parameters 1 _300_ 2. ZCopy the code
Parameter 1 is /54/00/7973227714515400
Parameter 2 is 1, 2, and 3……
Iqiyi loads a barrage every 5 minutes and each episode lasts about 46 minutes, so the link to the barrage is as follows:
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_1.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_2.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_3.z
.
.
.
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_10.z
Copy the code
The data decoding
When you copy the URL above to your browser, you will find that you can download a.z zip file directly. Windows cannot open the zip file directly, but you have to decode the zip file using Python first.
I’ll start with a brief explanation of zlib, a library used to compress and decompress streams.
Therefore, we can unzip the downloaded packets.
First, the packet needs to be read in binary form and then decompressed.
Take, for example, the zip I just downloaded.
The specific code is as follows:
import zlib
with open('7973227714515400_300_1.z'.'rb') as f:
data = f.read()
decode = zlib.decompress(bytearray(data), 15 + 32).decode('utf-8')
print(decode)
Copy the code
The running result is as follows:
I don’t know if you noticed that this kind of data looks a lot like XML, but let’s just write two more lines of code and save the data as AN XML file.
The specific code is as follows:
import zlib
with open('7973227714515400_300_1.z'.'rb') as f:
data = f.read()
decode = zlib.decompress(bytearray(data), 15 + 32).decode('utf-8')
with open('zx-1.xml'.'w', encoding='utf-8') as f:
f.write(decode)
Copy the code
The resulting XML file contents are as follows:
Isn’t it a bit of a surprise to see the results of the run, according to what I said above we can get the data we want.
Extract the data
The specific code is as follows:
from xml.dom.minidom import parse
import xml.dom.minidom
DOMTree = xml.dom.minidom.parse('zx-1.xml')
collection = DOMTree.documentElement
entrys = collection.getElementsByTagName('entry')
for entry in entrys:
content = entry.getElementsByTagName('content') [0].childNodes[0].data
print(content)
Copy the code
The running result is as follows:
Now the analysis of the web page and data acquisition ideas must be understood.
Now we need to go back to the starting point. We need to construct the URL of bullet screen, send the request to the URL, obtain its binary data, decompress and save it as XML file, and finally extract the data of bullet screen from the file.
To construct the URL
The specific code is as follows:
# constructs the URL
def get_urls(self) :
urls = []
for x in range(1.11):
url = f'https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_{x}.z'
urls.append(url)
return urls
Copy the code
Save the XML file
The specific code is as follows:
Save the XML file
def get_xml(self) :
urls = self.get_urls()
count = 1
for url in urls:
content = requests.get(url, headers=self.headers).content
decode = zlib.decompress(bytearray(content), 15 + 32).decode('utf-8')
with open(f'.. /data/zx-{count}.xml'.'a', encoding='utf-8') as f:
f.write(decode)
count += 1
Copy the code
From the pit:
The first thing you want to get is a compressed package, so your headers should look like this:
self.headers = {
'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'.'Accept-Encoding': 'gzip, deflate'
}
Copy the code
Avoid the following errors:
2, save the XML file can not be named in Chinese, it is better to add a -, as shown below:
zx-0
zx-1
.
.
.
zx-9
Copy the code
Avoid the following errors:
After saving all the XML files, comment out the crawler code for the time being, because next we need to extract the data from the above files.
Extract the data
# Extract data
def parse_data(self) :
danmus = []
for x in range(1.11):
DOMTree = xml.dom.minidom.parse(f'.. /data/zx-{x}.xml')
collection = DOMTree.documentElement
entrys = collection.getElementsByTagName('entry')
for entry in entrys:
danmu = entry.getElementsByTagName('content') [0].childNodes[0].data
danmus.append(danmu)
# print(danmus)
df = pd.DataFrame({
'barrage': danmus
})
return df
Copy the code
And here we’re just using the XML parsing that we just learned. So for us, extracting the barrage inside is basically fine for us.
Save the data
# Save data
def save_data(self) :
df = self.parse_data()
df.to_csv('.. /data/danmu.csv', encoding='utf-8-sig', index=False)
Copy the code
Comment content word cloud
Please note that this is only the first episode, there are more than 2,000 bullet scenes, so it can be seen that the show is quite popular.
The last
Nothing can be accomplished overnight, so is life, so is learning!
So what’s a three-day, seven-day crash?
Only insist, can succeed!
Biting books says:
Every word of the article is my heart to knock out, only hope to live up to every attention to my people. Click “like” at the end of the article to let me know that you are also working hard for your study.
The way ahead is so long without ending, yet high and low I’ll search with my will unbending.
I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more exciting content!