preface

I have had some minor issues with my recent working code, which has caused my updates to be quite slow. Today I would like to share with you the problems I have met before, and teach you through an actual combat content. I hope that when you encounter similar problems in the future, you can think of my article and solve the problem.

Today I’m going to share knowledge about parsing XML files.

What is the XML

XML refers to extensible Markup Language (XML), a subset of the Standard Generic Markup Language (GML), which is a markup language for structuring electronic files. XML is designed to transfer and store data. XML is a set of semantically defined markup rules that identify and identify many parts of a document. It is also a meta-markup language, that is, a syntactic language that defines other domain-specific, semantic, and structured markup languages

Parsing XML in Python

There are two main types of COMMON XML interfaces, DOM and SAX, which handle XML in different ways and, of course, in different scenarios.

  • SAX (Simple API for XML)

The Python standard library includes a SAX parser, which uses an event-driven model to process XML files by firing events and calling user-defined callbacks as XML is parsed.

  • DOM (Document Object Model)

XML data is parsed in memory into a tree, and XML is manipulated by manipulating the tree.

The XML file used in this sharing is movies.xml, which contains the following contents:

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>
Copy the code

At present, the most common way of parsing is to use DOM modules for parsing.

Examples of Python parsing XML

from xml.dom.minidom import parse
import xml.dom.minidom


Open the XML document with the Minidom parser
DOMTree = xml.dom.minidom.parse('movies.xml')   # return the Document object
collection = DOMTree.documentElement    Get the element operation object
# print(collection)
if collection.hasAttribute('shelf') :print('Root element : %s' % collection.getAttribute('shelf'))


Get all movies in the collection
movies = collection.getElementsByTagName('movie')   Return all movie tags and save them in the list
# print(movies)
for movie in movies:
    print('*******movie******')
    if movie.hasAttribute('title') :print('Title: %s' % movie.getAttribute('title'))
    type = movie.getElementsByTagName('type') [0]
    print('Type: %s' % type.childNodes[0].data)	Get the content of the tag element
    format = movie.getElementsByTagName('format') [0]
    print('format: %s' % format.childNodes[0].data)
    rating = movie.getElementsByTagName('rating') [0]
    print('rating: %s' % rating.childNodes[0].data)
    description = movie.getElementsByTagName('description') [0]
    print('description: %s' % description.childNodes[0].data)


Copy the code

Iqiyi bullet screen

Recently, a new play, called “Tausiness”, must have seen it. Today, our actual combat content is to grab the bullets sent by the audience, and SHARE the content I met in the process of crawling with you.

Analysis of web page

Generally speaking, it is impossible for the video barrage to appear in the source code of the web page, so the preliminary judgment is to load the barrage data asynchronously.

First open developer tools -> Network -> XHR

To find a URL like the one shown above, all we need is /54/00/7973227714515400.

Iqiyi’s barrage address can be obtained as follows:

https://cmts.iqiyi.com/bullet/ parameters 1 _300_ 2. ZCopy the code

Parameter 1 is /54/00/7973227714515400

Parameter 2 is 1, 2, and 3……

Iqiyi loads a barrage every 5 minutes and each episode lasts about 46 minutes, so the link to the barrage is as follows:

https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_1.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_2.z
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_3.z
.
.
.
https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_10.z
Copy the code

The data decoding

When you copy the URL above to your browser, you will find that you can download a.z zip file directly. Windows cannot open the zip file directly, but you have to decode the zip file using Python first.

I’ll start with a brief explanation of zlib, a library used to compress and decompress streams.

Therefore, we can unzip the downloaded packets.

First, the packet needs to be read in binary form and then decompressed.

Take, for example, the zip I just downloaded.

The specific code is as follows:

import zlib


with open('7973227714515400_300_1.z'.'rb') as f:
    data = f.read()

decode = zlib.decompress(bytearray(data), 15 + 32).decode('utf-8')
print(decode)
Copy the code

The running result is as follows:

I don’t know if you noticed that this kind of data looks a lot like XML, but let’s just write two more lines of code and save the data as AN XML file.

The specific code is as follows:

import zlib


with open('7973227714515400_300_1.z'.'rb') as f:
    data = f.read()

decode = zlib.decompress(bytearray(data), 15 + 32).decode('utf-8')
with open('zx-1.xml'.'w', encoding='utf-8') as f:
    f.write(decode)
Copy the code

The resulting XML file contents are as follows:

Isn’t it a bit of a surprise to see the results of the run, according to what I said above we can get the data we want.

Extract the data

The specific code is as follows:

from xml.dom.minidom import parse
import xml.dom.minidom


DOMTree = xml.dom.minidom.parse('zx-1.xml')
collection = DOMTree.documentElement
entrys = collection.getElementsByTagName('entry')	
for entry in entrys:
    content = entry.getElementsByTagName('content') [0].childNodes[0].data
    print(content)
Copy the code

The running result is as follows:

Now the analysis of the web page and data acquisition ideas must be understood.

Now we need to go back to the starting point. We need to construct the URL of bullet screen, send the request to the URL, obtain its binary data, decompress and save it as XML file, and finally extract the data of bullet screen from the file.

To construct the URL

The specific code is as follows:

# constructs the URL
    def get_urls(self) :
        urls = []
        for x in range(1.11):
            url = f'https://cmts.iqiyi.com/bullet/54/00/7973227714515400_300_{x}.z'
            urls.append(url)
        return urls
Copy the code

Save the XML file

The specific code is as follows:

Save the XML file
    def get_xml(self) :
        urls = self.get_urls()
        count = 1
        for url in urls:
            content = requests.get(url, headers=self.headers).content
            decode = zlib.decompress(bytearray(content), 15 + 32).decode('utf-8')
            with open(f'.. /data/zx-{count}.xml'.'a', encoding='utf-8') as f:
                f.write(decode)
            count += 1
Copy the code

From the pit:

The first thing you want to get is a compressed package, so your headers should look like this:

self.headers = {
            'user-agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'.'Accept-Encoding': 'gzip, deflate'
        }
Copy the code

Avoid the following errors:

2, save the XML file can not be named in Chinese, it is better to add a -, as shown below:

zx-0
zx-1
.
.
.
zx-9
Copy the code

Avoid the following errors:

After saving all the XML files, comment out the crawler code for the time being, because next we need to extract the data from the above files.

Extract the data

 # Extract data
    def parse_data(self) :
        danmus = []
        
        for x in range(1.11):
            DOMTree = xml.dom.minidom.parse(f'.. /data/zx-{x}.xml')
            collection = DOMTree.documentElement
            entrys = collection.getElementsByTagName('entry')
            for entry in entrys:
                danmu = entry.getElementsByTagName('content') [0].childNodes[0].data
                danmus.append(danmu)
        # print(danmus)
        df = pd.DataFrame({
            'barrage': danmus
        })
        return df
Copy the code

And here we’re just using the XML parsing that we just learned. So for us, extracting the barrage inside is basically fine for us.

Save the data

 # Save data
    def save_data(self) :
        df = self.parse_data()
        df.to_csv('.. /data/danmu.csv', encoding='utf-8-sig', index=False)
Copy the code

Comment content word cloud

Please note that this is only the first episode, there are more than 2,000 bullet scenes, so it can be seen that the show is quite popular.

The last

Nothing can be accomplished overnight, so is life, so is learning!

So what’s a three-day, seven-day crash?

Only insist, can succeed!

Biting books says:

Every word of the article is my heart to knock out, only hope to live up to every attention to my people. Click “like” at the end of the article to let me know that you are also working hard for your study.

The way ahead is so long without ending, yet high and low I’ll search with my will unbending.

I am book-learning, a person who concentrates on learning. The more you know, the more you don’t know. See you next time for more exciting content!