How do I parse HTML in Python?

With a few simple scripts, you can easily clean up documents and other large HTML files. But first you need to parse them.

As a longtime member of the Scribus documentation team, I keep abreast of the latest source code updates so I can update and complement the documentation. When I checked out using Subversion recently on a Fedora 27 system, I was surprised at how long it took to download the document, which consists of HTML pages and associated images. I’m afraid the project’s documentation looks much larger than the project itself, and suspect that some of it is “zombie” documentation — HTML files that are no longer in use and images that are not accessible within HTML.

I decided to create a project for myself to solve this problem. One way is to search for existing image files that are not in use. If I could scan all the image references in the HTML file and then compare that list to the actual image file, I would probably see mismatched files.

This is a typical image tag:

<img src="images/edit_shapes.png" ALT="Edit examples" ALIGN=left>Copy the code

I’m interested in the part between the first set of quotes after SRC =. After looking for some solutions, I found a Python module called BeautifulSoup. The core of the script looks like this:

soup = BeautifulSoup(all_text, 'html.parser')
match = soup.findAll("img")
if len(match) > 0:
    for m in match:
        imagelist.append(str(m))Copy the code

We can use the findAll method to dig up image tags. Here’s a little bit of output:

<img src="images/pdf-form-ht3.png"/><img src="images/pdf-form-ht4.png"/><img src="images/pdf-form-ht5.png"/><img src="images/pdf-form-ht6.png"/><img align="middle" alt="GSview - Advanced Options Panel" src="images/gsadv1.png" title="GSview - Advanced Options Panel"/><img align="middle" alt="Scribus External Tools Preferences" src="images/gsadv2.png" title="Scribus External Tools Preferences"/>Copy the code

So far so good. I thought THAT was the next step, but when I tried some string methods in the script, it returned errors about tags instead of strings. I saved the output to a file and edited it in KWrite. One of the nice things about KWrite is that you can use regular expressions (regex) to do “find and replace” operations, so I can replace

But I figured there had to be something better than that, so I switched to regular expressions, or more specifically Python’s RE module. The relevant parts of the new script look like this:

match = re.findall(r'src="(.*)/>', all_text)
if len(match)>0:
    for m in match:
        imagelist.append(m)Copy the code

A small portion of its output looks like this:

images/cmcanvas.png" title="Context Menu for the document canvas" alt="Context Menu for the document canvas" /></td></tr></table><br images/eps-imp1.png" title="EPS preview in a file dialog" alt="EPS preview in a file dialog" images/eps-imp5.png" title="Colors imported from an EPS file" alt="Colors imported from an EPS file" images/eps-imp4.png" title="EPS font substitution" alt="EPS font substitution" images/eps-imp2.png" title="EPS import progress" alt="EPS import progress" images/eps-imp3.png" title="Bitmap conversion failure" alt="Bitmap conversion failure"Copy the code

At first glance, it looks similar to the output above, with the benefit of a tag section that removes the image, but with a confusing mix of table tags and other content. I think this involves this regular expression SRC =”(.*)/>, which is called greedy, meaning it doesn’t necessarily stop on the first instance of /> encountered. I should add that I’ve tried SRC =”(.*)” and it really doesn’t work any better. I’m not a regular expression expert (just doing this) and have found various ways to improve on this but to no avail.

After doing a number of things, and even trying Perl’s HTML::Parser module, I finally tried to compare this to some scripts I wrote for Scribus, which parsed text character by character and then took some action. To that end, I came up with all of these methods, and I didn’t need regular expressions or HTML parsers at all. Let’s go back to the img tag example shown.

<img src="images/edit_shapes.png" ALT="Edit examples" ALIGN=left>Copy the code

I decided to go back to SRC =. One way is to wait for s to appear, and then see if the next character is r, c, and =. If so, that’s a match! So what’s between the two double quotes is what I need. The problem with this approach is that it requires continuous identification of structures like the one above. One way to view a string representing a line of HTML text is:

for c in all_text:Copy the code

But the logic is too messy to keep matching the previous C, and the character before that, and the character before that, and the character before that.

Finally, I decided to focus on = and use the index method so that I could easily refer to any previous or future characters in the string. Here’s the search section:

    index = 3
    while index < linelength:
        if (all_text[index] == '='):
            if (all_text[index-3] == 's') and (all_text[index-2] == 'r') and (all_text[index-1] == 'c'):
                imagefound(all_text, imagelist, index)
                index += 1
            else:
                index += 1
        else:
            index += 1Copy the code

I start the search with the fourth character (the index starts at 0), so I have no indexing error below, and in fact, no equals sign before the fourth character on each line. The first test is to see if = is present in the string, and if not, we move forward. If we do see an equal sign, then we look to see if the first three characters are S, R, and C. If all match, the function imagefound is called:

def imagefound(all_text, imagelist, index): end = 0 index += 2 newimage = '' while end == 0: if (all_text[index] ! = '"'): newimage = newimage + all_text[index] index += 1 else: newimage = newimage + '\n' imagelist.append(newimage) end = 1 returnCopy the code

We send the current index to the function, which stands for =. We know that the next character will be “, so we skip two characters and start adding characters to the control string named newImage until we find the next one “, at which point we complete a match. We add the string with a newline character (\n) to the list imagelist and return. Keep in mind that there may be more image labels in the remaining HTML string, so we immediately return to the search loop.

Here’s what our output looks like now:

images/text-frame-link.png
images/text-frame-unlink.png
images/gimpoptions1.png
images/gimpoptions3.png
images/gimpoptions2.png
images/fontpref3.png
images/font-subst.png
images/fontpref2.png
images/fontpref1.png
images/dtp-studio.pngCopy the code

Ah, it’s so much cleaner, and it only takes a few seconds. I could have moved the index forward 7 steps to cut the images/ section, but I prefer to save this section to make sure I didn’t cut the first letter of the image file name, which is easy to edit with KWrite — you don’t even need a regular expression. Once you’ve done that and saved the file, the next step is to run another script I wrote, sortlist.py:

#! /usr/bin/env python # -*- coding: utf-8 -*- # sortlist.py import os imagelist = [] for line in open('/tmp/imagelist_parse4.txt').xreadlines(): imagelist.append(line) imagelist.sort() outfile = open('/tmp/imagelist_parse4_sorted.txt', 'w') outfile.writelines(imagelist) outfile.close()Copy the code

This reads the contents of the file, stores them as a list, sorts them, and then saves them as another file. After that, I can do the following:

ls /home/gregp/development/Scribus15x/doc/en/images/*.png > '/tmp/actual_images.txt'Copy the code

I then need to run sortlist.py on that file, because the ls method sorts differently from Python. I could have run the comparison script on these files, but I prefer to do it visually. In the end, I managed to find 42 images that had no HTML references from the document.

Here’s my full parsing script:

#! /usr/bin/env python # -*- coding: utf-8 -*- # parseimg4.py import os def imagefound(all_text, imagelist, index): end = 0 index += 2 newimage = '' while end == 0: if (all_text[index] ! = '"'): newimage = newimage + all_text[index] index += 1 else: newimage = newimage + '\n' imagelist.append(newimage) end = 1 return htmlnames = [] imagelist = [] tempstring = '' filenames = os.listdir('/home/gregp/development/Scribus15x/doc/en/') for name in filenames: if name.endswith('.html'): htmlnames.append(name) #print htmlnames for htmlfile in htmlnames: all_text = open('/home/gregp/development/Scribus15x/doc/en/' + htmlfile).read() linelength = len(all_text) index = 3 while index < linelength: if (all_text[index] == '='): if (all_text[index-3] == 's') and (all_text[index-2] == 'r') and (all_text[index-1] == 'c'): imagefound(all_text, imagelist, index) index += 1 else: index += 1 else: index += 1 outfile = open('/tmp/imagelist_parse4.txt', 'w') outfile.writelines(imagelist) outfile.close() imageno = len(imagelist) print str(imageno) + " images were found and  saved"Copy the code

The name of the script is parseimg4.py, which doesn’t really reflect the number of scripts I’ve written over time (tweaks and overchanges, as well as scrapping and starting over). Note that I’ve hardcoded these directories and file names, but it’s easy to generalize and let the user enter this information. Again, since they are working scripts, I send the output to the/TMP directory, so they disappear once the system is restarted.

That’s not the end of the story, because the next question is: What about zombie HTML files? Any unused files may reference images and cannot be identified by the previous method. We have a menu.xml file as a table of contents for the online manual, but I also need to consider that some of the files listed in the TOC may refer to files that are not in the TOC, and yes, I did find some of those files.

In the end, I can say that this is a much simpler task than image search, and the development process has helped me a lot.

Related Posts

Yield summary

6. Zigzag transformation (medium)

ADG Single Instance Build series (DBCA)