GNE: News web page general text extractor updated version 0.2.1, greatly improving the speed of text extraction. While developing this version, I encountered a very strange Bug that turned out to be caused by garbage collection and memory reuse. Today we’re going to look at that question.
The problem background
Let’s start with some code:
This code reads the HTML code in tests/163/9.html and retrieves the text of all tags within all tags below . It’s a little tricky to say, but let me give you an example.
<body>
<div>
<a href="/xx">hello</a>
</div>
<h2>
<a>The world</a>
</h2>
</body>
Copy the code
Get the text in the tag below the
tag, respectively, which is hello and the world.
The problem with this code, however, is that tags with nested structures are repeatedly extracted. Such as:
<body>
<div>
<h2>
<a href="/xx">hello</a>
</h2>
</div>
</body>
Copy the code
First, get the tag below the
tag where you are. However, when you get the
tag below the < H2 > tag, you still get the same
tag.
In this case, lines 15-20 are repeated twice in the code above.
In order to improve the efficiency of the code, we introduce a cache to record the analysis result of each tag. If a tag is found to have been analyzed, we directly use the cached result to avoid repeated analysis.
So the code is changed to look like this:
STR (element) on line 18 corresponds to the memory address of this node, as shown below:
This code looks fine, but when the data is actually extracted, it turns out that the results are not normal.
Schrodinger’s Element
To debug this problem, I made a few changes to the code:
You can see that the previously cached result of the same HTML tag is not the same as the newly extracted result.
So I wanted to see which element was being extracted each time, but something even weirder happened. We made a change that didn’t seem to have any effect on the code:
In Figure 4, we cache element_text_list directly. In Figure 5, we cache [element_text_list, element], and when we read, we read the elements of the list with subscript 0. In other words, the cached element is not used at all.
But then a strange thing happened. The problem disappeared! In figure 4, the same label was printed in large numbers, and the cached data did not match the extracted data! In Figure 5, none of them are printed. After this modification, the result of GNE extraction is correct.
But why does this happen? Does it have something to do with the cache results? So let’s change the element in the list to something else:
Just by changing element to the number 1, the Bug appears again.
It seemed to know that I was trying to observe it, and when I tried to observe Element in code, it worked fine. When I don’t watch it, it goes wrong. Schrodinger’s element.
The invisible hand
Up in the air quantum mechanics. This question has nothing to do with quantum mechanics. The reason for this weird situation is a mechanism that has always been running in Python, but that you often ignore — garbage collection.
Python frees up memory by cleaning up objects that are no longer used. When we execute a for loop:
for element in element_list:
a = element.xpath('//xxx')
b = element.xpath('.//text()')
c = 1 + 1
Copy the code
On the first execution of the loop, the first Element object is generated, but the object is overwritten by the new Element object on the second execution of the loop. Because there is no other place to continue using the first Element object, its reference count goes to zero, and Python’s garbage collection mechanism cleans it up. The memory it occupies will also be freed up.
But if you write it this way:
cache = []
for element in element_list:
a = element.xpath('//xxx')
b = element.xpath('.//text()')
c = 1 + 1
cache.append(element)
Copy the code
Because the list cache contains references to each Element object, the reference count for the element object generated in the first loop is not zero. The garbage collection mechanism does not collect it, and it always occupies an area of memory. This area is not used by other data. Each time through the loop, a new Element object allocates a new memory area for its data, which is equivalent to a different memory location for each element node.
In the example code, notice the line element_flag = STR (Element), which has a value similar to
, where the hexadecimal number 0x1087ba638 corresponds to the object’s memory address.
At first, I had an incorrect assumption that THE value of STR (Element) corresponds to each node in the HTML. If you execute it multiple times on the same node, the result will be the same. If you execute it multiple times on different nodes, the result will be different.
But that’s actually not true. Because if the memory area of the previous node is garbage collected, the area will be reallocated, and the new node may happen to be placed there, resulting in two different tags, both of which will print the same result when you execute STR (element). But actually their text is different.
When I use element_text_cache[element_flag] = [element_text_list, element], I don’t overwrite each other because each Element object is not recycled. So it works as expected.
To solve the problem
So, the root cause of the bug is that INSTEAD of using STR (Element) as the cache Key, I should have found something that corresponds to the HTML node. Obviously, XPath is better.
So, modify the code to change element_flag to XPath:
The problem was solved.