This is my third day of the Gwen Challenge

Scrapy Kill cause explore

When scrapy crawls a site, it will be killed by the system after a certain amount of time.

why

When a certain amount of time is reached and then burst, it indicates that there is no network or system problem. There is only one possibility, that is, memory is accumulated during the climb, and finally memory overflow causes the system kill. It is worth mentioning that when the kill is done, the log will not record the reason for stopping, and the final death is not clear. But an KILLED will be output in the console (if you are not running in the background)

Using commands

egrep -i 'killed process' /var/log/syslog
Copy the code

Total-vm: indicates the total virtual memory used by a process.

Anon-rss: physical memory occupied by virtual memory. File-rss: indicates the disk space occupied by the virtual memory. The OOM Killer Out-of-memory killer mechanism Of the LINUX kernel is a self-protection mechanism used to prevent Memory exhaustion from affecting system running. According to the definition of oOM_kill. c in kernel source code, the system will calculate an OOM_score value according to “memory occupied by the process”, “process running time”, “priority of the process”, “root user process”, “number of child processes and memory occupied”, “user control parameter oom_adj”, etc. The higher the score, the more priority the kernel will kill.

As you can see, it was a memory burst that caused the killing.

What causes scrapy memory to overflow?

As we all know, scrapy is an asynchronous framework, and running out of memory sounds strange. In fact, some of the scrapy Settings cause running out of memory, as the official document shows:

Elements that can cause memory overflow

LXML eats memory

Since scrapy’s underlying parsing is built on TOP of LXML, when LXML parses a document, some operations cause it to eat a lot of memory,The official doc:== When writing xpath, if you keep references to parsed elements, the memory of the tree will be preserved, resulting in memory inflation. Therefore, it is not recommended to write ==

How to view Scrapy memory usage

Using telent

Telent localhost 6023 username: scrapy username: scrapy password:Copy the code

Check scrapy metrics using prefs()

As you can see from the above figure, when the crawler runs for more than 2,000 seconds, the scrapy scheduler has accumulated 1318 requests waiting to be processed and more than 600 responses waiting to be processed.

Use Muppy to view memory usage

__author__ = 'Laughing Pudding'
# in the first place
pip install Pympler
# Stolen text dead mother
from pympler import muppy
from pympler import summary
all_objects = muppy.get_objects()
suml = summary.summarize(all_objects)
Copy the code
len(all_objects) view the sizeCopy the code

Summary. Print_ (SUML) comprehensive analysisCopy the code

It can be seen that when the problem crawler runs for 600s, bytes have skyrocketed to more than 500 MB, which is the reason why the problem crawler dies after running for a period of time.

The solution

measures

  1. Reduce xpath references to elements
  2. Adjust the access order of yield Requests
  3. Reduce the number of crawlers running simultaneously
  4. Check for incorrect judgments in your code that cause you to fall into an endless loop

The results of

Bytes dropped to 94.55MB at 600 s.

Reference documentation

scrapy debugging memory doc