Make writing a habit together! This is the fifth day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.

What makes perfect, used in the field of technology in general, but as a beginner, need a firm grasp of technology, that must rely on their own daily work. I am a dream eraser, hope someday we meet on high.

Writing in the front

After reading the title, if you have a question about what incremental crawler is, congratulations, you will have a harvest, if you have no question, fierce, master ~

Incremental crawler is a content crawl and increment is to increase the quantity, when our creeper crawled after finished take a url, some sites will be updated on the original data and a batch of, for example, the tiger sniff headlines recommended, novel net section update and so on other as long as there is a dynamic update website, are suitable for incremental crawler.

This allows us to make a simple definition of an incremental crawler that crawls again on the results of the previous crawls

Encoding to

We then detect a site through a crawler and incremental crawl as the site is updated.

The site is sogou search platform: https://weixin.sogou.com/, you can track the hot news

The core of incremental crawler is de-weighting

There are three ways to lose weight

  1. Determine whether the URL is requested before initiating the request
  2. Determine if the content has been retrieved after parsing it
  3. Determine if the content already exists when storing

The scenarios used by the logic of the three de-duplication methods are also easier to determine

  1. Make a decision before making a request. This is the most common case, such as news, fiction, news updates, as long as a new link appears on the page can be used
  2. The second applies to pages where data is refreshed and no new links are added
  3. The third is the last logic, making a final judgment before warehousing

Means to heavy

This blog will be generated in the process of crawling URL storage, the core is the use of redis inside the set

For using Redis in Python, this is a very old problem and can be applied with a bit of memory. Note that the Redis library is installed first

Sadd can insert a piece of data into the collection, return 1 on success, return 0 on failure. Based on this, we can judge the incremental crawler.

def reids_conn(title) :
    conn = redis.Redis(host='127.0.0.1',port=6379)
    return conn.sadd('1',title)
Copy the code

Complete code presentation

The core of this blog is incremental crawler cognition. There is not much new knowledge in the code, so I will show it directly and add some notes in the comments. Note the following code, through the title field repeated check, in fact, here can be properly coded, in the judgment, directly using Chinese verification is not very professional.

import requests
from lxml import etree
import redis
import time
headers = {
    "User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36"
}
# Prepare links to crawl
# https://weixin.sogou.com/

def run() :
    while 1:  # Set an infinite loop and set the interval within the loop
        url = "https://weixin.sogou.com/"
        res = requests.get(url,headers=headers)
        res.encoding='utf-8'
        html = res.text Get web data
        parse(html)
        time.sleep(300) # 5*60 = 300s indicates a 5-minute cycle
        print("Get again...")

def parse(html) :
    html_element = etree.HTML(html)
    result = html_element.xpath('//h3//a[@uigs]')
    for item in result:
        href = item.get("href")
        title = item.text
        ex = reids_conn(title) # check whether this information has been crawled
        if ex == 1:
            print(F "is crawling{href}")


def reids_conn(title) :
	# redis connection
    conn = redis.Redis(host='127.0.0.1',port=6379)
    return conn.sadd('1',title) Add title to the collection and judge by title
if __name__ == "__main__":
    run()
Copy the code

The results

Finally run the project, waiting for the harvest of data, here I did not carry out the inner page grab, if necessary, you can add a function, of courserequestsandlxmlFor both libraries, you can swap them out

Write in the back

Hopefully from this blog you can GET the concept of incremental crawler, technically this blog is not difficult, work on it.