Make writing a habit together! This is the fifth day of my participation in the “Gold Digging Day New Plan · April More text Challenge”. Click here for more details.
What makes perfect, used in the field of technology in general, but as a beginner, need a firm grasp of technology, that must rely on their own daily work. I am a dream eraser, hope someday we meet on high.
Writing in the front
After reading the title, if you have a question about what incremental crawler is, congratulations, you will have a harvest, if you have no question, fierce, master ~
Incremental crawler is a content crawl and increment is to increase the quantity, when our creeper crawled after finished take a url, some sites will be updated on the original data and a batch of, for example, the tiger sniff headlines recommended, novel net section update and so on other as long as there is a dynamic update website, are suitable for incremental crawler.
This allows us to make a simple definition of an incremental crawler that crawls again on the results of the previous crawls
Encoding to
We then detect a site through a crawler and incremental crawl as the site is updated.
The site is sogou search platform: https://weixin.sogou.com/, you can track the hot news
The core of incremental crawler is de-weighting
There are three ways to lose weight
- Determine whether the URL is requested before initiating the request
- Determine if the content has been retrieved after parsing it
- Determine if the content already exists when storing
The scenarios used by the logic of the three de-duplication methods are also easier to determine
- Make a decision before making a request. This is the most common case, such as news, fiction, news updates, as long as a new link appears on the page can be used
- The second applies to pages where data is refreshed and no new links are added
- The third is the last logic, making a final judgment before warehousing
Means to heavy
This blog will be generated in the process of crawling URL storage, the core is the use of redis inside the set
For using Redis in Python, this is a very old problem and can be applied with a bit of memory. Note that the Redis library is installed first
Sadd can insert a piece of data into the collection, return 1 on success, return 0 on failure. Based on this, we can judge the incremental crawler.
def reids_conn(title) :
conn = redis.Redis(host='127.0.0.1',port=6379)
return conn.sadd('1',title)
Copy the code
Complete code presentation
The core of this blog is incremental crawler cognition. There is not much new knowledge in the code, so I will show it directly and add some notes in the comments. Note the following code, through the title field repeated check, in fact, here can be properly coded, in the judgment, directly using Chinese verification is not very professional.
import requests
from lxml import etree
import redis
import time
headers = {
"User-Agent": "Mozilla / 5.0 (Windows NT 6.1; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36"
}
# Prepare links to crawl
# https://weixin.sogou.com/
def run() :
while 1: # Set an infinite loop and set the interval within the loop
url = "https://weixin.sogou.com/"
res = requests.get(url,headers=headers)
res.encoding='utf-8'
html = res.text Get web data
parse(html)
time.sleep(300) # 5*60 = 300s indicates a 5-minute cycle
print("Get again...")
def parse(html) :
html_element = etree.HTML(html)
result = html_element.xpath('//h3//a[@uigs]')
for item in result:
href = item.get("href")
title = item.text
ex = reids_conn(title) # check whether this information has been crawled
if ex == 1:
print(F "is crawling{href}")
def reids_conn(title) :
# redis connection
conn = redis.Redis(host='127.0.0.1',port=6379)
return conn.sadd('1',title) Add title to the collection and judge by title
if __name__ == "__main__":
run()
Copy the code
The results
Finally run the project, waiting for the harvest of data, here I did not carry out the inner page grab, if necessary, you can add a function, of courserequests
andlxml
For both libraries, you can swap them out
Write in the back
Hopefully from this blog you can GET the concept of incremental crawler, technically this blog is not difficult, work on it.