I am participating in the Mid-Autumn Festival Creative Submission contest, please see: Mid-Autumn Festival Creative Submission Contest for details

instructions

In the last day of the activity, I will update the article of the Mid-Autumn Festival activity. Most of the articles are the front-end masters drawing animation, and I can hardly see what interesting things are done in the back end. Now I will share my beginner’s experience of crawling some sites with the Python language. Scrapy picks up some of the comments from V2ex, and a bit of Google’s search results page. V2ex is a discussion site for real programmers. There will be some anti-theft mechanism, because the time is limited not to in-depth research, the target is not a whole site data.

start

Install some packages that may be used later, do not be surprised what will use pillow, jieba, WordCloud, be sure to see the last ~

pip3 install scrapy
pip3 install pillow
pip3 install jieba
pip3 install numpy
pip3 install wordcloud
Copy the code

Select a directory initialization project:

scrapy startproject googlespider
Copy the code

You will then be prompted to enter the folder to create the crawler, which we will use to execute the following two commands

cd googlespider
scrapy genspider v2ex google.com
Copy the code

The following directories will then be generated

In addition toThe Mid-Autumn festival. PNG.get_word.py.juejin.jpgThese three files are for later use, the rest are generated by the command above

The premise is set

Open the setting. Py

ROBOTSTXT_OBEY = True, this is the gentleman’s agreement for reptiles, big factory crawlers will obey this rule, we small developers generally don’t follow this rule,

Modifying Concurrency Settings

Scrapy is an asynchronous framework, and the efficiency is quite high. The default is 16. We are not working on the anti-scrapy mechanism, so we have to set it to 1. Request all 404, or 403.

CONCURRENT_REQUESTS = 1
Copy the code

Turn on the default Headers

Can not be too blatant, directly tell others that I am a crawler, euphemism, cover up, this is the usual ~

DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 ', 'the Accept - Language' : 'en'},Copy the code

Turn on download latency

In order to avoid anti – raking had to settle for the next ~, frequent problems will arise.

AUTOTHROTTLE_START_DELAY = 5
Copy the code

Positive start

Open the generated v2ex.py and go directly to the code, as explained in the comments

# -*- coding: utf-8 -*- import scrapy from scrapy.http.request import Request from googlespider.items import GooglespiderItem class V2exSpider(scrapy.Spider): name = 'v2ex' # Allowed_domains = ['google.com'] # limit what domains the crawler can crawl, There are many cases in complex crawlers that will crawl to other sites to start_urls = ['http://google.com/'] # the domain name that the auto-generated crawler starts crawling from, usually without it. Page_data = 10 # Number of pages turned by Google. Headers = {'authority' = {'authority';}} 'www.v2ex.com', 'cache-control': 'no-cache', 'sec-ch-ua': '"Google Chrome"; v="93", " Not; A Brand"; v="99", "Chromium"; v="93"', 'sec-ch-ua-mobile': '? 0', 'sec-CH-UA-platform ': '"Linux", 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', 'Accept ': 'image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*; Q = 0.8 ', 'the SEC - fetch - site' : 'the same - origin', 'the SEC - fetch - mode' : 'no - cors',' the SEC - fetch - user ':'? 1', 'sec-fetch-dest': 'image', 'accept-language': 'zh-CN,zh; Q = 0.9 ', 'cookies' : 'PB3_SESSION="2|1:0|10:1632320639|11:PB3_SESSION|36:djJleDo0Ny4yNTQuODQuMjA2Ojk5NTgyNDE3|af262dbef778709e4964d0dec124e60 e267c03ac8e01bf98e702cb85b7fd0698"; V2EX_LANG=zhcn; _ga = GA1.2.2017158286.1632320642; _gid = GA1.2.1291035321.1632320642; A2="2|1:0|10:1632321211|2:A2|48:M2RjYzkzMWQtOTAwZi00YTA4LWEyNTctZmQ2NTdiMmY4YmMy|b75429367c29090d8e1b29d8f3c84a7c5e72300 5928c6d5267e4137e6bc02e91"; V2EX_REFERRER="2|1:0|10:1632321220|13:V2EX_REFERRER|8:VG9ieTIz|fbb0b7efd4f44fcd4298c61978cf3a395fd7111e54d8185344e5e8d28 6dff7f4"; V2EX_TAB="2|1:0|10:1632327619|8:V2EX_TAB|8:dGVjaA==|90bd1f7b91355e424c24b6192cbef107cde9747006042d46320b41b0b991173d"; _gat = 1 ', 'Referer' : 'https://www.v2ex.com/t/710481', 'the user-agent' : 'Mozilla / 5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', 'referer': 'https://www.v2ex.com/t/710481', 'Origin': 'https://www.v2ex.com', 'if-modified-since': 'Wed, 11 Aug 2021 00:32:57 GMT', 'Accept': '*/*', 'X-Requested-With': 'XMLHttpRequest', 'content-length': '0', 'content-type': 'text/plain', 'origin': 'https://www.v2ex.com', 'if-none-match': 'W/"55467a007c0429c0b04e98443edd5063d10f0b22"', 'pragma': 'no-cache', 'Content-Type': 'text/plain', } def start_requests(self): # the url is a Google search page. Site and inititle are Google's advanced search keywords. Site means the search results only contain a site. Too only search keywords and search results only # url = "https://www.google.com/search?q=site:v2ex.com/t+intitle:%E4%B8%AD%E7%A7%8B" in the title of the page Yield is an advanced use of Python, the iterator, and this is the key to implementing asynchronous crawlers. It gives parse the url request resolution. The current function can continue to execute, but there are no methods below, and the iterator below has many iterators. The correct asynchronous request will appear. yield Request(url, callback=self.parse) def parse(self, response): The # response object is a scrapy wrapper that contains a number of object methods, such as the following. Selector. Url_list = response.selector. Re ("https://www.v2ex.com/t/[0-9]*") print(url_list) # Give the asynchronous task of asking for article details to the next method, and then turn the page until there are no more links to articles in Google's results page. Dont_filter is the site that doesn't have to start crawling in Google's allowed_Domains list at the end. if(len(url_list) > 0): for i in url_list: yield Request(url=i, callback=self.parse_detail, dont_filter=True,headers=self.headers) yield Request(url="https://www.google.com/search?q=site:v2ex.com/t+intitle:%E4%B8%AD%E7%A7%8B&start="+str(self.page_data), Callback =self.parse) self.page_data += 10 def parse_detail(self, response): Xpath_str = '//*[@class="reply_content"]/text()' # Item = GooglespiderItem() word_list = response.xpath(xpath_str).getall() if(len(word_list)>0): For I in word_list: item['word'] = I # give the contents of the crawl to the pipe, and the pipe will automatically schedule the data to the downloader using yield itemCopy the code

The core function code is displayed, and the key points will be explained next

Headers in disguise

Headers is the first and most useful tool to break through anti-stripping. The simplest is to simulate normal Headers requests using a site that automatically generates Headers code for Python. Copy -> copy as CURL -> delete CURL -> delete CURL -> delete CURL -> delete CURL

Find an online curl to Python site, for example:tool.lu/curl

So you get the headers.

Finding comment data

Download the Google plugin Xpath Helper and right-click on a random comment to check the Xpath copy, then copy to Xpath Helper and the result will be automatically returned. It is really fast and saves a lot of debugging time.

Download the data

Class GooglespiderItem(scrapy.item): adds a line below kitems. py

word = scrapy.Field()
Copy the code

After formatting item to write download middleware, here is very simple not to elaborate:

# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html import csv import codecs import re,os class GooglespiderPipeline(object): def process_item(self, item, spider): return item class CsvspiderPipeline(object): def __init__(self): Open a CSV file, Self.file = codecs.open('word.csv', 'w', encoding='utf_8_sig') def process_item(self, item, spider): fieldnames = ['word'] w = csv.DictWriter(self.file, Print (item) w.iterow (item) return item def close_spider(self, spider): # Close self.file.close()Copy the code

Then open setting.py,290 is the middleware priority. Complex projects will have many middleware

ITEM_PIPELINES = {
   'googlespider.pipelines.CsvspiderPipeline': 290,
}
Copy the code

run

Run the following command in the project root directory (scrapy. CFG) :

scrapy crawl v2ex
Copy the code

Then wait for a few minutes to complete the crawl and a word. CSV file will be generated in the project directory

Process the data

Use stutter participle, wordcloud, PIL and other modules to make a simple wordcloud picture, directly on the code bar, here there is no in-depth study, find some information are called API, ha ha, the person is too short, I use Python ~

import csv import jieba.analyse as analyse from PIL import Image import matplotlib.pyplot as plt import wordcloud import With open('./word.csv')as f: f_csv = csv.reader(f) for row in f_csv: print(row[0]) all_text+=row[0]+'; 'image1 = image.open ('./juejin. JPG ') # turn on a background map for a lesson. Select the most common 100 sub text = ". Join (tags) # Change the list of 100 lyrics to a one-line string separated by Spaces to cloud words this library uses MASK = Np.array (image1) # MASK the word cloud as nampy array # You can set the width and height, the number of words, the background image, etc. Note that the font file needs to be found according to your system. Mine is Ubuntu. WC = wordcloud.WordCloud(font_path= "/usr/share/fonts/truetype/arphic/ukai.ttc",max_words=2000,mask = MASK,height= 400,width=400,background_color='white',repeat=False,mode='RGBA') # set word cloud object properties # generate images con = wc. generate(text) # display images PLT. Axis ("off") plt.show() print(tags) plt.imshow()Copy the code

Background:

Achievements:

conclusion

Koala9527 /v2ex- Google -scrapy-spider results without any unexpected words, e.g. ‘happy ‘,’ Mid-Autumn festival ‘, ‘Mid-Autumn Festival ‘,’ moon cake ‘. There are also some niche words that are more interesting, such as: ‘girlfriend’, ‘study’, ‘sleep’, haha. The difficulty of crawler technology lies in reverse crawling. When the request frequency is high at the beginning, both Google and V2ex request and respond directly to 404 or 403, which can only obtain a small part of data in a short time. If you want a large amount of data, you can only upload dynamic IP, dynamic Headers, or even study JS codes one by one. Finally, thank you for your appreciation, see the last are the big guy, welcome big guy to check the missing, point out the wrong place ~