How to collect a large number of news website data, thousands of news website data is a huge workload for any crawler. Generally, xpath can be used to analyze the data of news websites. However, because there are too many website data to be collected, if xpath is used alone to analyze and collect the data, the workload will be huge and the collection will be slow. I also found readbility library to automate page parsing. At the beginning, IT was ok to use it. I could parse standard HTML pages, but when I encountered some irregular HTML pages, the parsing effect was not good.

At present, the content of most websites contains irrelevant information such as navigation bar, advertisement, copyright and so on, which can be omitted if it has nothing to do with the collected news data information. After a period of research, we can use Python scrapy framework and crawler agent to collect a large number of news data:

#! -*- encoding:utf-8 -*- import base64 import sys import random PY3 = sys.version_info[0] >= 3 def base64ify(bytes_or_str): if PY3 and isinstance(bytes_or_str, str): input_bytes = bytes_or_str.encode('utf8') else: input_bytes = bytes_or_str output_bytes = base64.urlsafe_b64encode(input_bytes) if PY3: return output_bytes.decode('ascii') else: return output_bytes class ProxyMiddleware(object): def process_request(self, request, spider): ProxyHost = "t.16yun.cn" proxyPort = "31111" proxyUser = "username" proxyPass = Meta ['proxy'] = "http://{0}:{1}". Format (proxyHost,proxyPort) # encoded_user_pass = base64ify(proxyUser + ":" + proxyPass) request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass # Tunnel = random. Randint (1,10000) request. Headers [' proxy-tunnel '] = STR (tunnel)Copy the code