Due to the rapid development of the Internet, all the current information is in a state of massive accumulation, we have to obtain a large amount of data from the outside world, but also to filter the useless data in a large amount of data. In view of our beneficial data, we need to carry out specific capture, thus the emergence of the current crawler technology, through which we can quickly obtain the data we need. However, in the crawler process, the information owner will reverse crawler, so we need to overcome these difficulties one by one.
Just a period of time ago to do crawler related work, here is to record some of the relevant experience.
This article case code address github.com/yangtao9502…
Here I am using the Scrapy framework for crawlers, development environment related version number:
Scrapy: 1.5.1 LXML: 4.2.5.0 libxml2:2.9.8 cssSelect: 1.0.3 parsel: 1.5.1 w3lib: 1.20.0 Twisted: 18.9.0 Python: 3.7.1 (Default, Dec 10 2018, 22:54:23) [MSC V. 1915 64-bit (AMD64)] pyOpenSSL: 18.0.0 (OpenSSL 1.1.1 A 20 Nov 2018) Cryptography: 2.4.2 Platform: Windows-10-10.0.15063-SP0Copy the code
Local development environment recommends using Anaconda to install related environment, otherwise there may be a variety of dependency package conflict, I believe have encountered a deep experience, when you configure related environment will lose interest. This article mainly uses Xpath to extract page data, so before proceeding with the examples in this article, understand the basic use of Xpath.
Create Scrapy projects
Create ytaoCrawl scrapy create ytaoCrawl scrapy
scrapy startproject ytaoCrawl
Copy the code
Note that the project name must begin with a letter and contain only letters, numbers, and underscores. After the creation is successful, the following information is displayed:
The files to initialize the project are:
The purpose of each document:
- The spider directory is used to store crawler files.
- The kitems. py file is the object in which the crawler data is stored.
- The middlewares.py file is used by middleware processors, such as request and response transformations.
- > < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important;”
- The settings.py file is the configuration file in which some of the configuration in the crawler can be set.
- The scrapy. CFG file is the configuration file for crawler deployment.
With a few default generated files in mind, the following scrapy structure schematic is relatively straightforward.
This completes our scrapy crawler project.
Create the spiders
We’ll start by creating a Python file, ytaoSpider, which must inherit from the scrapy.Spider class. Next, we will take Beijing 58 rental information as an example for analysis.
#! /usr/bin/python3
# -*- coding: utf-8 -*-
#
# @Author : YangTao
# @blog : https://ytao.top
#
import scrapy
class YtaoSpider(scrapy.Spider):
Define the crawler name
name = "crawldemo"
Domain name allowed to be crawled, but does not contain links in start_urls
allowed_domains = ["58.com"]
# start crawling links
start_urls = [
"https://bj.58.com/chuzu/?PGTID=0d100000-0038-e441-0c8a-adeb346199d8&ClickID=2"
]
def download(self, response, fName):
with open(fName + ".html".'wb') as f:
f.write(response.body)
# Response returns the captured object
def parse(self, response):
Download Beijing rental page to local, easy to analyze
self.download(response, Rent a House in Beijing)
Copy the code
Start the crawler by executing the command, specifying the crawler name:
scrapy crawl crawldemo
Copy the code
When we have multiple crawlers, we can use a scrapy List to get all crawler names.
You can also use the mian function to start in the editor during development:
if __name__ == '__main__':
name = YtaoSpider.name
cmd = 'scrapy crawl {0} '.format(name)
cmdline.execute(cmd.split())
Copy the code
The page we climbed will then be downloaded and generated in the directory we started.
Flip up to take
Above, we only crawled the first page, but in the actual data capture process, we will inevitably involve pagination, so we observed that the website pagination is to show the last page (58 only shows the data of the first 70 pages at most), as shown in the figure.
See the paginated HTML section below.
The page number of the last page is then obtained through Xpath and regular matching.
def pageNum(self, response):
Get a block of HTML code for paging
page_ele = response.xpath("//li[@id='pager_wrap']/div[@class='pager']")
Get the text containing the page number through the re
num_eles = re.findall(r">\d+<", page_ele.extract()[0].strip())
# Find the biggest one
count = 0
for num_ele in num_eles:
num_ele = str(num_ele).replace(">"."").replace("<"."")
num = int(num_ele)
if num > count:
count = num
return count
Copy the code
Through the analysis of link that rent a house, you can see different page links to https://bj.58.com/chuzu/pn+num here num is representative of the page, we make different page fetching, only need to change the page number, the parse function can be changed to:
# crawler link, no page number
target_url = "https://bj.58.com/chuzu/pn"
def parse(self, response):
print("url: ", response.url)
num = self.pageNum(response)
# The start page is already the first page, so filter out the first page when traversing the page
p = 1
while p < num:
p += 1
try:
# Splice next page link
url = self.target_url + str(p)
# Fetch next page
yield Request(url, callback=self.parse)
except BaseException as e:
logging.error(e)
print("Abnormal crawl data:", url)
Copy the code
After execution, the printed information is shown as follows:
Because crawlers are asynchronous crawlers, our printed data is not ordered. The above introduction is to traverse the crawl by obtaining the page number of the last page, but some websites do not have the page number of the last page, then we can judge whether the current page is the last page by the next page, if not, we can obtain the link carried by the next page to crawl.
To get the data
So here we get the title, area, location, neighborhood, and price, and we need to create those fields in item first, so without further ado, code them.
Avoid indexing out of bounds when parsing data with xpath
def xpath_extract(self, selector, index):
if len(selector.extract()) > index:
return selector.extract()[index].strip()
return ""
def setData(self, response):
items = []
houses = response.xpath("//ul[@class='house-list']/li[@class='house-cell']")
for house in houses:
item = YtaocrawlItem()
# titles
item["title"] = self.xpath_extract(house.xpath("div[@class='des']/h2/a/text()"), 0)
# area
item["room"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='room']/text()"), 0)
# position
item["position"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='infor']/a/text()"), 0)
# village
item["quarters"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='infor']/a/text()"), 1)
money = self.xpath_extract(house.xpath("div[@class='list-li-right']/div[@class='money']/b/text()"), 0)
unit = self.xpath_extract(house.xpath("div[@class='list-li-right']/div[@class='money']/text()"), 1)
Price of #
item["price"] = money+unit
items.append(item)
return items
def parse(self, response):
items = self.setData(response)
for item in items:
yield item
# Follow the page-turning operation above.....
Copy the code
At this point, we can get the data we want and see the result by printing the item from parse.
Data warehousing
We have captured the data of the page, and the next step is to put the data into the database. Here, we take MySQL storage as an example. If there is a large amount of data, we suggest using other storage products. > < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important;”
ITEM_PIPELINES = {
The smaller the value, the higher the priority call
'ytaoCrawl.pipelines.YtaocrawlPipeline': 300,}Copy the code
Data persistence is handled in the YtaocrawlPipeline class, where the MySQL package utility mysqlUtils code can be viewed in Github. The data is transferred to YtaocrawlPipeline#process_item using yield in YtaoSpider#parse for processing.
class YtaocrawlPipeline(object):
def process_item(self, item, spider):
table = "crawl"
item["id"] = str(uuid.uuid1())
If the current crawl link exists in the library, delete the old one and save the new one
list = select(str.format("select * from {0} WHERE url = '{1}'", table, item["url"]))
if len(list) > 0:
for o in list:
delete_by_id(o[0], table)
insert(item, table)
return item
Copy the code
In the database, you can see that the data was successfully fetched and stored.
Anti-crawl mechanism coping
Since there is a demand for data crawler, there must be anti-crawler measures. The current crawler case is analyzed.
Font encryption
According to the graph of the database data above, it can be seen that there are garbled characters in the data. By viewing the garbled rules of the data, we can locate the digits that are encrypted.
At the same time, by printing the data, we can see the character \xa0, which is in the range of ASCII characters 0x20~0x7e.
Because I know that the font is encrypted, I find the code like the following when I view the font-family font on the downloaded page:
Fangchan-secret font is a bit suspicious. It is a font dynamically generated in JS and stored in Base64, decoding the following fonts.
if __name__ == '__main__':
secret = "AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8p/XQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQXlvp9AAAA4AA AADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAY qAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOOjpKBfDzz1AAsIAAAAAADaB9e2AAAAANoH17YAAP/mBGgGLgA AAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQA AAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAA AAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAA sAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wA AAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACAAGAAQAAgAKAAMACQABAAcABQAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAIAACVjwAAlY8AAAA GAACZPAAAmTwAAAAEAACaSwAAmksAAAACAACeOgAAnjoAAAAKAACeowAAnqMAAAADAACfZAAAn2QAAAAJAACfkgAAn5IAAAABAACfpAAAn6QAAAAHAACfpQA An6UAAAAFAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sE C6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCY jIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTY zMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxE jESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACE iJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjM yNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgA ACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7 rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwA jAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6 l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAA GAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AA DAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJ zaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQB uAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwB lAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB 2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgB jAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA"
Convert the font file encoding to utF-8 encoded byte objects
bytes = secret.encode(encoding='UTF-8')
Base64-bit decoding
decodebytes = base64.decodebytes(bytes)
Initialize BytesIO with decodebytes and parse the font library with TTFont
font = TTFont(BytesIO(decodebytes))
# font mapping
font_map = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap
print(font_map)
Copy the code
FontTools library TTFont parsing font, all to the following font mapping results:
{38006:'glyph00007',
38287: 'glyph00005',
39228: 'glyph00006',
39499: 'glyph00003',
40506: 'glyph00010',
40611: 'glyph00001',
40804: 'glyph00009',
40850: 'glyph00004',
40868: 'glyph00002',
40869: 'glyph00008'
}
Copy the code
There are just ten mappings, corresponding to the number of 0~9, but search for the corresponding rule, 1~9, there is a 10, so what is the corresponding number here exactly a law? In addition, the key corresponding to the above mapping is not a hexadecimal ASCII code, but a pure number, is it possible to be a decimal code? Next, we verified our assumption by converting the hexadecimal code obtained on the page into decimal code, and then matched the data in the mapping. We found that the non-zero numeric part of the mapping value was just 1 larger than the corresponding numeric character on the page. It can be seen that the real value needs to be subtracted by 1 in the mapping value. After finishing the code
def decrypt(self, response, code):
secret = re.findall("charset=utf-8; base64,(.*?) '\]", response.text)[0]
code = self.secretfont(code, secret)
return code
def secretfont(self, code, secret):
Convert the font file encoding to utF-8 encoded byte objects
bytes = secret.encode(encoding='UTF-8')
Base64-bit decoding
decodebytes = base64.decodebytes(bytes)
Initialize BytesIO with decodebytes and parse the font library with TTFont
font = TTFont(BytesIO(decodebytes))
# font mapping
font_map = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap
chars = []
for char in code:
Convert each character to decimal ASCII code
decode = ord(char)
If an ASCII key exists in the mapping, then the character has a font
if decode in font_map:
Get the value of the map
val = font_map[decode]
# According to the rule, get the numeric part, then subtract 1 to get the real value
char = int(re.findall("\d+", val)[0]) - 1
chars.append(char)
return "".join(map(lambda s:str(s), chars))
Copy the code
Now we decrypt all the data we crawled and look at the data:
In the figure above, after decryption, the perfect solution to data garbled!
Verification code and forbidden IP address
Verification codes are generally divided into two types. One type is the verification code that must be entered at the beginning of entry, and the other type is the verification code that is required to continue the following requests after frequent requests. For the first, it is necessary to crack its captcha to continue, and for the second, in addition to cracking the captcha, a proxy can be used to bypass authentication. A proxy can also be used to bypass the anti-crawl of banned IP addresses. For example, I still use the above url crawler. When they recognize that I may be crawler, they will use verification code for interception, as shown below:
Next, we use random user-agent and proxy IP for the bypass. Set settings.USER_AGENT first. Do not mix user-agent Settings on PC and mobile. Otherwise, you will get abnormal data crawling because different pages are different:
USER_AGENT = [
"Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.10 Safari/537.36"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; The.net CLR 1.1.4322; The.net CLR 2.0.50727)".#...
]
Copy the code
Set up random User-Agent middleware in the request
class RandomUserAgentMiddleware(object):
def __init__(self, agents):
self.agent = agents
@classmethod
def from_crawler(cls, crawler):
return cls(
agents=crawler.settings.get('USER_AGENT'))def process_request(self, request, spider):
Get a random user-agent from the Settings
request.headers.setdefault('User-Agent', random.choice(self.agent))
Copy the code
Set up dynamic IP middleware
class ProxyIPMiddleware(object):
def __init__(self, ip=' '):
self.ip = ip
def process_request(self, request, spider):
If the current address is redirected to a captcha address, the proxy IP is used to rerequest
if self.ban_url(request.url):
Get the redirected address
redirect_urls = request.meta.get("redirect_urls") [0]
Change the address currently redirected to the captcha to the original request address
request._set_url(redirect_urls)
# set dynamic proxy, where the interface is used to dynamically generate proxies online
request.meta["proxy"] = "http://%s" % (self.proxy_ip())
def ban_url(self, url):
# Settings a captcha or forbidden page link that the crawler will recrawl when it encounters
dic = settings.BAN_URLS
Verify that the current request address is a captcha address
for d in dic:
ifurl.find(d) ! =- 1:
return True
return False
Proxy dynamically generated IP :port
def proxy_ip(self):
Simulate dynamic generation of proxy addresses
ips = [
"127.0.0.1:8888"."127.0.0.1:8889",]return random.choice(ips);
def process_response(self, request, response, spider):
If the response is not successful, re-crawler
ifresponse.status ! =200:
logging.error("Failed response:"+ str(response.status))
return request
return response
Copy the code
Finally, enable the middleware in the Settings configuration file.
DOWNLOADER_MIDDLEWARES = {
'ytaoCrawl.middlewares.RandomUserAgentMiddleware': 500.'ytaoCrawl.middlewares.ProxyIPMiddleware': 501.'ytaoCrawl.middlewares.YtaocrawlDownloaderMiddleware': 543,}Copy the code
At this point, setting up random User-agent and dynamic IP bypass is complete.
The deployment of
Deploy crawler projects using Scrapyd to remotely manage crawlers such as startup, shutdown, log calls, and so on. To deploy, we need to install scrapyd, using the following command:
pip install scrapyd
Copy the code
After the installation is successful, you can see that the version is 1.2.1.
Once deployed, we’ll need a scrapyd-client to access it.
pip install scrapyd-client
Copy the code
Modify scrapy. CFG file
[settings]
default = ytaoCrawl.settings
[deploy:localytao]
url = http://localhost:6800/
project = ytaoCrawl
# deploy Enables batch deployment
Copy the code
Start scrapyd:
scrapyd
Copy the code
For Windows, create scrapyd-deploy.bat under X:\xx\Scripts
@echo off
"X:\xx\python.exe" "X:\xx\Scripts\scrapyd-deploy"% 1% 2Copy the code
Project deployment to Scrapyd service:
scrapyd-deploy localytao -p ytaoCrawl
Copy the code
Remote start curl http://localhost:6800/schedule.json – d project = spiders ytaoCrawl – d = ytaoSpider
After the execution is started, you can view the crawler execution status and logs in http://localhost:6800/
In addition to launching remote calls, Scrapyd also provides a rich API:
- Crawler status query in service
curl http://localhost:6800/daemonstatus.json
- Cancel the crawler
curl http://localhost:6800/cancel.json -d project=projectName -d job=jobId
- Show project
curl http://localhost:6800/listprojects.json
- Delete the project
curl http://localhost:6800/delproject.json -d project=projectName
- Show the crawler
curl http://localhost:6800/listspiders.json? project=projectName
- Gets all version numbers of the project
curl http://localhost:6800/listversions.json? project=projectName
- Delete the project version number
curl http://localhost:6800/delversion.json -d project=projectName -d version=versionName
More details scrapyd. Readthedocs. IO/en/stable/a…
conclusion
In this paper, the space is limited, not comprehensive analysis, some sites of climb more difficult, as long as we are analyzed one by one, can find a way to crack, and eyes to see the data is not necessarily you get data, such as some website HTML rendering is dynamic, requires us to deal with the information. It’s a lot of fun when you get into crawler’s world. Finally, I hope you do not face the prison crawler, data million, obey the law first.
Personal blog: Ytao.top
My official account is Ytao