Due to the rapid development of the Internet, all the current information is in a state of massive accumulation, we have to obtain a large amount of data from the outside world, but also to filter the useless data in a large amount of data. In view of our beneficial data, we need to carry out specific capture, thus the emergence of the current crawler technology, through which we can quickly obtain the data we need. However, in the crawler process, the information owner will reverse crawler, so we need to overcome these difficulties one by one.

Just a period of time ago to do crawler related work, here is to record some of the relevant experience.

This article case code address github.com/yangtao9502…

Here I am using the Scrapy framework for crawlers, development environment related version number:

Scrapy: 1.5.1 LXML: 4.2.5.0 libxml2:2.9.8 cssSelect: 1.0.3 parsel: 1.5.1 w3lib: 1.20.0 Twisted: 18.9.0 Python: 3.7.1 (Default, Dec 10 2018, 22:54:23) [MSC V. 1915 64-bit (AMD64)] pyOpenSSL: 18.0.0 (OpenSSL 1.1.1 A 20 Nov 2018) Cryptography: 2.4.2 Platform: Windows-10-10.0.15063-SP0Copy the code

Local development environment recommends using Anaconda to install related environment, otherwise there may be a variety of dependency package conflict, I believe have encountered a deep experience, when you configure related environment will lose interest. This article mainly uses Xpath to extract page data, so before proceeding with the examples in this article, understand the basic use of Xpath.

Create Scrapy projects

Create ytaoCrawl scrapy create ytaoCrawl scrapy

scrapy startproject ytaoCrawl
Copy the code

Note that the project name must begin with a letter and contain only letters, numbers, and underscores. After the creation is successful, the following information is displayed:

The files to initialize the project are:

The purpose of each document:

  • The spider directory is used to store crawler files.
  • The kitems. py file is the object in which the crawler data is stored.
  • The middlewares.py file is used by middleware processors, such as request and response transformations.
  • > < span style = “box-sizing: border-box; word-break: inherit! Important; word-break: inherit! Important;”
  • The settings.py file is the configuration file in which some of the configuration in the crawler can be set.
  • The scrapy. CFG file is the configuration file for crawler deployment.

With a few default generated files in mind, the following scrapy structure schematic is relatively straightforward.

This completes our scrapy crawler project.

Create the spiders

We’ll start by creating a Python file, ytaoSpider, which must inherit from the scrapy.Spider class. Next, we will take Beijing 58 rental information as an example for analysis.

#! /usr/bin/python3
# -*- coding: utf-8 -*-
# 
# @Author : YangTao
# @blog : https://ytao.top
# 
import scrapy

class YtaoSpider(scrapy.Spider):
    Define the crawler name
    name = "crawldemo"
    Domain name allowed to be crawled, but does not contain links in start_urls
    allowed_domains = ["58.com"]
    # start crawling links
    start_urls = [
        "https://bj.58.com/chuzu/?PGTID=0d100000-0038-e441-0c8a-adeb346199d8&ClickID=2"
    ]

    def download(self, response, fName):
        with open(fName + ".html".'wb') as f:
            f.write(response.body)

    # Response returns the captured object
    def parse(self, response):
        Download Beijing rental page to local, easy to analyze
        self.download(response, Rent a House in Beijing)
Copy the code

Start the crawler by executing the command, specifying the crawler name:

scrapy crawl crawldemo
Copy the code

When we have multiple crawlers, we can use a scrapy List to get all crawler names.

You can also use the mian function to start in the editor during development:

if __name__ == '__main__':
    name = YtaoSpider.name
    cmd = 'scrapy crawl {0} '.format(name)
    cmdline.execute(cmd.split())
Copy the code

The page we climbed will then be downloaded and generated in the directory we started.

Flip up to take

Above, we only crawled the first page, but in the actual data capture process, we will inevitably involve pagination, so we observed that the website pagination is to show the last page (58 only shows the data of the first 70 pages at most), as shown in the figure.

See the paginated HTML section below.

The page number of the last page is then obtained through Xpath and regular matching.

def pageNum(self, response):
    Get a block of HTML code for paging
    page_ele = response.xpath("//li[@id='pager_wrap']/div[@class='pager']")
    Get the text containing the page number through the re
    num_eles = re.findall(r">\d+<", page_ele.extract()[0].strip())
    # Find the biggest one
    count = 0
    for num_ele in num_eles:
        num_ele = str(num_ele).replace(">"."").replace("<"."")
        num = int(num_ele)
        if num > count:
            count = num
    return count
Copy the code

Through the analysis of link that rent a house, you can see different page links to https://bj.58.com/chuzu/pn+num here num is representative of the page, we make different page fetching, only need to change the page number, the parse function can be changed to:

# crawler link, no page number
target_url = "https://bj.58.com/chuzu/pn"

def parse(self, response):
        print("url: ", response.url)
        num = self.pageNum(response)
        # The start page is already the first page, so filter out the first page when traversing the page
        p = 1
        while p < num:
            p += 1
            try:
                # Splice next page link
                url = self.target_url + str(p)
                # Fetch next page
                yield Request(url, callback=self.parse)
            except BaseException as e:
                logging.error(e)
                print("Abnormal crawl data:", url)
Copy the code

After execution, the printed information is shown as follows:

Because crawlers are asynchronous crawlers, our printed data is not ordered. The above introduction is to traverse the crawl by obtaining the page number of the last page, but some websites do not have the page number of the last page, then we can judge whether the current page is the last page by the next page, if not, we can obtain the link carried by the next page to crawl.

To get the data

So here we get the title, area, location, neighborhood, and price, and we need to create those fields in item first, so without further ado, code them.

Avoid indexing out of bounds when parsing data with xpath
def xpath_extract(self, selector, index):
    if len(selector.extract()) > index:
        return selector.extract()[index].strip()
    return ""

def setData(self, response):
    items = []
    houses = response.xpath("//ul[@class='house-list']/li[@class='house-cell']")
    for house in houses:
        item = YtaocrawlItem()
        # titles
        item["title"] = self.xpath_extract(house.xpath("div[@class='des']/h2/a/text()"), 0)
        # area
        item["room"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='room']/text()"), 0)
        # position
        item["position"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='infor']/a/text()"), 0)
        # village
        item["quarters"] = self.xpath_extract(house.xpath("div[@class='des']/p[@class='infor']/a/text()"), 1)
        money = self.xpath_extract(house.xpath("div[@class='list-li-right']/div[@class='money']/b/text()"), 0)
        unit = self.xpath_extract(house.xpath("div[@class='list-li-right']/div[@class='money']/text()"), 1)
        Price of #
        item["price"] = money+unit
        items.append(item)
    return items

def parse(self, response):
    items = self.setData(response)
    for item in items:
        yield item
    
    # Follow the page-turning operation above.....
Copy the code

At this point, we can get the data we want and see the result by printing the item from parse.

Data warehousing

We have captured the data of the page, and the next step is to put the data into the database. Here, we take MySQL storage as an example. If there is a large amount of data, we suggest using other storage products. > < span style = “box-sizing: border-box; color: RGB (74, 74, 74); line-height: 22px! Important; word-break: inherit! Important;”

ITEM_PIPELINES = {
    The smaller the value, the higher the priority call
   'ytaoCrawl.pipelines.YtaocrawlPipeline': 300,}Copy the code

Data persistence is handled in the YtaocrawlPipeline class, where the MySQL package utility mysqlUtils code can be viewed in Github. The data is transferred to YtaocrawlPipeline#process_item using yield in YtaoSpider#parse for processing.

class YtaocrawlPipeline(object):

    def process_item(self, item, spider):
        table = "crawl"
        item["id"] = str(uuid.uuid1())
        If the current crawl link exists in the library, delete the old one and save the new one
        list = select(str.format("select * from {0} WHERE url = '{1}'", table, item["url"]))
        if len(list) > 0:
            for o in list:
                delete_by_id(o[0], table)
        insert(item, table)
        return item
Copy the code

In the database, you can see that the data was successfully fetched and stored.

Anti-crawl mechanism coping

Since there is a demand for data crawler, there must be anti-crawler measures. The current crawler case is analyzed.

Font encryption

According to the graph of the database data above, it can be seen that there are garbled characters in the data. By viewing the garbled rules of the data, we can locate the digits that are encrypted.

At the same time, by printing the data, we can see the character \xa0, which is in the range of ASCII characters 0x20~0x7e.

Because I know that the font is encrypted, I find the code like the following when I view the font-family font on the downloaded page:

Fangchan-secret font is a bit suspicious. It is a font dynamically generated in JS and stored in Base64, decoding the following fonts.

if __name__ == '__main__':
    secret = "AAEAAAALAIAAAwAwR1NVQiCLJXoAAAE4AAAAVE9TLzL4XQjtAAABjAAAAFZjbWFwq8p/XQAAAhAAAAIuZ2x5ZuWIN0cAAARYAAADdGhlYWQXlvp9AAAA4AA AADZoaGVhCtADIwAAALwAAAAkaG10eC7qAAAAAAHkAAAALGxvY2ED7gSyAAAEQAAAABhtYXhwARgANgAAARgAAAAgbmFtZTd6VP8AAAfMAAACanBvc3QFRAY qAAAKOAAAAEUAAQAABmb+ZgAABLEAAAAABGgAAQAAAAAAAAAAAAAAAAAAAAsAAQAAAAEAAOOjpKBfDzz1AAsIAAAAAADaB9e2AAAAANoH17YAAP/mBGgGLgA AAAgAAgAAAAAAAAABAAAACwAqAAMAAAAAAAIAAAAKAAoAAAD/AAAAAAAAAAEAAAAKADAAPgACREZMVAAObGF0bgAaAAQAAAAAAAAAAQAAAAQAAAAAAAAAAQA AAAFsaWdhAAgAAAABAAAAAQAEAAQAAAABAAgAAQAGAAAAAQAAAAEERAGQAAUAAAUTBZkAAAEeBRMFmQAAA9cAZAIQAAACAAUDAAAAAAAAAAAAAAAAAAAAAAA AAAAAAFBmRWQAQJR2n6UGZv5mALgGZgGaAAAAAQAAAAAAAAAAAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAABLEAAASxAAAEsQAAAAAABQAAAAMAAAA sAAAABAAAAaYAAQAAAAAAoAADAAEAAAAsAAMACgAAAaYABAB0AAAAFAAQAAMABJR2lY+ZPJpLnjqeo59kn5Kfpf//AACUdpWPmTyaS546nqOfZJ+Sn6T//wA AAAAAAAAAAAAAAAAAAAAAAAABABQAFAAUABQAFAAUABQAFAAUAAAACAAGAAQAAgAKAAMACQABAAcABQAAAQYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAAAAAAAiAAAAAAAAAAKAACUdgAAlHYAAAAIAACVjwAAlY8AAAA GAACZPAAAmTwAAAAEAACaSwAAmksAAAACAACeOgAAnjoAAAAKAACeowAAnqMAAAADAACfZAAAn2QAAAAJAACfkgAAn5IAAAABAACfpAAAn6QAAAAHAACfpQA An6UAAAAFAAAAAAAAACgAPgBmAJoAvgDoASQBOAF+AboAAgAA/+YEWQYnAAoAEgAAExAAISAREAAjIgATECEgERAhIFsBEAECAez+6/rs/v3IATkBNP7S/sE C6AGaAaX85v54/mEBigGB/ZcCcwKJAAABAAAAAAQ1Bi4ACQAAKQE1IREFNSURIQQ1/IgBW/6cAicBWqkEmGe0oPp7AAEAAAAABCYGJwAXAAApATUBPgE1NCY jIgc1NjMyFhUUAgcBFSEEGPxSAcK6fpSMz7y389Hym9j+nwLGqgHButl0hI2wx43iv5D+69b+pwQAAQAA/+YEGQYnACEAABMWMzI2NRAhIzUzIBE0ISIHNTY zMhYVEAUVHgEVFAAjIiePn8igu/5bgXsBdf7jo5CYy8bw/sqow/7T+tyHAQN7nYQBJqIBFP9uuVjPpf7QVwQSyZbR/wBSAAACAAAAAARoBg0ACgASAAABIxE jESE1ATMRMyERNDcjBgcBBGjGvv0uAq3jxv58BAQOLf4zAZL+bgGSfwP8/CACiUVaJlH9TwABAAD/5gQhBg0AGAAANxYzMjYQJiMiBxEhFSERNjMyBBUUACE iJ7GcqaDEx71bmgL6/bxXLPUBEv7a/v3Zbu5mswEppA4DE63+SgX42uH+6kAAAAACAAD/5gRbBicAFgAiAAABJiMiAgMzNjMyEhUUACMiABEQACEyFwEUFjM yNjU0JiMiBgP6eYTJ9AIFbvHJ8P7r1+z+8wFhASClXv1Qo4eAoJeLhKQFRj7+ov7R1f762eP+3AFxAVMBmgHjLfwBmdq8lKCytAAAAAABAAAAAARNBg0ABgA ACQEjASE1IQRN/aLLAkD8+gPvBcn6NwVgrQAAAwAA/+YESgYnABUAHwApAAABJDU0JDMyFhUQBRUEERQEIyIkNRAlATQmIyIGFRQXNgEEFRQWMzI2NTQBtv7 rAQTKufD+3wFT/un6zf7+AUwBnIJvaJLz+P78/uGoh4OkAy+B9avXyqD+/osEev7aweXitAEohwF7aHh9YcJlZ/7qdNhwkI9r4QAAAAACAAD/5gRGBicAFwA jAAA3FjMyEhEGJwYjIgA1NAAzMgAREAAhIicTFBYzMjY1NCYjIga5gJTQ5QICZvHD/wABGN/nAQT+sP7Xo3FxoI16pqWHfaTSSgFIAS4CAsIBDNbkASX+lf6 l/lP+MjUEHJy3p3en274AAAAAABAAxgABAAAAAAABAA8AAAABAAAAAAACAAcADwABAAAAAAADAA8AFgABAAAAAAAEAA8AJQABAAAAAAAFAAsANAABAAAAAAA GAA8APwABAAAAAAAKACsATgABAAAAAAALABMAeQADAAEECQABAB4AjAADAAEECQACAA4AqgADAAEECQADAB4AuAADAAEECQAEAB4A1gADAAEECQAFABYA9AA DAAEECQAGAB4BCgADAAEECQAKAFYBKAADAAEECQALACYBfmZhbmdjaGFuLXNlY3JldFJlZ3VsYXJmYW5nY2hhbi1zZWNyZXRmYW5nY2hhbi1zZWNyZXRWZXJ zaW9uIDEuMGZhbmdjaGFuLXNlY3JldEdlbmVyYXRlZCBieSBzdmcydHRmIGZyb20gRm9udGVsbG8gcHJvamVjdC5odHRwOi8vZm9udGVsbG8uY29tAGYAYQB uAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AFIAZQBnAHUAbABhAHIAZgBhAG4AZwBjAGgAYQBuAC0AcwBlAGMAcgBlAHQAZgBhAG4AZwBjAGgAYQBuAC0AcwB lAGMAcgBlAHQAVgBlAHIAcwBpAG8AbgAgADEALgAwAGYAYQBuAGcAYwBoAGEAbgAtAHMAZQBjAHIAZQB0AEcAZQBuAGUAcgBhAHQAZQBkACAAYgB5ACAAcwB 2AGcAMgB0AHQAZgAgAGYAcgBvAG0AIABGAG8AbgB0AGUAbABsAG8AIABwAHIAbwBqAGUAYwB0AC4AaAB0AHQAcAA6AC8ALwBmAG8AbgB0AGUAbABsAG8ALgB jAG8AbQAAAAIAAAAAAAAAFAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACwECAQMBBAEFAQYBBwEIAQkBCgELAQwAAAAAAAAAAAAAAAAAAAAA"
    Convert the font file encoding to utF-8 encoded byte objects
    bytes = secret.encode(encoding='UTF-8')
    Base64-bit decoding
    decodebytes = base64.decodebytes(bytes)
    Initialize BytesIO with decodebytes and parse the font library with TTFont
    font = TTFont(BytesIO(decodebytes))
    # font mapping
    font_map = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap

    print(font_map)
Copy the code

FontTools library TTFont parsing font, all to the following font mapping results:

{38006:'glyph00007',
	38287: 'glyph00005',
	39228: 'glyph00006',
	39499: 'glyph00003',
	40506: 'glyph00010',
	40611: 'glyph00001',
	40804: 'glyph00009',
	40850: 'glyph00004',
	40868: 'glyph00002',
	40869: 'glyph00008'
}
Copy the code

There are just ten mappings, corresponding to the number of 0~9, but search for the corresponding rule, 1~9, there is a 10, so what is the corresponding number here exactly a law? In addition, the key corresponding to the above mapping is not a hexadecimal ASCII code, but a pure number, is it possible to be a decimal code? Next, we verified our assumption by converting the hexadecimal code obtained on the page into decimal code, and then matched the data in the mapping. We found that the non-zero numeric part of the mapping value was just 1 larger than the corresponding numeric character on the page. It can be seen that the real value needs to be subtracted by 1 in the mapping value. After finishing the code

def decrypt(self, response, code):
    secret = re.findall("charset=utf-8; base64,(.*?) '\]", response.text)[0]
    code = self.secretfont(code, secret)
    return code

def secretfont(self, code, secret):
    Convert the font file encoding to utF-8 encoded byte objects
    bytes = secret.encode(encoding='UTF-8')
    Base64-bit decoding
    decodebytes = base64.decodebytes(bytes)
    Initialize BytesIO with decodebytes and parse the font library with TTFont
    font = TTFont(BytesIO(decodebytes))
    # font mapping
    font_map = font['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap
    chars = []
    for char in code:
        Convert each character to decimal ASCII code
        decode = ord(char)
        If an ASCII key exists in the mapping, then the character has a font
        if decode in font_map:
            Get the value of the map
            val = font_map[decode]
            # According to the rule, get the numeric part, then subtract 1 to get the real value
            char = int(re.findall("\d+", val)[0]) - 1
        chars.append(char)
    return "".join(map(lambda s:str(s), chars))
Copy the code

Now we decrypt all the data we crawled and look at the data:

In the figure above, after decryption, the perfect solution to data garbled!

Verification code and forbidden IP address

Verification codes are generally divided into two types. One type is the verification code that must be entered at the beginning of entry, and the other type is the verification code that is required to continue the following requests after frequent requests. For the first, it is necessary to crack its captcha to continue, and for the second, in addition to cracking the captcha, a proxy can be used to bypass authentication. A proxy can also be used to bypass the anti-crawl of banned IP addresses. For example, I still use the above url crawler. When they recognize that I may be crawler, they will use verification code for interception, as shown below:

Next, we use random user-agent and proxy IP for the bypass. Set settings.USER_AGENT first. Do not mix user-agent Settings on PC and mobile. Otherwise, you will get abnormal data crawling because different pages are different:

USER_AGENT = [
    "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.10 Safari/537.36"."Mozilla / 4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; The.net CLR 1.1.4322; The.net CLR 2.0.50727)".#...
]    
Copy the code

Set up random User-Agent middleware in the request

class RandomUserAgentMiddleware(object):
    def __init__(self, agents):
        self.agent = agents

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            agents=crawler.settings.get('USER_AGENT'))def process_request(self, request, spider):
        Get a random user-agent from the Settings
        request.headers.setdefault('User-Agent', random.choice(self.agent))
Copy the code

Set up dynamic IP middleware

class ProxyIPMiddleware(object):
    def __init__(self, ip=' '):
        self.ip = ip

    def process_request(self, request, spider):
        If the current address is redirected to a captcha address, the proxy IP is used to rerequest
        if self.ban_url(request.url):
            Get the redirected address
            redirect_urls = request.meta.get("redirect_urls") [0]
            Change the address currently redirected to the captcha to the original request address
            request._set_url(redirect_urls)
            # set dynamic proxy, where the interface is used to dynamically generate proxies online
            request.meta["proxy"] = "http://%s" % (self.proxy_ip())

    def ban_url(self, url):
        # Settings a captcha or forbidden page link that the crawler will recrawl when it encounters
        dic = settings.BAN_URLS
        Verify that the current request address is a captcha address
        for d in dic:
            ifurl.find(d) ! =- 1:
                return True
        return False

    Proxy dynamically generated IP :port
    def proxy_ip(self):
        Simulate dynamic generation of proxy addresses
        ips = [
            "127.0.0.1:8888"."127.0.0.1:8889",]return random.choice(ips);

    def process_response(self, request, response, spider):
        If the response is not successful, re-crawler
        ifresponse.status ! =200:
            logging.error("Failed response:"+ str(response.status))
            return request
        return response
Copy the code

Finally, enable the middleware in the Settings configuration file.

DOWNLOADER_MIDDLEWARES = {
   'ytaoCrawl.middlewares.RandomUserAgentMiddleware': 500.'ytaoCrawl.middlewares.ProxyIPMiddleware': 501.'ytaoCrawl.middlewares.YtaocrawlDownloaderMiddleware': 543,}Copy the code

At this point, setting up random User-agent and dynamic IP bypass is complete.

The deployment of

Deploy crawler projects using Scrapyd to remotely manage crawlers such as startup, shutdown, log calls, and so on. To deploy, we need to install scrapyd, using the following command:

pip install scrapyd
Copy the code

After the installation is successful, you can see that the version is 1.2.1.

Once deployed, we’ll need a scrapyd-client to access it.

pip install scrapyd-client
Copy the code

Modify scrapy. CFG file

[settings]
default = ytaoCrawl.settings

[deploy:localytao]
url = http://localhost:6800/
project = ytaoCrawl

# deploy Enables batch deployment
Copy the code

Start scrapyd:

scrapyd
Copy the code

For Windows, create scrapyd-deploy.bat under X:\xx\Scripts

@echo off
"X:\xx\python.exe" "X:\xx\Scripts\scrapyd-deploy"% 1% 2Copy the code

Project deployment to Scrapyd service:

scrapyd-deploy localytao -p ytaoCrawl
Copy the code

Remote start curl http://localhost:6800/schedule.json – d project = spiders ytaoCrawl – d = ytaoSpider

After the execution is started, you can view the crawler execution status and logs in http://localhost:6800/

In addition to launching remote calls, Scrapyd also provides a rich API:

  • Crawler status query in servicecurl http://localhost:6800/daemonstatus.json
  • Cancel the crawlercurl http://localhost:6800/cancel.json -d project=projectName -d job=jobId
  • Show projectcurl http://localhost:6800/listprojects.json
  • Delete the projectcurl http://localhost:6800/delproject.json -d project=projectName
  • Show the crawlercurl http://localhost:6800/listspiders.json? project=projectName
  • Gets all version numbers of the projectcurl http://localhost:6800/listversions.json? project=projectName
  • Delete the project version numbercurl http://localhost:6800/delversion.json -d project=projectName -d version=versionName

More details scrapyd. Readthedocs. IO/en/stable/a…

conclusion

In this paper, the space is limited, not comprehensive analysis, some sites of climb more difficult, as long as we are analyzed one by one, can find a way to crack, and eyes to see the data is not necessarily you get data, such as some website HTML rendering is dynamic, requires us to deal with the information. It’s a lot of fun when you get into crawler’s world. Finally, I hope you do not face the prison crawler, data million, obey the law first.


Personal blog: Ytao.top

My official account is Ytao