【 Reptile combat 】 Let's analyze amazon's anti-reptile mechanism step by step

Hi, I’m Lex the Lex who likes to bully Superman

Areas of expertise: Python development, network security penetration, Windows domain control Exchange architecture

Today’s focus: Step by step analysis and beyond amazon’s anti-crawler mechanism

Here’s what happened

Amazon is the world’s largest shopping platform

Many product information, user reviews and so on are the most abundant.

Today, take you through amazon’s anti-reptile mechanism, hand in hand

Crawl for items, reviews, and other useful information

Anticrawler mechanism

However, we want to use the crawler to crawl the relevant data information

Big shopping malls like Amazon, TBao, JD

In order to protect their own data information, they have a set of perfect anti-crawler mechanism

Try Amazon’s anti-crawl mechanic first

We tested it with several different Python crawler modules

Finally, the anti-crawl mechanism was successfully overcome.

1. Urllib module

The code is as follows:

# -*- coding:utf-8 -*-
import urllib.request
req = urllib.request.urlopen('https://www.amazon.com')
print(req.code)
Copy the code

The status code is 503.

Analysis: Amazon takes your request, identifies it as a crawler, and rejects it.

In line with scientific rigorous attitude, we take ten thousand people on the Baidu try.

Return result: status code 200

Analysis: Normal access

That means that the request from the URllib module was recognized by Amazon as a crawler and refused to provide service

The requests module

Requests for direct crawler access

The effect is as follows: ↓ ↓ ↓

The code is as follows: ↓ ↓ ↓

import requests
url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxx'
r = requests.get(url)
print(r.status_code)
Copy the code

The status code is 503.

Analysis: Amazon also rejected the requsets module request

Identify it as a crawler and refuse to provide service.

We added cookies to our requests

Add the request cookie and other related information

The effect is as follows: ↓ ↓ ↓

The code is as follows: ↓ ↓ ↓

import requests url='https://www.amazon.com/KAVU-Rope-Bag-Denim-Size/product-reviews/xxxxxxx' web_header={ 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; x64; Rv :88.0) Gecko/20100101 Firefox/88.0', 'Accept': '*/*', 'accept-language ':' zh-cn,zh; Q = 0.8, useful - TW; Q = 0.7, useful - HK; Q = 0.5, en - US; Q = 0.3, en. Q =0.2', 'accept-encoding ': 'gzip, deflate, br', 'Connection': 'keep-alive', 'Cookie':' your Cookie value ', 'TE': 'Trailers'} r = requests.get(url,headers=web_header) print(r.status_code)Copy the code

The status code is 200

Analysis: return status code is 200, normal, a little crawler that smell.

3. Check the return page

We obtained a status code of 200 using the requests+ cookies method

At least it’s being served properly by Amazon’s servers

We write the crawled page into text and open it through the browser.

I stepped a horse… The return status is normal, but the return is a reptile captcha page.

It was blocked by Amazon.

Third, Selenium automation module

Installation of the relevant Selenium module

pip install selenium
Copy the code

Introduce Selenium into the code and set the parameters

import os from requests.api import options from selenium import webdriver from selenium.webdriver.chrome.options import Add_argument ('--headless') # Configure the Selenium driver for Chrome chromedriver="C:/Users/pacer/AppData/Local/Google/Chrome/Application/chromedriver.exe" Os. environ["webdriver.chrome.driver"] = Chromedriver webdriver.Chrome(chromedriver,chrome_options=options)Copy the code

Test access

Url = "https://www.amazon.com" print(url)Copy the code

The status code is 200

Return status code is 200, access status is normal, we see again to crawl to the page information.

Save the web source to the local

# Will crawl to the page information, Fw =open('E:/amzon.html','w',encoding=' utF-8 ') fw.write(STR (browser.page_source)) browser.close() fw.close()Copy the code

Open the local file that we crawled and look,

We’ve made it past the anti-crawler and onto Amazon’s home page

The ending

With the Selenium module, we can successfully leapfrog

Amazon’s anti-crawler.

Next: We continue to introduce how to crawl amazon’s hundreds of thousands of product information and reviews.

【 Have a question, please leave a message ~~~】

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

【 Reptile combat 】 Let’s analyze amazon’s anti-reptile mechanism step by step

Here’s what happened

Anticrawler mechanism

Try Amazon’s anti-crawl mechanic first

1. Urllib module

The requests module

Third, Selenium automation module

The ending

【 Reptile combat 】 Let’s analyze amazon’s anti-reptile mechanism step by step

Here’s what happened

Anticrawler mechanism

Try Amazon’s anti-crawl mechanic first

1. Urllib module

The requests module

Third, Selenium automation module

The ending

Related Posts

Originally just want to have a simple look at the String source, did not expect to sort out so many knowledge points

Interviewer: Young man, why don’t you give me a rundown of RocketMQ integration with Spring Boot

Talk about mysql transactions