This is the second day of my participation in the August More text Challenge. For details, see: August More Text Challenge

1 introduction

When crawling data, there are some sites set to reverse crawling (disable F12, web Debugger, ugly Js), such as the following situations:

1. Do not view the source code

2. Web Debugger

It is prohibited to view web page problems, you can press F12, and then visit the website, but there is a web page debugging Debugger

After a variety of Baidu, you can turn off debugging in the browser

So you can click blue and close it.

3. Uglification JS

By viewing, it can be found that the data is loaded asynchronously. When viewing the data packet (network), it is found that JS is uglier and cannot be viewed

Does this prevent us from collecting data?? Obviously not (hahaha)

There are policies on the top and countermeasures on the bottom

Today we’ll show you how to solve these problems and crawl data in Python.

2 Python solves the above backward crawling

1. Introduce the Selenium

The data is loaded asynchronously, and the asynchronous link is also ugly.

At this time, I consider the packet capture method, but unfortunately, can not get asynchronous link through the packet capture method

Therefore, the Selenium approach is used to crawl the data (there are new problems later, which are too bad, but they are all solved).

2.Selenium preparation

In order to use Selenium in Python, you need to do some preparatory work

To install Selenium library

Install the Selemium library with the following command

Copy the code

Download the chromedriver. Exe

Check out your browser version (chrome in this case)

Download Chromedriver.exe at the following address

chromedriver.storage.googleapis.com/index.html

Download your browser version (I chose 89 here)

Configuration chromedriver. Exe

Copy the downloaded file to the Python installation directory

The python installation path can be viewed with the following Python code

print(sys.path)  
Copy the code

3.Selenium requests data

"" import the Selenium library """  
from selenium import webdriver  
driver = webdriver.Chrome()  
""" chromedriver.ex is not copied to the Python path, so you need to write """  
Webdriver.exe (executable_path="chromedriver.exe ")
driver.get('https://www.aqistudy.cn/historydata/daydata.php?city= Beijing')  
Copy the code

Here are the results:

No data was found, the reason is that the website detected illegal operation, so open the Debuggger, so the data is not loaded asynchronously.

That’s how it should be

Now you need to do something else (close the Debugger)

4. Set up an agent for Selenium

Set the agent

Find the path to Chrome

In CMD (Terminal), go to the path

Start the broker

chrome.exe –remote-debugging-port=9222 –user-data-dir=”C:\selenum\AutomationProfile”

Start the agent with the above command

IP is the local IP (127.0.0.1)

Port is 9222

Once started, Chrome automatically opens and waits for code to execute

Write the code

from selenium import webdriver  
option = webdriver.ChromeOptions()  
option.add_experimental_option('debuggerAddress'.'127.0.0.1:9222')  
driver = webdriver.Chrome(executable_path="C:/Users/Administrator/Anaconda3/envs/lyc36/chromedriver.exe",chrome_options=option)  
driver.get('https://www.aqistudy.cn/historydata/daydata.php?city= Beijing')  
Copy the code

So the waiting browser will automatically load the data, successfully solved!!

Take a look at the GIF below

3 summary

1. Solved the F12 ban to view the web page crawl.

2. Solved the web debugging Debugger reverse crawl.

3.Selenium combines proxies to simulate browser requests.

4. This article summarizes several kinds of reverse climbing situation, recommended collection! Collection! Collection!

Finally say: the original is not easy, ask to give a praise!