Selenium and Puppeteer have dozens of features that can be detected by websites
Source: no code has been heard
Many people like to use Selenium or Puppeteer(Pyppeteer) to write crawlers by simulating a browser in the belief that they can crawl whatever data they want without being detected by the site
But in fact, the browser launched by Selenium has dozens of features that websites can detect through JavaScript; The browser that Puppeteer launches also has a number of features that can be detected by websites
If you don’t believe me, let’s do an experiment. First, you use a normal browser to open the following website: bot.sannysoft.com/. You can see the page is as follows:
It’s a long page that you have to scroll down, and it’s mostly green
Next, use Selenium to launch a headed mode for Chrome and open the page to see what it looks like:
At first the WebDriver item is highlighted in red, indicating that the site successfully detected that you are using an analog browser
If you scroll down, the red ones are the detectable features
On the left is a normal browser and on the right is an analog browser
If you compare them one by one, you’ll see a lot of differences
This is still the head mode effect. Let’s look at headless mode:
The screenshot opens and looks like this. Don’t freak out:
You can’t hide anything when all these features are exposed. Websites that want to find you are very easy.
Since Selenium is not available, how about Puppeteer or Pyppeteer?
Let’s do an experiment with Pyppeteer, and start headless mode directly and take a screenshot of what it looks like:
Is Selenium the same as Selenium
So, you still have the nerve to use these two things to write about crawlers?
Crawl no security awareness of small websites can climb those who have a strong security team and legal team of companies, you are looking for death!
However, it is not impossible to run Selenium and Puppeteer in header mode on Linux. Update in the next article.