Abstract: In crawler, in addition to the common websites that can be climbed without login, there is a kind of websites that need to be logged in first. Such as Douban, Zhihu and, in the case of this article, Tangerine. This type of website can be divided into the following types: you only need to enter the account password, and you need to enter or click the verification code in addition to the account password. This article takes the example of orange web, which can be logged in only by entering the account and password, and introduces three commonly used methods of simulated login.
- POST request method: You need to get the URL of login in the background and fill in the request body parameters, and then POST request login, relatively troublesome;
- Method of adding Cookies: first login, add the Cookies obtained to Headers, and finally use the GET method to request login, which is the most convenient;
- Selenium login simulation: Instead of manual operation, it automatically completes the account and password input, which is simple but slow.
Next, let’s implement each of the three methods in code.
1. Target page
Here is the page where we want to get the content:
radar.itjuzi.com/investevent
You need to log in to this page before you can see the data information. The login interface is as follows:
As you can see, you only need to enter your account and password to log in. You don’t need to enter a verification code. Let’s use a test account and password to simulate login.
2. POST a request for login
First, we need to find the URL of the POST request.
There are two ways to view the request. One is to view the request in the Web devTools, and the other is to view it in the Fiddler software.
Let’s talk about the first method.
Enter your account and password on the login screen, open developer tools, clear all requests, then click the login button, and you will see a large number of requests. Which is the URL for the POST request? This takes a bit of experience, because it’s a login, so try clicking on the request with the word “login.” So here we click on the fourth request, and in the Headers on the right you can see the URL of the request, and the request type is POST, so it’s the right URL.
Next, we drop down to Form Data, which has several parameters, including identify and password, which are exactly the account and password that we need to enter when we log in, and which are the parameters that the POST request will carry.
Parameter construction is simple, and all you need to do is request a login to the site using the Requests. Post method, and then you can crawl the content.
Next, let’s try to get a POST request with Fiddler.
If you’re not familiar with Fiddler or don’t have it installed on your computer, check it out and install it.
Fiddler is an HTTP proxy on both the client and server side and is one of the most commonly used TOOLS for CAPTURING HTTP packets. It can record all HTTP requests between the client and the server. It can analyze the request data, set breakpoints, debug the Web application, modify the request data, and even modify the data returned by the server for a specific HTTP request. It is very powerful and is a useful tool for Web debugging.
Download Fiddler
www.telerik.com/download/fi…
Use tutorial:
zhuanlan.zhihu.com/p/37374178
www.hangge.com/blog/cache/…
Next, we intercept the login request using Fiddler.
When you click to log in, the official Fiddler page, on the left side, shows a large number of requests being grabbed. By observing that the URL of the 15th request contains the “login” field, it is most likely a POST request for login. After clicking the request, back to the right side and clicking “Inspectors” and “Headers” respectively, it can be seen that the POST request is the same as the URL obtained by the above method.
Next, switch to the Webforms option on the right and you can see the Body request Body. This is also consistent with the above method.
With the URL and request body parameters obtained, you can now mock the login using the Requests. Post method.
The code is as follows:
import requests
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/66.0.3359.181 Safari/537.36',
}
data = {
'identity':'[email protected]'.'password':'test2018',
}
url ='https://www.itjuzi.com/user/login?redirect=&flag=&radar_coupon='
session = requests.Session()
session.post(url,headers = headers,data = data)
After login, we need to get the content of another page
response = session.get('http://radar.itjuzi.com/investevent',headers = headers)
print(response.status_code)
print(response.text)
Copy the code
Use the session.post method to submit the login request, and then use the session.get method to request the target page and output the HTML code. As you can see, the web content was successfully captured.
Here is the second method.
3. Obtain Cookies and directly request login
In the above method, we need to go to the background to get the POST request link and parameters, which is more troublesome. Next, we can try to log in first, GET the Cookie, then add the Cookie to the Headers, and then use the GET method to request, the process is much simpler.
The code is as follows:
import requests
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/66.0.3359.181 Safari/537.36'.'Cookie': 'your cookies',
}
url = 'https://www.itjuzi.com/user/login?redirect=&flag=&radar_coupon='
session = requests.Session()
response = session.get('http://radar.itjuzi.com/investevent', headers=headers)
print(response.status_code)
print(response.text)
Copy the code
As you can see, when you add cookies, you don’t have to POST requests anymore, you just GET requests to the target page. As you can see, you can also successfully get the web content.
The third method is described below.
4. Selenium simulation login
This approach is straightforward. Use Selenium to automatically enter your account password and log in instead of the manual method.
The use of Selenium is covered in detail in a previous article, if you’re not familiar with it:
www.makcyun.top/web_scrapin…
The code is as follows:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
browser = webdriver.Chrome()
browser.maximize_window() # Maximize window
wait = WebDriverWait(browser, 10) Wait for 10s to load
def login(a):
browser.get('https://www.itjuzi.com/user/login')
input = wait.until(EC.presence_of_element_located(
(By.XPATH, '//*[@id="create_account_email"]')))
input.send_keys('[email protected]')
input = wait.until(EC.presence_of_element_located(
(By.XPATH, '//*[@id="create_account_password"]')))
input.send_keys('test2018')
submit = wait.until(EC.element_to_be_clickable(
(By.XPATH, '//*[@id="login_btn"]')))
submit.click() Click the login button
get_page_index()
def get_page_index(a):
browser.get('http://radar.itjuzi.com/investevent')
try:
print(browser.page_source) Output the source code of the web page
except Exception as e:
print(str(e))
login()
Copy the code
Here, we first locate the account node in the web page: ‘//*[@id=”create_account_email”]’, and then enter the account using the input.send_keys method. Similarly, we locate the password box and enter the password. //*[@id=”login_btn”]; //*[@id=”login_btn”]; As you can see, you can also successfully get the web content.
These are several ways to simulate the need to log on to the website. Once you’ve logged in, you can begin to crawl for what you need.
5. Conclusion:
- In this paper, three operation methods of simulating login are implemented respectively. It is recommended to choose the second one first, that is, to obtain Cookies and then Get the request for direct login.
- In this article, you only need to enter the account password to log in to the website. You do not need to obtain the relevant encryption parameters, such as Authenticity_token, and you do not need to enter the verification code, so the method is simple. However, there are many sites to simulate login, need to deal with encryption parameters, verification code input and other issues. More on that later.
Recommended Reading:
In this paper, to the end.